Facial Expression Recognition with High Response-Based Local Directional Pattern(HR-LDP)Network

2024-03-13 13:21SherlyAlphonseandHarshitVerma
Computers Materials&Continua 2024年2期

Sherly Alphonseand Harshit Verma

School of Computer Science and Engineering,Vellore Institute of Technology,Chennai,India

ABSTRACT Although lots of research has been done in recognizing facial expressions,there is still a need to increase the accuracy of facial expression recognition,particularly under uncontrolled situations.The use of Local Directional Patterns(LDP),which has good characteristics for emotion detection has yielded encouraging results.An innovative end-to-end learnable High Response-based Local Directional Pattern(HR-LDP)network for facial emotion recognition is implemented by employing fixed convolutional filters in the proposed work.By combining learnable convolutional layers with fixed-parameter HR-LDP layers made up of eight Kirsch filters and derivable simulated gate functions,this network considerably minimizes the number of network parameters.The cost of the parameters in our fully linked layers is up to 64 times lesser than those in currently used deep learning-based detection algorithms.On seven well-known databases,including JAFFE,CK+,MMI,SFEW,OULU-CASIA and MUG,the recognition rates for seven-class facial expression recognition are 99.36%,99.2%,97.8%,60.4%,91.1% and 90.1%,respectively.The results demonstrate the advantage of the proposed work over cutting-edge techniques.

KEYWORDS Emotion;classification;CNN;network;HR-LDP

1 Introduction

Human Computer Interaction (HCI) primarily consists of the study of interface design,with its applications concentrating on user-computer interaction.Since computers are used in almost every area of daily life,HCI applications are found in every industry,including social science,psychology,science,industrial engineering for computers,and many more.A crucial field of research in pattern recognition and computer vision is Facial Expression Recognition (FER).FER has emerged as a crucial research area within computer vision and artificial intelligence,offering profound implications for diverse applications,such as human-computer interaction,emotion-aware computing,and affective computing.Automatic emotion recognition from facial expressions is an interesting research topic that has been used in healthcare,social networks,and human-machine interactions,among other domains.To improve computer prediction,researchers in this discipline are working on methods to decode,analyze,and extract these characteristics from facial expressions.The remarkable success of this technology has led to the use of numerous deep-learning architectures to boost performance[1].Emotions have a natural influence on human behavior and are important in shaping communication and behavior patterns.Accurately analyzing and interpreting the emotional content of facial expressions is essential for a deeper understanding of human behavior.Computer systems still struggle to accurately identify facial expressions,even though it requires little to no effort for a person to recognize faces and decipher facial emotions.It is believed that analyzing a person’s facial features and determining their emotional state are incredibly tough tasks.The main obstacles are the irregularities of the human face and variations in elements such as direction,lighting,shadows,and facial posture.Research has indicated that disparate individuals can identify distinct emotional states within an identical facial expression.FER involves many hurdles,including the need for diverse training data and pictures featuring a range of ethnicities,genders,and nations,among others.Deep learning methods have been researched as a stream of techniques to achieve resilience and provide the required scalability on new forms of data [2].It is necessary to acquire a proper classification model that is both subtle to minute differences in the appearance of facial emotions and resilient to larger variations to recognize facial expressions under uncontrolled situations.For recognizing facial expressions,a variety of pre-trained deep neural networks can be used.These networks have a great number of parameters that can be learned,yet they were trained and used on quite varied applications.The neural aspects make it challenging to accurately train neural networks for facial emotion recognition.To overcome this,in this study,a big neural network that is trained on extensive facial emotion recognition datasets is chosen which is later used to train a small neural network.The small network has a lesser number of parameters to be learned than the large network.The suggested network is then built using its convolutional layers,and the complete structure is trained with facial expression photos.

The main objective is the construction of neural network models that support the input of the images in the right format and produce an output that can be mapped to a classification of emotion.After the successful building of the model,testing and troubleshooting also have to be done to maximize the accuracy and also to perform analysis via various metrics available to cross-examine the efficiency and the correctness of the model.Another major aim is to try and eliminate problems present in the dataset such as cross-oriented images,wrong facial position,alignment issues,etc.This has to be addressed because the images when they are disoriented,will lead to bad predictions due to unnecessary parallax error and wrong orientation of the images.The next issue is edge detection and the reason for performing edge detection is to enhance the facial features and boost the parts where emotion is displayed,like the position of the mouth,eyebrows,eyes,and even the nose.The alignment problems are rectified using a face detection and alignment method“Chehra”in the proposed work.The proposed High Response-based Local Directional Pattern(HR-LDP)based classification method also uses the Kirch filter which eliminates the noise in images and accurately captures the sharp edges that represent the structure of the face.The major contributions of the proposed work are as follows:

• A novel HR-LDP network-based classification is proposed in this work with a module for eliminating noise using high responses obtained from Kirsch filters that reduce the computation while increasing accuracy.

• The proposed work suggests a novel learnable HR-LDP network that reduces the number of learnable parameters compared to the existing works.

• Compared to existing deep learning-based detection algorithms,the parameters in our fully linked layers can save up to 64 times the cost while outperforming state-of-the-art techniques.

The paper is structured as follows:The state-of-the-art techniques for facial emotion recognition are reviewed in Section 2.Then,in Section 3,the suggested learnable HR-LDP network is presented.The specifics of the experimental setting are provided and the findings of the detection are then displayed and analyzed in Section 4.Finally,this paper is concluded in Section 5 with the guidelines for future research.

2 Related works

This section presents a detailed survey of the existing works.Table 1 gives a summary of the latest works in literature.In[3],the authors have used the Cohn Kanade(CK+)dataset that is available to the public.They forwarded it through four different Convolutional Neural Networks(CNN)which implement transfer learning.They were VGG-19,ResNet-50,MobileNet and Inception V3.After the image pre-processing and the feature extraction were done,they passed it through the 4 networks and compared the performance of each one with the other.Reference[4]suggested a novel technique called Facial Emotion Recognition using Convolutional neural networks(FERC)and used it for this problem.FERC is a 2-part CNN,one for removing the background of the image and the other for the classification into one of the five emotions set.They tested the algorithm with CK,Caltech,CMU and NIST datasets.In[5],the authors have used deep CNNs with 2 layers that are included with dropouts after each layer.It is passed through an activation function and then to the pooling layer.The same is repeated in the next layer.The final dense layer has 5 units representing each emotion.

Table 1:Summary of literature survey with the algorithms

In[6],the authors have developed a FER system,and it has been verified on eight different pretrained Deep CNN models with the Karolinska Directed Emotional Faces (KDEF) and Japanese Female Facial Expression (JAFFE) facial datasets.On application of a 10-fold cross-validation,the best model uses DenseNet-161.The CNN algorithms[7]are used by several works in literature that have shown superior performance.Among that,the authors in[12]have proposed a CNN-based single classifier that achieved high performance.It also performed the necessary pre-processing.The model has two Convolution layers,two sub-sampling layers and an output layer.They also used a maxpooling and flattening layer with the final activation function as SoftMax.They got an accuracy of 97.6%.Also,Reference[15]did the necessary pre-processing by taking the mean shape and mapping the dataset with the closeness from the mean shape.Notably,the authors in [16,17] conducted a comprehensive review focused on CNNs for FER.Their study explored various CNN architectures and methodologies,showcasing their effectiveness in capturing spatial hierarchies within facial images.The studies from [18] and [19],have significantly transformed FER.These works highlight the proficiency of CNN in capturing spatial hierarchies and achieving impressive performance,along with the critical contributions of data augmentation and feature extraction in improving FER accuracy and robustness.Despite the remarkable strides made in Facial Emotion Recognition(FER),the field continues to grapple with a series of substantial challenges and limitations that warrant thorough exploration.While FER algorithms [20–24] have shown proficiency in identifying basic emotions,the recognition of nuanced and subtle facial expressions remains an ongoing research frontier.The intricate interplay of various facial muscles and features,especially in complex emotional states,poses a significant challenge for current models.Inside the neural network,the different combinations of layers can accomplish a task with high accuracy.This work proposes a novel HR-LDP network-based classification that helps to attain good accuracy while classifying six datasets and learning a smaller number of parameters.The proposed work is explained in Section 3.

3 The Proposed Work

The architecture of the suggested work is shown in Fig.1.The three main elements of this network are convolutional layers,a fully connected layer for HR-LDP computation,and another fully connected layer that is proportional to a loss function.This network creates feature maps associated with expression by applying convolutional layers to the input image.The three modules that make up this network are the convolutional layer,HR-LDP layer and loss function layer,as shown in Fig.1.A classification layer is also used at the end to predict the emotions using a classification algorithm like SVM.The loss function module is used to train the parameters of the network.The convolutional and HR-LDP layers are used to extract simulated HR-LDP features,and classification layers are used to predict the emotion.The main elements of the suggested neural network are thoroughly described in this section.

3.1 Convolution Layer

The faces are detected from the sample images from the dataset using a‘Chehra’[20]face detector.In the proposed work,a face detection and alignment tool ‘Chehra’is used to solve the alignment problems.The convolutional feature maps for the original images are created by forward-propagating the unprocessed pixels via the initial module.More precisely,there are three convolutional layers in the initial module:two for convolution,one for pooling,and one for Restricted Linear Units(ReLU).In addition,it reduces the effect of initializing filter parameters.Before the ReLU layer,a batch normalization(BN)layer is used.This is depicted as

whereIis the BN layer’s input.The mean and variance ofIareμ,σcorrespondingly.Here,′Υ andβare scale and shift factors,respectively,while a constant ∈is further added to the variance to account for numerical stability.

Figure 1:The architecture of the proposed facial expression recognition network

3.2 HR-LDP Layer

The HR-LDP layer performs convolution using Kirsch masks [24] and extracts only the high responses related to shape and texture information which is then normalized using Sigmoid function and then the histograms are extracted using gate functions as in the subsequent sections.

3.2.1 Convolution Using Kirsch Filter Masks

The Kirsch masks in Fig.2 are applied on the output from the convolution layer and the eight responses are obtained on which max pooling is applied.

Figure 2:Kirsch mask

Hereσargmaxis obtained by max pooling.The pool_size=2 and strides=2 are used when creating a MaxPool2D layer.The MaxPooled output is obtained in tensor form by applying the MaxPool2D layer on the matrix.When it is applied to the matrix,the Max pooling layer will iteratively compute the maximum of each 2×2 pool with a 2 jump.The values are then normalized using the sigmoid function and given to the gate functions for histogram formation.

3.2.2 Histogram Calculation

A histogram shows the probability distribution of a quantity in different bins.Different appearance-based feature extraction techniques have been developed,which process the image using either manually applied or learnable filters and a histogram to calculate statistical data.CNN can be thought of as a collection of learnable filters when feature maps are generated at the output of convolutional layers.The feature maps are first flattened,and then they are added to a layer with all connections.A simple method for constructing the histograms of feature maps involves applying specific shifted step activation functions to the obtained feature maps and then aggregating each result as a bin of histograms.However,gradient-based learning is incorrect since the step function’s derivative is infinite at its edges and zero everywhere else,and the gate function determines the variable’s histogram in the range[0,1].

where n denotes the histogram’s number of bins.The gradient of Eq.(3)in the backpropagation stage is taken to be 2nduring 0

Here TheHhistogram’s ithbin is designated asHi.The current feature map is FM.Eis the number of feature map(FM)elements,mis the number of histogram bins,andfis the gate activation function mentioned in Eq.(4).The feature map used to calculate the histogram is called FM.In the suggested CNN,executed with average pooling operators.The input variable should fall between 0 and 1 as is expected for histogram calculation with the gate function.However,this presumption might not apply to feature maps.Consequently,the input of the gate activation function needs to be normalized to[0,1]to be used for histogram calculation.The sigmoid function can be utilized for this.Nevertheless,at very large/small values,the sigmoid is saturated.To solve this issue,

Figure 3:The feature map and gate function

Figure 4:Gate function

As in Fig.1,the feature values for the histogram computation layer are initially constrained using batch normalization to prevent sigmoid function saturation.The values are then normalized to [0,1] using a sigmoid activation function.The output of the sigmoid function is then shifted n times.Ultimately,n-gate activation functions and the n-bin histograms are calculated via average pooling.The computed histograms show feature-specific statistical data maps for the image input.Convolutional neural networks can employ this feature map histogram computation approach without any issues to the learning process.The generated histograms are then integrated into the completely connected layer of the proposed network which is explained in the following section:

3.2.3 Loss Function

The most popular SoftMax loss function is therefore utilized as in Eq.(6) to quantify the classification error following the extraction of HR-LDP features.The SoftMax loss function can optimize the likelihood of the correct class during the training stage and fine-tune the network parameters based on Back Propagation (BP).Hereiis the training sample index andnrepresents the count of training samples.[Y=Y1,Y2,Y3,...,Yn]is the label set and[Yi=yi1,yi2,yi3,...,yiv]is the prediction vector of the ithtraining sample.The predicted value is denoted byyiv,and the number of classes is indicated byv.To combine the data on facial movement during testing,the HRLDP features are taken from a video sequence and the average is calculated and converted into a feature vector.The averaged features are then classified using Support Vector Machine(SVM)classifier.Algorithm 1 describes the basic flow of the classification module.

The SoftMax loss function,which is based on the BP method,can optimize the likelihood of the correct class during the training stage and fine-tune the network parameters.The given testing sample is classified and the results are given in the next section.

4 Results and Discussion

The suggested approach uses Matlab 2018a for its experiments.

4.1 Datasets

The research makes use of six datasets,including JAFFE [26],Cohn Kanade (CK+) [27],Oulu-CASIA NIR&VIS facial expression database(OULU-CASIA)[28,29],Man Machine Interface(MMI)[30]Multimedia Understanding Group(MUG)[31]and Static Facial Expressions in the Wild(SFEW)[32,33].

4.2 Experimental Analysis

The high computational complexity is a significant limitation for state-of-the-art descriptors like Gabor.The accuracy of every other feature descriptor in literature is far lower,especially under unrestricted circumstances.So,HR-LDP is incorporated into the proposed model which achieves high accuracy under low complexity.SFEW dataset poses significant challenges because it was collected in unrestricted circumstances.Tables 2–7 demonstrate the effectiveness of the suggested strategy by listing both the count of samples that were properly identified and the count of samples that were erroneously classified.The neutral and depressed expressions are confused when predicting other images during the classification of the photos from the JAFFE dataset with the suggested method,as in the confusion matrix given in Table 2.As seen in Table 3,the CK+dataset’s classification accuracy for anger and neutral emotions is significantly lower.Expressions like neutral,happiness,and surprise are mixed up with other emotions in the MUG dataset,as shown in Table 4.The fundamental issue with the SFEW dataset is that the samples of the various classes are out of balance and that the photographs were taken in an unrestricted environment.Therefore,as seen in Table 5,more training data is required to increase accuracy.The suggested method outperforms the other current descriptors in terms of accuracy for SFEW due to its capacity to identify crisp edges and its scale and rotationinvariant characteristics.In comparison to other available datasets,the classification accuracy of the SFEW dataset is lower.When equated to the other descriptors currently used in the literature,however,SFEW obtains a greater accuracy utilizing proposed technique,as shown in Table 5.Most other facial expressions can be mistaken for the disgusted face.As in the confusion matrices provided in Tables 6 and 7,fear and sadness facial emotions cause misunderstanding with the rest of the expressions in the Oulu-CASIA dataset and MMI.

Table 2:Matrix showing the confusion in the JAFFE dataset

Table 3:Matrix showing the confusion in the CK+dataset

Table 4:Matrix showing the confusion in the MUG dataset

Table 5:Matrix showing the confusion in the SFEW dataset

Table 6:Matrix showing the confusion in the Oulu-CASIA dataset

Table 7:Matrix showing the confusion in the MMI dataset

Figs.5–10 compare the recognition outcomes.In comparison to more current methods like intercategory distinction feature fusion network [34–38] and ROI-guided deep architecture [39–42],the suggested study attains greater accuracy.Because there is less likelihood of overfitting[43–45],less data noise,improved discriminating,and improved data visualization,the proposed approach performs better.The recommended feature extraction technique automatically chooses only the relevant data needed for this activity.This work suggested using a new HR-LDP network to tackle the detection of emotions.The suggested network mixes deep learning and manually created features,and it can minimize the network parameters by producing statistical histograms.Numerous tests using the databases produced intriguing findings.Furthermore,unlike the majority of modern techniques,this suggested approach produces reliable performance.The VGG-face network [46] is chosen as the reference network for comparing the efficiency of our network in terms of time and memory intake.VGG-face is fine-tuned for face expression detection because it is utilized for face recognition.To be fair,identical training data is employed,as training parameters,and loss function in both the VGG-face network and our suggested LBP network.On an HP workstation set up as follows,the comparison experiments are conducted.Matlab 2018a,64 G of RAM,two Intel E-52620 v3 CPUs,and one NVIDIA GeForce GTX 1080 Ti GPU are all included with the Windows 10 Enterprise Edition operating system.The results of the comparison of time and memory are shown in Table 8.The table shows that,when training,our suggested network requires just 132 MB of memory,which is up to 25 times less memory than the VGG-face network.Furthermore,in training rounds,the proposed network outperforms the VGGface network.Depending on the input’s size the suggested network should be significantly faster than the VGG-face network due to the size of the proposed network.

Figure 5:Classification accuracy of JAFFE dataset

Figure 6:Classification accuracy of CK+dataset

Table 8:Comparison of mean time and cost of memory

Table 9 represents the different parameters used.The Stochastic Gradient Descent (SGD)approach is used for optimization during the training phase,with learning rate=0.01 and momentum=0.9.There are 100 training epochs,and from the thirty-first to the last epoch,the learning rate drops by 0.99 in each epoch.This setting of 0.5 for the dropout prevents over-fitting.The margin hyper-parameter is set to 0.2.The dimension of the feature vector obtained at the output of the histogram computation layer is 512 ∼10=5120 since the count of histogram bins in HR-LDP is initialized to 10.Ten percent of the training in each trial is selected at random and utilized for validation.Table 10 represents the results obtained using different classifiers in the proposed work.The proposed work achieves higher accuracy when using SVM,deep learning techniques and CNN in the final classification layer of the proposed architecture.However,the SVM in the final layer has a lesser number of parameters,saving the computational cost and achieving higher accuracy.

Figure 7:Classification accuracy of MUG dataset

Figure 8:Classification accuracy of SFEW dataset

Figure 9:Classification accuracy of OULU-CASIA dataset

Table 9:Parameters used in the proposed model

Table 10:Classification outcomes using different classifiers in the classification layer of the proposed work

Figure 10:Classification accuracy of MMI dataset

4.3 Ablation Study:Analysis of Several Proposed Model Components

(i)By eliminating the histogram formation layer in the suggested work:At the output of module 2 in Fig.1 of the experiment,a max pooling layer is utilized to construct a 5120-dimensional feature vector.Next,using a loss function,the network is trained for seven classes of face emotion identification.

(iii)Changing from SoftMax loss function to chi-squared distance-based loss function:The loss function is defined in Eq.(6) as a SoftMax function.This is changed as an improved chi-squared distance-based loss function[47]as in Eq.(7).

(iii)Using the whole proposed work for emotion classification:In this experiment,face expression recognition is accomplished by using the whole HR-LDP and SVM.The results from three different cases are given in Fig.11.

Figure 11:Ablation study using three different cases

5 Conclusion

This novel HR-LDP network is suggested to tackle facial expression recognition.The suggested network mixes deep learning and manually created features,and it can minimize the network parameters by producing statistical histograms.Numerous tests using the seven databases produced intriguing findings.Furthermore,unlike the majority of modern techniques,this suggested approach produces reliable performance.Concerning SFEW photos with significant blur and occlusions,the suggested technique obtains greater classification accuracy compared to other methodologies in the literature,it achieves good accuracy.The results show that the suggested strategy improves classification accuracy across six datasets.Future research will concentrate on micro-expressions and the analysis of dynamic emotions in videos.

Acknowledgement:We thank Vellore Institute of Technology,Chennai for supporting us with the APC.

Funding Statement:The authors received no specific funding for this study.

Author Contributions:The authors confirm contribution to the paper as follows:study conception and design,draft manuscript preparation:Sherly Alphonse.analysis and interpretation of results:Harshit Verma.All authors reviewed the results and approved the final version of the manuscript.

Availability of Data and Materials:Both CK and JAFFE are openly accessible datasets.On request,more datasets from specific authors are available.Access the MUG dataset at https://mug.ee.auth.gr/fed/.Access the Oulu-CASIA dataset at https://paperswithcode.com/dataset/oulu-casia.Access the MMI dataset at https://mmifacedb.eu/.Visit https://paperswithcode.com/dataset/sfew to get the SFEW dataset.

Conflicts of Interest:The authors declare that they have no conflicts of interest to report regarding the present study.