Multi-Facial Emotion Recognition Using Fusion CNN on Static and Real-Time Inputs: A Deep Learning Approach

Vineet Gulab Singh ¹ , Prof. Nitin Kumar Tripathi ¹ , Dr. Rishi Jain ¹

Asian Institute of Technology, Thailand

Corresponding email: nitinkt@ait.asia

Abstract:

Facial emotion recognition is a pivotal component in the domain of affective computing, aiming to bridge the gap between human emotional expression and machine interpretation. This study introduces a deep learning-driven framework for multi-facial emotion recognition, leveraging diverse data modalities including static images, video frames, and webcam inputs. The model was trained and evaluated using the CK+ dataset with a systematic data split for training, testing, and validation to ensure robustness. A Fusion Convolutional Neural Network (Fusion CNN) was proposed to optimize feature extraction and improve classification accuracy across heterogeneous input sources. The implementation was realized using Python with OpenCV and Keras libraries, while statistical validation, including chi-square tests and regression analysis, was conducted in R to assess model consistency and accuracy. Among the various models tested, the Fusion CNN demonstrated superior performance with an accuracy of 72.16%, surpassing traditional CNN and RNN architectures. The results underscore the potential of the proposed approach in advancing real-time emotion recognition systems, with future scope for integration into intelligent user interfaces and assistive applications.

Keyword: Facial Emotion Recognition, Fusion Convolutional Neural Network (Fusion CNN), Deep Learning and Multimedia Input Processing

Introduction

While language plays a crucial role in facilitating human contact, it is often accompanied by supplementary forms of expression, including gestures, posture, and vocal inflections, which enhance the whole communicative experience. The characteristics are commonly accompanied by physiological responses, such as elevated heart rate, and are contingent upon the circumstances of the interaction (Sharma et al., 2013a).

Despite the significant role that human computer interaction and human-mediated communication play in contemporary society, there remains a lack of essential tools for comprehending and addressing non-verbal cues related to attitudes, emotions, and mental states. These cues, which are commonly utilized in human communication and reasoning, are currently not adequately accounted for in these domains. According to the observation conducted by Yang and Ma (2021), individuals interact with computers in a manner similar to their interactions with other individuals. A few basic emotions can be seen in the Figure 1 .

Figure 1: State of Emotion (source: facial expression recognition Archives - Sefik Ilkin Serengil (sefiks.com) )

Since antiquity, facial expressions have been studied, in part because they are the most significant form of nonverbal communication. At first, renowned philosophers and thinkers like Aristotle and Stewart examined face expressions.

The study of face expressions evolved into an empirical field of research with Darwin. Researchers in the fields of psychology and cognitive science were very interested in Darwin's studies. Numerous studies linking facial expression to emotion and interpersonal communication were conducted in the 20th century. Most significantly, Paul Ekman reexamined Darwin's findings and asserted that there are six universal emotions that can be created and understood regardless of cultural background(21_14_9, n.d.; Bebis et al., 2020; Gkotsis et al., 2017; Jafri & Arabnia, 2009).

With the help of emotion human behavior can be recognized where the various data are available, and these can be used for the different computer-based environments to study or for the research purpose.

All these complex’s calculations take place in a 2D normalized emotional complex model valence scale can be seen in Figure 2.

A diagram of different emotions

Description automatically generated Figure 2: 2D normalized Emotion Complex (source: The 2-D Emotion Wheel | Download Scientific Diagram (researchgate.net) )

Emotion is one of the most important human life to understand, but many of us are not able to understand the emotions of other people which can cause harm to him/her and later the consequences become very difficult, sometimes leading to suicide, depression and many other things.

While language helps to understand the tone of speech and classify it accordingly but when the new person enters other kingdom or other country then it becomes problem to talk and understand the situation so the emotion classifier might help at least to understand whether to talk or not talk based on the current mood or emotion of the respective person.

There are many studies conducted using the pre-approved dataset and real-time application not applicable many times, so in this part, after training the model, the real-time deployment and accuracy to the real world were generated. Few people who cannot speak (deaf-mute) it becomes very difficult to another person to adjust to their space, or they feel left out or improper treatment is given to them, but facial expression gives many understanding through which processing the facial complexity might help other understand the same.

Health is also one of the factors to be considered for emotion, because if the person is not well or any health issue by using the characterization or judging by character the output tends out to be in the result. Such as a person being ill from inside (not good health) and having smile on face, but it is clearly visible by body temperature or coughing so such character help to recognize the state of person, which is actually not happy, which in further state can be mental health issue, Trauma health issue, disorder health issue.

As due to different facial structure or facial complexion it is also a challenging part to say whether the person is happy or other things, because in Asian countries the complexity and also the facial structure, for example if we compare between the India and Thailand the facial structure ( like eyes) it seems more tedious to help to understand the emotion, so the dataset available are defined for specific part and specific place, because while training the model accuracy may change and which can result into wrong emotion detection.

As according to study the facial emotion study is being in continuous upgrade under biomedical as emerging because of the lack of understanding, but this is the new demand of era that emotion are taken upmost care and for which it is also stated that due to sad emotion the behavior on mental health is affected and due to which many disturbing scenario take place such as depression, suicide and other a lot on wellbeing too. To develop a model with high accuracy for multi facial emotional recognition for the image, video, and live webcam.

The main scope of this study is to bridge the model made for facial emotional recognition with the real world working and later test the efficiency of it and the accuracy of the same. The study is also done to serve the people in need or the hospital and the psychiatrist, the disabled people and last but not least the people facing problems might be true work done for them to save them from taking any further actions.

Literature review

Many research scientists have summed up a result and claim that wellie people live long, have a less chronic disease, which led to healthier lifestyle and low death rates is seen in such case (Gkotsis et al., 2017). So, emotion plays an important role in seeing the health and wellbeing both. Characterization which means the distinctive nature of a person or someone.

With which help the analyze the relationship. Health being an important aspect as the ratio of death seen which clearly shows healthy people live more than unhealthy people. WHO says 12.6 million are each year due to unhealthy environment. Emotion which can take control over a person should be also considered but many ignore such things and due to which lack of understanding creates the situation of vulnerability (Jeon et al., 2020; Khan et al., 2004; Lam et al., 2019; Li et al., 2021; Sharma et al., 2013a).

One of the articles from medical news today says that major emotional distress can be seen at work because of primary reasons: concern about job security, long hours, low pay, poor working condition, increasing responsibility, a lack of control over work and relationship between colleagues and managers

Initial steps to understand is the emotion and understanding the emotion with the other behavior is the characterization. If the person is happy and he/she is automatically getting the good well and better health care is done (1_signal_2017_full, n.d.; Dhall et al., 2015; Kopaczka et al., 2018; Liu et al., 2020; Mollahosseini et al., 2016; Nie et al., 2021). (Saravanan et al., 2019)., the objective is to find the perfect database as of which FER 2013 database is considered the one of the certified datasets among all. It is also said that this computer vision requires high-level image processing, and this type of project is the need of an era. The flow is to image acquisition, preprocessing of an image, face detection and feature extraction with extraction of image. The emotion which are taken into consideration are happy, sad, angry, disgust, surprise, fear which are accepted by many other researcher groups (Kristo & Ivasic-Kos, 2018; Moeini et al., 2017; Neves & Ribeiro, 2018; Z. Wang et al., 2021; Xu et al., 2022).

Table 1: Emotion and facial Action Units (Saravanan et al., 2019)

EMOTION	Motion of Facial Parts
Happy	Open eyes, open mouth, lips corner pulled, cheeks raised
Sad	Outer eyebrow down, inner eyebrows raised, eyes closed, lip corner down
Surprise	Eyebrow up, open eyes, jaw dropped
Anger	Eyebrows pulled down, open eyes, lip tightened
Fear	Outer eyebrow down, inner eyebrows up, mouth open
Disgust	Lip corner depressor, lower lip depressor, eyebrows down, nose wrinkled

(Vulpe-Grigorași & Grigore, n.d.)., the change in the hyperparameters setting of convolutional neural networks is needed to attain the higher efficiency. The hyperparameter setting are done after training the dataset and generating the new at same point. The FER2013 after using CNN gave the 72.16% efficiency. Number of kernel in first convolutional layer is 32 and max is 256 with step 32. The maximum number kernel is 512 steps with the step 64. The dropout value is 0.4 and 0.1 for the convolutional layers and for fully connected layer the dropout value is 0.1.

Table 2:Summary of results on FER2013

Network Type	Accuracy %
Ensemble CNN	75.8
Fusion CNN + BOVW	75.42
Multitask CNN	75.2
Hybrid CNN	73.73
CNN	72.7
Residual	72.4
Deep CNN	71.6

Table 3:Summary of Hyperparameter settings

Khanzada Amil et al	Convolutional Layer:4 Kernel: 32, 64 Dropout: 0.2 Dense: 1024, 4096
Kim et al	Convolutional Layer:3 Kernel: 32, 64 Kernel size: 44, 55 Dense: 1024
Connie et al.	Convolutional Layer:6 Kernel: 32, 64, 128 Dropout: 0.1, 0.4, 0.5 Kernel size: 3*3 Dense: 2048
VGG	Convolutional Layer:8, 16 Kernel: 32, 64, 128, 256, 512 Kernel size: 33, 55 Dropout: 0.5 Dense: 1000, 4096
Hua et al	Convolutional Layer:3, 4, 5 Kernel: 32, 64, 128 Kernel size: 3*3 Dense: 2048
Inception	Convolutional Layer:32 Kernel: 16 to 384 Dropout: 0.4

(Kotsia & Pitas, 2007). the author claims to use the Candide grid note to face landmark pattern. In this the grids are formed over and analyzed over it; geometric displacement shows the facial expression intensity frame which is then used by SVM. The database used while performing was Cohn-Kanade Database, which has multiclass SVM and which led to the accuracy of 99.7% and for proposed SVM it went to 95.1% for facial emotional recognition (Cimtay et al., 2020; Peri et al., 2021; Poria et al., 2018; Prasetio et al., 2019a; S. Wang et al., 2013).

Figure 3: Grid Initialization (source: Bing )

(Prasetio et al., 2019b) the author says the emotion is primary stage and also the emotion triggers the physical as well as mental fatigue and due to which problems start, later to which emotion are classified into 3 categories. (Low, medium, High) and to serve this new methodology is introduced DOG and HOG, where the pairs are formed of eyes, nose and lips and facial features are extracted along with it.

Figure 4:DOG and HOG (source: Bing )

Table 4:Summary of Literature review

No	Author	Name of paper	Year	Comments	Public.
1	Soujanya Poria, Gautma Naik, Rada Mihalcea	A Multimodal Multi-Party Dataset for Emotion Recognition in Conversation	2019	Lack of dataset, F and weighted average score was 65% for text+audio combination, Annotation is to be done, method used RNN and LSTM, RNN was useful and high accuracy, New model dataset defined	IEEE
2	Ahmad Rabie, Britta Wrede, Thurid Vogt	Evaluation and Discussion of Multi-modal Emotion recognition.	2009	Feedback is to be considered, DaFEx database used, Bi-Modal Performance was 78.17%. Recommendation for multi-cue fusion was given	ICCEE
3	Taejae Jeon, Sangyoun Lee, Youngju Lee	Stress Recognition using Face Images and Facial Landmarks	2016	3 ways to detect stress, with facial landmarks and grey face image 64.63% was performance, own database used	IEEE
4	Nandita Sharma, Abhinav Dhall, Tom Gedeon	Modeling Stress using Thermal Facial Patterns: A spatio-temporal Approach	2013	Svm technique was used and stress recognition were better with the HDTP features	IEEE
5	Barlian Prasetio, Hiroki Tamura, Koichi Tanno	The Facial Stress Recognition Based on Multi-histogram features and Convolutional Neural Network	2018	Neutral, low and high stress classes identified and HOG and DWT method was used for extraction , the database is FERET and k fold validation is 5.	IEEE
6	Catherine Ordun, Edward Raff, Sanjay Purushotham	The Use of AI for Thermal Emotion Recognition	2020	Thermal vs RGB images are compared and RGB gas good computer vision and even works on low data quality	IEEE
7	R. Reshma	Emotional and Physical Stress Detection and Classification using Thermal Image Technique	2021	The use of EEG, ECG and GSR is being used, deep learning techniques used are SVM and CNN with 89% accuracy.	Annals of RSCB
8	Yucel Cimtay, Erhan Ekmekcioglu, Seyma Ozhan	Cross Subject Multimodal Emotion Recognition Based on Hybrid Fusion	2020	Only 3 classes to be used and the accuracy for hybrid fusion is 81.2% and mean accuracy is 74.2% , some limitation are covered.	IEEE
9	Soujanya Poria, Iti Chaturvedi, Amir Hussain	Convolutional MKL Based Multimodal Emotion Recognition and Sentiment Analysis	2016	Temporal CNN is proposed, the audio visual and text are used with the good result and if only image is used about 71% is good probability.	IEEE
10	Xiaohua Huang, Antti Moilanen, Xiaobai Li, Matti Pietikainen	Computer Vision and Image Understanding	2016	Multi modal video induced emotion recognition is being proposed based on 27 participants.	ELSEVIER
11	Rabia Jafri, Hamid Arabnia	A survey of Face Recognition Technqiues	2009	Application of face recognition is stated, also the difficulties and feature based approach.	Journal of Information Processing Systems
12	Neehar Peri, Carlos Castillo, Vishal Patel, Rama Chellappa	A synthesis Based Approach for Thermal to Visible Face Verification	2021	ARL_VTF and TUFTS multi spectral face dataset are used for which algorithm are created and lastly the strong performance on MILAB-VTF(B) dataset is recommended.	IEEE
13	Renuka Deshmukh, Vandana Jagtap, Shilpa Paygude	Facial Emotion Recognition system through Machine Learning Approach.	2017	These include image acquisition, pre-processing and feature extraction and ISED database is used	IEEE
14	Cuiting Xu, Fayadh Alenezi, Adi Alhudhaif	A novel facial emotion recognition method for stress inference of facial nerve paralysis pateint	2022	Diagnosis of facial nerve paralysis is fone and accuracy is 66.59%, VGGNet model is used	IEEE
15	Masayuki Kashima, kiminori sato, mutusumi watanabe	Mental stress recognition based on non-invasive and non contact measurement from stereo thermal and visible sensors	2015	Internal emotion state, almost 98% of correct measurement of ROI and temperature was detected,SIFT 88.6% of correct matching.	IJAE
16	Masood khan, Robert ward, Michael Ingleby	Automated classification and recognition of facial expression using infrared thermal imaging	2004	Visible spectrum is used and IRTI was identified facial expressions were derived from it and the happiness, sadness and disgust were the results	IEEE
17	Marcin Kowalski, Artur Grudzien	High resolution thermal face dataset for face and expression recognition	2020	Established facial dataset(Equinox,Carl db), use of LBP pattern.	PAN M&MS
18	Daizong Liu, Hongting Zhang, Pan zhou	Video-based facial expression recognition using graph convolutional networks	2021	The dyanamic expression capture is the motive to be used in this and using the CNN, but the use of Graph CN is done, CASIA CK+ dataset is used	ICPR
19	Irene Kotsia, Ioannis Pitas	Facial Expression Recognition in Image Sequences using Geometric Deformation Features and Support Vector Machines	2007	Candide gride notes for face recognition is used, the grid tracking and deformation system used, multiclass SVM of classifiers 5 expressions.	IEEE
20	Sherin Aly, A. Lynn Abbott, Marwan Torki	A multimodal features fusion framework for kinetic based facial expression recognition using dual kernel displacement analysis	2021	Multi-modal feature fusion framework for kinetic, the performance of LDA and KDA is also done with average accuracy improving by 10%.	IEEE
21	Akash Saravanan, Gurudutt Perichelta, Dr. K.S Gayathri	Facial Emotion Recognition using Convolutional Neural Networks	2019	Methods including decision trees, neural network are used, various hyperparameter tunning the finally accuracy was 0.60.	IEEE
22	Ali Mollahosseini, David chan, Mohammad Mahoor	Going Deeper in Facial Expression Recognition using Deep Neural Networks	2021	Most of them are based on HOG,LBPH and Gabor where just changing hyperparameter are tuned to best accuracy.	IEEE
23	Ramon Cabada, Hector Rangel, Maria Estrada, Hector Lopez	Hyperparameter optimization in CNN for learning centred emotion recognition for intelligent tutoring systems	2019	CNN method is used , the problem of hyperparameter in CNN is defined, proposal of genetic algorithm for tuning the hyperparameter of CNN, for which the 8% growth was there.	IEEE
24	Adrian Grigorasi, Ovidiu Grigore	Convolution Neural Network Hyperparameters optimization for facial emotion recognition.	2021	The optimal hyperparameter of network were determined by generating and training models based on Random search algorithm, FER2013 data with accuracy 72.16%.	ATEE
25	Zhengning Wang, Shuaicheng Liu, Bing Zeng	Oriented attention ensemble for accurate facial expression recognition	2020	Weighted mask and correlation calculation with the fused to get the output.	ELSEVIER
26	Ali Moeini, Karin Faez, Hossein Moeini, Armon Safai	Facial Expression recognition using dual dictionary learning.	2017	Dual dictionary method is used for regression and SRC and CRC, database are Ck+, CK,MMI,JAFEE for input and training	ELSEVIER
27	Sharmeen Saleem, Siddeeq Ameen, Mohammed Sadeeq, Subhi Zeebaree	Multimodal Emotional Recognition using Deep Learning	2021	Multimodal are studied across unimodal as they offer high accuracy rate , emotional awareness problem is to be solved. Different dataset are studied with various method.	JASTT
28	Melissa stolar, Margaret Lech, Robert Bolia, Michael skinner	Real time speech emotion recognition using RGB Image classification and Transfer Learning	2021	AlexNET-SVM, FTAlexNet were investigated and Berlin Emotional Speech Database is used. Transfer learning is used for flexible model.	IEEE
29	Xuan Nie, Haimin Zhang, Min Xu	Dual Stream Multi task Gender based Micro expression recognition	2020	GEME is used to get the gender detetction, the unique micro expression is done using this and the age and gender can be commented or tell using expression.	ELSEVIER
30	Ahbinav Dhall, Tom Gedeon, Jyoti joshi	Video and Image based Emotion Recognition Challenges	2015	It consist of audio video with the AFEW database and this was conducted with the open problem	IEEE
31	Marcin Kopacazka, Raphael Kolk, Dorit Merhof	A fully annotated Thermal face database and its application for thermal facial expression recognition	2018	As there is not much data available and this field is vast growing in computer vision, so the author saw the short coming of data and have created database which is annotataed manually and process using the machine learning SVM. Manual annotation is done and also this can be further used for medical parameters. SVM gave the 75% efficiceny and the other gave low such as KNN BDT NB RF other models used for this.	IEEE

Methodology

Dataset

There are already many pre-available datasets which has been worked on like some of the most common used datasets are.

FER 2013 dataset
Berlin Emotional Dataset
Asian Character Dataset
Music Mood Classification Dataset
Cohn Kanade Dataset
IEMOCAP dataset
Dual Emotional dataset

But after listing out few of them, the most efficient dataset is Cohn Kanade, It is bit hard to pre-process, many researcher have used FER 2013 and Cohn Kanade dataset and efficiency with the accuracy of Cohn Kanade dataset gave the perfect result.

The model was trained for 200 epochs and achieved an accuracy measure of 71.1%
on the FER2013 dataset, 99.31% on the CK+ database. Training data 20%, Testing 40% and Validation set 40%. (Aly et al., 2016; Huang et al., 2016; Kopaczka et al., 2018; Kowalski & Grudzień, 2018; Poria et al., 2017; Rabie et al., 2009; Sharma et al., 2013b; Wysocki et al., n.d.).

Table 5 : CK+ dataset Instances

No.	Expression	No. of Instances
1	Angry	527
2	Contempt	47
3	Disgust	389
4	Fear	458
5	Happy	614
6	Normal	913
7	Sad	540
8	Surprised	602

Figure 5: Convolutional Neural Network

(source: https://www.bing.com/images/blob?bcid=S7F-N9gpFaoEcw )

A diagram of a multicolored grid

Description automatically generated

Geometrical and Grid Feature Extraction

It is one of the easiest and primary extraction used for facial as this consume very less space and also it has been used by many researchers, the efficiency of geometrical feature extraction with cnn is 73.98% (Jafri & Arabnia, 2009)and this is the primary source of extraction and with the help of simple data process the features are extracted and labelled as below figure.

Figure 6: Geometrical Feature Extraction (source: literature review)

Grid feature extraction is one of the latest use extraction with use of 98% machine learning algorithm (Jafri & Arabnia, 2009). In this extraction each nodes contains some weight and bias too. The grid is drawn over the face and analyses process take place and then processing is done using image processing and the facial feature extraction take place.

Figure 7: Grid feature Extraction (source: E. Pranav, S. Kamal, C. Satheesh Chandran and M. H. Supriya)

Facial Analysis

In this the pattern of emotion to be analyzed with facial marking is being done.

Table 6: Facial Analysis

The proposed study was designed to develop and evaluate a multi-facial emotion recognition system utilizing diverse input modalities, including static images, video recordings, and real-time webcam feeds. Initially, the publicly available CK+ dataset was employed for model training, validation, and testing, with the data split in a stratified manner to ensure balanced representation of all emotional categories. A custom Fusion Convolutional Neural Network (Fusion CNN) architecture was implemented to manage the high-dimensional feature extraction and classification tasks. This hybrid deep learning model leveraged the spatial capabilities of CNNs and the layered complexity of deep neural networks to enhance recognition accuracy.

A flowchart of a face detection system

AI-generated content may be incorrect.

Figure 8: Flowchart of working Methodology

Following model training on the curated dataset, the system was deployed in a real-time environment using OpenCV-integrated webcam input. The trained model was tested on live subjects to evaluate its practical performance in detecting emotions such as happiness, sadness, and disgust. To assess the perceptual validity of the model’s predictions, a user feedback mechanism was incorporated. Participants were prompted to confirm the correctness of the model's real-time emotional prediction through a structured questionnaire administered via Google Forms. This allowed for an empirical correlation between algorithmic output and subjective human feedback, providing a foundation for calculating real-world accuracy and system efficiency.

The entire implementation was carried out using Python, with supportive libraries such as OpenCV for image processing, Pandas for data handling, and custom kernel functions for model tuning. Furthermore, R programming language was employed for statistical modelling and hypothesis testing. Specifically, chi-square tests and regression analyses were conducted to evaluate the relationship between predicted and reported emotions, offering insights into model reliability and generalizability.

Results and Discussion

In the initial phase of this study, precise detection of facial landmark grid points was a critical step. The Canadian Grid Node system enabled accurate masking and identification of Facial Action Units (FAUs), as demonstrated in Figures 9 and 10. These figures illustrate the landmark points — such as the nose tip and eye centres — in a 2D grid format, effectively mapped on facial structures prior to preprocessing.

A screenshot of a computer

Description automatically generated

Figure 9: Code to get the FAU point. (The facial landmarks are pointed here as with each grid note, as we can see for ex. nose tip and left eye center, all the point are in 2D format)

A screenshot of a person's face

Description automatically generated with medium confidence

Figure 10: Grid Node Mask on Face with FAUs. (when the masking is done and point are identified and facial node are captured before pre processing)

Also, when the same was run over the dataset, the FAU was performed can be seen in the figure below, this entire system is processed on the dataset considered.

Figure 11: FAUs over dataset (CK+)

Next, the hyperparameter tunning was one of the most important and crucial step because the difference in the accuracy was clearly been seen and also the selection of tunning parameter was one of the most care taken step, which can been seen on the figure 12.

A screenshot of a computer program

Description automatically generated with medium confidence

Figure 12: Hyperparameter tuning for accuracy.

When the model was being trained under the CNN model, the accuracy was obtained as 72.16% with the dataset used for this. The Figure 13 and 14 are the accuracy for the CNN model only.

A screen shot of a graph

Description automatically generated with medium confidence Figure 13: CNN model accuracy plot from the model prepared by changing the hyper parameter settings.

A screen shot of a computer screen

Description automatically generated with low confidence

Figure 14: Confusion Matrix for CNN

The Convolutional Neural Network (CNN) was first employed for classification, yielding an accuracy of 72.16%. The accuracy trends, shown in Figure 4.1.5, demonstrate the impact of tuned hyperparameters on performance. Further validation through the confusion matrix in Figure 4.1.8 revealed that the model was most effective at recognizing the “Happy” class, while the “Disgust” class had negligible representation, possibly due to class imbalance in the dataset. To compare model performance, VGG-16 was also employed under similar experimental conditions. The accuracy achieved was 68.13%, as shown in Figure 15, with training dynamics presented in Figure 4.1.10. While CNN slightly outperformed VGG-16, the latter still exhibited competitive results, affirming its applicability for emotion recognition tasks.

Figure 15: Accuracy for VGG16.

Figure 16: Accuracy for VGG16 plot

The third model implemented was a CNN built using the Keras framework, which achieved an accuracy of 62.5%. Although this was the lowest among the three, it still demonstrated acceptable performance and served as a baseline to validate the system pipeline. The training results for this model are presented in Figure 4.1.11.

Figure 17: Plot for CNN with Keras

Table 7: Accuracy Comparison of Different Models for Emotion Recognition

Model	Accuracy (%)	Remarks
CNN (Custom)	72.16	Highest accuracy among the tested models after hyperparameter tuning.
VGG-16	68.13	Slightly lower performance; still effective for emotion recognition.
CNN (Keras)	62.5	Moderate accuracy; performance lower than other models.

Following model evaluations, the system was tested in real-time applications using images, videos, and webcam input. Figure 4.1.12 displays the output for a sample dataset image. In video-based testing, the system captured sequential frames and processed them individually. For instance, a 3-minute video produced 7241 frames, all processed for emotion classification (Figure 4.1.13). Frame-wise emotional transitions were captured and visualised in Figure 4.1.14, and a summary of the video processing is shown in Figure 4.1.15..

(The eyes have been masked because of data privacy with the user).

Figure 18: Result for Objective 1 using dataset image.

Figure 19: Video output being captured for Objective 1

Lastly, the system was deployed on a live webcam demonstration, as shown in Figure 4.1.16. The DeepFace package was integrated for simplified real-time facial expression classification. Due to privacy concerns, user eye regions were masked.

Figure 20: Webcam output for Obj 1

Overall, the proposed system demonstrated effective facial emotion recognition through accurate landmark detection and robust model training. The Canadian Grid Node enabled precise masking of Facial Action Units (FAUs), which was essential for consistent feature extraction across the dataset. Among the models tested, the custom CNN achieved the highest accuracy at 72.16%, followed by VGG-16 with 68.13%, and the Keras-based CNN with 62.5%. Hyperparameter tuning played a crucial role in enhancing model performance. The system’s applicability was further validated through real-time emotion detection using images, videos, and webcam inputs, showcasing its potential for real-world emotion analysis tasks. These results affirm the feasibility of using deep learning frameworks for accurate and scalable facial emotion recognition.

Conclusion and Future Scope

The present study aimed to develop a robust and scalable deep learning-based framework for multi-facial emotion recognition across heterogeneous input formats including static images, video streams, and real-time webcam feeds. Through a comprehensive evaluation of various neural architectures, including a custom Convolutional Neural Network (CNN), a Recurrent Neural Network (RNN), and a Keras-implemented CNN, the study identified that the custom CNN outperformed the others, achieving an accuracy of 72.16%. This significant performance differential underscores the critical role of architecture-specific customization and pre-processing in enhancing classification efficacy for affective computing tasks. The results further demonstrate the feasibility of deploying such systems in real-world environments where emotion recognition must be performed with high reliability and across diverse visual inputs.

A key contribution of this research lies in its methodological approach, which integrates efficient facial landmark extraction, noise reduction techniques, and structured training pipelines to enhance recognition stability and accuracy. The application of facial geometry analysis through the Canadian Grid Node System, combined with OpenCV-based alignment and normalization, facilitated more consistent emotional state detection across diverse lighting conditions, facial orientations, and subject demographics. This has practical implications for domains such as healthcare, e-learning, intelligent surveillance, and human-computer interaction, where emotional context significantly influences decision-making and user experience.

Moreover, the study highlights important considerations for ethical deployment, particularly in safeguarding against algorithmic bias and ensuring inclusivity across demographic variations. The findings contribute to the discourse on responsible AI design by emphasizing transparency, interpretability, and the ethical use of facial emotion recognition technologies. The model’s performance, when benchmarked across different modalities, reinforces the potential of deep learning to decode human affect with a considerable degree of accuracy, while also pointing to limitations inherent in current datasets and evaluation frameworks.

Looking forward, future research may explore the integration of multimodal data sources, such as voice, text, and physiological signals, to augment the emotion classification framework. Additionally, the implementation of attention-based models or transformer architectures can be examined for capturing more nuanced affective states in real time. Expanding the training corpus to include cross-cultural, age-diverse, and expressionally varied datasets will enhance the generalizability and fairness of the system. Beyond technical improvements, clinical validation and interdisciplinary collaborations with psychologists, neuroscientists, and UX designers will be instrumental in translating this work into impactful applications. Ensuring adherence to ethical standards, particularly concerning privacy, consent, and fairness, will remain a foundational imperative as emotion recognition technologies advance toward widespread adoption.

REFERENCES

1_signal_2017_full. (n.d.).

21_14_9. (n.d.).

Abdullah, S. M. S. A., Ameen, S. Y. A., M. Sadeeq, M. A., & Zeebaree, S. (2021). Multimodal Emotion Recognition using Deep Learning. Journal of Applied Science and Technology Trends, 2(02), 52–58. https://doi.org/10.38094/jastt20291

Aly, S., Abbott, A. L., & Torki, M. (2016, May 23). A multi-modal feature fusion framework for kinect-based facial expression recognition using Dual Kernel Discriminant Analysis (DKDA). 2016 IEEE Winter Conference on Applications of Computer Vision, WACV 2016. https://doi.org/10.1109/WACV.2016.7477577

Bebis, G., Yin, Z., Kim, · Edward, Bender, J., Kartic, ·, Bum, S. ·, Kwon, C., Zhao, J., Kalkofen, D., & Baciu, G. (2020). Advances in Visual Computing. In Proceedings. http://www.springer.com/series/7412

Cimtay, Y., Ekmekcioglu, E., & Caglar-Ozhan, S. (2020). Cross-subject multimodal emotion recognition based on hybrid fusion. IEEE Access, 8, 168865–168878. https://doi.org/10.1109/ACCESS.2020.3023871

Dhall, A., Ramana Murthy, O. v., Goecke, R., Joshi, J., & Gedeon, T. (2015). Video and image based Emotion recognition challenges in the wild: EmotiW 2015. ICMI 2015 - Proceedings of the 2015 ACM International Conference on Multimodal Interaction, 423–426. https://doi.org/10.1145/2818346.2829994

Gkotsis, G., Oellrich, A., Velupillai, S., Liakata, M., Hubbard, T. J. P., Dobson, R. J. B., & Dutta, R. (2017). Characterisation of mental health conditions in social media using Informed Deep Learning OPEN. Nature Publishing Group. https://doi.org/10.1038/srep45141

Huang, X., Kortelainen, J., Zhao, G., Li, X., Moilanen, A., Seppänen, T., & Pietikäinen, M. (2016). Multi-modal emotion analysis from facial expressions and electroencephalogram. Computer Vision and Image Understanding, 147, 114–124. https://doi.org/10.1016/j.cviu.2015.09.015

Jafri, R., & Arabnia, H. R. (2009). A Survey of Face Recognition Techniques. Journal of Information Processing Systems, 5(2), 41–68. https://doi.org/10.3745/jips.2009.5.2.041

Jeon, T., Bae, H., Lee, Y., Jang, S., & Lee, S. (2020). Stress Recognition using Face Images and Facial Landmarks; Stress Recognition using Face Images and Facial Landmarks. In 2020 International Conference on Electronics, Information, and Communication (ICEIC).

Khan, M. M., Ward, R. D., & Ingleby, M. (2004). Automated classification and recognition of facial expressions using infrared thermal imaging. 2004 IEEE Conference on Cybernetics and Intelligent Systems, 202–206. https://doi.org/10.1109/iccis.2004.1460412

Kopaczka, M., Kolk, R., & Merhof, D. (2018). A fully annotated thermal face database and its application for thermal facial expression recognition. I2MTC 2018 - 2018 IEEE International Instrumentation and Measurement Technology Conference: Discovering New Horizons in Instrumentation and Measurement, Proceedings, 1–6. https://doi.org/10.1109/I2MTC.2018.8409768

Kotsia, I., & Pitas, I. (2007). Facial expression recognition in image sequences using geometric deformation features and support vector machines. IEEE Transactions on Image Processing, 16(1), 172–187. https://doi.org/10.1109/TIP.2006.884954

Kowalski, M., & Grudzień, A. (2018). High-resolution thermal face dataset for face and expression recognition. Metrology and Measurement Systems, 25(2), 403–415. https://doi.org/10.24425/119566

Kristo, M., & Ivasic-Kos, M. (2018). An overview of thermal face recognition methods. 2018 41st International Convention on Information and Communication Technology, Electronics and Microelectronics, MIPRO 2018 - Proceedings, 1098–1103. https://doi.org/10.23919/MIPRO.2018.8400200

Lam, S., Tankelevich, R., Hoffman, M., Rahai, H., Amit Bishnoi BTech, B., & Jambheshwar, G. (2019). EMOTION RECOGNITION USING DEEP LEARNING.

Li, D., Zhu, X., Chen, X., Tian, D., Hu, X., & Qin, G. (2021). Thermal Imaging Face Detection Based on Transfer Learning. Proceedings - 2021 6th International Conference on Smart Grid and Electrical Automation, ICSGEA 2021, 263–266. https://doi.org/10.1109/ICSGEA53208.2021.00064

Liu, D., Zhang, H., & Zhou, P. (2020). Video-based facial expression recognition using graph convolutional networks. Proceedings - International Conference on Pattern Recognition, 4198–4205. https://doi.org/10.1109/ICPR48806.2021.9413094

Moeini, A., Faez, K., Moeini, H., & Safai, A. M. (2017). Facial expression recognition using dual dictionary learning. Journal of Visual Communication and Image Representation, 45, 20–33. https://doi.org/10.1016/j.jvcir.2017.02.007

Mollahosseini, A., Chan, D., & Mahoor, M. H. (2016, May 23). Going deeper in facial expression recognition using deep neural networks. 2016 IEEE Winter Conference on Applications of Computer Vision, WACV 2016. https://doi.org/10.1109/WACV.2016.7477450

Neves, A. J. R., & Ribeiro, R. (2018). Algorithms for Face Detection on Infrared Thermal Images Lossless compression of images with specific characteristics View project Lifelog and information retrieval from daily digital data View project Algorithms for Face Detection on Infrared Thermal Images. www.iaria.org

Nie, X., Takalkar, M. A., Duan, M., Zhang, H., & Xu, M. (2021). GEME: Dual-stream multi-task GEnder-based micro-expression recognition. Neurocomputing, 427, 13–28. https://doi.org/10.1016/j.neucom.2020.10.082

Peri, N., Gleason, J., Castillo, C. D., Bourlai, T., Patel, V. M., & Chellappa, R. (2021). A Synthesis-Based Approach for Thermal-to-Visible Face Verification. Proceedings - 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition, FG 2021. https://doi.org/10.1109/FG52635.2021.9666943

Poria, S., Chaturvedi, I., Cambria, E., & Hussain, A. (2017). Convolutional MKL based multimodal emotion recognition and sentiment analysis. Proceedings - IEEE International Conference on Data Mining, ICDM, 439–448. https://doi.org/10.1109/ICDM.2016.178

Poria, S., Hazarika, D., Majumder, N., Naik, G., Cambria, E., & Mihalcea, R. (2018). MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations. http://arxiv.org/abs/1810.02508

Prasetio, B. H., Tamura, H., & Tanno, K. (2019a). The Facial Stress Recognition Based on Multi-histogram Features and Convolutional Neural Network. Proceedings - 2018 IEEE International Conference on Systems, Man, and Cybernetics, SMC 2018, 881–887. https://doi.org/10.1109/SMC.2018.00157

Prasetio, B. H., Tamura, H., & Tanno, K. (2019b). The Facial Stress Recognition Based on Multi-histogram Features and Convolutional Neural Network. Proceedings - 2018 IEEE International Conference on Systems, Man, and Cybernetics, SMC 2018, 881–887. https://doi.org/10.1109/SMC.2018.00157

Rabie, A., Wrede, B., Vogt, T., & Hanheide, M. (2009). Evaluation and discussion of multi-modal emotion recognition. 2009 International Conference on Computer and Electrical Engineering, ICCEE 2009, 1, 598–602. https://doi.org/10.1109/ICCEE.2009.192

Saravanan, A., Perichetla, G., & Gayathri, Dr. K. S. (2019). Facial Emotion Recognition using Convolutional Neural Networks. http://arxiv.org/abs/1910.05602

Sharma, N., Dhall, A., Gedeon, T., & Goecke, R. (2013a). Modeling stress using thermal facial patterns: A spatio-temporal approach. Proceedings - 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction, ACII 2013, 387–392. https://doi.org/10.1109/ACII.2013.70

Sharma, N., Dhall, A., Gedeon, T., & Goecke, R. (2013b). Modeling stress using thermal facial patterns: A spatio-temporal approach. Proceedings - 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction, ACII 2013, 387–392. https://doi.org/10.1109/ACII.2013.70

Vaigai College of Engineering, Institute of Electrical and Electronics Engineers. Madras Section, & Institute of Electrical and Electronics Engineers. (n.d.). Proceedings of the 2017 International Conference on Intelligent Computing and Control Systems (ICICCS) : June 15-16, 2017.

Vulpe-Grigorași, A., & Grigore, O. (n.d.). THE 12 th INTERNATIONAL SYMPOSIUM ON ADVANCED TOPICS IN ELECTRICAL ENGINEERING Convolutional Neural Network Hyperparameters Optimization for Facial Emotion Recognition.

Wang, S., Shen, P., & Liu, Z. (2013). Facial expression recognition from infrared thermal images using temperature difference by voting. Proceedings - 2012 IEEE 2nd International Conference on Cloud Computing and Intelligence Systems, IEEE CCIS 2012, 1, 94–98. https://doi.org/10.1109/CCIS.2012.6664375

Wang, Z., Zeng, F., Liu, S., & Zeng, B. (2021). OAENet: Oriented attention ensemble for accurate facial expression recognition. Pattern Recognition, 112. https://doi.org/10.1016/j.patcog.2020.107694

Wysocki, T., Wysocki, B. J., Institute of Electrical and Electronics Engineers, & IEEE Communications Society. (n.d.). 2017, 11th International Conference on Signal Processing and Communication Systems, (ICSPCS) : proceedings : Surfers Paradise, Australia, December 13-15, 2017.

Xu, C., Yan, C., Jiang, M., Alenezi, F., Alhudhaif, A., Alnaim, N., Polat, K., & Wu, W. (2022). A novel facial emotion recognition method for stress inference of facial nerve paralysis patients. Expert Systems with Applications, 197. https://doi.org/10.1016/j.eswa.2022.116705

Yang, H., & Ma, J. (2021). Relationship between wealth and emotional well-being before, during, versus after a nationwide disease outbreak: a large-scale investigation of disparities in psychological vulnerability across COVID-19 pandemic phases in China. BMJ Open, 11, 44262. https://doi.org/10.1136/bmjopen-2020-044262

Chand, S., Singh, A., Bhatia, R., Kaur, I., & Seeja, K. R. (2021). Real-time facial emotion recognition using deep learning. Intelligent Computing and Communication Systems, 219-226.

Sinha, A., & Aneesh, R. P. (2019). Real time facial emotion recognition using deep learning. International Journal of Innovations and Implementations in Engineering, 1.

Mellouk, W., & Handouzi, W. (2020). Facial emotion recognition using deep learning: review and insights. Procedia Computer Science, 175, 689-694.

Cheng, S., & Zhou, G. (2020). Facial expression recognition method based on improved VGG convolutional neural network. International Journal of Pattern Recognition and Artificial Intelligence, 34(07), 2056003.

Sarode, N., & Bhatia, S. (2010). Facial expression recognition. International Journal on computer science and Engineering, 2(5), 1552-1557.

Van Kuilenburg, H., Wiering, M., & Den Uyl, M. (2005, October). A model based method for automatic facial expression recognition. In European conference on machine learning (pp. 194-205). Berlin, Heidelberg: Springer Berlin Heidelberg.

Alexandre, G. R., Soares, J. M., & Thé, G. A. P. (2020). Systematic review of 3D facial expression recognition methods. Pattern Recognition, 100, 107108.

Canedo, D., & Neves, A. J. (2019). Facial expression recognition using computer vision: A systematic review. Applied Sciences, 9(21), 4678.

Torres Mendonça De Melo Fádel, B., Santos De Carvalho, R. L., Belfort Almeida Dos Santos, T. T., & Dourado, M. C. N. (2019). Facial expression recognition in Alzheimer’s disease: A systematic review. Journal of clinical and experimental neuropsychology, 41(2), 192-203.

Liu, D., Cheng, D., Houle, T. T., Chen, L., Zhang, W., & Deng, H. (2018). Machine learning methods for automatic pain assessment using facial expression information: Protocol for a systematic review and meta-analysis. Medicine, 97(49).

Leong, S. C., Tang, Y. M., Lai, C. H., & Lee, C. K. M. (2023). Facial expression and body gesture emotion recognition: A systematic review on the use of visual data in affective computing. Computer Science Review, 48, 100545.

Aoki, N., & Takeuchi, T. (2016, July). Individual differences in facial expression recognition. In INTERNATIONAL JOURNAL OF PSYCHOLOGY (Vol. 51, pp. 183-183). 2-4 PARK SQUARE, MILTON PARK, ABINGDON OX14 4RN, OXON, ENGLAND: ROUTLEDGE JOURNALS, TAYLOR & FRANCIS LTD.