A Multimodal Facial Emotion Recognition Framework through the Fusion of Speech with Visible and Infrared Images




Siddiqui, Mohammad Faridul Haque

Journal Title

Journal ISSN

Volume Title



The exigency of emotion recognition is pushing the envelope for meticulous strategies of discerning actual emotions through the use of superior multimodal techniques. This work presents a multimodal automatic emotion recognition (AER) framework capable of differentiating between expressed emotions with high accuracy. The contribution involves implementing an ensemble-based approach for the AER through the fusion of visible images and infrared (IR) images with speech. The framework is implemented in two layers, where the first layer detects emotions using single modalities while the second layer combines the modalities and classifies emotions. Convolutional Neural Networks (CNN) have been used for feature extraction and classification. A hybrid fusion approach comprising early (feature-level) and late (decision-level) fusion, was applied to combine the features and the decisions at different stages. The output of the CNN trained with voice samples of the RAVDESS database was combined with the image classifier's output using decision-level fusion to obtain the final decision. An accuracy of 86.36% and similar recall (0.86), precision (0.88), and f-measure (0.87) scores were obtained. A comparison with contemporary work endorsed the competitiveness of the framework with the rationale for exclusivity in attaining this accuracy in wild backgrounds and light-invariant conditions.


Very few infrared (IR) emotional databases are available for use, and most of them did not meet the framework requirements we were working on. To address this, we developed our own visible and IR image database called the VIRI database. This new database was created at The University of Toledo and was designed to overcome the limitations of existing IR databases and includes facial expressions captured in both visible and IR format in uncontrolled wild backgrounds. The database was created using pictures from on-campus students who consented to be included in the study. The VIRI database includes five different expressions (happy, sad, angry, surprised, and neutral) captured from 110 subjects (70 males and 40 females), resulting in 550 images in a radiometric JPEG format. The format constitutes visible, infrared, and MSX images and VIRI DB contains all three forms.


2023 Faculty and Student Research Poster Session and Research Fair, West Texas A&M University, College of Engineering, Poster, Multimodal automatic emotion recognition, Emotions, Convolutional Neural Networks


Permalink for this item. Use this when sharing or citing this source.