📌 Summary
This project presents a multimodal emotion recognition system that analyzes both human voice and its corresponding text to predict emotional states. Leveraging deep learning, feature extraction, and natural language processing (NLP), the model can classify emotions into five main categories: anger
, happiness
, sadness
, surprise
, and neutral
.
🎯 Objective
The goal is to design a system capable of detecting human emotions from spoken audio and its textual transcript, thereby enhancing communication technologies and human-computer interaction.
🧠 Motivation & Literature Background
- Voice and text are fundamental to emotional expression.
- Tone of voice can completely alter the meaning of identical sentences.
- Prior research used either audio or text independently; combining both provides richer emotion understanding.
- Inspired by studies like Yoon et al. (2018) which showed improved performance using both modalities.
🔧 Feasibility & Tools
- Technical Stack: Python (for ML and NLP), Google Colab (for compute), open-source libraries.
- Libraries:
openSMILE
,Librosa
,Scikit-learn
,TF-IDF
,SVC
,CNN
,RoBERTa
,VADER
. - Hardware: Google Colab with Tesla V100 GPU, 51 GB RAM.
- Cost: Only 330 TL for Google Colab Premium.
- Legal: All datasets and libraries are open-source and ethically used.
🧪 Dataset
- Merged multiple multimodal datasets:
CREMA-D
,RAVDESS
,SAVEE
,TESS
,IEMOCAP
,MELD
,JL-Corpus
,ESD
. - Datasets include both audio and corresponding text for emotion labels.
- Preprocessing steps included cleaning, balancing labels, and removing unusable entries.
🎙️ Audio Processing Pipeline
- Feature Extraction:
- Used
openSMILE
witheGeMAPSv02
to extract low-level descriptors (e.g., MFCCs, ZCR, spectral contrast). Librosa
used for comparative analysis but performed worse on test data.
- Used
- Model:
- CNN performed better than LSTM in both speed and accuracy:
- CNN Accuracy: 96%, Loss: 0.13
- LSTM Accuracy: 89%, Loss: 0.32
- CNN performed better than LSTM in both speed and accuracy:
📄 Text Processing Pipeline
- TF-IDF Vectorization: Transformed text into feature vectors.
- Label Encoding: Translated emotion labels into numerical form.
- Models Tested:
NLTK
with VADERRoBERTa
(transformer-based)SVC
(best performing model with 86% accuracy)
🔁 Integrated Model
- Feature Fusion: Audio and text features were concatenated and scaled using
MinMaxScaler
. - Model Architecture:
- CNN for audio stream
- Dense layer for text features
- Shared layers for final classification
- EarlyStopping and ModelCheckpoint callbacks used
- Training: Done with balanced and cleaned dataset, supported by real-time validation.
📈 Performance Evaluation
Emotion | Precision | Recall | F1 Score |
---|---|---|---|
Anger | 0.76 | 0.57 | 0.65 |
Happy | 0.64 | 0.54 | 0.59 |
Neutral | 0.62 | 0.76 | 0.68 |
Sad | 0.69 | 0.72 | 0.70 |
Surprise | 0.65 | 0.65 | 0.65 |
- Overall Accuracy: ~66–69% across test and validation sets.
- Confusion Matrix: Identified primary misclassifications (e.g., anger vs. neutral).
- Model Strengths: Neutral and Sad emotions classified more accurately.
📊 Sample Predictions
System output includes:
- Audio file path
- Spoken text
- Predicted vs. true emotion labels
Examples confirm that the model can consistently detect emotional context even in subtle expressions.
🧩 Applications
This emotion recognition system can be applied in:
- 📞 Call centers (emotion tracking)
- 🏥 Healthcare (mental state analysis)
- 🧠 Education (student engagement)
- 🤖 Human-AI interaction (voice assistants)
- 📱 Social media monitoring
🧾 Conclusion
- A successful multimodal emotion recognition system was implemented using deep learning.
- The model's ability to learn from both audio and text enables more nuanced and accurate emotion classification.
- Future work may include expanding emotion classes, using real-time streaming data, and building a user-friendly interface for deployment.
📚 References
- Yoon et al. (2018) – Multimodal speech emotion recognition
- Hutto & Gilbert (2014) – VADER Sentiment Analysis
- Liu et al. (2019) – RoBERTa Pretraining
- Eyben et al. (2015) – openSMILE and GeMAPS
- Yamashita et al. (2018) – CNN in signal processing
(Full reference list in original paper)
✍️ Authors
- Güneşsu AÇIK — System Analyst & Developer
- Osman Yiğit SÖKEL — System Analyst & Developer
- Advisor: Prof. Dr. Banu DİRİ