Osman Yiğit Sökel

📌 Summary

This project presents a multimodal emotion recognition system that analyzes both human voice and its corresponding text to predict emotional states. Leveraging deep learning, feature extraction, and natural language processing (NLP), the model can classify emotions into five main categories: anger, happiness, sadness, surprise, and neutral.

🎯 Objective

The goal is to design a system capable of detecting human emotions from spoken audio and its textual transcript, thereby enhancing communication technologies and human-computer interaction.

🧠 Motivation & Literature Background

Voice and text are fundamental to emotional expression.
Tone of voice can completely alter the meaning of identical sentences.
Prior research used either audio or text independently; combining both provides richer emotion understanding.
Inspired by studies like Yoon et al. (2018) which showed improved performance using both modalities.

🔧 Feasibility & Tools

Technical Stack: Python (for ML and NLP), Google Colab (for compute), open-source libraries.
Libraries: openSMILE, Librosa, Scikit-learn, TF-IDF, SVC, CNN, RoBERTa, VADER.
Hardware: Google Colab with Tesla V100 GPU, 51 GB RAM.
Cost: Only 330 TL for Google Colab Premium.
Legal: All datasets and libraries are open-source and ethically used.

🧪 Dataset

Merged multiple multimodal datasets: CREMA-D, RAVDESS, SAVEE, TESS, IEMOCAP, MELD, JL-Corpus, ESD.
Datasets include both audio and corresponding text for emotion labels.
Preprocessing steps included cleaning, balancing labels, and removing unusable entries.

🎙️ Audio Processing Pipeline

Feature Extraction:
- Used openSMILE with eGeMAPSv02 to extract low-level descriptors (e.g., MFCCs, ZCR, spectral contrast).
- Librosa used for comparative analysis but performed worse on test data.
Model:
- CNN performed better than LSTM in both speed and accuracy:
  - CNN Accuracy: 96%, Loss: 0.13
  - LSTM Accuracy: 89%, Loss: 0.32

📄 Text Processing Pipeline

TF-IDF Vectorization: Transformed text into feature vectors.
Label Encoding: Translated emotion labels into numerical form.
Models Tested:
- NLTK with VADER
- RoBERTa (transformer-based)
- SVC (best performing model with 86% accuracy)

🔁 Integrated Model

Feature Fusion: Audio and text features were concatenated and scaled using MinMaxScaler.
Model Architecture:
- CNN for audio stream
- Dense layer for text features
- Shared layers for final classification
- EarlyStopping and ModelCheckpoint callbacks used
Training: Done with balanced and cleaned dataset, supported by real-time validation.

📈 Performance Evaluation

Emotion	Precision	Recall	F1 Score
Anger	0.76	0.57	0.65
Happy	0.64	0.54	0.59
Neutral	0.62	0.76	0.68
Sad	0.69	0.72	0.70
Surprise	0.65	0.65	0.65

Overall Accuracy: ~66–69% across test and validation sets.
Confusion Matrix: Identified primary misclassifications (e.g., anger vs. neutral).
Model Strengths: Neutral and Sad emotions classified more accurately.

📊 Sample Predictions

System output includes:

Audio file path
Spoken text
Predicted vs. true emotion labels

Examples confirm that the model can consistently detect emotional context even in subtle expressions.

🧩 Applications

This emotion recognition system can be applied in:

📞 Call centers (emotion tracking)
🏥 Healthcare (mental state analysis)
🧠 Education (student engagement)
🤖 Human-AI interaction (voice assistants)
📱 Social media monitoring

🧾 Conclusion

A successful multimodal emotion recognition system was implemented using deep learning.
The model's ability to learn from both audio and text enables more nuanced and accurate emotion classification.
Future work may include expanding emotion classes, using real-time streaming data, and building a user-friendly interface for deployment.

📚 References

Yoon et al. (2018) – Multimodal speech emotion recognition
Hutto & Gilbert (2014) – VADER Sentiment Analysis
Liu et al. (2019) – RoBERTa Pretraining
Eyben et al. (2015) – openSMILE and GeMAPS
Yamashita et al. (2018) – CNN in signal processing

(Full reference list in original paper)

✍️ Authors

Güneşsu AÇIK — System Analyst & Developer
Osman Yiğit SÖKEL — System Analyst & Developer
Advisor: Prof. Dr. Banu DİRİ

Emotion Recognition From Human Voice Supported by Text