June 25, 2025

Emotion Recognition From Human Voice Supported by Text

This project aims to detect human emotions using both voice and text. By combining acoustic features (such as tone and pitch) with the transcribed content of speech, a multimodal AI model identifies emotional states more accurately. Models like CNN, LSTM, and BERT were used. Results show that combining audio and text inputs significantly improves emotion recognition performance.

MLAIEmotion

📌 Summary

This project presents a multimodal emotion recognition system that analyzes both human voice and its corresponding text to predict emotional states. Leveraging deep learning, feature extraction, and natural language processing (NLP), the model can classify emotions into five main categories: anger, happiness, sadness, surprise, and neutral.


🎯 Objective

The goal is to design a system capable of detecting human emotions from spoken audio and its textual transcript, thereby enhancing communication technologies and human-computer interaction.


🧠 Motivation & Literature Background

  • Voice and text are fundamental to emotional expression.
  • Tone of voice can completely alter the meaning of identical sentences.
  • Prior research used either audio or text independently; combining both provides richer emotion understanding.
  • Inspired by studies like Yoon et al. (2018) which showed improved performance using both modalities.

🔧 Feasibility & Tools

  • Technical Stack: Python (for ML and NLP), Google Colab (for compute), open-source libraries.
  • Libraries: openSMILE, Librosa, Scikit-learn, TF-IDF, SVC, CNN, RoBERTa, VADER.
  • Hardware: Google Colab with Tesla V100 GPU, 51 GB RAM.
  • Cost: Only 330 TL for Google Colab Premium.
  • Legal: All datasets and libraries are open-source and ethically used.

🧪 Dataset

  • Merged multiple multimodal datasets: CREMA-D, RAVDESS, SAVEE, TESS, IEMOCAP, MELD, JL-Corpus, ESD.
  • Datasets include both audio and corresponding text for emotion labels.
  • Preprocessing steps included cleaning, balancing labels, and removing unusable entries.

🎙️ Audio Processing Pipeline

  • Feature Extraction:
    • Used openSMILE with eGeMAPSv02 to extract low-level descriptors (e.g., MFCCs, ZCR, spectral contrast).
    • Librosa used for comparative analysis but performed worse on test data.
  • Model:
    • CNN performed better than LSTM in both speed and accuracy:
      • CNN Accuracy: 96%, Loss: 0.13
      • LSTM Accuracy: 89%, Loss: 0.32

📄 Text Processing Pipeline

  • TF-IDF Vectorization: Transformed text into feature vectors.
  • Label Encoding: Translated emotion labels into numerical form.
  • Models Tested:
    • NLTK with VADER
    • RoBERTa (transformer-based)
    • SVC (best performing model with 86% accuracy)

🔁 Integrated Model

  • Feature Fusion: Audio and text features were concatenated and scaled using MinMaxScaler.
  • Model Architecture:
    • CNN for audio stream
    • Dense layer for text features
    • Shared layers for final classification
    • EarlyStopping and ModelCheckpoint callbacks used
  • Training: Done with balanced and cleaned dataset, supported by real-time validation.

📈 Performance Evaluation

EmotionPrecisionRecallF1 Score
Anger0.760.570.65
Happy0.640.540.59
Neutral0.620.760.68
Sad0.690.720.70
Surprise0.650.650.65
  • Overall Accuracy: ~66–69% across test and validation sets.
  • Confusion Matrix: Identified primary misclassifications (e.g., anger vs. neutral).
  • Model Strengths: Neutral and Sad emotions classified more accurately.

📊 Sample Predictions

System output includes:

  • Audio file path
  • Spoken text
  • Predicted vs. true emotion labels

Examples confirm that the model can consistently detect emotional context even in subtle expressions.


🧩 Applications

This emotion recognition system can be applied in:

  • 📞 Call centers (emotion tracking)
  • 🏥 Healthcare (mental state analysis)
  • 🧠 Education (student engagement)
  • 🤖 Human-AI interaction (voice assistants)
  • 📱 Social media monitoring

🧾 Conclusion

  • A successful multimodal emotion recognition system was implemented using deep learning.
  • The model's ability to learn from both audio and text enables more nuanced and accurate emotion classification.
  • Future work may include expanding emotion classes, using real-time streaming data, and building a user-friendly interface for deployment.

📚 References

  • Yoon et al. (2018) – Multimodal speech emotion recognition
  • Hutto & Gilbert (2014) – VADER Sentiment Analysis
  • Liu et al. (2019) – RoBERTa Pretraining
  • Eyben et al. (2015) – openSMILE and GeMAPS
  • Yamashita et al. (2018) – CNN in signal processing

(Full reference list in original paper)


✍️ Authors

  • Güneşsu AÇIK — System Analyst & Developer
  • Osman Yiğit SÖKEL — System Analyst & Developer
  • Advisor: Prof. Dr. Banu DİRİ

Try the AI Model

Test the model mentioned in this post with your own data