speech-recognition-system

The objective of this DLM (Deep Learning Model) is to recognize the emotions from speech.

https://github.com/codersacademy006/speech-recognition-system

Keywords

deep-learning emotion-detection emotion-recognition emotion-recognizer feature-extraction gradient-boosting keras kneighborsclassifier librosa machine-learning mfcc mlp-classifier neural-networks random-forest-classifier recurrent-neural-networks sklearn speech-emotion-recognition support-vector-machine

Last synced: 6 months ago · JSON representation

Repository

The objective of this DLM (Deep Learning Model) is to recognize the emotions from speech.

Basic Info

Host: GitHub
Owner: CodersAcademy006
License: mit
Language: Python
Default Branch: main
Homepage:
Size: 59.6 MB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Topics

deep-learning emotion-detection emotion-recognition emotion-recognizer feature-extraction gradient-boosting keras kneighborsclassifier librosa machine-learning mfcc mlp-classifier neural-networks random-forest-classifier recurrent-neural-networks sklearn speech-emotion-recognition support-vector-machine

Created over 1 year ago · Last pushed over 1 year ago

Metadata Files

Readme License Citation

README.md

Speech Emotion Recognition

Introduction

This repository handles building and training Speech Emotion Recognition System.
The basic idea behind this tool is to build and train/test a suited machine learning ( as well as deep learning ) algorithm that could recognize and detects human emotions from speech.
This is useful for many industry fields such as making product recommendations, affective computing, etc.
Check this tutorial for more information. ## Requirements
Python 3.6+ ### Python Packages
tensorflow
librosa==0.6.3
numpy
pandas
soundfile==0.9.0
wave
scikit-learn==0.24.2
tqdm==4.28.1
matplotlib==2.2.3
pyaudio==0.2.11
ffmpeg (optional): used if you want to add more sample audio by converting to 16000Hz sample rate and mono channel which is provided in convert_wavs.py

Install these libraries by the following command: pip3 install -r requirements.txt

Dataset

This repository used 4 datasets (including this repo's custom dataset) which are downloaded and formatted already in data folder: - RAVDESS : The Ryson Audio-Visual Database of Emotional Speech and Song that contains 24 actors (12 male, 12 female), vocalizing two lexically-matched statements in a neutral North American accent. - TESS : Toronto Emotional Speech Set that was modeled on the Northwestern University Auditory Test No. 6 (NU-6; Tillman & Carhart, 1966). A set of 200 target words were spoken in the carrier phrase "Say the word ____' by two actresses (aged 26 and 64 years). - EMO-DB : As a part of the DFG funded research project SE462/3-1 in 1997 and 1999 we recorded a database of emotional utterances spoken by actors. The recordings took place in the anechoic chamber of the Technical University Berlin, department of Technical Acoustics. Director of the project was Prof. Dr. W. Sendlmeier, Technical University of Berlin, Institute of Speech and Communication, department of communication science. Members of the project were mainly Felix Burkhardt, Miriam Kienast, Astrid Paeschke and Benjamin Weiss. - Custom : Some unbalanced noisy dataset that is located in data/train-custom for training and data/test-custom for testing in which you can add/remove recording samples easily by converting the raw audio to 16000 sample rate, mono channel (this is provided in `createwavs.pyscript inconvert_audio(audio_path)` method which requires ffmpeg to be installed and in PATH) and adding the emotion to the end of audio file name separated with '' (e.g "20190616125714_happy.wav" will be parsed automatically as happy)

Emotions available

There are 9 emotions available: "neutral", "calm", "happy" "sad", "angry", "fear", "disgust", "ps" (pleasant surprise) and "boredom".

Feature Extraction

Feature extraction is the main part of the speech emotion recognition system. It is basically accomplished by changing the speech waveform to a form of parametric representation at a relatively lesser data rate.

In this repository, we have used the most used features that are available in librosa library including: - MFCC - Chromagram - MEL Spectrogram Frequency (mel) - Contrast - Tonnetz (tonal centroid features)

Grid Search

Grid search results are already provided in grid folder, but if you want to tune various grid search parameters in parameters.py, you can run the script grid_search.py by: python grid_search.py This may take several hours to complete execution, once it is finished, best estimators are stored and pickled in grid folder.

Example 1: Using 3 Emotions

The way to build and train a model for classifying 3 emotions is as shown below: ```python from emotion_recognition import EmotionRecognizer from sklearn.svm import SVC

init a model, let's use SVC

my_model = SVC()

pass my model to EmotionRecognizer instance

and balance the dataset

rec = EmotionRecognizer(model=my_model, emotions=['sad', 'neutral', 'happy'], balance=True, verbose=0)

train the model

rec.train()

check the test accuracy for that model

print("Test score:", rec.test_score())

check the train accuracy for that model

print("Train score:", rec.train_score()) **Output:** Test score: 0.8148148148148148 Train score: 1.0 ```

Determining the best model

In order to determine the best model, you can by:

```python

loads the best estimators from `grid` folder that was searched by GridSearchCV in `grid_search.py`,

and set the model to the best in terms of test score, and then train it

rec.determinebestmodel()

get the determined sklearn model name

print(rec.model.class.name, "is the best")

get the test accuracy score for the best estimator

print("Test score:", rec.test_score()) **Output:** MLPClassifier is the best Test Score: 0.8958333333333334 ```

Predicting

Just pass an audio path to the rec.predict() method as shown below: ```python

this is a neutral speech from emo-db from the testing set

print("Prediction:", rec.predict("data/emodb/wav/15a04Nc.wav"))

this is a sad speech from TESS from the testing set

print("Prediction:", rec.predict("data/validation/Actor25/25010101backsad.wav")) **Output:** Prediction: neutral Prediction: sad ``You can pass any audio file, if it's not in the appropriate format (16000Hz and mono channel), then it'll be automatically converted, make sure you haveffmpeg` installed in your system and added to PATH.

Example 2: Using RNNs for 5 Emotions

```python from deepemotionrecognition import DeepEmotionRecognizer

initialize instance

inherited from emotion_recognition.EmotionRecognizer

default parameters (LSTM: 128x2, Dense:128x2)

deeprec = DeepEmotionRecognizer(emotions=['angry', 'sad', 'neutral', 'ps', 'happy'], nrnnlayers=2, ndenselayers=2, rnnunits=128, denseunits=128)

train the model

deeprec.train()

get the accuracy

print(deeprec.test_score())

predict angry audio sample

prediction = deeprec.predict('data/validation/Actor10/03-02-05-02-02-02-10angry.wav') print(f"Prediction: {prediction}") **Output:** 0.7717948717948718 Prediction: angry Predicting probabilities is also possible (for classification ofc):python print(deeprec.predict_proba("data/emodb/wav/16a01Wb.wav")) **Output:** {'angry': 0.99878675, 'sad': 0.0009922335, 'neutral': 7.959707e-06, 'ps': 0.00021298956, 'happy': 8.3598025e-08} ```

Confusion Matrix

python print(deeprec.confusion_matrix(percentage=True, labeled=True)) Output: predicted_angry predicted_sad predicted_neutral predicted_ps predicted_happy true_angry 80.769226 7.692308 3.846154 5.128205 2.564103 true_sad 12.820514 73.076920 3.846154 6.410257 3.846154 true_neutral 1.282051 1.282051 79.487183 1.282051 16.666668 true_ps 10.256411 3.846154 1.282051 79.487183 5.128205 true_happy 5.128205 8.974360 7.692308 8.974360 69.230774

Example 3: Not Passing any Model and Removing the Custom Dataset

Below code initializes EmotionRecognizer with 3 chosen emotions while removing Custom dataset, and setting balance to False: ```python from emotion_recognition import EmotionRecognizer

initialize instance, this will take a bit the first time executed

as it'll extract the features and calls determinebestmodel() automatically

to load the best performing model on the picked dataset

rec = EmotionRecognizer(emotions=["angry", "neutral", "sad"], balance=False, verbose=1, custom_db=False)

it will be trained, so no need to train this time

get the accuracy on the test set

print(rec.confusion_matrix())

predict angry audio sample

prediction = rec.predict('data/validation/Actor10/03-02-05-02-02-02-10angry.wav') print(f"Prediction: {prediction}") **Output:** [+] Best model determined: RandomForestClassifier with 93.454% test accuracy

          predicted_angry  predicted_neutral  predicted_sad

trueangry 98.275864 1.149425 0.574713 trueneutral 0.917431 88.073395 11.009174 true_sad 6.250000 1.875000 91.875000

Prediction: angry You can print the number of samples on each class:python rec.getsamplesby_class() **Output:** train test total angry 910 174 1084 neutral 650 109 759 sad 862 160 1022 total 2422 443 2865 ``In this case, the dataset is only from TESS and RAVDESS, and not balanced, you can passTruetobalanceon theEmotionRecognizer` instance to balance the data.

Algorithms Used

This repository can be used to build machine learning classifiers as well as regressors for the case of 3 emotions {'sad': 0, 'neutral': 1, 'happy': 2} and the case of 5 emotions {'angry': 1, 'sad': 2, 'neutral': 3, 'ps': 4, 'happy': 5}

Classifiers

SVC
RandomForestClassifier
GradientBoostingClassifier
KNeighborsClassifier
MLPClassifier
BaggingClassifier
Recurrent Neural Networks (Keras) ### Regressors
SVR
RandomForestRegressor
GradientBoostingRegressor
KNeighborsRegressor
MLPRegressor
BaggingRegressor
Recurrent Neural Networks (Keras)

Testing

You can test your own voice by executing the following command: python test.py Wait until "Please talk" prompt is appeared, then you can start talking, and the model will automatically detects your emotion when you stop (talking).

You can change emotions to predict, as well as models, type --help for more information. python test.py --help Output: ``` usage: test.py [-h] [-e EMOTIONS] [-m MODEL]

Testing emotion recognition system using your voice, please consider changing the model and/or parameters as you wish.

optional arguments: -h, --help show this help message and exit -e EMOTIONS, --emotions EMOTIONS Emotions to recognize separated by a comma ',', available emotions are "neutral", "calm", "happy" "sad", "angry", "fear", "disgust", "ps" (pleasant surprise) and "boredom", default is "sad,neutral,happy" -m MODEL, --model MODEL The model to use, 8 models available are: "SVC","AdaBo ostClassifier","RandomForestClassifier","GradientBoost ingClassifier","DecisionTreeClassifier","KNeighborsCla ssifier","MLPClassifier","BaggingClassifier", default is "BaggingClassifier"

```

Plotting Histograms

This will only work if grid search is performed. ```python from emotionrecognition import plothistograms

plot histograms on different classifiers

plot_histograms(classifiers=True) ``` Output:

A Histogram shows different algorithms metric results on different data sizes as well as time consumed to train/predict.

Citation

```bibtex @software{speechemotionrecognition_2019, author = {Abdeladim Fadheli}, title = {Speech Emotion Recognition}, version = {1.0.0}, year = {2019}, publisher = {GitHub}, journal = {GitHub repository}, url = {https://github.com/x4nth055/emotion-recognition-using-speech} }

speech-recognition-system

Science Score: 36.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Speech Emotion Recognition

Introduction

Dataset

Emotions available

Feature Extraction

Grid Search

Example 1: Using 3 Emotions

init a model, let's use SVC

pass my model to EmotionRecognizer instance

and balance the dataset

train the model

check the test accuracy for that model

check the train accuracy for that model

Determining the best model

loads the best estimators from grid folder that was searched by GridSearchCV in grid_search.py,

and set the model to the best in terms of test score, and then train it

get the determined sklearn model name

get the test accuracy score for the best estimator

Predicting

this is a neutral speech from emo-db from the testing set

this is a sad speech from TESS from the testing set

Example 2: Using RNNs for 5 Emotions

initialize instance

inherited from emotion_recognition.EmotionRecognizer

default parameters (LSTM: 128x2, Dense:128x2)

train the model

get the accuracy

predict angry audio sample

Confusion Matrix

Example 3: Not Passing any Model and Removing the Custom Dataset

initialize instance, this will take a bit the first time executed

as it'll extract the features and calls determinebestmodel() automatically

to load the best performing model on the picked dataset

it will be trained, so no need to train this time

get the accuracy on the test set

predict angry audio sample

Algorithms Used

Classifiers

Testing

Plotting Histograms

plot histograms on different classifiers

Citation

Owner

GitHub Events

Total

Last Year

Dependencies

loads the best estimators from `grid` folder that was searched by GridSearchCV in `grid_search.py`,