Latent Cluster Representation of Emotions from Speech Signals

January 5, 2020

A novel Convolutional Neural Network based end-to-end formulation to tackle attribute-based speech emotion recognition.
- A supervised emotional attribute-based regressor is jointly trained with an unsupervised cluster classifier, reinforcing the information gains from the unlabeled data to learn emotionally discriminative contents under a maximum latent clusters separation constraint.
The results provide evidence that the optimal number of clusters is a function of the size of the unlabeled set and the emotional attribute. Also, this approach creates latent clusters that depend on the emotional content, enriching the geometric interpretation of the clusters
Achieved significant improvements in recognition performance up to 5.26%, 16.02% and 10.10% for arousal, valence and dominance respectively, indicating the power of the proposed framework
A novel model driven curriculum metric (Deep Mutual Information) based on the pseudo-labels obtained on the training samples run through a K-means cluster classifier has been shown to achieve significant improvements in attribute-based speech emotion recognition through this study and this idea is currently being extended to a journal

Look for this at IEEE ICASSP 2021, Toronto, ON, Canada

[Conference presentation]

[Paper]

[code]