Background Knowledge & Summary
Recognizing the speaker’s emotion from his speech can act as a key part in seamless HCI (Human-Computer Interaction). In the traditional approach, we first model what emotions are and how to represent them. Then according to the emotion model, first, we divide the speech into clusters and label them, and second, extract speech/textual features that can reflect emotion. Then with the data and features, we can run machine learning.
The problem with this process is that emotions have no definitive answer, which makes labeling very difficult even for the speaker himself. Partly because of this, good-quality labelled data is always scarce, becoming a bottleneck.
Further, the article explains the upcoming and moonshot challenges of Speech Emotion Recognition, which include holistic speaker modeling, and handling atypical speech situations. Refer to the following keynote presentation I’ve made for more detail: