Dimensional Speech Emotion Recognition from Acoustic and Text Features using Recurrent Neural Networks

Main Authors: Atmaja, Bagus Tris, Akagi, Masato, Elbarougy, Reda
Format: Article info application/pdf eJournal
Bahasa: eng
Terbitan: Universitas Komputer Indonesia , 2020
Online Access: https://search.unikom.ac.id/index.php/injiiscom/article/view/4023
https://search.unikom.ac.id/index.php/injiiscom/article/view/4023/2137
Daftar Isi:
  • Emotion can be inferred from tonal and verbal information, where both features can be extracted from speech. While most researchers conducted studies on categorical emotion recognition from a single modality, this research presents a dimensional emotion recognition combining acoustic and text features. A number of 31 acoustic features are extracted from speech, while word vector is used as text features. The initial result on single modality emotion recognition can be used as a cue to combine both features with improving the recognition result. The latter result shows that a combination of acoustic and text features decreases the error of dimensional emotion score prediction by about 5% from the acoustic system and 1% from the text system. This smallest error is achieved by combining the text system with Long Short-Term Memory (LSTM) networks and acoustic systems with bidirectional LSTM networks and concatenated both systems with dense networks.