Dimensional Speech Emotion Recognition from Acoustic and Text Features using Recurrent Neural Networks
Main Authors: | Atmaja, Bagus Tris, Akagi, Masato, Elbarougy, Reda |
---|---|
Format: | Article info application/pdf eJournal |
Bahasa: | eng |
Terbitan: |
Universitas Komputer Indonesia
, 2020
|
Online Access: |
https://search.unikom.ac.id/index.php/injiiscom/article/view/4023 https://search.unikom.ac.id/index.php/injiiscom/article/view/4023/2137 |
Daftar Isi:
- Emotion can be inferred from tonal and verbal information, where both features can be extracted from speech. While most researchers conducted studies on categorical emotion recognition from a single modality, this research presents a dimensional emotion recognition combining acoustic and text features. A number of 31 acoustic features are extracted from speech, while word vector is used as text features. The initial result on single modality emotion recognition can be used as a cue to combine both features with improving the recognition result. The latter result shows that a combination of acoustic and text features decreases the error of dimensional emotion score prediction by about 5% from the acoustic system and 1% from the text system. This smallest error is achieved by combining the text system with Long Short-Term Memory (LSTM) networks and acoustic systems with bidirectional LSTM networks and concatenated both systems with dense networks.