In this paper, we present a speech-based emotion recognition framework based on a latent Dirichlet allocation model. This method assumes that incoming speech frames are conditionally independent and exchangeable. While this leads to a loss of temporal structure, it is able to capture significant statistical information between frames. In contrast, a hidden Markov model-based approach captures the temporal structure in speech. Using the German emotional speech database EMO-DB for evaluation, we achieve an average classification accuracy of 80.7% compared to 73% for hidden Markov models. This improvement is achieved at the cost of a slight increase in computational complexity. We map the proposed algorithm onto an FPGA platform and show that emotions in a speech utterance of duration 1.5s can be identified in 1.8ms, while utilizing 70% of the resources. This further demonstrates the suitability of our approach for real-time applications on hand-held devices.
Download Full PDF Version (Non-Commercial Use)