Springer, 2016. — 328.Automatic classification of speech and music has become an important topic with the advent of speech technologies in devices of our daily lives, such as smartphones and smartwatches. While for automatic speech recognition, commercial technologies with good accuracies are available to buy, the classification of music and paralinguistic information beyond the textual content, such as mood, emotion, voice quality, or personality, is a very young field, but possibly the next technological milestone for man–machine interfaces. This thesis advances the state of the art in the area by defining standard acoustic feature sets for real-time speech and music analysis and by proposing solutions for real-world problems: a multi-condition learning approach to increase noise robustness, noise robust incremental segmentation of the input audio stream based on a novel, context-aware, and data-driven voice activity detector, and a method for fully (time and value) continuous affect regression tasks are introduced. Standard acoustic feature sets were defined and evaluated throughout a series of international research challenges. Further, a framework for incremental and real-time acoustic feature extraction is proposed, implemented, and published as an open-source toolkit (openSMILE). The toolkit implements all of the proposed baseline acoustic feature sets and has been widely accepted by the community—the publications introducing the toolkit have been cited over 400 times. The proposed acoustic feature sets are systematically evaluated on 13 databases containing speech affect and music style classification tasks. Experiments over a wide range of conditions are performed, i. e. training instance balancing, feature value normalisation, and various classifier parameters. Also, the proposed methods for real-time, noise robust, and incremental input segmentation and noise robust multi-condition classification are extensively evaluated on several databases. Finally, fully continuous (time and value), automatic recognition of affect in five dimensions with long short-term memory recurrent neural networks is evaluated on a database of natural and spontaneous affect expressions (SEMAINE). The superior performance of the proposed large feature sets over smaller sets was shown successfully for a multitude of tasks. All in all, this thesis is a significant contribution to the field of speech and music analysis and hopefully expedites the process of bringing real-world speech and music analysis applications, such as robust emotion or music mood recognition, a bit closer to daily use.Introduction Acoustic Features and Modelling Standard Baseline Feature Sets Real-time Incremental Processing Real-Life Robustness Evaluation Discussion and Outlook A: Detailed Descriptions of the Baseline Feature Sets B: Mel-Frequency Filterbank Parameters
Чтобы скачать этот файл зарегистрируйтесь и/или войдите на сайт используя форму сверху.