##### #1. On Training Targets and Objective Functions for Deep-Learning-Based Audio-Visual Speech Enhancement
###### Daniel Michelsanti, Zheng-Hua Tan, Sigurdur Sigurdsson, Jesper Jensen
Audio-visual speech enhancement (AV-SE) is the task of improving speech quality and intelligibility in a noisy environment using audio and visual information from a talker. Recently, deep learning techniques have been adopted to solve the AV-SE task in a supervised manner. In this context, the choice of the target, i.e. the quantity to be estimated, and the objective function, which quantifies the quality of this estimate, to be used for training is critical for the performance. This work is the first that presents an experimental study of a range of different targets and objective functions used to train a deep-learning-based AV-SE system. The results show that the approaches that directly estimate a mask perform the best overall in terms of estimated speech quality and intelligibility, although the model that directly estimates the log magnitude spectrum performs as good in terms of estimated speech quality.
##### #2. Effects of Lombard Reflex on the Performance of Deep-Learning-Based Audio-Visual Speech Enhancement Systems
###### Daniel Michelsanti, Zheng-Hua Tan, Sigurdur Sigurdsson, Jesper Jensen
Humans tend to change their way of speaking when they are immersed in a noisy environment, a reflex known as Lombard effect. Current speech enhancement systems based on deep learning do not usually take into account this change in the speaking style, because they are trained with neutral (non-Lombard) speech utterances recorded under quiet conditions to which noise is artificially added. In this paper, we investigate the effects that the Lombard reflex has on the performance of audio-visual speech enhancement systems based on deep learning. The results show that a gap in the performance of as much as approximately 5 dB between the systems trained on neutral speech and the ones trained on Lombard speech exists. This indicates the benefit of taking into account the mismatch between neutral and Lombard speech in the design of audio-visual speech enhancement systems.
##### #3. Robust universal neural vocoding
###### Jaime Lorenzo-Trueba, Thomas Drugman, Javier Latorre, Thomas Merritt, Bartosz Putrycz, Roberto Barra-Chicote
This paper introduces a robust universal neural vocoder trained with 74 speakers (comprised of both genders) coming from 17 languages. This vocoder is shown to be capable of generating speech of consistently good quality (98% relative mean MUSHRA when compared to natural speech) regardless of whether the input spectrogram comes from a speaker, style or recording condition seen during training or from an out-of-domain scenario. Together with the system, we present a full text-to-speech analysis of robustness of a number of implemented systems. The complexity of systems tested range from a convolutional neural networks-based system conditioned on linguistics to a recurrent neural networks-based system conditioned on mel-spectrograms. The analysis shows that convolutional neural networks-based systems are prone to occasional instabilities, while the recurrent approaches are significantly more stable and capable of providing universalizing robustness.
##### #4. A Multimodal Approach towards Emotion Recognition of Music using Audio and Lyrical Content
###### Aniruddha Bhattacharya, K. V. Kadambari
We propose MoodNet - A Deep Convolutional Neural Network based architecture to effectively predict the emotion associated with a piece of music given its audio and lyrical content.We evaluate different architectures consisting of varying number of two-dimensional convolutional and subsampling layers,followed by dense layers.We use Mel-Spectrograms to represent the audio content and word embeddings-specifically 100 dimensional word vectors, to represent the textual content represented by the lyrics.We feed input data from both modalities to our MoodNet architecture.The output from both the modalities are then fused as a fully connected layer and softmax classfier is used to predict the category of emotion.Using F1-score as our metric,our results show excellent performance of MoodNet over the two datasets we experimented on-The MIREX Multimodal dataset and the Million Song Dataset.Our experiments reflect the hypothesis that more complex models perform better with more training data.We also observe that lyrics outperform audio as a...
##### #5. HCU400: An Annotated Dataset for Exploring Aural Phenomenology Through Causal Uncertainty
###### Ishwarya Ananthabhotla, David B. Ramsay, Joseph A. Paradiso
The way we perceive a sound depends on many aspects-- its ecological frequency, acoustic features, typicality, and most notably, its identified source. In this paper, we present the HCU400: a dataset of 402 sounds ranging from easily identifiable everyday sounds to intentionally obscured artificial ones. It aims to lower the barrier for the study of aural phenomenology as the largest available audio dataset to include an analysis of causal attribution. Each sample has been annotated with crowd-sourced descriptions, as well as familiarity, imageability, arousal, and valence ratings. We extend existing calculations of causal uncertainty, automating and generalizing them with word embeddings. Upon analysis we find that individuals will provide less polarized emotion ratings as a sound's source becomes increasingly ambiguous; individual ratings of familiarity and imageability, on the other hand, diverge as uncertainty increases despite a clear negative trend on average.
##### #6. Comprehensive evaluation of statistical speech waveform synthesis
###### Thomas Merritt, Bartosz Putrycz, Adam Nadolski, Tianjun Ye, Daniel Korzekwa, Wiktor Dolecki, Thomas Drugman, Viacheslav Klimkov, Alexis Moinet, Andrew Breen, Rafal Kuklinski, Nikko Strom, Roberto Barra-Chicote
Statistical TTS systems that directly predict the speech waveform have recently reported improvements in synthesis quality. This investigation evaluates Amazon's statistical speech waveform synthesis (SSWS) system. An in-depth evaluation of SSWS is conducted across a number of domains to better understand the consistency in quality. The results of this evaluation are validated by repeating the procedure on a separate group of testers. Finally, an analysis of the nature of speech errors of SSWS compared to hybrid unit selection synthesis is conducted to identify the strengths and weaknesses of SSWS. Having a deeper insight into SSWS allows us to better define the focus of future work to improve this new technology.
##### #7. Open-source platforms for fast room acoustic simulations in complex structures
###### Matthieu Aussal, Robin Gueguen
This article presents new numerical simulation tools, respectively developed in Matlab and Blender softwares. Available in open-source under the GPL 3.0 license, it uses a ray-tracing/image-sources hybrid method to calculate the room acoustics for large meshes. Performances are optimized to solve problems of significant size (typically more than 100,000 surface elements and about a million of rays). For this purpose, a Divide and Conquer approach with a recursive binary tree structure has been implemented to reduce the quadratic complexity of the ray/element interactions to near-linear. Thus, execution times are less sensitive to the mesh density, which allows simulations of complex geometry. After ray propagation, a hybrid method leads to image-sources, which can be visually analyzed to localize sound map. Finally, impulse responses are constructed from the image-sources and FIR filters are proposed natively over 8 octave bands, taking into account material absorption properties and propagation medium. This algorithm is validated...
##### #8. Speech Coding, Speech Interfaces and IoT - Opportunities and Challenges
###### Tom Bäckström
Recent speech and audio coding standards such as 3GPP Enhanced Voice Services match the foreseeable needs and requirements in transmission of speech and audio, when using current transmission infrastructure and applications. Trends in Internet-of-Things technology and development in personal digital assistants (PDAs) however begs us to consider future requirements for speech and audio codecs. The opportunities and challenges are here summarized in three concepts: collaboration, unification and privacy. First, an increasing number of devices will in the future be speech-operated, whereby the ability to focus voice commands to a specific devices becomes essential. We therefore need methods which allows collaboration between devices, such that ambiguities can be resolved. Second, such collaboration can be achieved with a unified and standardized communication protocol between voice-operated devices. To achieve such collaboration protocols, we need to develop distributed speech coding technology for ad-hoc IoT networks. Finally...
##### #9. Gaussian-Constrained training for speaker verification
###### Lantian Li, Zhiyuan Tang, Ying Shi, Dong Wang
Neural models, in particular the d-vector and x-vector architectures, have produced state-of-the-art performance on many speaker verification tasks. However, two potential problems of these neural models deserve more investigation. Firstly, both models suffer from information leak', which means that some parameters participating in model training will be discarded during inference, i.e, the layers that are used as the classifier. Secondly, both models do not regulate the distribution of the derived speaker vectors. This unconstrained distribution' may degrade the performance of the subsequent scoring component, e.g., PLDA. This paper proposes a Gaussian-constrained training approach that (1) discards the parametric classifier, and (2) enforces the distribution of the derived speaker vectors to be Gaussian. Our experiments on the VoxCeleb and SITW databases demonstrated that this new training approach produced more representative and regular speaker embeddings, leading to consistent performance improvement.
##### #10. Acoustic Scene Classification: A Competition Review
###### Shayan Gharib, Honain Derrar, Daisuke Niizumi, Tuukka Senttula, Janne Tommola, Toni Heittola, Tuomas Virtanen, Heikki Huttunen
In this paper we study the problem of acoustic scene classification, i.e., categorization of audio sequences into mutually exclusive classes based on their spectral content. We describe the methods and results discovered during a competition organized in the context of a graduate machine learning course; both by the students and external participants. We identify the most suitable methods and study the impact of each by performing an ablation study of the mixture of approaches. We also compare the results with a neural network baseline, and show the improvement over that. Finally, we discuss the impact of using a competition as a part of a university course, and justify its importance in the curriculum based on student feedback.
