Top 10 Arxiv Papers Today in Audio And Speech Processing


2.106 Mikeys
#1. On Training Targets and Objective Functions for Deep-Learning-Based Audio-Visual Speech Enhancement
Daniel Michelsanti, Zheng-Hua Tan, Sigurdur Sigurdsson, Jesper Jensen
Audio-visual speech enhancement (AV-SE) is the task of improving speech quality and intelligibility in a noisy environment using audio and visual information from a talker. Recently, deep learning techniques have been adopted to solve the AV-SE task in a supervised manner. In this context, the choice of the target, i.e. the quantity to be estimated, and the objective function, which quantifies the quality of this estimate, to be used for training is critical for the performance. This work is the first that presents an experimental study of a range of different targets and objective functions used to train a deep-learning-based AV-SE system. The results show that the approaches that directly estimate a mask perform the best overall in terms of estimated speech quality and intelligibility, although the model that directly estimates the log magnitude spectrum performs as good in terms of estimated speech quality.
more | pdf | html
Figures
None.
Tweets
BrundageBot: On Training Targets and Objective Functions for Deep-Learning-Based Audio-Visual Speech Enhancement. Daniel Michelsanti, Zheng-Hua Tan, Sigurdur Sigurdsson, and Jesper Jensen https://t.co/nmv2Urbz9T
arxivml: "On Training Targets and Objective Functions for Deep-Learning-Based Audio-Visual Speech Enhancement", Daniel Miche… https://t.co/bJZForUUlW
Memoirs: On Training Targets and Objective Functions for Deep-Learning-Based Audio-Visual Speech Enhancement. https://t.co/Hgl0H4jlXb
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 4
Total Words: 5068
Unqiue Words: 1720

2.106 Mikeys
#2. Effects of Lombard Reflex on the Performance of Deep-Learning-Based Audio-Visual Speech Enhancement Systems
Daniel Michelsanti, Zheng-Hua Tan, Sigurdur Sigurdsson, Jesper Jensen
Humans tend to change their way of speaking when they are immersed in a noisy environment, a reflex known as Lombard effect. Current speech enhancement systems based on deep learning do not usually take into account this change in the speaking style, because they are trained with neutral (non-Lombard) speech utterances recorded under quiet conditions to which noise is artificially added. In this paper, we investigate the effects that the Lombard reflex has on the performance of audio-visual speech enhancement systems based on deep learning. The results show that a gap in the performance of as much as approximately 5 dB between the systems trained on neutral speech and the ones trained on Lombard speech exists. This indicates the benefit of taking into account the mismatch between neutral and Lombard speech in the design of audio-visual speech enhancement systems.
more | pdf | html
Figures
None.
Tweets
BrundageBot: Effects of Lombard Reflex on the Performance of Deep-Learning-Based Audio-Visual Speech Enhancement Systems. Daniel Michelsanti, Zheng-Hua Tan, Sigurdur Sigurdsson, and Jesper Jensen https://t.co/yU6DoIbqWs
arxivml: "Effects of Lombard Reflex on the Performance of Deep-Learning-Based Audio-Visual Speech Enhancement Systems", Dani… https://t.co/UBzOw9VGL5
Memoirs: Effects of Lombard Reflex on the Performance of Deep-Learning-Based Audio-Visual Speech Enhancement Systems. https://t.co/3BOk7sYXAG
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 4
Total Words: 4499
Unqiue Words: 1522

2.054 Mikeys
#3. Robust universal neural vocoding
Jaime Lorenzo-Trueba, Thomas Drugman, Javier Latorre, Thomas Merritt, Bartosz Putrycz, Roberto Barra-Chicote
This paper introduces a robust universal neural vocoder trained with 74 speakers (comprised of both genders) coming from 17 languages. This vocoder is shown to be capable of generating speech of consistently good quality (98% relative mean MUSHRA when compared to natural speech) regardless of whether the input spectrogram comes from a speaker, style or recording condition seen during training or from an out-of-domain scenario. Together with the system, we present a full text-to-speech analysis of robustness of a number of implemented systems. The complexity of systems tested range from a convolutional neural networks-based system conditioned on linguistics to a recurrent neural networks-based system conditioned on mel-spectrograms. The analysis shows that convolutional neural networks-based systems are prone to occasional instabilities, while the recurrent approaches are significantly more stable and capable of providing universalizing robustness.
more | pdf | html
Figures
Tweets
BrundageBot: Robust universal neural vocoding. Jaime Lorenzo-Trueba, Thomas Drugman, Javier Latorre, Thomas Merritt, Bartosz Putrycz, and Roberto Barra-Chicote https://t.co/zXdpwhZmWy
ComputerPapers: Robust universal neural vocoding. https://t.co/8gV2PT4zcc
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 6
Total Words: 3313
Unqiue Words: 1375

2.051 Mikeys
#4. A Multimodal Approach towards Emotion Recognition of Music using Audio and Lyrical Content
Aniruddha Bhattacharya, K. V. Kadambari
We propose MoodNet - A Deep Convolutional Neural Network based architecture to effectively predict the emotion associated with a piece of music given its audio and lyrical content.We evaluate different architectures consisting of varying number of two-dimensional convolutional and subsampling layers,followed by dense layers.We use Mel-Spectrograms to represent the audio content and word embeddings-specifically 100 dimensional word vectors, to represent the textual content represented by the lyrics.We feed input data from both modalities to our MoodNet architecture.The output from both the modalities are then fused as a fully connected layer and softmax classfier is used to predict the category of emotion.Using F1-score as our metric,our results show excellent performance of MoodNet over the two datasets we experimented on-The MIREX Multimodal dataset and the Million Song Dataset.Our experiments reflect the hypothesis that more complex models perform better with more training data.We also observe that lyrics outperform audio as a...
more | pdf | html
Figures
Tweets
arxivml: "A Multimodal Approach towards Emotion Recognition of Music using Audio and Lyrical Content", Aniruddha Bhattachary… https://t.co/F6tigv79Zx
nmfeeds: [CV] https://t.co/GERbfk0L0V A Multimodal Approach towards Emotion Recognition of Music using Audio and Lyrical Content. W...
nmfeeds: [CL] https://t.co/GERbfk0L0V A Multimodal Approach towards Emotion Recognition of Music using Audio and Lyrical Content. W...
nmfeeds: [O] https://t.co/GERbfk0L0V A Multimodal Approach towards Emotion Recognition of Music using Audio and Lyrical Content. We...
arxiv_cscv: A Multimodal Approach towards Emotion Recognition of Music using Audio and Lyrical Content https://t.co/4BLIjstgfc
arxiv_cscv: A Multimodal Approach towards Emotion Recognition of Music using Audio and Lyrical Content https://t.co/4BLIjsbEQC
arxiv_cscl: A Multimodal Approach towards Emotion Recognition of Music using Audio and Lyrical Content https://t.co/gghPBaFEBI
ComputerPapers: A Multimodal Approach towards Emotion Recognition of Music using Audio and Lyrical Content. https://t.co/6WFYIC92JC
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 2
Total Words: 3469
Unqiue Words: 1460

2.04 Mikeys
#5. HCU400: An Annotated Dataset for Exploring Aural Phenomenology Through Causal Uncertainty
Ishwarya Ananthabhotla, David B. Ramsay, Joseph A. Paradiso
The way we perceive a sound depends on many aspects-- its ecological frequency, acoustic features, typicality, and most notably, its identified source. In this paper, we present the HCU400: a dataset of 402 sounds ranging from easily identifiable everyday sounds to intentionally obscured artificial ones. It aims to lower the barrier for the study of aural phenomenology as the largest available audio dataset to include an analysis of causal attribution. Each sample has been annotated with crowd-sourced descriptions, as well as familiarity, imageability, arousal, and valence ratings. We extend existing calculations of causal uncertainty, automating and generalizing them with word embeddings. Upon analysis we find that individuals will provide less polarized emotion ratings as a sound's source becomes increasingly ambiguous; individual ratings of familiarity and imageability, on the other hand, diverge as uncertainty increases despite a clear negative trend on average.
more | pdf | html
Figures
Tweets
arxiv_cscl: HCU400: An Annotated Dataset for Exploring Aural Phenomenology Through Causal Uncertainty https://t.co/uEauRnuAay
ComputerPapers: HCU400: An Annotated Dataset for Exploring Aural Phenomenology Through Causal Uncertainty. https://t.co/ISSHry9ruQ
Github
None.
Youtube
None.
Other stats
Sample Sizes : [5]
Authors: 3
Total Words: 3125
Unqiue Words: 1478

2.004 Mikeys
#6. Comprehensive evaluation of statistical speech waveform synthesis
Thomas Merritt, Bartosz Putrycz, Adam Nadolski, Tianjun Ye, Daniel Korzekwa, Wiktor Dolecki, Thomas Drugman, Viacheslav Klimkov, Alexis Moinet, Andrew Breen, Rafal Kuklinski, Nikko Strom, Roberto Barra-Chicote
Statistical TTS systems that directly predict the speech waveform have recently reported improvements in synthesis quality. This investigation evaluates Amazon's statistical speech waveform synthesis (SSWS) system. An in-depth evaluation of SSWS is conducted across a number of domains to better understand the consistency in quality. The results of this evaluation are validated by repeating the procedure on a separate group of testers. Finally, an analysis of the nature of speech errors of SSWS compared to hybrid unit selection synthesis is conducted to identify the strengths and weaknesses of SSWS. Having a deeper insight into SSWS allows us to better define the focus of future work to improve this new technology.
more | pdf | html
Figures
Tweets
ComputerPapers: Comprehensive evaluation of statistical speech waveform synthesis. https://t.co/9xONeqKbDM
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 13
Total Words: 4587
Unqiue Words: 1641

2.0 Mikeys
#7. Open-source platforms for fast room acoustic simulations in complex structures
Matthieu Aussal, Robin Gueguen
This article presents new numerical simulation tools, respectively developed in Matlab and Blender softwares. Available in open-source under the GPL 3.0 license, it uses a ray-tracing/image-sources hybrid method to calculate the room acoustics for large meshes. Performances are optimized to solve problems of significant size (typically more than 100,000 surface elements and about a million of rays). For this purpose, a Divide and Conquer approach with a recursive binary tree structure has been implemented to reduce the quadratic complexity of the ray/element interactions to near-linear. Thus, execution times are less sensitive to the mesh density, which allows simulations of complex geometry. After ray propagation, a hybrid method leads to image-sources, which can be visually analyzed to localize sound map. Finally, impulse responses are constructed from the image-sources and FIR filters are proposed natively over 8 octave bands, taking into account material absorption properties and propagation medium. This algorithm is validated...
more | pdf | html
Figures
Tweets
ComputerPapers: Open-source platforms for fast room acoustic simulations in complex structures. https://t.co/anCA76ztAy
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 2
Total Words: 4995
Unqiue Words: 1835

1.995 Mikeys
#8. Speech Coding, Speech Interfaces and IoT - Opportunities and Challenges
Tom Bäckström
Recent speech and audio coding standards such as 3GPP Enhanced Voice Services match the foreseeable needs and requirements in transmission of speech and audio, when using current transmission infrastructure and applications. Trends in Internet-of-Things technology and development in personal digital assistants (PDAs) however begs us to consider future requirements for speech and audio codecs. The opportunities and challenges are here summarized in three concepts: collaboration, unification and privacy. First, an increasing number of devices will in the future be speech-operated, whereby the ability to focus voice commands to a specific devices becomes essential. We therefore need methods which allows collaboration between devices, such that ambiguities can be resolved. Second, such collaboration can be achieved with a unified and standardized communication protocol between voice-operated devices. To achieve such collaboration protocols, we need to develop distributed speech coding technology for ad-hoc IoT networks. Finally...
more | pdf | html
Figures
Tweets
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 1
Total Words: 3691
Unqiue Words: 1389

0.0 Mikeys
#9. Gaussian-Constrained training for speaker verification
Lantian Li, Zhiyuan Tang, Ying Shi, Dong Wang
Neural models, in particular the d-vector and x-vector architectures, have produced state-of-the-art performance on many speaker verification tasks. However, two potential problems of these neural models deserve more investigation. Firstly, both models suffer from `information leak', which means that some parameters participating in model training will be discarded during inference, i.e, the layers that are used as the classifier. Secondly, both models do not regulate the distribution of the derived speaker vectors. This `unconstrained distribution' may degrade the performance of the subsequent scoring component, e.g., PLDA. This paper proposes a Gaussian-constrained training approach that (1) discards the parametric classifier, and (2) enforces the distribution of the derived speaker vectors to be Gaussian. Our experiments on the VoxCeleb and SITW databases demonstrated that this new training approach produced more representative and regular speaker embeddings, leading to consistent performance improvement.
more | pdf | html
Figures
Tweets
BrundageBot: Gaussian-Constrained training for speaker verification. Lantian Li, Zhiyuan Tang, Ying Shi, and Dong Wang https://t.co/48bN5rRXSg
arxivml: "Gaussian-Constrained training for speaker verification", Lantian Li, Zhiyuan Tang, Ying Shi, Dong Wang https://t.co/n16EZ8Ow61
arxiv_cscl: Gaussian-Constrained training for speaker verification https://t.co/MkmQBBtW4y
arxiv_cscl: Gaussian-Constrained training for speaker verification https://t.co/MkmQBBtW4y
arxiv_cscl: Gaussian-Constrained training for speaker verification https://t.co/MkmQBBtW4y
arxiv_cscl: Gaussian-Constrained training for speaker verification https://t.co/MkmQBBtW4y
arxiv_cscl: Gaussian-Constrained training for speaker verification https://t.co/MkmQBBtW4y
arxiv_cscl: Gaussian-Constrained training for speaker verification https://t.co/MkmQBBtW4y
arxiv_cscl: Gaussian-Constrained training for speaker verification https://t.co/MkmQBBtW4y
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 4
Total Words: 3404
Unqiue Words: 1283

0.0 Mikeys
#10. Acoustic Scene Classification: A Competition Review
Shayan Gharib, Honain Derrar, Daisuke Niizumi, Tuukka Senttula, Janne Tommola, Toni Heittola, Tuomas Virtanen, Heikki Huttunen
In this paper we study the problem of acoustic scene classification, i.e., categorization of audio sequences into mutually exclusive classes based on their spectral content. We describe the methods and results discovered during a competition organized in the context of a graduate machine learning course; both by the students and external participants. We identify the most suitable methods and study the impact of each by performing an ablation study of the mixture of approaches. We also compare the results with a neural network baseline, and show the improvement over that. Finally, we discuss the impact of using a competition as a part of a university course, and justify its importance in the curriculum based on student feedback.
more | pdf | html
Figures
Tweets
nmfeeds: [O] https://t.co/6F1eQGb9Wg Acoustic Scene Classification: A Competition Review. In this paper we study the problem of aco...
nmfeeds: [CV] https://t.co/6F1eQGb9Wg Acoustic Scene Classification: A Competition Review. In this paper we study the problem of ac...
nizumical: https://t.co/5hU4WaE3Hn “Acoustic Scene Classification: a competition review” Finlandの大学主催Kaggleコンペティションに参加したあと、co-authorに招待して下さいました。工夫をまとめる良いきっかけになりました。結果も再現できて良かった(汗
sylvan5: RT @nizumical: https://t.co/5hU4WaE3Hn “Acoustic Scene Classification: a competition review” Finlandの大学主催Kaggleコンペティションに参加したあと、co-authorに招待…
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 8
Total Words: 4538
Unqiue Words: 1783

About

Assert is a website where the best academic papers on arXiv (computer science, math, physics), bioRxiv (biology), BITSS (reproducibility), EarthArXiv (earth science), engrXiv (engineering), LawArXiv (law), PsyArXiv (psychology), SocArXiv (social science), and SportRxiv (sport research) bubble to the top each day.

Papers are scored (in real-time) based on how verifiable they are (as determined by their Github repos) and how interesting they are (based on Twitter).

To see top papers, follow us on twitter @assertpub_ (arXiv), @assert_pub (bioRxiv), and @assertpub_dev (everything else).

To see beautiful figures extracted from papers, follow us on Instagram.

Tracking 57,756 papers.

Search
Sort results based on if they are interesting or reproducible.
Interesting
Reproducible
Categories
All
Astrophysics
Cosmology and Nongalactic Astrophysics
Earth and Planetary Astrophysics
Astrophysics of Galaxies
High Energy Astrophysical Phenomena
Instrumentation and Methods for Astrophysics
Solar and Stellar Astrophysics
Condensed Matter
Disordered Systems and Neural Networks
Mesoscale and Nanoscale Physics
Materials Science
Other Condensed Matter
Quantum Gases
Soft Condensed Matter
Statistical Mechanics
Strongly Correlated Electrons
Superconductivity
Computer Science
Artificial Intelligence
Hardware Architecture
Computational Complexity
Computational Engineering, Finance, and Science
Computational Geometry
Computation and Language
Cryptography and Security
Computer Vision and Pattern Recognition
Computers and Society
Databases
Distributed, Parallel, and Cluster Computing
Digital Libraries
Discrete Mathematics
Data Structures and Algorithms
Emerging Technologies
Formal Languages and Automata Theory
General Literature
Graphics
Computer Science and Game Theory
Human-Computer Interaction
Information Retrieval
Information Theory
Machine Learning
Logic in Computer Science
Multiagent Systems
Multimedia
Mathematical Software
Numerical Analysis
Neural and Evolutionary Computing
Networking and Internet Architecture
Other Computer Science
Operating Systems
Performance
Programming Languages
Robotics
Symbolic Computation
Sound
Software Engineering
Social and Information Networks
Systems and Control
Economics
Econometrics
General Economics
Theoretical Economics
Electrical Engineering and Systems Science
Audio and Speech Processing
Image and Video Processing
Signal Processing
General Relativity and Quantum Cosmology
General Relativity and Quantum Cosmology
High Energy Physics - Experiment
High Energy Physics - Experiment
High Energy Physics - Lattice
High Energy Physics - Lattice
High Energy Physics - Phenomenology
High Energy Physics - Phenomenology
High Energy Physics - Theory
High Energy Physics - Theory
Mathematics
Commutative Algebra
Algebraic Geometry
Analysis of PDEs
Algebraic Topology
Classical Analysis and ODEs
Combinatorics
Category Theory
Complex Variables
Differential Geometry
Dynamical Systems
Functional Analysis
General Mathematics
General Topology
Group Theory
Geometric Topology
History and Overview
Information Theory
K-Theory and Homology
Logic
Metric Geometry
Mathematical Physics
Numerical Analysis
Number Theory
Operator Algebras
Optimization and Control
Probability
Quantum Algebra
Rings and Algebras
Representation Theory
Symplectic Geometry
Spectral Theory
Statistics Theory
Mathematical Physics
Mathematical Physics
Nonlinear Sciences
Adaptation and Self-Organizing Systems
Chaotic Dynamics
Cellular Automata and Lattice Gases
Pattern Formation and Solitons
Exactly Solvable and Integrable Systems
Nuclear Experiment
Nuclear Experiment
Nuclear Theory
Nuclear Theory
Physics
Accelerator Physics
Atmospheric and Oceanic Physics
Applied Physics
Atomic and Molecular Clusters
Atomic Physics
Biological Physics
Chemical Physics
Classical Physics
Computational Physics
Data Analysis, Statistics and Probability
Physics Education
Fluid Dynamics
General Physics
Geophysics
History and Philosophy of Physics
Instrumentation and Detectors
Medical Physics
Optics
Plasma Physics
Popular Physics
Physics and Society
Space Physics
Quantitative Biology
Biomolecules
Cell Behavior
Genomics
Molecular Networks
Neurons and Cognition
Other Quantitative Biology
Populations and Evolution
Quantitative Methods
Subcellular Processes
Tissues and Organs
Quantitative Finance
Computational Finance
Economics
General Finance
Mathematical Finance
Portfolio Management
Pricing of Securities
Risk Management
Statistical Finance
Trading and Market Microstructure
Quantum Physics
Quantum Physics
Statistics
Applications
Computation
Methodology
Machine Learning
Other Statistics
Statistics Theory
Feedback
Online
Stats
Tracking 57,756 papers.