Top 9 Arxiv Papers Today in Audio And Speech Processing


2.048 Mikeys
#1. Bimodal Speech Emotion Recognition Using Pre-Trained Language Models
Verena Heusser, Niklas Freymuth, Stefan Constantin, Alex Waibel
Speech emotion recognition is a challenging task and an important step towards more natural human-machine interaction. We show that pre-trained language models can be fine-tuned for text emotion recognition, achieving an accuracy of 69.5% on Task 4A of SemEval 2017, improving upon the previous state of the art by over 3% absolute. We combine these language models with speech emotion recognition, achieving results of 73.5% accuracy when using provided transcriptions and speech data on a subset of four classes of the IEMOCAP dataset. The use of noise-induced transcriptions and speech data results in an accuracy of 71.4%. For our experiments, we created IEmoNet, a modular and adaptable bimodal framework for speech emotion recognition based on pre-trained language models. Lastly, we discuss the idea of using an emotional classifier as a reward for reinforcement learning as a step towards more successful and convenient human-machine interaction.
more | pdf | html
Figures
None.
Tweets
arxivml: "Bimodal Speech Emotion Recognition Using Pre-Trained Language Models", Verena Heusser, Niklas Freymuth, Stefan Con… https://t.co/4DCvYtcZsl
StatsPapers: Bimodal Speech Emotion Recognition Using Pre-Trained Language Models. https://t.co/gPYnjRy9Oy
arxiv_cscl: Bimodal Speech Emotion Recognition Using Pre-Trained Language Models https://t.co/tylyAEBNfI
arxiv_cscl: Bimodal Speech Emotion Recognition Using Pre-Trained Language Models https://t.co/tylyAEkbR8
arxiv_cscl: Bimodal Speech Emotion Recognition Using Pre-Trained Language Models https://t.co/tylyAEBNfI
arxiv_cscl: Bimodal Speech Emotion Recognition Using Pre-Trained Language Models https://t.co/tylyAEBNfI
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 4
Total Words: 0
Unqiue Words: 0

2.025 Mikeys
#2. Singing Voice Conversion with Disentangled Representations of Singer and Vocal Technique Using Variational Autoencoders
Yin-Jyun Luo, Chin-Chen Hsu, Kat Agres, Dorien Herremans
We propose a flexible framework that deals with both singer conversion and singers vocal technique conversion. The proposed model is trained on non-parallel corpora, accommodates many-to-many conversion, and leverages recent advances of variational autoencoders. It employs separate encoders to learn disentangled latent representations of singer identity and vocal technique separately, with a joint decoder for reconstruction. Conversion is carried out by simple vector arithmetic in the learned latent spaces. Both a quantitative analysis as well as a visualization of the converted spectrograms show that our model is able to disentangle singer identity and vocal technique and successfully perform conversion of these attributes. To the best of our knowledge, this is the first work to jointly tackle conversion of singer identity and vocal technique based on a deep learning approach.
more | pdf | html
Figures
None.
Tweets
roadrunning01: Singing Voice Conversion with Disentangled Representations of Singer and Vocal Technique Using Variational Autoencoders pdf: https://t.co/knY0o17RY2 abs: https://t.co/Slgyo4dNF8 samples: https://t.co/gtXd2GfQZT https://t.co/e2wEKJfJsA
arxivml: "Singing Voice Conversion with Disentangled Representations of Singer and Vocal Technique Using Variational Autoenc… https://t.co/kp90grcEYU
StatsPapers: Singing Voice Conversion with Disentangled Representations of Singer and Vocal Technique Using Variational Autoencoders. https://t.co/KunM72m0fa
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 4
Total Words: 0
Unqiue Words: 0

2.024 Mikeys
#3. Audio-Visual Target Speaker Extraction on Multi-Talker Environment using Event-Driven Cameras
Ander Arriandiaga, Giovanni Morrone, Luca Pasa, Leonardo Badino, Chiara Bartolozzi
In this work, we propose a new method to address audio-visual target speaker extraction in multi-talker environments using event-driven cameras. All audio-visual speech separation approaches use a frame-based video to extract visual features. However, these frame-based cameras usually work at 30 frames per second. This limitation makes it difficult to process an audio-visual signal with low latency. In order to overcome this limitation, we propose using event-driven cameras due to their high temporal resolution and low latency. Recent work showed that the use of landmark motion features is very important in order to get good results on audio-visual speech separation. Thus, we use event-driven vision sensors from which the extraction of motion is available at lower latency computational cost. A stacked Bidirectional LSTM is trained to predict an Ideal Amplitude Mask before post-processing to get a clean audio signal. The performance of our model is close to those yielded in frame-based fashion.
more | pdf | html
Figures
None.
Tweets
arxivml: "Audio-Visual Target Speaker Extraction on Multi-Talker Environment using Event-Driven Cameras", Ander Arriandiaga,… https://t.co/xBczQCvUC4
arxiv_cs_LG: Audio-Visual Target Speaker Extraction on Multi-Talker Environment using Event-Driven Cameras. Ander Arriandiaga, Giovanni Morrone, Luca Pasa, Leonardo Badino, and Chiara Bartolozzi https://t.co/p0K6R8qtKw
Memoirs: Audio-Visual Target Speaker Extraction on Multi-Talker Environment using Event-Driven Cameras. https://t.co/KdvHwd4r6w
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 5
Total Words: 3714
Unqiue Words: 1400

2.013 Mikeys
#4. Audiovisual Transformer Architectures for Large-Scale Classification and Synchronization of Weakly Labeled Audio Events
Wim Boes, Hugo Van hamme
We tackle the task of environmental event classification by drawing inspiration from the transformer neural network architecture used in machine translation. We modify this attention-based feedforward structure in such a way that allows the resulting model to use audio as well as video to compute sound event predictions. We perform extensive experiments with these adapted transformers on an audiovisual data set, obtained by appending relevant visual information to an existing large-scale weakly labeled audio collection. The employed multi-label data contains clip-level annotation indicating the presence or absence of 17 classes of environmental sounds, and does not include temporal information. We show that the proposed modified transformers strongly improve upon previously introduced models and in fact achieve state-of-the-art results. We also make a compelling case for devoting more attention to research in multimodal audiovisual classification by proving the usefulness of visual information for the task at hand,namely audio...
more | pdf | html
Figures
None.
Tweets
arxivml: "Audiovisual Transformer Architectures for Large-Scale Classification and Synchronization of Weakly Labeled Audio E… https://t.co/4x15IRB8uE
StatsPapers: Audiovisual Transformer Architectures for Large-Scale Classification and Synchronization of Weakly Labeled Audio Events. https://t.co/Qhf9WhPEk0
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 2
Total Words: 0
Unqiue Words: 0

2.013 Mikeys
#5. Predominant Musical Instrument Classification based on Spectral Features
Ankit Khairkar, Chaudhari Bhushan Jayant, Karthikeya Racharla, Paturu Harish, Vineet Kumar
This work aims to examine one of the cornerstone problems of Musical Instrument Recognition, in particular instrument classification. IRMAS (Instrument recognition in Musical Audio Signals) data set is chosen. The data includes music obtained from various decades in the last century, thus having a wide variety in audio quality. We have presented a very concise summary of past work in this domain. Having implemented various supervised learning algorithms for this classification task, SVM classifier has outperformed the other state-of-the-art models with an accuracy of 79%. The classifier had a major challenge distinguishing between flute and organ. We also implemented Unsupervised techniques out of which Hierarchical Clustering has performed well. We have included most of the code (jupyter notebook) for easy reproducibility.
more | pdf | html
Figures
None.
Tweets
arxivml: "Predominant Musical Instrument Classification based on Spectral Features", Ankit Khairkar, Chaudhari Bhushan Jayan… https://t.co/GYnkwYjFTA
StatsPapers: Predominant Musical Instrument Classification based on Spectral Features. https://t.co/1vVFWsJSPL
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 5
Total Words: 0
Unqiue Words: 0

2.013 Mikeys
#6. Investigating Deep Neural Transformations for Spectrogram-based Musical Source Separation
Woosung Choi, Minseok Kim, Jaehwa Chung, Daewon Lee Soonyoung Jung
Musical Source Separation (MSS) is a signal processing task that tries to separate the mixed musical signal into each acoustic sound source, such as singing voice or drums. Recently many machine learning-based methods have been proposed for the MSS task, but there were no existing works that evaluate and directly compare various types of networks. In this paper, we aim to design a variety of neural transformation methods, including time-invariant methods, time-frequency methods, and mixtures of two different transformations. Our experiments provide abundant material for future works by comparing several transformation methods. We train our models on raw complex-valued STFT outputs and achieve state-of-the-art SDR performance in the MUSDB18 singing voice separation task by a large margin of 1.0 dB.
more | pdf | html
Figures
None.
Tweets
arxivml: "Investigating Deep Neural Transformations for Spectrogram-based Musical Source Separation", Woosung Choi, Minseok … https://t.co/jTnprJgH6M
StatsPapers: Investigating Deep Neural Transformations for Spectrogram-based Musical Source Separation. https://t.co/2ptmXJSb3N
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 4
Total Words: 0
Unqiue Words: 0

2.01 Mikeys
#7. Deep Contextualized Acoustic Representations For Semi-Supervised Speech Recognition
Shaoshi Ling, Yuzong Liu, Julian Salazar, Katrin Kirchhoff
We propose a novel approach to semi-supervised automatic speech recognition (ASR). We first exploit a large amount of unlabeled audio data via representation learning, where we reconstruct a temporal slice of filterbank features from past and future context frames. The resulting deep contextualized acoustic representations (DeCoAR) are then used to train a CTC-based end-to-end ASR system using a smaller amount of labeled audio data. In our experiments, we show that systems trained on DeCoAR consistently outperform ones trained on conventional filterbank features, giving 42% and 19% relative improvement over the baseline on WSJ eval92 and LibriSpeech test-clean, respectively. Our approach can drastically reduce the amount of labeled data required; unsupervised training on LibriSpeech then supervision with 100 hours of labeled data achieves performance on par with training on all 960 hours directly.
more | pdf | html
Figures
None.
Tweets
BrundageBot: Deep Contextualized Acoustic Representations For Semi-Supervised Speech Recognition. Shaoshi Ling, Yuzong Liu, Julian Salazar, and Katrin Kirchhoff https://t.co/6ck15ilnJL
arxivml: "Deep Contextualized Acoustic Representations For Semi-Supervised Speech Recognition", Shaoshi Ling, Yuzong Liu, Ju… https://t.co/y97yrJ8w6t
JulianSlzr: our new work: Deep Contextual >>Acoustic<< Representations (DeCoAR) https://t.co/dRMFdJjhBU #SpeechRecognition #NLProc https://t.co/QPmuqZYjFX
Memoirs: Deep Contextualized Acoustic Representations For Semi-Supervised Speech Recognition. https://t.co/CPCAWFOZ2Q
arxiv_cscl: Deep Contextualized Acoustic Representations For Semi-Supervised Speech Recognition https://t.co/BNgWbZfESi
arxiv_cscl: Deep Contextualized Acoustic Representations For Semi-Supervised Speech Recognition https://t.co/BNgWbZfESi
toannguyen177: RT @JulianSlzr: our new work: Deep Contextual >>Acoustic<< Representations (DeCoAR) https://t.co/dRMFdJjhBU #SpeechRecognition #NLProc http…
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 4
Total Words: 0
Unqiue Words: 0

2.005 Mikeys
#8. Integrating Whole Context to Sequence-to-sequence Speech Recognition
Ye Bai, Jiangyan Yi, Jianhua Tao, Zhengkun Tian, Zhengqi Wen, Shuai Zhang
Because an attention based sequence-to-sequence speech (Seq2Seq) recognition model decodes a token sequence in a left-to-right manner, it is non-trivial for the decoder to leverage the whole context of the target sequence. In this paper, we propose a self-attention mechanism based language model called casual cloze completer (COR), which models the left context and the right context simultaneously. Then, we utilize our previously proposed "Learn Spelling from Teachers" approach to integrate the whole context knowledge from COR to the Seq2Seq model. We conduct the experiments on public Chinese dataset AISHELL-1. The experimental results show that leveraging whole context can improve the performance of the Seq2Seq model.
more | pdf | html
Figures
None.
Tweets
BrundageBot: Integrating Whole Context to Sequence-to-sequence Speech Recognition. Ye Bai, Jiangyan Yi, Jianhua Tao, Zhengkun Tian, Zhengqi Wen, and Shuai Zhang https://t.co/eQZmygxKQW
arxivml: "Integrating Whole Context to Sequence-to-sequence Speech Recognition", Ye Bai, Jiangyan Yi, Jianhua Tao, Zhengkun … https://t.co/cUG6RJx8TZ
arxiv_cscl: Integrating Whole Context to Sequence-to-sequence Speech Recognition https://t.co/P3FqFSNfDa
arxiv_cscl: Integrating Whole Context to Sequence-to-sequence Speech Recognition https://t.co/P3FqFSNfDa
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 6
Total Words: 0
Unqiue Words: 0

1.993 Mikeys
#9. Powerful Speaker Embedding Training Framework by Adversarially Disentangled Identity Representation
Jianwei Tai, Hang Zhou, Qingjia Huang, Xiaoqi Jia
The main challenge of speaker verification in the wild is the interference caused by irrelevant information in speech and the lack of speaker labels in speech datasets. In order to solve the above problems, we propose a novel speaker embedding training framework based on adversarially disentangled identity representation. Our key insight is to adversarially learn the identity-purified features for speaker verification, and learn an identity-irrelated feature whose speaker information cannot be distinguished. Based on the existing state-of-the-art speaker verification models, we improve them without adjusting the structure and hyper-parameters of any model. Experiments prove that the framework we propose can significantly improve the performance of speaker verification from the original model without any empirical adjustments. Proving that it is particularly useful for alleviating the lack of speaker labels.
more | pdf | html
Figures
None.
Tweets
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 4
Total Words: 0
Unqiue Words: 0

About

Assert is a website where the best academic papers on arXiv (computer science, math, physics), bioRxiv (biology), BITSS (reproducibility), EarthArXiv (earth science), engrXiv (engineering), LawArXiv (law), PsyArXiv (psychology), SocArXiv (social science), and SportRxiv (sport research) bubble to the top each day.

Papers are scored (in real-time) based on how verifiable they are (as determined by their Github repos) and how interesting they are (based on Twitter).

To see top papers, follow us on twitter @assertpub_ (arXiv), @assert_pub (bioRxiv), and @assertpub_dev (everything else).

To see beautiful figures extracted from papers, follow us on Instagram.

Tracking 234,430 papers.

Search
Sort results based on if they are interesting or reproducible.
Interesting
Reproducible
Categories
All
Astrophysics
Cosmology and Nongalactic Astrophysics
Earth and Planetary Astrophysics
Astrophysics of Galaxies
High Energy Astrophysical Phenomena
Instrumentation and Methods for Astrophysics
Solar and Stellar Astrophysics
Condensed Matter
Disordered Systems and Neural Networks
Mesoscale and Nanoscale Physics
Materials Science
Other Condensed Matter
Quantum Gases
Soft Condensed Matter
Statistical Mechanics
Strongly Correlated Electrons
Superconductivity
Computer Science
Artificial Intelligence
Hardware Architecture
Computational Complexity
Computational Engineering, Finance, and Science
Computational Geometry
Computation and Language
Cryptography and Security
Computer Vision and Pattern Recognition
Computers and Society
Databases
Distributed, Parallel, and Cluster Computing
Digital Libraries
Discrete Mathematics
Data Structures and Algorithms
Emerging Technologies
Formal Languages and Automata Theory
General Literature
Graphics
Computer Science and Game Theory
Human-Computer Interaction
Information Retrieval
Information Theory
Machine Learning
Logic in Computer Science
Multiagent Systems
Multimedia
Mathematical Software
Numerical Analysis
Neural and Evolutionary Computing
Networking and Internet Architecture
Other Computer Science
Operating Systems
Performance
Programming Languages
Robotics
Symbolic Computation
Sound
Software Engineering
Social and Information Networks
Systems and Control
Economics
Econometrics
General Economics
Theoretical Economics
Electrical Engineering and Systems Science
Audio and Speech Processing
Image and Video Processing
Signal Processing
General Relativity and Quantum Cosmology
General Relativity and Quantum Cosmology
High Energy Physics - Experiment
High Energy Physics - Experiment
High Energy Physics - Lattice
High Energy Physics - Lattice
High Energy Physics - Phenomenology
High Energy Physics - Phenomenology
High Energy Physics - Theory
High Energy Physics - Theory
Mathematics
Commutative Algebra
Algebraic Geometry
Analysis of PDEs
Algebraic Topology
Classical Analysis and ODEs
Combinatorics
Category Theory
Complex Variables
Differential Geometry
Dynamical Systems
Functional Analysis
General Mathematics
General Topology
Group Theory
Geometric Topology
History and Overview
Information Theory
K-Theory and Homology
Logic
Metric Geometry
Mathematical Physics
Numerical Analysis
Number Theory
Operator Algebras
Optimization and Control
Probability
Quantum Algebra
Rings and Algebras
Representation Theory
Symplectic Geometry
Spectral Theory
Statistics Theory
Mathematical Physics
Mathematical Physics
Nonlinear Sciences
Adaptation and Self-Organizing Systems
Chaotic Dynamics
Cellular Automata and Lattice Gases
Pattern Formation and Solitons
Exactly Solvable and Integrable Systems
Nuclear Experiment
Nuclear Experiment
Nuclear Theory
Nuclear Theory
Physics
Accelerator Physics
Atmospheric and Oceanic Physics
Applied Physics
Atomic and Molecular Clusters
Atomic Physics
Biological Physics
Chemical Physics
Classical Physics
Computational Physics
Data Analysis, Statistics and Probability
Physics Education
Fluid Dynamics
General Physics
Geophysics
History and Philosophy of Physics
Instrumentation and Detectors
Medical Physics
Optics
Plasma Physics
Popular Physics
Physics and Society
Space Physics
Quantitative Biology
Biomolecules
Cell Behavior
Genomics
Molecular Networks
Neurons and Cognition
Other Quantitative Biology
Populations and Evolution
Quantitative Methods
Subcellular Processes
Tissues and Organs
Quantitative Finance
Computational Finance
Economics
General Finance
Mathematical Finance
Portfolio Management
Pricing of Securities
Risk Management
Statistical Finance
Trading and Market Microstructure
Quantum Physics
Quantum Physics
Statistics
Applications
Computation
Methodology
Machine Learning
Other Statistics
Statistics Theory
Feedback
Online
Stats
Tracking 234,430 papers.