Top 10 Arxiv Papers Today in Audio And Speech Processing


2.129 Mikeys
#1. The phonetic bases of vocal expressed emotion: natural versus acted
Hira Dhamyal, Shahan A. Memon, Bhiksha Raj, Rita Singh
Can vocal emotions be emulated? This question has been a recurrent concern of the speech community, and has also been vigorously investigated. It has been fueled further by its link to the issue of validity of acted emotion databases. Much of the speech and vocal emotion research has relied on acted emotion databases as valid proxies for studying natural emotions. To create models that generalize to natural settings, it is crucial to work with valid prototypes -- ones that can be assumed to reliably represent natural emotions. More concretely, it is important to study emulated emotions against natural emotions in terms of their physiological, and psychological concomitants. In this paper, we present an on-scale systematic study of the differences between natural and acted vocal emotions. We use a self-attention based emotion classification model to understand the phonetic bases of emotions by discovering the most attentive phonemes for each class of emotions. We then compare these attentive phonemes in their importance and...
more | pdf | html
Figures
None.
Tweets
arxiv_cscl: The phonetic bases of vocal expressed emotion: natural versus acted https://t.co/ELVBS6FNY4
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 4
Total Words: 0
Unqiue Words: 0

2.129 Mikeys
#2. Emotional Voice Conversion using multitask learning with Text-to-speech
Tae-Ho Kim, Sungjae Cho, Shinkook Choi, Sejik Park, Soo-Young Lee
Voice conversion (VC) is a task to transform a person's voice to different style while conserving linguistic contents. Previous state-of-the-art on VC is based on sequence-to-sequence (seq2seq) model, which could mislead linguistic information. There was an attempt to overcome it by using textual supervision, it requires explicit alignment which loses the benefit of using seq2seq model. In this paper, a voice converter using multitask learning with text-to-speech (TTS) is presented. The embedding space of seq2seq-based TTS has abundant information on the text. The role of the decoder of TTS is to convert embedding space to speech, which is same to VC. In the proposed model, the whole network is trained to minimize loss of VC and TTS. VC is expected to capture more linguistic information and to preserve training stability by multitask learning. Experiments of VC were performed on a male Korean emotional text-speech dataset, and it is shown that multitask learning is helpful to keep linguistic contents in VC.
more | pdf | html
Figures
None.
Tweets
arxiv_cscl: Emotional Voice Conversion using multitask learning with Text-to-speech https://t.co/ZGg8QeDvVV
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 5
Total Words: 0
Unqiue Words: 0

2.053 Mikeys
#3. Recurrent Neural Network Transducer for Audio-Visual Speech Recognition
Takaki Makino, Hank Liao, Yannis Assael, Brendan Shillingford, Basilio Garcia, Otavio Braga, Olivier Siohan
This work presents a large-scale audio-visual speech recognition system based on a recurrent neural network transducer (RNN-T) architecture. To support the development of such a system, we built a large audio-visual (A/V) dataset of segmented utterances extracted from YouTube public videos, leading to 31k hours of audio-visual training content. The performance of an audio-only, visual-only, and audio-visual system are compared on two large-vocabulary test sets: a set of utterance segments from public YouTube videos called YTDEV18 and the publicly available LRS3-TED set. To highlight the contribution of the visual modality, we also evaluated the performance of our system on the YTDEV18 set artificially corrupted with background noise and overlapping speech. To the best of our knowledge, our system significantly improves the state-of-the-art on the LRS3-TED set.
more | pdf | html
Figures
None.
Tweets
ta_makino: 今年こそは論文出すと言っていた今年の抱負をなんとか達成しました。システム論文なので理論的な新しさはありませんが、この分野としてはブレイクスルーの結果です。 https://t.co/V0hmtZzP3j
arxivml: "Recurrent Neural Network Transducer for Audio-Visual Speech Recognition", Takaki Makino, Hank Liao, Yannis Assael,… https://t.co/eHXVTBH6sn
arxiv_cscv: Recurrent Neural Network Transducer for Audio-Visual Speech Recognition https://t.co/DRlo3Qnz5w
arxiv_cscv: Recurrent Neural Network Transducer for Audio-Visual Speech Recognition https://t.co/DRlo3Qnz5w
arxiv_cscv: Recurrent Neural Network Transducer for Audio-Visual Speech Recognition https://t.co/DRlo3QF9X4
arxiv_cscl: Recurrent Neural Network Transducer for Audio-Visual Speech Recognition https://t.co/GpYqNSctz1
arxiv_cscl: Recurrent Neural Network Transducer for Audio-Visual Speech Recognition https://t.co/GpYqNSctz1
arxiv_cscl: Recurrent Neural Network Transducer for Audio-Visual Speech Recognition https://t.co/GpYqNSctz1
TJO_datasci: RT @ta_makino: 今年こそは論文出すと言っていた今年の抱負をなんとか達成しました。システム論文なので理論的な新しさはありませんが、この分野としてはブレイクスルーの結果です。 https://t.co/V0hmtZzP3j
siskw: RT @ta_makino: 今年こそは論文出すと言っていた今年の抱負をなんとか達成しました。システム論文なので理論的な新しさはありませんが、この分野としてはブレイクスルーの結果です。 https://t.co/V0hmtZzP3j
btreetaiji: RT @ta_makino: 今年こそは論文出すと言っていた今年の抱負をなんとか達成しました。システム論文なので理論的な新しさはありませんが、この分野としてはブレイクスルーの結果です。 https://t.co/V0hmtZzP3j
ballforest: RT @ta_makino: 今年こそは論文出すと言っていた今年の抱負をなんとか達成しました。システム論文なので理論的な新しさはありませんが、この分野としてはブレイクスルーの結果です。 https://t.co/V0hmtZzP3j
heiga_zen: RT @ta_makino: 今年こそは論文出すと言っていた今年の抱負をなんとか達成しました。システム論文なので理論的な新しさはありませんが、この分野としてはブレイクスルーの結果です。 https://t.co/V0hmtZzP3j
koh_t: RT @ta_makino: 今年こそは論文出すと言っていた今年の抱負をなんとか達成しました。システム論文なので理論的な新しさはありませんが、この分野としてはブレイクスルーの結果です。 https://t.co/V0hmtZzP3j
twisugi: RT @ta_makino: 今年こそは論文出すと言っていた今年の抱負をなんとか達成しました。システム論文なので理論的な新しさはありませんが、この分野としてはブレイクスルーの結果です。 https://t.co/V0hmtZzP3j
tsubosaka: RT @ta_makino: 今年こそは論文出すと言っていた今年の抱負をなんとか達成しました。システム論文なので理論的な新しさはありませんが、この分野としてはブレイクスルーの結果です。 https://t.co/V0hmtZzP3j
cocomoff: RT @ta_makino: 今年こそは論文出すと言っていた今年の抱負をなんとか達成しました。システム論文なので理論的な新しさはありませんが、この分野としてはブレイクスルーの結果です。 https://t.co/V0hmtZzP3j
ryo_masumura: RT @ta_makino: 今年こそは論文出すと言っていた今年の抱負をなんとか達成しました。システム論文なので理論的な新しさはありませんが、この分野としてはブレイクスルーの結果です。 https://t.co/V0hmtZzP3j
kogecoo: RT @ta_makino: 今年こそは論文出すと言っていた今年の抱負をなんとか達成しました。システム論文なので理論的な新しさはありませんが、この分野としてはブレイクスルーの結果です。 https://t.co/V0hmtZzP3j
tig33739130: RT @ta_makino: 今年こそは論文出すと言っていた今年の抱負をなんとか達成しました。システム論文なので理論的な新しさはありませんが、この分野としてはブレイクスルーの結果です。 https://t.co/V0hmtZzP3j
yuizumi_y5i: RT @ta_makino: 今年こそは論文出すと言っていた今年の抱負をなんとか達成しました。システム論文なので理論的な新しさはありませんが、この分野としてはブレイクスルーの結果です。 https://t.co/V0hmtZzP3j
ita_cora: RT @ta_makino: 今年こそは論文出すと言っていた今年の抱負をなんとか達成しました。システム論文なので理論的な新しさはありませんが、この分野としてはブレイクスルーの結果です。 https://t.co/V0hmtZzP3j
hishiko79: RT @ta_makino: 今年こそは論文出すと言っていた今年の抱負をなんとか達成しました。システム論文なので理論的な新しさはありませんが、この分野としてはブレイクスルーの結果です。 https://t.co/V0hmtZzP3j
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 7
Total Words: 6093
Unqiue Words: 2082

2.006 Mikeys
#4. Segment Relevance Estimation for Audio Analysis and Weakly-Labelled Classification
Juliano Henrique Foleiss, Tiago Fernandes Tavares
We propose a method that quantifies the importance, namely relevance, of audio segments for classification in weakly-labelled problems. It works by drawing information from a set of class-wise one-vs-all classifiers. By selecting the classifiers used in each specific classification problem, the relevance measure adapts to different user-defined viewpoints without requiring additional neural network training. This characteristic allows the relevance measure to highlight audio segments that quickly adapt to user-defined criteria. Such functionality can be used for computer-assisted audio analysis. Also, we propose a neural network architecture, namely RELNET, that leverages the relevance measure for weakly-labelled audio classification problems. RELNET was evaluated in the DCASE2018 dataset and achieved competitive classification results when compared to previous attention-based proposals.
more | pdf | html
Figures
None.
Tweets
arxivml: "Segment Relevance Estimation for Audio Analysis and Weakly-Labelled Classification", Juliano Henrique Foleiss, Tia… https://t.co/ZqXLN4bdn6
arxiv_cs_LG: Segment Relevance Estimation for Audio Analysis and Weakly-Labelled Classification. Juliano Henrique Foleiss and Tiago Fernandes Tavares https://t.co/CZqA9n5U6p
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 2
Total Words: 3908
Unqiue Words: 1388

2.006 Mikeys
#5. Detection of speech events and speaker characteristics through photo-plethysmographic signal neural processing
Guillermo Cámbara, Jordi Luque, Mireia Farrús
The use of photoplethysmogram signal (PPG) for heart and sleep monitoring is commonly found nowadays in smartphones and wrist wearables. Besides common usages, it has been proposed and reported that person information can be extracted from PPG for other uses, like biometry tasks. In this work, we explore several end-to-end convolutional neural network architectures for detection of human's characteristics such as gender or person identity. In addition, we evaluate whether speech/non-speech events may be inferred from PPG signal, where speech might translate in fluctuations into the pulse signal. The obtained results are promising and clearly show the potential of fully end-to-end topologies for automatic extraction of meaningful biomarkers, even from a noisy signal sampled by a low-cost PPG sensor. The AUCs for best architectures put forward PPG wave as biological discriminant, reaching $79\%$ and $89.0\%$, respectively for gender and person verification tasks. Furthermore, speech detection experiments reporting AUCs around...
more | pdf | html
Figures
None.
Tweets
arxivml: "Detection of speech events and speaker characteristics through photo-plethysmographic signal neural processing", G… https://t.co/rSnpfypbPT
arxiv_cs_LG: Detection of speech events and speaker characteristics through photo-plethysmographic signal neural processing. Guillermo Cámbara, Jordi Luque, and Mireia Farrús https://t.co/p6N29RJ0br
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 3
Total Words: 0
Unqiue Words: 0

2.005 Mikeys
#6. 3-D Feature and Acoustic Modeling for Far-Field Speech Recognition
Anurenjan Purushothaman, Anirudh Sreeram, Sriram Ganapathy
Automatic speech recognition in multi-channel reverberant conditions is a challenging task. The conventional way of suppressing the reverberation artifacts involves a beamforming based enhancement of the multi-channel speech signal, which is used to extract spectrogram based features for a neural network acoustic model. In this paper, we propose to extract features directly from the multi-channel speech signal using a multi variate autoregressive (MAR) modeling approach, where the correlations among all the three dimensions of time, frequency and channel are exploited. The MAR features are fed to a convolutional neural network (CNN) architecture which performs the joint acoustic modeling on the three dimensions. The 3-D CNN architecture allows the combination of multi-channel features that optimize the speech recognition cost compared to the traditional beamforming models that focus on the enhancement task. Experiments are conducted on the CHiME-3 and REVERB Challenge dataset using multi-channel reverberant speech. In these...
more | pdf | html
Figures
None.
Tweets
arxivml: "3-D Feature and Acoustic Modeling for Far-Field Speech Recognition", Anurenjan Purushothaman, Anirudh Sreeram, Sri… https://t.co/0vVBZ6ej1H
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 3
Total Words: 0
Unqiue Words: 0

2.001 Mikeys
#7. 'Warriors of the Word' -- Deciphering Lyrical Topics in Music and Their Connection to Audio Feature Dimensions Based on a Corpus of Over 100,000 Metal Songs
Isabella Czedik-Eysenberg, Oliver Wieczorek, Christoph Reuter
We look into the connection between the musical and lyrical content of metal music by combining automated extraction of high-level audio features and quantitative text analysis on a corpus of 124.288 song lyrics from this genre. Based on this text corpus, a topic model was first constructed using Latent Dirichlet Allocation (LDA). For a subsample of 500 songs, scores for predicting perceived musical hardness/heaviness and darkness/gloominess were extracted using audio feature models. By combining both audio feature and text analysis, we (1) offer a comprehensive overview of the lyrical topics present within the metal genre and (2) are able to establish whether or not levels of hardness and other music dimensions are associated with the occurrence of particularly harsh (and other) textual topics. Twenty typical topics were identified and projected into a topic space using multidimensional scaling (MDS). After Bonferroni correction, positive correlations were found between musical hardness and darkness and textual topics dealing...
more | pdf | html
Figures
None.
Tweets
arxivml: "'Warriors of the Word' -- Deciphering Lyrical Topics in Music and Their Connection to Audio Feature Dimensions Bas… https://t.co/ikZ7Iu3m9N
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 3
Total Words: 0
Unqiue Words: 0

2.001 Mikeys
#8. An End-to-end Approach for Lexical Stress Detection based on Transformer
Yong Ruan, Xiangdong Wang, Hong Liu, Zhigang Ou, Yun Gao, Jianfeng Cheng, Yueliang Qian
The dominant automatic lexical stress detection method is to split the utterance into syllable segments using phoneme sequence and their time-aligned boundaries. Then we extract features from syllable to use classification method to classify the lexical stress. However, we can't get very accurate time boundaries of each phoneme and we have to design some features in the syllable segments to classify the lexical stress. Therefore, we propose a end-to-end approach using sequence to sequence model of transformer to estimate lexical stress. For this, we train transformer model using feature sequence of audio and their phoneme sequence with lexical stress marks. During the recognition process, the recognized phoneme sequence is restricted according to the original standard phoneme sequence without lexical stress marks, but the lexical stress mark of each phoneme is not limited. We train the model in different subset of Librispeech and do lexical stress recognition in TIMIT and L2-ARCTIC dataset. For all subsets, the end-to-end model...
more | pdf | html
Figures
None.
Tweets
arxivml: "An End-to-end Approach for Lexical Stress Detection based on Transformer", Yong Ruan, Xiangdong Wang, Hong Liu, Zh… https://t.co/f2aAWi9r8J
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 7
Total Words: 0
Unqiue Words: 0

2.001 Mikeys
#9. Non-Autoregressive Transformer Automatic Speech Recognition
Nanxin Chen, Shinji Watanabe, Jesús Villalba, Najim Dehak
Recently very deep transformers start showing outperformed performance to traditional bi-directional long short-term memory networks by a large margin. However, to put it into production usage, inference computation cost and latency are still serious concerns in real scenarios. In this paper, we study a novel non-autoregressive transformers structure for speech recognition, which is originally introduced in machine translation. During training input tokens fed to the decoder are randomly replaced by a special mask token. The network is required to predict those mask tokens by taking both context and input speech into consideration. During inference, we start from all mask tokens and the network gradually predicts all tokens based on partial results. We show this framework can support different decoding strategies, including traditional left-to-right. A new decoding strategy is proposed as an example, which starts from the easiest predictions to difficult ones. Some preliminary results on Aishell and CSJ benchmarks show the...
more | pdf | html
Figures
Tweets
arxivml: "Non-Autoregressive Transformer Automatic Speech Recognition", Nanxin Chen, Shinji Watanabe, Jesús Villalba, Najim … https://t.co/46d7a6bQ4d
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 4
Total Words: 3328
Unqiue Words: 1313

1.992 Mikeys
#10. Enhanced Voice Post Processing Using Voice Decoder Guidance Indicators
Phani Kumar Nyshadham, D R Shivakumar, Peter Kroon, Shmulik Markovich-Golan
Voice enhancement and voice coding are imperative and important functions in a voice-communication system. However, both functions are commonly treated independently, even though both utilize similar features of the underlying signals. Our proposal is to leverage information from one function to the benefit of the other. Specifically, our proposed changes are focused on changes to the voice enhancement at the downlink side and utilizing information of the voice decoding. Preliminary results show that such an approach results in improved quality. Additionally, suggestions are provided on future extensions of the proposed concept.
more | pdf | html
Figures
None.
Tweets
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 4
Total Words: 0
Unqiue Words: 0

About

Assert is a website where the best academic papers on arXiv (computer science, math, physics), bioRxiv (biology), BITSS (reproducibility), EarthArXiv (earth science), engrXiv (engineering), LawArXiv (law), PsyArXiv (psychology), SocArXiv (social science), and SportRxiv (sport research) bubble to the top each day.

Papers are scored (in real-time) based on how verifiable they are (as determined by their Github repos) and how interesting they are (based on Twitter).

To see top papers, follow us on twitter @assertpub_ (arXiv), @assert_pub (bioRxiv), and @assertpub_dev (everything else).

To see beautiful figures extracted from papers, follow us on Instagram.

Tracking 222,744 papers.

Search
Sort results based on if they are interesting or reproducible.
Interesting
Reproducible
Categories
All
Astrophysics
Cosmology and Nongalactic Astrophysics
Earth and Planetary Astrophysics
Astrophysics of Galaxies
High Energy Astrophysical Phenomena
Instrumentation and Methods for Astrophysics
Solar and Stellar Astrophysics
Condensed Matter
Disordered Systems and Neural Networks
Mesoscale and Nanoscale Physics
Materials Science
Other Condensed Matter
Quantum Gases
Soft Condensed Matter
Statistical Mechanics
Strongly Correlated Electrons
Superconductivity
Computer Science
Artificial Intelligence
Hardware Architecture
Computational Complexity
Computational Engineering, Finance, and Science
Computational Geometry
Computation and Language
Cryptography and Security
Computer Vision and Pattern Recognition
Computers and Society
Databases
Distributed, Parallel, and Cluster Computing
Digital Libraries
Discrete Mathematics
Data Structures and Algorithms
Emerging Technologies
Formal Languages and Automata Theory
General Literature
Graphics
Computer Science and Game Theory
Human-Computer Interaction
Information Retrieval
Information Theory
Machine Learning
Logic in Computer Science
Multiagent Systems
Multimedia
Mathematical Software
Numerical Analysis
Neural and Evolutionary Computing
Networking and Internet Architecture
Other Computer Science
Operating Systems
Performance
Programming Languages
Robotics
Symbolic Computation
Sound
Software Engineering
Social and Information Networks
Systems and Control
Economics
Econometrics
General Economics
Theoretical Economics
Electrical Engineering and Systems Science
Audio and Speech Processing
Image and Video Processing
Signal Processing
General Relativity and Quantum Cosmology
General Relativity and Quantum Cosmology
High Energy Physics - Experiment
High Energy Physics - Experiment
High Energy Physics - Lattice
High Energy Physics - Lattice
High Energy Physics - Phenomenology
High Energy Physics - Phenomenology
High Energy Physics - Theory
High Energy Physics - Theory
Mathematics
Commutative Algebra
Algebraic Geometry
Analysis of PDEs
Algebraic Topology
Classical Analysis and ODEs
Combinatorics
Category Theory
Complex Variables
Differential Geometry
Dynamical Systems
Functional Analysis
General Mathematics
General Topology
Group Theory
Geometric Topology
History and Overview
Information Theory
K-Theory and Homology
Logic
Metric Geometry
Mathematical Physics
Numerical Analysis
Number Theory
Operator Algebras
Optimization and Control
Probability
Quantum Algebra
Rings and Algebras
Representation Theory
Symplectic Geometry
Spectral Theory
Statistics Theory
Mathematical Physics
Mathematical Physics
Nonlinear Sciences
Adaptation and Self-Organizing Systems
Chaotic Dynamics
Cellular Automata and Lattice Gases
Pattern Formation and Solitons
Exactly Solvable and Integrable Systems
Nuclear Experiment
Nuclear Experiment
Nuclear Theory
Nuclear Theory
Physics
Accelerator Physics
Atmospheric and Oceanic Physics
Applied Physics
Atomic and Molecular Clusters
Atomic Physics
Biological Physics
Chemical Physics
Classical Physics
Computational Physics
Data Analysis, Statistics and Probability
Physics Education
Fluid Dynamics
General Physics
Geophysics
History and Philosophy of Physics
Instrumentation and Detectors
Medical Physics
Optics
Plasma Physics
Popular Physics
Physics and Society
Space Physics
Quantitative Biology
Biomolecules
Cell Behavior
Genomics
Molecular Networks
Neurons and Cognition
Other Quantitative Biology
Populations and Evolution
Quantitative Methods
Subcellular Processes
Tissues and Organs
Quantitative Finance
Computational Finance
Economics
General Finance
Mathematical Finance
Portfolio Management
Pricing of Securities
Risk Management
Statistical Finance
Trading and Market Microstructure
Quantum Physics
Quantum Physics
Statistics
Applications
Computation
Methodology
Machine Learning
Other Statistics
Statistics Theory
Feedback
Online
Stats
Tracking 222,744 papers.