### Top 10 Arxiv Papers Today in Audio And Speech Processing

##### #1. The phonetic bases of vocal expressed emotion: natural versus acted
###### Hira Dhamyal, Shahan A. Memon, Bhiksha Raj, Rita Singh
Can vocal emotions be emulated? This question has been a recurrent concern of the speech community, and has also been vigorously investigated. It has been fueled further by its link to the issue of validity of acted emotion databases. Much of the speech and vocal emotion research has relied on acted emotion databases as valid proxies for studying natural emotions. To create models that generalize to natural settings, it is crucial to work with valid prototypes -- ones that can be assumed to reliably represent natural emotions. More concretely, it is important to study emulated emotions against natural emotions in terms of their physiological, and psychological concomitants. In this paper, we present an on-scale systematic study of the differences between natural and acted vocal emotions. We use a self-attention based emotion classification model to understand the phonetic bases of emotions by discovering the most attentive phonemes for each class of emotions. We then compare these attentive phonemes in their importance and...
more | pdf | html
None.
###### Tweets
arxiv_cscl: The phonetic bases of vocal expressed emotion: natural versus acted https://t.co/ELVBS6FNY4
None.
None.
###### Other stats
Sample Sizes : None.
Authors: 4
Total Words: 0
Unqiue Words: 0

##### #2. Emotional Voice Conversion using multitask learning with Text-to-speech
###### Tae-Ho Kim, Sungjae Cho, Shinkook Choi, Sejik Park, Soo-Young Lee
Voice conversion (VC) is a task to transform a person's voice to different style while conserving linguistic contents. Previous state-of-the-art on VC is based on sequence-to-sequence (seq2seq) model, which could mislead linguistic information. There was an attempt to overcome it by using textual supervision, it requires explicit alignment which loses the benefit of using seq2seq model. In this paper, a voice converter using multitask learning with text-to-speech (TTS) is presented. The embedding space of seq2seq-based TTS has abundant information on the text. The role of the decoder of TTS is to convert embedding space to speech, which is same to VC. In the proposed model, the whole network is trained to minimize loss of VC and TTS. VC is expected to capture more linguistic information and to preserve training stability by multitask learning. Experiments of VC were performed on a male Korean emotional text-speech dataset, and it is shown that multitask learning is helpful to keep linguistic contents in VC.
more | pdf | html
None.
###### Tweets
arxiv_cscl: Emotional Voice Conversion using multitask learning with Text-to-speech https://t.co/ZGg8QeDvVV
None.
None.
###### Other stats
Sample Sizes : None.
Authors: 5
Total Words: 0
Unqiue Words: 0

##### #3. Recurrent Neural Network Transducer for Audio-Visual Speech Recognition
###### Takaki Makino, Hank Liao, Yannis Assael, Brendan Shillingford, Basilio Garcia, Otavio Braga, Olivier Siohan
This work presents a large-scale audio-visual speech recognition system based on a recurrent neural network transducer (RNN-T) architecture. To support the development of such a system, we built a large audio-visual (A/V) dataset of segmented utterances extracted from YouTube public videos, leading to 31k hours of audio-visual training content. The performance of an audio-only, visual-only, and audio-visual system are compared on two large-vocabulary test sets: a set of utterance segments from public YouTube videos called YTDEV18 and the publicly available LRS3-TED set. To highlight the contribution of the visual modality, we also evaluated the performance of our system on the YTDEV18 set artificially corrupted with background noise and overlapping speech. To the best of our knowledge, our system significantly improves the state-of-the-art on the LRS3-TED set.
more | pdf | html
None.
###### Tweets
ta_makino: 今年こそは論文出すと言っていた今年の抱負をなんとか達成しました。システム論文なので理論的な新しさはありませんが、この分野としてはブレイクスルーの結果です。 https://t.co/V0hmtZzP3j
arxivml: "Recurrent Neural Network Transducer for Audio-Visual Speech Recognition", Takaki Makino, Hank Liao, Yannis Assael,… https://t.co/eHXVTBH6sn
arxiv_cscv: Recurrent Neural Network Transducer for Audio-Visual Speech Recognition https://t.co/DRlo3Qnz5w
arxiv_cscv: Recurrent Neural Network Transducer for Audio-Visual Speech Recognition https://t.co/DRlo3Qnz5w
arxiv_cscv: Recurrent Neural Network Transducer for Audio-Visual Speech Recognition https://t.co/DRlo3QF9X4
arxiv_cscl: Recurrent Neural Network Transducer for Audio-Visual Speech Recognition https://t.co/GpYqNSctz1
arxiv_cscl: Recurrent Neural Network Transducer for Audio-Visual Speech Recognition https://t.co/GpYqNSctz1
arxiv_cscl: Recurrent Neural Network Transducer for Audio-Visual Speech Recognition https://t.co/GpYqNSctz1
TJO_datasci: RT @ta_makino: 今年こそは論文出すと言っていた今年の抱負をなんとか達成しました。システム論文なので理論的な新しさはありませんが、この分野としてはブレイクスルーの結果です。 https://t.co/V0hmtZzP3j
siskw: RT @ta_makino: 今年こそは論文出すと言っていた今年の抱負をなんとか達成しました。システム論文なので理論的な新しさはありませんが、この分野としてはブレイクスルーの結果です。 https://t.co/V0hmtZzP3j
btreetaiji: RT @ta_makino: 今年こそは論文出すと言っていた今年の抱負をなんとか達成しました。システム論文なので理論的な新しさはありませんが、この分野としてはブレイクスルーの結果です。 https://t.co/V0hmtZzP3j
ballforest: RT @ta_makino: 今年こそは論文出すと言っていた今年の抱負をなんとか達成しました。システム論文なので理論的な新しさはありませんが、この分野としてはブレイクスルーの結果です。 https://t.co/V0hmtZzP3j
heiga_zen: RT @ta_makino: 今年こそは論文出すと言っていた今年の抱負をなんとか達成しました。システム論文なので理論的な新しさはありませんが、この分野としてはブレイクスルーの結果です。 https://t.co/V0hmtZzP3j
koh_t: RT @ta_makino: 今年こそは論文出すと言っていた今年の抱負をなんとか達成しました。システム論文なので理論的な新しさはありませんが、この分野としてはブレイクスルーの結果です。 https://t.co/V0hmtZzP3j
twisugi: RT @ta_makino: 今年こそは論文出すと言っていた今年の抱負をなんとか達成しました。システム論文なので理論的な新しさはありませんが、この分野としてはブレイクスルーの結果です。 https://t.co/V0hmtZzP3j
tsubosaka: RT @ta_makino: 今年こそは論文出すと言っていた今年の抱負をなんとか達成しました。システム論文なので理論的な新しさはありませんが、この分野としてはブレイクスルーの結果です。 https://t.co/V0hmtZzP3j
cocomoff: RT @ta_makino: 今年こそは論文出すと言っていた今年の抱負をなんとか達成しました。システム論文なので理論的な新しさはありませんが、この分野としてはブレイクスルーの結果です。 https://t.co/V0hmtZzP3j
ryo_masumura: RT @ta_makino: 今年こそは論文出すと言っていた今年の抱負をなんとか達成しました。システム論文なので理論的な新しさはありませんが、この分野としてはブレイクスルーの結果です。 https://t.co/V0hmtZzP3j
kogecoo: RT @ta_makino: 今年こそは論文出すと言っていた今年の抱負をなんとか達成しました。システム論文なので理論的な新しさはありませんが、この分野としてはブレイクスルーの結果です。 https://t.co/V0hmtZzP3j
tig33739130: RT @ta_makino: 今年こそは論文出すと言っていた今年の抱負をなんとか達成しました。システム論文なので理論的な新しさはありませんが、この分野としてはブレイクスルーの結果です。 https://t.co/V0hmtZzP3j
yuizumi_y5i: RT @ta_makino: 今年こそは論文出すと言っていた今年の抱負をなんとか達成しました。システム論文なので理論的な新しさはありませんが、この分野としてはブレイクスルーの結果です。 https://t.co/V0hmtZzP3j
ita_cora: RT @ta_makino: 今年こそは論文出すと言っていた今年の抱負をなんとか達成しました。システム論文なので理論的な新しさはありませんが、この分野としてはブレイクスルーの結果です。 https://t.co/V0hmtZzP3j
hishiko79: RT @ta_makino: 今年こそは論文出すと言っていた今年の抱負をなんとか達成しました。システム論文なので理論的な新しさはありませんが、この分野としてはブレイクスルーの結果です。 https://t.co/V0hmtZzP3j
None.
None.
###### Other stats
Sample Sizes : None.
Authors: 7
Total Words: 6093
Unqiue Words: 2082

##### #4. Segment Relevance Estimation for Audio Analysis and Weakly-Labelled Classification
###### Juliano Henrique Foleiss, Tiago Fernandes Tavares
We propose a method that quantifies the importance, namely relevance, of audio segments for classification in weakly-labelled problems. It works by drawing information from a set of class-wise one-vs-all classifiers. By selecting the classifiers used in each specific classification problem, the relevance measure adapts to different user-defined viewpoints without requiring additional neural network training. This characteristic allows the relevance measure to highlight audio segments that quickly adapt to user-defined criteria. Such functionality can be used for computer-assisted audio analysis. Also, we propose a neural network architecture, namely RELNET, that leverages the relevance measure for weakly-labelled audio classification problems. RELNET was evaluated in the DCASE2018 dataset and achieved competitive classification results when compared to previous attention-based proposals.
more | pdf | html
None.
###### Tweets
arxivml: "Segment Relevance Estimation for Audio Analysis and Weakly-Labelled Classification", Juliano Henrique Foleiss, Tia… https://t.co/ZqXLN4bdn6
arxiv_cs_LG: Segment Relevance Estimation for Audio Analysis and Weakly-Labelled Classification. Juliano Henrique Foleiss and Tiago Fernandes Tavares https://t.co/CZqA9n5U6p
None.
None.
###### Other stats
Sample Sizes : None.
Authors: 2
Total Words: 3908
Unqiue Words: 1388

##### #5. Detection of speech events and speaker characteristics through photo-plethysmographic signal neural processing
###### Guillermo Cámbara, Jordi Luque, Mireia Farrús
The use of photoplethysmogram signal (PPG) for heart and sleep monitoring is commonly found nowadays in smartphones and wrist wearables. Besides common usages, it has been proposed and reported that person information can be extracted from PPG for other uses, like biometry tasks. In this work, we explore several end-to-end convolutional neural network architectures for detection of human's characteristics such as gender or person identity. In addition, we evaluate whether speech/non-speech events may be inferred from PPG signal, where speech might translate in fluctuations into the pulse signal. The obtained results are promising and clearly show the potential of fully end-to-end topologies for automatic extraction of meaningful biomarkers, even from a noisy signal sampled by a low-cost PPG sensor. The AUCs for best architectures put forward PPG wave as biological discriminant, reaching $79\%$ and $89.0\%$, respectively for gender and person verification tasks. Furthermore, speech detection experiments reporting AUCs around...
more | pdf | html
None.
###### Tweets
arxivml: "Detection of speech events and speaker characteristics through photo-plethysmographic signal neural processing", G… https://t.co/rSnpfypbPT
arxiv_cs_LG: Detection of speech events and speaker characteristics through photo-plethysmographic signal neural processing. Guillermo Cámbara, Jordi Luque, and Mireia Farrús https://t.co/p6N29RJ0br
None.
None.
###### Other stats
Sample Sizes : None.
Authors: 3
Total Words: 0
Unqiue Words: 0

##### #6. 3-D Feature and Acoustic Modeling for Far-Field Speech Recognition
###### Anurenjan Purushothaman, Anirudh Sreeram, Sriram Ganapathy
Automatic speech recognition in multi-channel reverberant conditions is a challenging task. The conventional way of suppressing the reverberation artifacts involves a beamforming based enhancement of the multi-channel speech signal, which is used to extract spectrogram based features for a neural network acoustic model. In this paper, we propose to extract features directly from the multi-channel speech signal using a multi variate autoregressive (MAR) modeling approach, where the correlations among all the three dimensions of time, frequency and channel are exploited. The MAR features are fed to a convolutional neural network (CNN) architecture which performs the joint acoustic modeling on the three dimensions. The 3-D CNN architecture allows the combination of multi-channel features that optimize the speech recognition cost compared to the traditional beamforming models that focus on the enhancement task. Experiments are conducted on the CHiME-3 and REVERB Challenge dataset using multi-channel reverberant speech. In these...
more | pdf | html
None.
###### Tweets
arxivml: "3-D Feature and Acoustic Modeling for Far-Field Speech Recognition", Anurenjan Purushothaman, Anirudh Sreeram, Sri… https://t.co/0vVBZ6ej1H
None.
None.
###### Other stats
Sample Sizes : None.
Authors: 3
Total Words: 0
Unqiue Words: 0

##### #7. 'Warriors of the Word' -- Deciphering Lyrical Topics in Music and Their Connection to Audio Feature Dimensions Based on a Corpus of Over 100,000 Metal Songs
###### Isabella Czedik-Eysenberg, Oliver Wieczorek, Christoph Reuter
We look into the connection between the musical and lyrical content of metal music by combining automated extraction of high-level audio features and quantitative text analysis on a corpus of 124.288 song lyrics from this genre. Based on this text corpus, a topic model was first constructed using Latent Dirichlet Allocation (LDA). For a subsample of 500 songs, scores for predicting perceived musical hardness/heaviness and darkness/gloominess were extracted using audio feature models. By combining both audio feature and text analysis, we (1) offer a comprehensive overview of the lyrical topics present within the metal genre and (2) are able to establish whether or not levels of hardness and other music dimensions are associated with the occurrence of particularly harsh (and other) textual topics. Twenty typical topics were identified and projected into a topic space using multidimensional scaling (MDS). After Bonferroni correction, positive correlations were found between musical hardness and darkness and textual topics dealing...
more | pdf | html
None.
###### Tweets
arxivml: "'Warriors of the Word' -- Deciphering Lyrical Topics in Music and Their Connection to Audio Feature Dimensions Bas… https://t.co/ikZ7Iu3m9N
None.
None.
###### Other stats
Sample Sizes : None.
Authors: 3
Total Words: 0
Unqiue Words: 0

##### #8. An End-to-end Approach for Lexical Stress Detection based on Transformer
###### Yong Ruan, Xiangdong Wang, Hong Liu, Zhigang Ou, Yun Gao, Jianfeng Cheng, Yueliang Qian
The dominant automatic lexical stress detection method is to split the utterance into syllable segments using phoneme sequence and their time-aligned boundaries. Then we extract features from syllable to use classification method to classify the lexical stress. However, we can't get very accurate time boundaries of each phoneme and we have to design some features in the syllable segments to classify the lexical stress. Therefore, we propose a end-to-end approach using sequence to sequence model of transformer to estimate lexical stress. For this, we train transformer model using feature sequence of audio and their phoneme sequence with lexical stress marks. During the recognition process, the recognized phoneme sequence is restricted according to the original standard phoneme sequence without lexical stress marks, but the lexical stress mark of each phoneme is not limited. We train the model in different subset of Librispeech and do lexical stress recognition in TIMIT and L2-ARCTIC dataset. For all subsets, the end-to-end model...
more | pdf | html
None.
###### Tweets
arxivml: "An End-to-end Approach for Lexical Stress Detection based on Transformer", Yong Ruan, Xiangdong Wang, Hong Liu, Zh… https://t.co/f2aAWi9r8J
None.
None.
###### Other stats
Sample Sizes : None.
Authors: 7
Total Words: 0
Unqiue Words: 0

##### #9. Non-Autoregressive Transformer Automatic Speech Recognition
###### Nanxin Chen, Shinji Watanabe, Jesús Villalba, Najim Dehak
Recently very deep transformers start showing outperformed performance to traditional bi-directional long short-term memory networks by a large margin. However, to put it into production usage, inference computation cost and latency are still serious concerns in real scenarios. In this paper, we study a novel non-autoregressive transformers structure for speech recognition, which is originally introduced in machine translation. During training input tokens fed to the decoder are randomly replaced by a special mask token. The network is required to predict those mask tokens by taking both context and input speech into consideration. During inference, we start from all mask tokens and the network gradually predicts all tokens based on partial results. We show this framework can support different decoding strategies, including traditional left-to-right. A new decoding strategy is proposed as an example, which starts from the easiest predictions to difficult ones. Some preliminary results on Aishell and CSJ benchmarks show the...
more | pdf | html
###### Tweets
arxivml: "Non-Autoregressive Transformer Automatic Speech Recognition", Nanxin Chen, Shinji Watanabe, Jesús Villalba, Najim … https://t.co/46d7a6bQ4d
None.
None.
###### Other stats
Sample Sizes : None.
Authors: 4
Total Words: 3328
Unqiue Words: 1313

##### #10. Enhanced Voice Post Processing Using Voice Decoder Guidance Indicators
###### Phani Kumar Nyshadham, D R Shivakumar, Peter Kroon, Shmulik Markovich-Golan
Voice enhancement and voice coding are imperative and important functions in a voice-communication system. However, both functions are commonly treated independently, even though both utilize similar features of the underlying signals. Our proposal is to leverage information from one function to the benefit of the other. Specifically, our proposed changes are focused on changes to the voice enhancement at the downlink side and utilizing information of the voice decoding. Preliminary results show that such an approach results in improved quality. Additionally, suggestions are provided on future extensions of the proposed concept.
more | pdf | html
None.
None.
None.
###### Other stats
Sample Sizes : None.
Authors: 4
Total Words: 0
Unqiue Words: 0

###### About

Assert is a website where the best academic papers on arXiv (computer science, math, physics), bioRxiv (biology), BITSS (reproducibility), EarthArXiv (earth science), engrXiv (engineering), LawArXiv (law), PsyArXiv (psychology), SocArXiv (social science), and SportRxiv (sport research) bubble to the top each day.

Papers are scored (in real-time) based on how verifiable they are (as determined by their Github repos) and how interesting they are (based on Twitter).

To see top papers, follow us on twitter @assertpub_ (arXiv), @assert_pub (bioRxiv), and @assertpub_dev (everything else).

To see beautiful figures extracted from papers, follow us on Instagram.

Tracking 222,744 papers.

###### Search
Sort results based on if they are interesting or reproducible.
Interesting
Reproducible
Online
###### Stats
Tracking 222,744 papers.