Top 10 Arxiv Papers Today in Audio And Speech Processing


0.0 Mikeys
#1. Twin Regularization for online speech recognition
Mirco Ravanelli, Dmitriy Serdyuk, Yoshua Bengio
Online speech recognition is crucial for developing natural human-machine interfaces. This modality, however, is significantly more challenging than off-line ASR, since real-time/low-latency constraints inevitably hinder the use of future information, that is known to be very helpful to perform robust predictions. A popular solution to mitigate this issue consists of feeding neural acoustic models with context windows that gather some future frames. This introduces a latency which depends on the number of employed look-ahead features. This paper explores a different approach, based on estimating the future rather than waiting for it. Our technique encourages the hidden representations of a unidirectional recurrent network to embed some useful information about the future. Inspired by a recently proposed technique called Twin Networks, we add a regularization term that forces forward hidden states to be as close as possible to cotemporal backward ones, computed by a "twin" neural network running backwards in time. The experiments,...
more | pdf | html
Figures
None.
Tweets
mirco_ravanelli: For more info, you can find the paper here: https://t.co/CviBZh61VP
ahammami0: Twin Regularization for online speech recognition Mirco Ravanelli, Dmitriy Serdyuk, Yoshua Bengio :https://t.co/R8PSFX8ZRH @mirco_ravanelli @d_serdyuk #AI #ArtificialIntelligence #DataScience #speechrecognition #LANGUAGE #MachineLearning https://t.co/uSzApyP95X
machinelearn_d: RT @ahammami0: Twin Regularization for online speech recognition Mirco Ravanelli, Dmitriy Serdyuk, Yoshua Bengio :https://t.co/R8PSFX8ZRH…
jeffjosesvlj: RT @mirco_ravanelli: For more info, you can find the paper here: https://t.co/CviBZh61VP
Github

pytorch-kaldi is a project for developing state-of-the-art DNN/RNN hybrid speech recognition systems. The DNN part is managed by pytorch, while feature extraction, label computation, and decoding are performed with the kaldi toolkit.

Repository: pytorch-kaldi
User: mravanelli
Language: Perl
Stargazers: 84
Subscribers: 7
Forks: 17
Open Issues: 3
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 3
Total Words: 4597
Unqiue Words: 1687

0.0 Mikeys
#2. Fine-tuning on Clean Data for End-to-End Speech Translation: FBK @ IWSLT 2018
Mattia Antonino Di Gangi, Roberto Dessì, Roldano Cattoni, Matteo Negri, Marco Turchi
This paper describes FBK's submission to the end-to-end English-German speech translation task at IWSLT 2018. Our system relies on a state-of-the-art model based on LSTMs and CNNs, where the CNNs are used to reduce the temporal dimension of the audio input, which is in general much higher than machine translation input. Our model was trained only on the audio-to-text parallel data released for the task, and fine-tuned on cleaned subsets of the original training corpus. The addition of weight normalization and label smoothing improved the baseline system by 1.0 BLEU point on our validation set. The final submission also featured checkpoint averaging within a training run and ensemble decoding of models trained during multiple runs. On test data, our best single model obtained a BLEU score of 9.7, while the ensemble obtained a BLEU score of 10.24.
more | pdf | html
Figures
Tweets
mdigangiPA: Our submission to the speech translation task @ #iwslt18 now on arxiv. Our goal was to squeeze out the small parallel data to obtain every bit of improvement for our model. https://t.co/YCOtJ3sZhQ #NLProc #machinetranslation @fbk_mt @robdessi @Turchi_Marco @negri_teo
StatsPapers: Fine-tuning on Clean Data for End-to-End Speech Translation: FBK @ IWSLT 2018. https://t.co/QqP1ZD5WM3
arxiv_cscl: Fine-tuning on Clean Data for End-to-End Speech Translation: FBK @ IWSLT 2018 https://t.co/CsAaoqDl28
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 5
Total Words: 4344
Unqiue Words: 1500

0.0 Mikeys
#3. Text-Independent Speaker Verification Using Long Short-Term Memory Networks
Aryan Mobiny
In this paper, an architecture based on Long Short-Term Memory Networks has been proposed for the text-independent scenario which is aimed to capture the temporal speaker-related information by operating over traditional speech features. For speaker verification, at first, a background model must be created for speaker representation. Then, in enrollment stage, the speaker models will be created based on the enrollment utterances. For this work, the model will be trained in an end-to-end fashion to combine the first two stages. The main goal of end-to-end training is the model being optimized to be consistent with the speaker verification protocol. The end- to-end training jointly learns the background and speaker models by creating the representation space. The LSTM architecture is trained to create a discrimination space for validating the match and non-match pairs for speaker verification. The proposed architecture demonstrate its superiority in the text-independent compared to other traditional methods.
more | pdf | html
Figures
Tweets
arxiv_cscl: Text-Independent Speaker Verification Using Long Short-Term Memory Networks https://t.co/rxHEIKsDek
arxiv_cscl: Text-Independent Speaker Verification Using Long Short-Term Memory Networks https://t.co/rxHEIKsDek
_Artemisa_v: RT @arxiv_cscl: Text-Independent Speaker Verification Using Long Short-Term Memory Networks https://t.co/rxHEIKsDek
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 1
Total Words: 3437
Unqiue Words: 1272

0.0 Mikeys
#4. Concatenated Identical DNN (CI-DNN) to Reduce Noise-Type Dependence in DNN-Based Speech Enhancement
Ziyi Xu, Maximilian Strake, Tim Fingscheidt
Estimating time-frequency domain masks for speech enhancement using deep learning approaches has recently become a popular field of research. In this paper, we propose a mask-based speech enhancement framework by using concatenated identical deep neural networks (CI-DNNs). The idea is that a single DNN is trained under multiple input and output signal-to-noise power ratio (SNR) conditions, using targets that provide a moderate SNR gain with respect to the input and therefore achieve a balance between speech component quality and noise suppression. We concatenate this single DNN several times without any retraining to provide enough noise attenuation. Simulation results show that our proposed CI-DNN outperforms enhancement methods using classical spectral weighting rules w.r.t. total speech quality and speech intelligibility. Moreover, our approach shows similar or even a little bit better performance with much fewer trainable parameters compared with a noisy-target single DNN approach of the same size. A comparison to...
more | pdf | html
Figures
None.
Tweets
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 3
Total Words: 5428
Unqiue Words: 1740

0.0 Mikeys
#5. Acoustic Scene Classification: A Competition Review
Shayan Gharib, Honain Derrar, Daisuke Niizumi, Tuukka Senttula, Janne Tommola, Toni Heittola, Tuomas Virtanen, Heikki Huttunen
In this paper we study the problem of acoustic scene classification, i.e., categorization of audio sequences into mutually exclusive classes based on their spectral content. We describe the methods and results discovered during a competition organized in the context of a graduate machine learning course; both by the students and external participants. We identify the most suitable methods and study the impact of each by performing an ablation study of the mixture of approaches. We also compare the results with a neural network baseline, and show the improvement over that. Finally, we discuss the impact of using a competition as a part of a university course, and justify its importance in the curriculum based on student feedback.
more | pdf | html
Figures
Tweets
nmfeeds: [O] https://t.co/6F1eQGb9Wg Acoustic Scene Classification: A Competition Review. In this paper we study the problem of aco...
nmfeeds: [CV] https://t.co/6F1eQGb9Wg Acoustic Scene Classification: A Competition Review. In this paper we study the problem of ac...
nizumical: https://t.co/5hU4WaE3Hn “Acoustic Scene Classification: a competition review” Finlandの大学主催Kaggleコンペティションに参加したあと、co-authorに招待して下さいました。工夫をまとめる良いきっかけになりました。結果も再現できて良かった(汗
sylvan5: RT @nizumical: https://t.co/5hU4WaE3Hn “Acoustic Scene Classification: a competition review” Finlandの大学主催Kaggleコンペティションに参加したあと、co-authorに招待…
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 8
Total Words: 4538
Unqiue Words: 1783

0.0 Mikeys
#6. Revisiting Synthesis Model of Sparse Audio Declipper
Pavel Záviška, Pavel Rajmic, Zdeněk Průša, Vítězslav Veselý
The state of the art in audio declipping has currently been achieved by SPADE (SParse Audio DEclipper) algorithm by Kiti\'c et al. Until now, the synthesis/sparse variant, S-SPADE, has been considered significantly slower than its analysis/cosparse counterpart, A-SPADE. It turns out that the opposite is true: by exploiting a recent projection lemma, individual iterations of both algorithms can be made equally computationally expensive, while S-SPADE tends to require considerably fewer iterations to converge. In this paper, the two algorithms are compared across a range of parameters such as the window length, window overlap and redundancy of the transform. The experiments show that although S-SPADE typically converges faster, the average performance in terms of restoration quality is not superior to A-SPADE.
more | pdf | html
Figures
None.
Tweets
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 4
Total Words: 6331
Unqiue Words: 2066

0.0 Mikeys
#7. Speech Separation Using Partially Asynchronous Microphone Arrays Without Resampling
Ryan M. Corey, Andrew C. Singer
We consider the problem of separating speech sources captured by multiple spatially separated devices, each of which has multiple microphones and samples its signals at a slightly different rate. Most asynchronous array processing methods rely on sample rate offset estimation and resampling, but these offsets can be difficult to estimate if the sources or microphones are moving. We propose a source separation method that does not require offset estimation or signal resampling. Instead, we divide the distributed array into several synchronous subarrays. All arrays are used jointly to estimate the time-varying signal statistics, and those statistics are used to design separate time-varying spatial filters in each array. We demonstrate the method for speech mixtures recorded on both stationary and moving microphone arrays.
more | pdf | html
Figures
Tweets
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 2
Total Words: 4444
Unqiue Words: 1344

0.0 Mikeys
#8. Delay-Performance Tradeoffs in Causal Microphone Array Processing
Ryan M. Corey, Naoki Tsuda, Andrew C. Singer
In real-time listening enhancement applications, such as hearing aid signal processing, sounds must be processed with no more than a few milliseconds of delay to sound natural to the listener. Listening devices can achieve better performance with lower delay by using microphone arrays to filter acoustic signals in both space and time. Here, we analyze the tradeoff between delay and squared-error performance of causal multichannel Wiener filters for microphone array noise reduction. We compute exact expressions for the delay-error curves in two special cases and present experimental results from real-world microphone array recordings. We find that delay-performance characteristics are determined by both the spatial and temporal correlation structures of the signals.
more | pdf | html
Figures
Tweets
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 3
Total Words: 4199
Unqiue Words: 1522

0.0 Mikeys
#9. Multimodal speech synthesis architecture for unsupervised speaker adaptation
Hieu-Thi Luong, Junichi Yamagishi
This paper proposes a new architecture for speaker adaptation of multi-speaker neural-network speech synthesis systems, in which an unseen speaker's voice can be built using a relatively small amount of speech data without transcriptions. This is sometimes called "unsupervised speaker adaptation". More specifically, we concatenate the layers to the audio inputs when performing unsupervised speaker adaptation while we concatenate them to the text inputs when synthesizing speech from text. Two new training schemes for the new architecture are also proposed in this paper. These training schemes are not limited to speech synthesis, other applications are suggested. Experimental results show that the proposed model not only enables adaptation to unseen speakers using untranscribed speech but it also improves the performance of multi-speaker modeling and speaker adaptation using transcribed audio files.
more | pdf | html
Figures
None.
Tweets
BrundageBot: Multimodal speech synthesis architecture for unsupervised speaker adaptation. Hieu-Thi Luong and Junichi Yamagishi https://t.co/SkXTyQc3oR
SythonUK: Multimodal speech synthesis architecture for unsupervised speaker adaptation https://t.co/FNHaDRyJgf
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 2
Total Words: 4239
Unqiue Words: 1466

0.0 Mikeys
#10. A Fully Time-domain Neural Model for Subband-based Speech Synthesizer
Azam Rabiee, Soo-Young Lee
This paper introduces a deep neural network model for subband-based speech synthesizer. The model benefits from the short bandwidth of the subband signals to reduce the complexity of the time-domain speech generator. We employed the multi-level wavelet analysis/synthesis to decompose/reconstruct the signal to subbands in time domain. Inspired from the WaveNet, a convolutional neural network (CNN) model predicts subband speech signals fully in time domain. Due to the short bandwidth of the subbands, a simple network architecture is enough to train the simple patterns of the subbands accurately. In the ground truth experiments with teacher forcing, the subband synthesizer outperforms the fullband model significantly. In addition, by conditioning the model on the phoneme sequence using a pronunciation dictionary, we have achieved the first fully time-domain neural text-to-speech (TTS) system. The generated speech of the subband TTS shows comparable quality as the fullband one with a slighter network architecture for each subband.
more | pdf | html
Figures
None.
Tweets
BrundageBot: A Fully Time-domain Neural Model for Subband-based Speech Synthesizer. Azam Rabiee and Soo-Young Lee https://t.co/0u5P9xwBpx
nmfeeds: [O] https://t.co/jHUDH2o2jS A Fully Time-domain Neural Model for Subband-based Speech Synthesizer. This paper introduces a...
ComputerPapers: A Fully Time-domain Neural Model for Subband-based Speech Synthesizer. https://t.co/boV8J26V8K
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 2
Total Words: 4102
Unqiue Words: 1486

About

Assert is a website where the best academic papers on arXiv (computer science, math, physics), bioRxiv (biology), BITSS (reproducibility), EarthArXiv (earth science), engrXiv (engineering), LawArXiv (law), PsyArXiv (psychology), SocArXiv (social science), and SportRxiv (sport research) bubble to the top each day.

Papers are scored (in real-time) based on how verifiable they are (as determined by their Github repos) and how interesting they are (based on Twitter).

To see top papers, follow us on twitter @assertpub_ (arXiv), @assert_pub (bioRxiv), and @assertpub_dev (everything else).

To see beautiful figures extracted from papers, follow us on Instagram.

Tracking 72,893 papers.

Search
Sort results based on if they are interesting or reproducible.
Interesting
Reproducible
Categories
All
Astrophysics
Cosmology and Nongalactic Astrophysics
Earth and Planetary Astrophysics
Astrophysics of Galaxies
High Energy Astrophysical Phenomena
Instrumentation and Methods for Astrophysics
Solar and Stellar Astrophysics
Condensed Matter
Disordered Systems and Neural Networks
Mesoscale and Nanoscale Physics
Materials Science
Other Condensed Matter
Quantum Gases
Soft Condensed Matter
Statistical Mechanics
Strongly Correlated Electrons
Superconductivity
Computer Science
Artificial Intelligence
Hardware Architecture
Computational Complexity
Computational Engineering, Finance, and Science
Computational Geometry
Computation and Language
Cryptography and Security
Computer Vision and Pattern Recognition
Computers and Society
Databases
Distributed, Parallel, and Cluster Computing
Digital Libraries
Discrete Mathematics
Data Structures and Algorithms
Emerging Technologies
Formal Languages and Automata Theory
General Literature
Graphics
Computer Science and Game Theory
Human-Computer Interaction
Information Retrieval
Information Theory
Machine Learning
Logic in Computer Science
Multiagent Systems
Multimedia
Mathematical Software
Numerical Analysis
Neural and Evolutionary Computing
Networking and Internet Architecture
Other Computer Science
Operating Systems
Performance
Programming Languages
Robotics
Symbolic Computation
Sound
Software Engineering
Social and Information Networks
Systems and Control
Economics
Econometrics
General Economics
Theoretical Economics
Electrical Engineering and Systems Science
Audio and Speech Processing
Image and Video Processing
Signal Processing
General Relativity and Quantum Cosmology
General Relativity and Quantum Cosmology
High Energy Physics - Experiment
High Energy Physics - Experiment
High Energy Physics - Lattice
High Energy Physics - Lattice
High Energy Physics - Phenomenology
High Energy Physics - Phenomenology
High Energy Physics - Theory
High Energy Physics - Theory
Mathematics
Commutative Algebra
Algebraic Geometry
Analysis of PDEs
Algebraic Topology
Classical Analysis and ODEs
Combinatorics
Category Theory
Complex Variables
Differential Geometry
Dynamical Systems
Functional Analysis
General Mathematics
General Topology
Group Theory
Geometric Topology
History and Overview
Information Theory
K-Theory and Homology
Logic
Metric Geometry
Mathematical Physics
Numerical Analysis
Number Theory
Operator Algebras
Optimization and Control
Probability
Quantum Algebra
Rings and Algebras
Representation Theory
Symplectic Geometry
Spectral Theory
Statistics Theory
Mathematical Physics
Mathematical Physics
Nonlinear Sciences
Adaptation and Self-Organizing Systems
Chaotic Dynamics
Cellular Automata and Lattice Gases
Pattern Formation and Solitons
Exactly Solvable and Integrable Systems
Nuclear Experiment
Nuclear Experiment
Nuclear Theory
Nuclear Theory
Physics
Accelerator Physics
Atmospheric and Oceanic Physics
Applied Physics
Atomic and Molecular Clusters
Atomic Physics
Biological Physics
Chemical Physics
Classical Physics
Computational Physics
Data Analysis, Statistics and Probability
Physics Education
Fluid Dynamics
General Physics
Geophysics
History and Philosophy of Physics
Instrumentation and Detectors
Medical Physics
Optics
Plasma Physics
Popular Physics
Physics and Society
Space Physics
Quantitative Biology
Biomolecules
Cell Behavior
Genomics
Molecular Networks
Neurons and Cognition
Other Quantitative Biology
Populations and Evolution
Quantitative Methods
Subcellular Processes
Tissues and Organs
Quantitative Finance
Computational Finance
Economics
General Finance
Mathematical Finance
Portfolio Management
Pricing of Securities
Risk Management
Statistical Finance
Trading and Market Microstructure
Quantum Physics
Quantum Physics
Statistics
Applications
Computation
Methodology
Machine Learning
Other Statistics
Statistics Theory
Feedback
Online
Stats
Tracking 72,893 papers.