Top 10 Arxiv Papers Today in Sound


0.0 Mikeys
#1. Automatic Chord Recognition with Higher-Order Harmonic Language Modelling
Filip Korzeniowski, Gerhard Widmer
Common temporal models for automatic chord recognition model chord changes on a frame-wise basis. Due to this fact, they are unable to capture musical knowledge about chord progressions. In this paper, we propose a temporal model that enables explicit modelling of chord changes and durations. We then apply N-gram models and a neural-network-based acoustic model within this framework, and evaluate the effect of model overconfidence. Our results show that model overconfidence plays only a minor role (but target smoothing still improves the acoustic model), and that stronger chord language models do improve recognition results, however their effects are small compared to other domains.
more | pdf | html
Figures
None.
Tweets
M157q_News_RSS: Automatic Chord Recognition with Higher-Order Harmonic Language Modelling. (arXiv:1808.05341v1 [https://t.co/m5YfPVL0l1]) https://t.co/KrmLllnRZ0 Common tempo
arxivml: "Automatic Chord Recognition with Higher-Order Harmonic Language Modelling", Filip Korzeniowski, Gerhard Widmer https://t.co/G9nz6otszy
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 2
Total Words: 4231
Unqiue Words: 1529

0.0 Mikeys
#2. Learning Transposition-Invariant Interval Features from Symbolic Music and Audio
Stefan Lattner, Maarten Grachten, Gerhard Widmer
Many music theoretical constructs (such as scale types, modes, cadences, and chord types) are defined in terms of pitch intervals---relative distances between pitches. Therefore, when computer models are employed in music tasks, it can be useful to operate on interval representations rather than on the raw musical surface. Moreover, interval representations are transposition-invariant, valuable for tasks like audio alignment, cover song detection and music structure analysis. We employ a gated autoencoder to learn fixed-length, invertible and transposition-invariant interval representations from polyphonic music in the symbolic domain and in audio. An unsupervised training method is proposed yielding an organization of intervals in the representation space which is musically plausible. Based on the representations, a transposition-invariant self-similarity matrix is constructed and used to determine repeated sections in symbolic music and in audio, yielding competitive results in the MIREX task "Discovery of Repeated Themes and Sections".
more | pdf | html
Figures
Tweets
arxiv_org: Learning Transposition-Invariant Interval Features from Symbolic Music and Audio. https://t.co/55ISnGJKDf https://t.co/A9Xjsvegxg
hiropon_matsu: "Learning Transposition-Invariant Interval Features from Symbolic Music and Audio." https://t.co/9r3s1GtGzM https://t.co/mIdACzoAXY
udmrzn: RT @arxiv_org: Learning Transposition-Invariant Interval Features from Symbolic Music and Audio. https://t.co/55ISnGJKDf https://t.co/A9Xjs…
heghbalz: RT @arxiv_org: Learning Transposition-Invariant Interval Features from Symbolic Music and Audio. https://t.co/55ISnGJKDf https://t.co/A9Xjs…
linkoffate: RT @arxiv_org: Learning Transposition-Invariant Interval Features from Symbolic Music and Audio. https://t.co/55ISnGJKDf https://t.co/A9Xjs…
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 3
Total Words: 5554
Unqiue Words: 2184

0.0 Mikeys
#3. Data-Efficient Weakly Supervised Learning for Low-Resource Audio Event Detection Using Deep Learning
Veronica Morfi, Dan Stowell
We propose a method to perform audio event detection under the common constraint that only limited training data are available. In training a deep learning system to perform audio event detection, two practical problems arise. Firstly, most datasets are 'weakly labelled' having only a list of events present in each recording without any temporal information for training. Secondly, deep neural networks need a very large amount of labelled training data to achieve good quality performance, yet in practice it is difficult to collect enough samples for most classes of interest. In this paper, we propose a data-efficient training of a stacked convolutional and recurrent neural network. This neural network is trained in a multi instance learning setting for which we introduce a new loss function that leads to improved training compared to the usual approaches for weakly supervised learning. We successfully test our approach on a low-resource dataset that lacks temporal labels, for bird vocalisation detection.
more | pdf | html
Figures
Tweets
mclduk: New research pre-print from Veronica Morfi & me, evaluating new loss-functions for training neural nets in difficult conditions: https://t.co/SEofWWyUpF "Data-Efficient Weakly Supervised Learning for Low-Resource Audio Event Detection Using Deep Learning" #deeplearning
nmfeeds: [O] https://t.co/THsRrTr3Sz Data-Efficient Weakly Supervised Learning for Low-Resource Audio Event Detection Using Deep Le...
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 2
Total Words: 6517
Unqiue Words: 1835

0.0 Mikeys
#4. Auto-adaptive Resonance Equalization using Dilated Residual Networks
Maarten Grachten, Emmanuel Deruty, Alexandre Tanguy
In music and audio production, attenuation of spectral resonances is an important step towards a technically correct result. In this paper we present a two-component system to automate the task of resonance equalization. The first component is a dynamic equalizer that automatically detects resonances and offers to attenuate them by a user-specified factor. The second component is a deep neural network that predicts the optimal attenuation factor based on the windowed audio. The network is trained and validated on empirical data gathered from an experiment in which sound engineers choose their preferred attenuation factors for a set of tracks. We test two distinct network architectures for the predictive model and find that a dilated residual network operating directly on the audio signal is on a par with a network architecture that requires a prior audio feature extraction stage. Both architectures predict human-preferred resonance attenuation factors significantly better than a baseline approach.
more | pdf | html
Figures
None.
Tweets
Memoirs: Auto-adaptive Resonance Equalization using Dilated Residual Networks. https://t.co/5pjrpehdRI
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 3
Total Words: 6106
Unqiue Words: 2115

0.0 Mikeys
#5. Monaural source enhancement maximizing source-to-distortion ratio via automatic differentiation
Hiroaki Nakajima, Yu Takahashi, Kazunobu Kondo, Yuji Hisaminato
Recently, deep neural network (DNN) has made a breakthrough in monaural source enhancement. Through a training step by using a large amount of data, DNN estimates a mapping between mixed signals and clean signals. At this time, we use an objective function that numerically expresses the quality of a mapping by DNN. In the conventional methods, L1 norm, L2 norm, and Itakura-Saito divergence are often used as objective functions. Recently, an objective function based on short-time objective intelligibility (STOI) has also been proposed. However, these functions only indicate similarity between the clean signal and the estimated signal by DNN. In other words, they do not show the quality of noise reduction or source enhancement. Motivated by the fact, this paper adopts signal-to-distortion ratio (SDR) as the objective function. Since SDR virtually shows signal-to-noise ratio (SNR), maximizing SDR solves the above problem. The experimental results revealed that the proposed method achieved better performance than the conventional methods.
more | pdf | html
Figures
Tweets
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 4
Total Words: 3073
Unqiue Words: 1172

0.0 Mikeys
#6. Efficient Neural Audio Synthesis
Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aaron van den Oord, Sander Dieleman, Koray Kavukcuoglu
Sequential models achieve state-of-the-art results in audio, visual and textual domains with respect to both estimating the data distribution and generating high-quality samples. Efficient sampling for this class of models has however remained an elusive problem. With a focus on text-to-speech synthesis, we describe a set of general techniques for reducing sampling time while maintaining high output quality. We first describe a single-layer recurrent neural network, the WaveRNN, with a dual softmax layer that matches the quality of the state-of-the-art WaveNet model. The compact form of the network makes it possible to generate 24kHz 16-bit audio 4x faster than real time on a GPU. Second, we apply a weight pruning technique to reduce the number of weights in the WaveRNN. We find that, for a constant number of parameters, large sparse networks perform better than small dense networks and this relationship holds for sparsity levels beyond 96%. The small number of weights in a Sparse WaveRNN makes it possible to sample high-fidelity...
more | pdf | html
Figures
None.
Tweets
hiho_karuta: WaveRNNの論文、定量評価にNLLの値を使ってるんだけど、そういえばどうやってDual Softmaxから単一のNLLを得てるんだろう。Dualだから、NLLが2つ得られる理解だったけど・・・。 https://t.co/IFQMENUAWG
Swall0wTech: [1802.08435] Efficient Neural Audio Synthesis https://t.co/FYt9Sof2p7
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 10
Total Words: 7272
Unqiue Words: 2122

0.0 Mikeys
#7. Conditioning Deep Generative Raw Audio Models for Structured Automatic Music
Rachel Manzelli, Vijay Thakkar, Ali Siahkamari, Brian Kulis
Existing automatic music generation approaches that feature deep learning can be broadly classified into two types: raw audio models and symbolic models. Symbolic models, which train and generate at the note level, are currently the more prevalent approach; these models can capture long-range dependencies of melodic structure, but fail to grasp the nuances and richness of raw audio generations. Raw audio models, such as DeepMind's WaveNet, train directly on sampled audio waveforms, allowing them to produce realistic-sounding, albeit unstructured music. In this paper, we propose an automatic music generation methodology combining both of these approaches to create structured, realistic-sounding compositions. We consider a Long Short Term Memory network to learn the melodic structure of different styles of music, and then use the unique symbolic generations from this model as a conditioning input to a WaveNet-based raw audio generator, creating a model for automatic, novel music. We then evaluate this approach by showcasing results...
more | pdf | html
Figures
Tweets
satory074: » [1806.09905] Conditioning Deep Generative Raw Audio Models for Structured Automatic Music https://t.co/WTPZbyt0vG
nmfeeds: [O] https://t.co/4NUnvcnNyN Conditioning Deep Generative Raw Audio Models for Structured Automatic Music. Existing automat...
Memoirs: Conditioning Deep Generative Raw Audio Models for Structured Automatic Music. https://t.co/M3w47o3JSe
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 4
Total Words: 4998
Unqiue Words: 1675

0.0 Mikeys
#8. Deep Extractor Network for Target Speaker Recovery From Single Channel Speech Mixtures
Jun Wang, Jie Chen, Dan Su, Lianwu Chen, Meng Yu, Yanmin Qian, Dong Yu
Speaker-aware source separation methods are promising workarounds for major difficulties such as arbitrary source permutation and unknown number of sources. However, it remains challenging to achieve satisfying performance provided a very short available target speaker utterance (anchor). Here we present a novel "deep extractor network" which creates an extractor point for the target speaker in a canonical high dimensional embedding space, and pulls together the time-frequency bins corresponding to the target speaker. The proposed model is different from prior works in that the canonical embedding space encodes knowledges of both the anchor and the mixture during an end-to-end training phase: First, embeddings for the anchor and mixture speech are separately constructed in a primary embedding space, and then combined as an input to feed-forward layers to transform to a canonical embedding space which we discover more stable than the primary one. Experimental results show that given a very short utterance, the proposed model can...
more | pdf | html
Figures
Tweets
_ty274: 最後は目的話者抽出。各TF pointにおける話者空間での目的話者に対する近接性(的な学習される何か)をembeddingとして分離マスクを求める。入力の各位置と大域的な情報との関連で新しいembedding空間を求めるというアイデアは結構普遍的だと思う。 https://t.co/sGUI3H2T4N
SythonUK: Deep Extractor Network for Target Speaker Recovery From Single Channel Speech Mixtures https://t.co/ttjNjuMJzh
ComputerPapers: Deep Extractor Network for Target Speaker Recovery From Single Channel Speech Mixtures. https://t.co/jmComVtMX2
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 7
Total Words: 4301
Unqiue Words: 1462

0.0 Mikeys
#9. StarGAN-VC: Non-parallel many-to-many voice conversion with star generative adversarial networks
Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, Nobukatsu Hojo
This paper proposes a method that allows non-parallel many-to-many voice conversion (VC) by using a variant of a generative adversarial network (GAN) called StarGAN. Our method, which we call StarGAN-VC, is noteworthy in that it (1) requires no parallel utterances, transcriptions, or time alignment procedures for speech generator training, (2) simultaneously learns many-to-many mappings across different attribute domains using a single generator network, (3) is able to generate converted speech signals quickly enough to allow real-time implementations and (4) requires only several minutes of training examples to generate reasonably realistic-sounding speech. Subjective evaluation experiments on a non-parallel many-to-many speaker identity conversion task revealed that the proposed method obtained higher sound quality and speaker similarity than a state-of-the-art method based on variational autoencoding GANs.
more | pdf | html
Figures
Tweets
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 4
Total Words: 6047
Unqiue Words: 1816

0.0 Mikeys
#10. Audio-to-Score Alignment using Transposition-invariant Features
Andreas Arzt, Stefan Lattner
Audio-to-score alignment is an important pre-processing step for in-depth analysis of classical music. In this paper, we apply novel transposition-invariant audio features to this task. These low-dimensional features represent local pitch intervals and are learned in an unsupervised fashion by a gated autoencoder. Our results show that the proposed features are indeed fully transposition-invariant and enable accurate alignments between transposed scores and performances. Furthermore, they can even outperform widely used features for audio-to-score alignment on `untransposed data', and thus are a viable and more flexible alternative to well-established features for music alignment and matching.
more | pdf | html
Figures
Tweets
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 2
Total Words: 5817
Unqiue Words: 1959

About

Assert is a website where the best academic papers on arXiv (computer science, math, physics), bioRxiv (biology), BITSS (reproducibility), EarthArXiv (earth science), engrXiv (engineering), LawArXiv (law), PsyArXiv (psychology), SocArXiv (social science), and SportRxiv (sport research) bubble to the top each day.

Papers are scored (in real-time) based on how verifiable they are (as determined by their Github repos) and how interesting they are (based on Twitter).

To see top papers, follow us on twitter @assertpub_ (arXiv), @assert_pub (bioRxiv), and @assertpub_dev (everything else).

To see beautiful figures extracted from papers, follow us on Instagram.

Tracking 72,893 papers.

Search
Sort results based on if they are interesting or reproducible.
Interesting
Reproducible
Categories
All
Astrophysics
Cosmology and Nongalactic Astrophysics
Earth and Planetary Astrophysics
Astrophysics of Galaxies
High Energy Astrophysical Phenomena
Instrumentation and Methods for Astrophysics
Solar and Stellar Astrophysics
Condensed Matter
Disordered Systems and Neural Networks
Mesoscale and Nanoscale Physics
Materials Science
Other Condensed Matter
Quantum Gases
Soft Condensed Matter
Statistical Mechanics
Strongly Correlated Electrons
Superconductivity
Computer Science
Artificial Intelligence
Hardware Architecture
Computational Complexity
Computational Engineering, Finance, and Science
Computational Geometry
Computation and Language
Cryptography and Security
Computer Vision and Pattern Recognition
Computers and Society
Databases
Distributed, Parallel, and Cluster Computing
Digital Libraries
Discrete Mathematics
Data Structures and Algorithms
Emerging Technologies
Formal Languages and Automata Theory
General Literature
Graphics
Computer Science and Game Theory
Human-Computer Interaction
Information Retrieval
Information Theory
Machine Learning
Logic in Computer Science
Multiagent Systems
Multimedia
Mathematical Software
Numerical Analysis
Neural and Evolutionary Computing
Networking and Internet Architecture
Other Computer Science
Operating Systems
Performance
Programming Languages
Robotics
Symbolic Computation
Sound
Software Engineering
Social and Information Networks
Systems and Control
Economics
Econometrics
General Economics
Theoretical Economics
Electrical Engineering and Systems Science
Audio and Speech Processing
Image and Video Processing
Signal Processing
General Relativity and Quantum Cosmology
General Relativity and Quantum Cosmology
High Energy Physics - Experiment
High Energy Physics - Experiment
High Energy Physics - Lattice
High Energy Physics - Lattice
High Energy Physics - Phenomenology
High Energy Physics - Phenomenology
High Energy Physics - Theory
High Energy Physics - Theory
Mathematics
Commutative Algebra
Algebraic Geometry
Analysis of PDEs
Algebraic Topology
Classical Analysis and ODEs
Combinatorics
Category Theory
Complex Variables
Differential Geometry
Dynamical Systems
Functional Analysis
General Mathematics
General Topology
Group Theory
Geometric Topology
History and Overview
Information Theory
K-Theory and Homology
Logic
Metric Geometry
Mathematical Physics
Numerical Analysis
Number Theory
Operator Algebras
Optimization and Control
Probability
Quantum Algebra
Rings and Algebras
Representation Theory
Symplectic Geometry
Spectral Theory
Statistics Theory
Mathematical Physics
Mathematical Physics
Nonlinear Sciences
Adaptation and Self-Organizing Systems
Chaotic Dynamics
Cellular Automata and Lattice Gases
Pattern Formation and Solitons
Exactly Solvable and Integrable Systems
Nuclear Experiment
Nuclear Experiment
Nuclear Theory
Nuclear Theory
Physics
Accelerator Physics
Atmospheric and Oceanic Physics
Applied Physics
Atomic and Molecular Clusters
Atomic Physics
Biological Physics
Chemical Physics
Classical Physics
Computational Physics
Data Analysis, Statistics and Probability
Physics Education
Fluid Dynamics
General Physics
Geophysics
History and Philosophy of Physics
Instrumentation and Detectors
Medical Physics
Optics
Plasma Physics
Popular Physics
Physics and Society
Space Physics
Quantitative Biology
Biomolecules
Cell Behavior
Genomics
Molecular Networks
Neurons and Cognition
Other Quantitative Biology
Populations and Evolution
Quantitative Methods
Subcellular Processes
Tissues and Organs
Quantitative Finance
Computational Finance
Economics
General Finance
Mathematical Finance
Portfolio Management
Pricing of Securities
Risk Management
Statistical Finance
Trading and Market Microstructure
Quantum Physics
Quantum Physics
Statistics
Applications
Computation
Methodology
Machine Learning
Other Statistics
Statistics Theory
Feedback
Online
Stats
Tracking 72,893 papers.