Top 10 Biorxiv Papers Today in Bioinformatics


2.116 Mikeys
#1. VariantSpark, A Random Forest Machine Learning Implementation for Ultra High Dimensional Data
Arash Bayat, Piotr Szul, Aidan O'Brien, Robert Dunne, Oscar Luo, Yatish Jain, Brendan Hosking, Denis Bauer
The demands on machine learning methods to cater for ultra high dimensional datasets, datasets with millions of features, have been increasing in domains like life sciences and the Internet of Things (IoT). While Random-Forests are suitable for "wide" datasets, current implementations such as Google PLANET lack the ability to scale to such dimensions. Recent improvements by Yggdrasil begin to address these limitations but do not extend to Random-Forest. This paper introduces Cursed-Forest, a novel Random-Forest implementation on top of Apache Spark and part of the VariantSpark platform, which parallelises processing of all nodes over the entire forest. Cursed-Forest is 9 and up to 89 times faster than Google PLANET and Yggdrasil, respectively, and is the first method capable of scaling to millions of features.
more | pdf
Figures
Tweets
razoralign: VariantSpark: A Random Forest Machine Learning Implementation for Ultra High Dimensional Data https://t.co/IbJajkvCdJ
lynnlangit: RT @biorxiv_bioinfo: VariantSpark, A Random Forest Machine Learning Implementation for Ultra High Dimensional Data https://t.co/5cT4O5f4Kv…
sigitpurnomo: RT @biorxiv_bioinfo: VariantSpark, A Random Forest Machine Learning Implementation for Ultra High Dimensional Data https://t.co/5cT4O5f4Kv…
jameslz: RT @biorxiv_bioinfo: VariantSpark, A Random Forest Machine Learning Implementation for Ultra High Dimensional Data https://t.co/5cT4O5f4Kv…
MischaLundberg: RT @biorxiv_bioinfo: VariantSpark, A Random Forest Machine Learning Implementation for Ultra High Dimensional Data https://t.co/5cT4O5f4Kv…
Github

machine learning for genomic variants

Repository: VariantSpark
User: aehrc
Language: Scala
Stargazers: 54
Subscribers: 10
Forks: 21
Open Issues: 36
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 8
Total Words: 4633
Unqiue Words: 1531

2.024 Mikeys
#2. souporcell: Robust clustering of single cell RNAseq by genotype and ambient RNA inference without reference genotypes
Haynes Heaton, Arthur M Talman, Andrew Knights, Maria Imaz, Richard Durbin, Martin Hemberg, Mara Lawniczak
A popular design for scRNAseq experiments is to multiplex cells from different donors, as this strategy avoids batch effects, reduces costs, and improves doublet detection. Using variants in the reads, it is possible to assign cells to genotypes. The first tool in this space, demuxlet, assigns cells based on genotypes known a priori, but more recently tools not requiring this information have become available including sc_split and vireo. However, none of these methods have been validated across a wide range of sample parameters, types, and species. Further, none of these tools model an important confounder of the data, ambient RNA caused by cell lysis prior to cell partitioning. We present souporcell, a robust method to cluster cells by their genetic variants without a genotype reference and show that it outperforms existing methods on clustering accuracy, doublet detection, and genotyping across a wide range of challenging scenarios while accurately estimating the amount of ambient RNA in the sample.
more | pdf
Figures
None.
Tweets
biorxivpreprint: souporcell: Robust clustering of single cell RNAseq by genotype and ambient RNA inference without reference genotypes https://t.co/oYW6xnxbWr #bioRxiv
biorxiv_bioinfo: souporcell: Robust clustering of single cell RNAseq by genotype and ambient RNA inference without reference genotypes https://t.co/3VHK0leJlz #biorxiv_bioinfo
razoralign: souporcell: Robust clustering of single cell RNAseq by genotype and ambient RNA inference without reference genotypes https://t.co/99LK13Ht2h https://t.co/naaARiuGgm
Eomesodermin: souporcell: Robust clustering of single cell RNAseq by genotype and ambient RNA inference without reference genotypes https://t.co/uiaUWLgZLI #immunotherapy #biorxiv #immunobot
PromPreprint: souporcell: Robust clustering of single cell RNAseq by genotype and ambient RNA inference without reference genotypes https://t.co/SSRg54pWvQ
BioRxivCurator: souporcell: Robust clustering of single cell RNAseq by genotype and ambient RNA inference without reference genotypes https://t.co/9uy9vl5zAe
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 7
Total Words: 0
Unqiue Words: 0

2.021 Mikeys
#3. RNA-Bloom provides lightweight reference-free transcriptome assembly for single cells
Ka Ming Nip, Readman Chiu, Chen Yang, Justin Chu, Hamid Mohamadi, Rene L Warren, Inanc Birol
We present RNA-Bloom, a de novo RNA-seq assembly algorithm that leverages the rich information content in single-cell transcriptome sequencing (scRNA-seq) data to reconstruct cell-specific isoforms. We benchmark RNA-Bloom's performance against leading bulk RNA-seq assembly approaches, and illustrate its utility in detecting cell-specific gene fusion events using sequencing data from HiSeq-4000 and BGISEQ-500 platforms. We expect RNA-Bloom to boost the utility of scRNA-seq data, expanding what is informatically accessible now.
more | pdf
Figures
Tweets
GagalovaK: RNA-Bloom provides lightweight reference-free transcriptome assembly for single cells https://t.co/CM8Ozy3C0g @BCCancer_GSC
WarrenRene: RT @biorxivpreprint: RNA-Bloom provides lightweight reference-free transcriptome assembly for single cells https://t.co/23CAhwkmV0 #bioRxiv
BohlmannLab: RT @GagalovaK: RNA-Bloom provides lightweight reference-free transcriptome assembly for single cells https://t.co/CM8Ozy3C0g @BCCancer_GSC
Github

:hibiscus: fast and memory-efficient de novo assembler for bulk and single-cell RNA-seq data

Repository: RNA-Bloom
User: bcgsc
Language: Java
Stargazers: 14
Subscribers: 12
Forks: 1
Open Issues: 0
Youtube
None.
Other stats
Sample Sizes : [310, 185, 170]
Authors: 7
Total Words: 4915
Unqiue Words: 1483

2.02 Mikeys
#4. Janggu: Deep Learning for Genomics
Wolfgang Kopp, Remo Monti, Annalaura Tamburrini, Uwe Ohler, Altuna Akalin
In recent years, numerous applications have demonstrated the potential of deep learning for an improved understanding of biological processes. However, most deep learning tools developed so far are designed to address a specific question on a fixed dataset and/or by a fixed model architecture. Adapting these models to integrate new datasets or to address different hypotheses can lead to considerable software engineering effort. To address this aspect we have built Janggu, a python library that facilitates deep learning for genomics applications. Janggu aims to ease data acquisition and model evaluation in multiple ways. Among its key features are special dataset objects, which form a unified and flexible data acquisition and pre-processing framework for genomics data that enables streamlining of future research applications through reusable components.Through a numpy-like interface, dataset objects are directly compatible with popular deep learning libraries, including keras. Furthermore, Janggu offers the possibility to visualize...
more | pdf
Figures
Tweets
biorxivpreprint: Janggu: Deep Learning for Genomics https://t.co/eI39UeOt80 #bioRxiv
biorxiv_bioinfo: Janggu: Deep Learning for Genomics https://t.co/82PMKWQsfW #biorxiv_bioinfo
razoralign: Janggu: Deep Learning for Genomics https://t.co/9Xe3Xz5tVy https://t.co/wrVqs0qGmX
mtanichthys: RT @biorxivpreprint: Janggu: Deep Learning for Genomics https://t.co/eI39UeOt80 #bioRxiv
TJesse62: RT @biorxivpreprint: Janggu: Deep Learning for Genomics https://t.co/eI39UeOt80 #bioRxiv
wckdouglas: RT @biorxivpreprint: Janggu: Deep Learning for Genomics https://t.co/eI39UeOt80 #bioRxiv
hdeshmuk: RT @biorxiv_bioinfo: Janggu: Deep Learning for Genomics https://t.co/82PMKWQsfW #biorxiv_bioinfo
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 5
Total Words: 6010
Unqiue Words: 1877

2.019 Mikeys
#5. CoralP: Flexible visualization of the human phosphatome
Amit Min, Erika Deoudes, Marielle L Bond, Eric Davis, Douglas Phanstiel
Protein phosphatases and kinases play critical roles in a host of biological processes and diseases via the removal and addition of phosphoryl groups. While kinases have been extensively studied for decades, recent findings regarding the specificity and activities of phosphatases have generated an increased interest in targeting phosphatases for pharmaceutical development. This increased focus has created a need for methods to visualize this important class of proteins within the context of the entire phosphatase protein family. Here, we present CoralP, an interactive web application for the generation of customizable, publication-quality representations of human phosphatome data. Phosphatase attributes can be encoded through edge colors, node colors, and node sizes. CoralP is the first and currently the only tool designed for phosphatome visualization and should be of great use to the signaling community. The source code and web application are available at https:// github.com/PhanstielLab/coralp and http://phanstiel-...
more | pdf
Figures
None.
Tweets
biorxivpreprint: CoralP: Flexible visualization of the human phosphatome https://t.co/thabJdtAXE #bioRxiv
biorxiv_bioinfo: CoralP: Flexible visualization of the human phosphatome https://t.co/jU8RlKZWBs #biorxiv_bioinfo
hdeshmuk: RT @biorxiv_bioinfo: CoralP: Flexible visualization of the human phosphatome https://t.co/jU8RlKZWBs #biorxiv_bioinfo
Github

Visualization platform for human phosphatases

Repository: coralp
User: PhanstielLab
Language: R
Stargazers: 0
Subscribers: 2
Forks: 0
Open Issues: 0
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 5
Total Words: 2727
Unqiue Words: 1320

2.019 Mikeys
#6. Improving interpretability of deep learning models: splicing codes as a case study
Anupama Jha, Joseph K. Aicher, Deependra Singh, Yoseph Barash
Despite the success and fast adaptation of deep learning models in a wide range of fields, lack of interpretability remains an issue, especially in biomedical domains. A recent promising method to address this limitation is Integrated Gradients (IG), which identifies features associated with a prediction by traversing a linear path from a baseline to a sample. We extend IG with nonlinear paths, embedding in latent space, alternative baselines, and a framework to identify important features which make it suitable for interpretation of deep models for genomics.
more | pdf
Figures
Tweets
biorxivpreprint: Improving interpretability of deep learning models: splicing codes as a case study https://t.co/EkVUg6x2ch #bioRxiv
biorxiv_bioinfo: Improving interpretability of deep learning models: splicing codes as a case study https://t.co/1SS4esOWsr #biorxiv_bioinfo
razoralign: Improving interpretability of deep learning models: splicing codes as a case study https://t.co/I92zlV22qK https://t.co/Xet3zmQuxE
prashbio: RT @biorxiv_bioinfo: Improving interpretability of deep learning models: splicing codes as a case study https://t.co/1SS4esOWsr #biorxiv_b…
AlanBOCallaghan: RT @biorxiv_bioinfo: Improving interpretability of deep learning models: splicing codes as a case study https://t.co/1SS4esOWsr #biorxiv_b…
sbotlite: RT @biorxivpreprint: Improving interpretability of deep learning models: splicing codes as a case study https://t.co/EkVUg6x2ch #bioRxiv
hdeshmuk: RT @biorxiv_bioinfo: Improving interpretability of deep learning models: splicing codes as a case study https://t.co/1SS4esOWsr #biorxiv_b…
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 4
Total Words: 6946
Unqiue Words: 1935

2.019 Mikeys
#7. Systematic assessment of commercially available low-input miRNA library preparation kits
Fatima Heinicke, Xiangfu Zhong, Manuela Zucknick, Johannes Breidenbach, Arvind Sundaram, Siri T Flåm, Magnus Leithaug, Marianne Dalland, Andrew Farmer, Jordana M. Henderson, Melanie A. Hussong, Pamela Moll, Loan Nguyen, Amanda McNulty, Jonathan M. Shaffer, Sabrina Shore, HoiChong Karen Yip, Jana Vitkovska, Simon Rayner, Benedicte A. Lie, Gregor D. Gilfillan
High-throughput sequencing is increasingly favoured to assay the presence and abundance of micro RNAs (miRNAs) in biological samples, even from low RNA amounts, and a number of commercial vendors now offer kits that allow miRNA sequencing from sub-nanogram (ng) inputs. However, although biases introduced during library preparation have been documented, the relative performance of current reagent kits has not been investigated in detail. Here, six commercial kits capable of handling <100ng total RNA input were used for library preparation, performed by kit manufactures, on synthetic miRNAs of known quantities and human biological total RNA samples. We compared the performance of miRNA detection sensitivity, reliability, titration response and the ability to detect differentially expressed miRNAs. In addition, we assessed the use of unique molecular identifiers sequence (UMI) tags in one kit. We observed differences in detection sensitivity and ability to identify differentially expressed miRNAs between the kits, but none were able...
more | pdf
Figures
Tweets
biorxivpreprint: Systematic assessment of commercially available low-input miRNA library preparation kits https://t.co/P8c8XXuftN #bioRxiv
biorxiv_bioinfo: Systematic assessment of commercially available low-input miRNA library preparation kits https://t.co/dI38ld88a3 #biorxiv_bioinfo
FriedlanderLab: RT @biorxiv_bioinfo: Systematic assessment of commercially available low-input miRNA library preparation kits https://t.co/dI38ld88a3 #bio…
joey0576: RT @biorxiv_bioinfo: Systematic assessment of commercially available low-input miRNA library preparation kits https://t.co/dI38ld88a3 #bio…
joey0576: RT @biorxivpreprint: Systematic assessment of commercially available low-input miRNA library preparation kits https://t.co/P8c8XXuftN #bio…
sbotlite: RT @biorxivpreprint: Systematic assessment of commercially available low-input miRNA library preparation kits https://t.co/P8c8XXuftN #bio…
Github
None.
Youtube
None.
Other stats
Sample Sizes : [4, 4]
Authors: 21
Total Words: 10427
Unqiue Words: 3306

2.016 Mikeys
#8. Evaluating probabilistic programming and fast variational Bayesian inference in phylogenetics
Mathieu Fourment, Aaron E Darling
Recent advances in statistical machine learning techniques have led to the creation of probabilistic programming frameworks. These frameworks enable probabilistic models to be rapidly prototyped and fit to data using scalable approximation methods such as variational inference. In this work, we explore the use of the Stan language for probabilistic programming in application to phylogenetic models. We show that many commonly used phylogenetic models including the general time reversible (GTR) substitution model, rate heterogeneity among sites, and a range of coalescent models can be implemented using a probabilistic programming language. The posterior probability distributions obtained via the black box variational inference engine in Stan were compared to those obtained with reference implementations of Markov chain Monte Carlo (MCMC) for phylogenetic inference. We find that black box variational inference in Stan is less accurate than MCMC methods for phylogenetic models, but requires far less compute time. Finally, we evaluate...
more | pdf
Figures
None.
Tweets
PhyDyn_Papers: Evaluating probabilistic programming and fast variational Bayesian inference in phylogenetics Mathieu Fourment @m4ment and Aaron E Darling @koadman https://t.co/4RuL2aZ61k
razoralign: Evaluating probabilistic programming and fast variational Bayesian inference in phylogenetics https://t.co/IoGksmLtLW
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 2
Total Words: 7449
Unqiue Words: 2050

2.012 Mikeys
#9. SINATRA: A Sub-Image Analysis Pipeline for Selecting Features that Differentiate Classes of 3D Shapes
Bruce Wang, Timothy Sudijono, Henry Kirveslahti, Tingran Gao, Doug M. Boyer, Sayan Mukherjee, Lorin Crawford
It has been a longstanding challenge in geometric morphometrics and medical imaging to infer the physical locations (or regions) of 3D shapes that are most associated with a given response variable (e.g.~class labels) without needing common predefined landmarks across the shapes, computing correspondence maps between the shapes, or requiring the shapes to be diffeomorphic to each other. In this paper, we introduce SINATRA: the first statistical pipeline for sub-image analysis which identifies physical shape features that explain most of the variation between two classes without the aforementioned requirements. We also illustrate how the problem of 3D sub-image analysis can be mapped onto the well-studied problem of variable selection in nonlinear regression models. Here, the key insight is that tools from integral geometry and differential topology, specifically the Euler characteristic, can be used to transform a 3D mesh representation of an image or shape into a collection of vectors with minimal loss of geometric information....
more | pdf
Figures
None.
Tweets
biorxivpreprint: SINATRA: A Sub-Image Analysis Pipeline for Selecting Features that Differentiate Classes of 3D Shapes https://t.co/lLtnGcucGk #bioRxiv
biorxiv_bioinfo: SINATRA: A Sub-Image Analysis Pipeline for Selecting Features that Differentiate Classes of 3D Shapes https://t.co/HRpfZ547kc #biorxiv_bioinfo
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 7
Total Words: 0
Unqiue Words: 0

2.011 Mikeys
#10. MetaOmGraph: a workbench for interactive exploratory data analysis of large expression datasets
Urminder Singh, Manhoi Hur, Karin Dorman, Eve Wurtele
The diverse and growing omics data in public domains provide researchers with a tremendous opportunity to extract hidden knowledge. However, the challenge of providing domain experts with easy access to these big data has resulted in the vast majority of archived data remaining unused. Here, we present MetaOmGraph (MOG), a free, open-source, standalone software for exploratory data analysis of massive datasets by scientific researchers. Using MOG, a researcher can interactively visualize and statistically analyze the data, in the context of its metadata. Researchers can interactively hone-in on groups of experiments or genes based on attributes such as expression values, statistical results, metadata terms, and ontology annotations. MOG's statistical tools include coexpression, differential expression, and differential correlation analysis, with permutation test-based options for significance assessments. Multithreading and indexing enable efficient data analysis on a personal computer, with no need for writing code. Data can be...
more | pdf
Figures
Tweets
biorxivpreprint: MetaOmGraph: a workbench for interactive exploratory data analysis of large expression datasets https://t.co/Sslw9veEyt #bioRxiv
biorxiv_bioinfo: MetaOmGraph: a workbench for interactive exploratory data analysis of large expression datasets https://t.co/oDvi78P3b6 #biorxiv_bioinfo
razoralign: MetaOmGraph: a workbench for interactive exploratory data analysis of large expression datasets https://t.co/Er1paPQBFc
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 4
Total Words: 15705
Unqiue Words: 5191

About

Assert is a website where the best academic papers on arXiv (computer science, math, physics), bioRxiv (biology), BITSS (reproducibility), EarthArXiv (earth science), engrXiv (engineering), LawArXiv (law), PsyArXiv (psychology), SocArXiv (social science), and SportRxiv (sport research) bubble to the top each day.

Papers are scored (in real-time) based on how verifiable they are (as determined by their Github repos) and how interesting they are (based on Twitter).

To see top papers, follow us on twitter @assertpub_ (arXiv), @assert_pub (bioRxiv), and @assertpub_dev (everything else).

To see beautiful figures extracted from papers, follow us on Instagram.

Tracking 158,347 papers.

Search
Sort results based on if they are interesting or reproducible.
Interesting
Reproducible
Feedback
Online
Stats
Tracking 158,347 papers.