Top 4 Arxiv Papers Today in Databases


2.114 Mikeys
#1. A Survey of Data Quality Measurement and Monitoring Tools
Lisa Ehrlinger, Elisa Rusz, Wolfram Wöß
High-quality data is key to interpretable and trustworthy data analytics and the basis for meaningful data-driven decisions. In practical scenarios, data quality is typically associated with data preprocessing, profiling, and cleansing for subsequent tasks like data integration or data analytics. However, from a scientific perspective, a lot of research has been published about the measurement (i.e., the detection) of data quality issues and different generally applicable data quality dimensions and metrics have been discussed. In this work, we close the gap between research into data quality measurement and practical implementations by investigating the functional scope of current data quality tools. With a systematic search, we identified 667 software tools dedicated to "data quality", from which we evaluated 13 tools with respect to three functionality areas: (1) data profiling, (2) data quality measurement in terms of metrics, and (3) continuous data quality monitoring. We selected the evaluated tools with regard to...
more | pdf | html
Figures
None.
Tweets
arxivml: "A Survey of Data Quality Measurement and Monitoring Tools", Lisa Ehrlinger, Elisa Rusz, Wolfram Wöß https://t.co/o1ENJqPv8q
arxiv_cs_LG: A Survey of Data Quality Measurement and Monitoring Tools. Lisa Ehrlinger, Elisa Rusz, and Wolfram Wöß https://t.co/DYDgbhoIQK
Memoirs: A Survey of Data Quality Measurement and Monitoring Tools. https://t.co/L8hqaIWvQ5
darmont_lyon2: RT @dbworld_: https://t.co/VcD6SLqXcL A Survey of Data Quality Measurement and Monitoring Tools. (arXiv:1907.08138v1 [cs.DB]) #databases
bimontesandro: RT @dbworld_: https://t.co/VcD6SLqXcL A Survey of Data Quality Measurement and Monitoring Tools. (arXiv:1907.08138v1 [cs.DB]) #databases
Github

:whale: Tool to automate data quality checks on data pipelines

Repository: mobydq
User: mobydq
Language: JavaScript
Stargazers: 37
Subscribers: 11
Forks: 15
Open Issues: 21
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 3
Total Words: 21624
Unqiue Words: 5058

2.003 Mikeys
#2. Effcient logging and querying for Blockchain-based cross-site genomic dataset access audit
Shuaicheng Ma, Yang Cao, Li Xiong
Background: Genomic data have been collected by different institutions and companies and need to be shared for broader use. In a cross-site genomic data sharing system, a secure and transparent access control audit module plays an essential role in ensuring the accountability. The 2018 iDASH competition first track provides us with an opportunity to design efficient logging and querying system for cross-site genomic dataset access audit. We designed a blockchain-based log system which can provide a light-weight and widely compatible module for existing blockchain platforms. The submitted solution won the third place of the competition. In this paper, we report the technical details in our system. Methods: We present two methods: baseline method and enhanced method. We started with the baseline method and then adjusted our implementation based on the competition evaluation criteria and characteristics of the log system. To overcome obstacles of indexing on the immutable Blockchain system, we designed a hierarchical timestamp...
more | pdf | html
Figures
Tweets
Github

Blockchain-based immutable logging and querying for cross-site genomic dataset access audit trial

Repository: Blockchain_med
User: mshuaic
Language: Python
Stargazers: 1
Subscribers: 3
Forks: 2
Open Issues: 0
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 3
Total Words: 7093
Unqiue Words: 2417

1.996 Mikeys
#3. In-Depth Benchmarking of Graph Database Systems with the Linked Data Benchmark Council (LDBC) Social Network Benchmark (SNB)
Florin Rusu, Zhiyi Huang
In this study, we present the first results of a complete implementation of the LDBC SNB benchmark -- interactive short, interactive complex, and business intelligence -- in two native graph database systems---Neo4j and TigerGraph. In addition to thoroughly evaluating the performance of all of the 46 queries in the benchmark on four scale factors -- SF-1, SF-10, SF-100, and SF-1000 -- and three computing architectures -- on premise and in the cloud -- we also measure the bulk loading time and storage size. Our results show that TigerGraph is consistently outperforming Neo4j on the majority of the queries---by two or more orders of magnitude (100X factor) on certain interactive complex and business intelligence queries. The gap increases with the size of the data since only TigerGraph is able to scale to SF-1000---Neo4j finishes only 12 of the 25 business intelligence queries in reasonable time. Nonetheless, Neo4j is generally faster at bulk loading graph data up to SF-100. A key to our study is the active involvement of the...
more | pdf | html
Figures
None.
Tweets
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 2
Total Words: 8328
Unqiue Words: 2689

1.991 Mikeys
#4. A Differentially Private Algorithm for Range Queries on Trajectories
Soheila Ghane, Lars Kulik, Kotagiri Ramamohanarao
We propose a novel algorithm to ensure $\epsilon$-differential privacy for answering range queries on trajectory data. In order to guarantee privacy, differential privacy mechanisms add noise to either data or query, thus introducing errors to queries made and potentially decreasing the utility of information. In contrast to the state-of-the-art, our method achieves significantly lower error as it is the first data- and query-aware approach for such queries. The key challenge for answering range queries on trajectory data privately is to ensure an accurate count. Simply representing a trajectory as a set instead of \emph{sequence} of points will generally lead to highly inaccurate query answers as it ignores the sequential dependency of location points in trajectories, i.e., will violate the consistency of trajectory data. Furthermore, trajectories are generally unevenly distributed across a city and adding noise uniformly will generally lead to a poor utility. To achieve differential privacy, our algorithm adaptively adds noise...
more | pdf | html
Figures
None.
Tweets
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 3
Total Words: 0
Unqiue Words: 0

About

Assert is a website where the best academic papers on arXiv (computer science, math, physics), bioRxiv (biology), BITSS (reproducibility), EarthArXiv (earth science), engrXiv (engineering), LawArXiv (law), PsyArXiv (psychology), SocArXiv (social science), and SportRxiv (sport research) bubble to the top each day.

Papers are scored (in real-time) based on how verifiable they are (as determined by their Github repos) and how interesting they are (based on Twitter).

To see top papers, follow us on twitter @assertpub_ (arXiv), @assert_pub (bioRxiv), and @assertpub_dev (everything else).

To see beautiful figures extracted from papers, follow us on Instagram.

Tracking 160,428 papers.

Search
Sort results based on if they are interesting or reproducible.
Interesting
Reproducible
Categories
All
Astrophysics
Cosmology and Nongalactic Astrophysics
Earth and Planetary Astrophysics
Astrophysics of Galaxies
High Energy Astrophysical Phenomena
Instrumentation and Methods for Astrophysics
Solar and Stellar Astrophysics
Condensed Matter
Disordered Systems and Neural Networks
Mesoscale and Nanoscale Physics
Materials Science
Other Condensed Matter
Quantum Gases
Soft Condensed Matter
Statistical Mechanics
Strongly Correlated Electrons
Superconductivity
Computer Science
Artificial Intelligence
Hardware Architecture
Computational Complexity
Computational Engineering, Finance, and Science
Computational Geometry
Computation and Language
Cryptography and Security
Computer Vision and Pattern Recognition
Computers and Society
Databases
Distributed, Parallel, and Cluster Computing
Digital Libraries
Discrete Mathematics
Data Structures and Algorithms
Emerging Technologies
Formal Languages and Automata Theory
General Literature
Graphics
Computer Science and Game Theory
Human-Computer Interaction
Information Retrieval
Information Theory
Machine Learning
Logic in Computer Science
Multiagent Systems
Multimedia
Mathematical Software
Numerical Analysis
Neural and Evolutionary Computing
Networking and Internet Architecture
Other Computer Science
Operating Systems
Performance
Programming Languages
Robotics
Symbolic Computation
Sound
Software Engineering
Social and Information Networks
Systems and Control
Economics
Econometrics
General Economics
Theoretical Economics
Electrical Engineering and Systems Science
Audio and Speech Processing
Image and Video Processing
Signal Processing
General Relativity and Quantum Cosmology
General Relativity and Quantum Cosmology
High Energy Physics - Experiment
High Energy Physics - Experiment
High Energy Physics - Lattice
High Energy Physics - Lattice
High Energy Physics - Phenomenology
High Energy Physics - Phenomenology
High Energy Physics - Theory
High Energy Physics - Theory
Mathematics
Commutative Algebra
Algebraic Geometry
Analysis of PDEs
Algebraic Topology
Classical Analysis and ODEs
Combinatorics
Category Theory
Complex Variables
Differential Geometry
Dynamical Systems
Functional Analysis
General Mathematics
General Topology
Group Theory
Geometric Topology
History and Overview
Information Theory
K-Theory and Homology
Logic
Metric Geometry
Mathematical Physics
Numerical Analysis
Number Theory
Operator Algebras
Optimization and Control
Probability
Quantum Algebra
Rings and Algebras
Representation Theory
Symplectic Geometry
Spectral Theory
Statistics Theory
Mathematical Physics
Mathematical Physics
Nonlinear Sciences
Adaptation and Self-Organizing Systems
Chaotic Dynamics
Cellular Automata and Lattice Gases
Pattern Formation and Solitons
Exactly Solvable and Integrable Systems
Nuclear Experiment
Nuclear Experiment
Nuclear Theory
Nuclear Theory
Physics
Accelerator Physics
Atmospheric and Oceanic Physics
Applied Physics
Atomic and Molecular Clusters
Atomic Physics
Biological Physics
Chemical Physics
Classical Physics
Computational Physics
Data Analysis, Statistics and Probability
Physics Education
Fluid Dynamics
General Physics
Geophysics
History and Philosophy of Physics
Instrumentation and Detectors
Medical Physics
Optics
Plasma Physics
Popular Physics
Physics and Society
Space Physics
Quantitative Biology
Biomolecules
Cell Behavior
Genomics
Molecular Networks
Neurons and Cognition
Other Quantitative Biology
Populations and Evolution
Quantitative Methods
Subcellular Processes
Tissues and Organs
Quantitative Finance
Computational Finance
Economics
General Finance
Mathematical Finance
Portfolio Management
Pricing of Securities
Risk Management
Statistical Finance
Trading and Market Microstructure
Quantum Physics
Quantum Physics
Statistics
Applications
Computation
Methodology
Machine Learning
Other Statistics
Statistics Theory
Feedback
Online
Stats
Tracking 160,428 papers.