Top 10 Arxiv Papers Today in Databases


2.055 Mikeys
#1. Model-based Approximate Query Processing
Moritz Kulessa, Alejandro Molina, Carsten Binnig, Benjamin Hilprecht, Kristian Kersting
Interactive visualizations are arguably the most important tool to explore, understand and convey facts about data. In the past years, the database community has been working on different techniques for Approximate Query Processing (AQP) that aim to deliver an approximate query result given a fixed time bound to support interactive visualizations better. However, classical AQP approaches suffer from various problems that limit the applicability to support the ad-hoc exploration of a new data set: (1) Classical AQP approaches that perform online sampling can support ad-hoc exploration queries but yield low quality if executed over rare subpopulations. (2) Classical AQP approaches that rely on offline sampling can use some form of biased sampling to mitigate these problems but require a priori knowledge of the workload, which is often not realistic if users want to explore a new database. In this paper, we present a new approach to AQP called Model-based AQP that leverages generative models learned over the complete database to...
more | pdf | html
Figures
Tweets
arxivml: "Model-based Approximate Query Processing", Moritz Kulessa, Alejandro Molina, Carsten Binnig, Benjamin Hilprecht, K… https://t.co/MZ22xmTpuu
Memoirs: Model-based Approximate Query Processing. https://t.co/xhufNvCrEp
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 5
Total Words: 12564
Unqiue Words: 2366

0.0 Mikeys
#2. Measuring and Computing Database Inconsistency via Repairs
Leopoldo Bertossi
We propose a generic numerical measure of inconsistency of a database with respect to integrity constraints that is based on a repair semantics. A particular measure is investigated, with mechanisms for computing it via answer-set programs.
more | pdf | html
Figures
None.
Tweets
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 1
Total Words: 2463
Unqiue Words: 1011

0.0 Mikeys
#3. Budget-aware Online Task Assignment in Spatial Crowdsourcing
Jia-Xu Liu, Ke Xu
The prevalence of mobile internet techniques stimulates the emergence of various spatial crowdsourcing applications. Certain of the applications serve for requesters, budget providers, who submit a batch of tasks and a fixed budget to platform with the desire to search suitable workers to complete the tasks in maximum quantity. Platform lays stress on optimizing assignment strategies on seeking less budget-consumed worker-task pairs to meet requesters' demands. Existing research on the task assignment with budget constraint mostly focuses on static offline scenarios, where the spatiotemporal information of all workers and tasks is known in advance. However, workers usually appear dynamically on real spatial crowdsourcing platforms, where existing solutions can hardly handle it. In this paper, we formally define a novel problem Budget-aware Online task Assignment(BOA) in spatial crowdsourcing applications. BOA aims to maximize the number of assigned worker- task pairs under a budget constraint where workers appear dynamically on...
more | pdf | html
Figures
Tweets
ComputerPapers: Budget-aware Online Task Assignment in Spatial Crowdsourcing. https://t.co/DPqX3M9sfI
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 2
Total Words: 8489
Unqiue Words: 2172

0.0 Mikeys
#4. Optimizing error of high-dimensional statistical queries under differential privacy
Ryan McKenna, Gerome Miklau, Michael Hay, Ashwin Machanavajjhala
Differentially private algorithms for answering sets of predicate counting queries on a sensitive database have many applications. Organizations that collect individual-level data, such as statistical agencies and medical institutions, use them to safely release summary tabulations. However, existing techniques are accurate only on a narrow class of query workloads, or are extremely slow, especially when analyzing more than one or two dimensions of the data. In this work we propose HDMM, a new differentially private algorithm for answering a workload of predicate counting queries, that is especially effective for higher-dimensional datasets. HDMM represents query workloads using an implicit matrix representation and exploits this compact representation to efficiently search (a subset of) the space of differentially private algorithms for one that answers the input query workload with high accuracy. We empirically show that HDMM can efficiently answer queries with lower error than state-of-the-art techniques on a variety of low and...
more | pdf | html
Figures
None.
Tweets
M157q_News_RSS: Optimizing error of high-dimensional statistical queries under differential privacy. (arXiv:1808.03537v1 [cs.DB]) https://t.co/Q0rsIjy34D Di
ComputerPapers: Optimizing error of high-dimensional statistical queries under differential privacy. https://t.co/F8oKvBkzzu
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 4
Total Words: 16978
Unqiue Words: 3910

0.0 Mikeys
#5. Crowd-Powered Data Mining
Chengliang Chai, Ju Fan, Guoliang Li, Jiannan Wang, Yudian Zheng
Many data mining tasks cannot be completely addressed by automated processes, such as sentiment analysis and image classification. Crowdsourcing is an effective way to harness the human cognitive ability to process these machine-hard tasks. Thanks to public crowdsourcing platforms, e.g., Amazon Mechanical Turk and CrowdFlower, we can easily involve hundreds of thousands of ordi- nary workers (i.e., the crowd) to address these machine-hard tasks. In this tutorial, we will survey and synthesize a wide spectrum of existing studies on crowd-powered data mining. We rst give an overview of crowdsourcing, and then summarize the fundamental techniques, including quality control, cost control, and latency control, which must be considered in crowdsourced data mining. Next we review crowd-powered data mining operations, including classification, clustering, pattern mining, outlier detection, knowledge base construction and enrichment. Finally, we provide the emerging challenges in crowdsourced data mining.
more | pdf | html
Figures
None.
Tweets
nmfeeds: [AI] https://t.co/jyYMZm8h7H Crowd-Powered Data Mining. Many data mining tasks cannot be completely addressed by auto- mat...
nmfeeds: [O] https://t.co/jyYMZm8h7H Crowd-Powered Data Mining. Many data mining tasks cannot be completely addressed by auto- mate...
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 5
Total Words: 4862
Unqiue Words: 1856

0.0 Mikeys
#6. Constant-Delay Enumeration for Nondeterministic Document Spanners
Antoine Amarilli, Pierre Bourhis, Stefan Mengel, Matthias Niewerth
We consider the information extraction approach known as document spanners, and study the problem of efficiently computing the results of the extraction from an input document, where the extraction task is described as a sequential variable-set automaton (VA). We pose this problem in the setting of enumeration algorithms, where we can first run a preprocessing phase and must then produce the results with a small delay between any two consecutive results. Our goal is to have an algorithm which is tractable in combined complexity, i.e., in the input document and in the VA; while ensuring the best possible data complexity bounds in the input document, in particular constant delay in the document. Several recent works at PODS'18 proposed such algorithms but with linear delay in the document or with an exponential dependency in the (generally nondeterministic) input VA. Florenzano et al. suggest that our desired runtime guarantees cannot be met for general sequential VAs. We refute this and show that, given a nondeterministic...
more | pdf | html
Figures
None.
Tweets
dbworld_: https://t.co/UmhZW6CCcZ Constant-Delay Enumeration for Nondeterministic Document Spanners. (arXiv:1807.09320v2 [cs.DB] UPDATED) #databases
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 4
Total Words: 13631
Unqiue Words: 2407

0.0 Mikeys
#7. Towards Semantically Enhanced Data Understanding
Markus Schröder, Christian Jilek, Jörn Hees, Andreas Dengel
In the field of machine learning, data understanding is the practice of getting initial insights in unknown datasets. Such knowledge-intensive tasks require a lot of documentation, which is necessary for data scientists to grasp the meaning of the data. Usually, documentation is separate from the data in various external documents, diagrams, spreadsheets and tools which causes considerable look up overhead. Moreover, other supporting applications are not able to consume and utilize such unstructured data. That is why we propose a methodology that uses a single semantic model that interlinks data with its documentation. Hence, data scientists are able to directly look up the connected information about the data by simply following links. Equally, they can browse the documentation which always refers to the data. Furthermore, the model can be used by other approaches providing additional support, like searching, comparing, integrating or visualizing data. To showcase our approach we also demonstrate an early prototype.
more | pdf | html
Figures
Tweets
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 4
Total Words: 3102
Unqiue Words: 1271

0.0 Mikeys
#8. The Internals of the Data Calculator
Stratos Idreos, Kostas Zoumpatianos, Brian Hentschel, Michael S. Kester, Demi Guo
Data structures are critical in any data-driven scenario, but they are notoriously hard to design due to a massive design space and the dependence of performance on workload and hardware which evolve continuously. We present a design engine, the Data Calculator, which enables interactive and semi-automated design of data structures. It brings two innovations. First, it offers a set of fine-grained design primitives that capture the first principles of data layout design: how data structure nodes lay data out, and how they are positioned relative to each other. This allows for a structured description of the universe of possible data structure designs that can be synthesized as combinations of those primitives. The second innovation is computation of performance using learned cost models. These models are trained on diverse hardware and data profiles and capture the cost properties of fundamental data access primitives (e.g., random access). With these models, we synthesize the performance cost of complex operations on arbitrary...
more | pdf | html
Figures
Tweets
arxiv_org: The Internals of the Data Calculator. https://t.co/iFESCNVNIK https://t.co/LbUvi5vRg7
Jose_A_Alonso: The internals of the data calculator. ~ S. Idreos et als. https://t.co/KrFmA6A1Zp #Algorithmic via @delimitry
delimitry: @CompSciFact Also check out the paper "The Internals of the Data Calculator" https://t.co/G45MSWK8Xr
MUKULBHALLA7: https://t.co/TjrrOOgfCD
vodkamomo: RT @arxiv_org: The Internals of the Data Calculator. https://t.co/iFESCNVNIK https://t.co/LbUvi5vRg7
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 5
Total Words: 29527
Unqiue Words: 5708

0.0 Mikeys
#9. Validation and Inference of Schema-Level Workflow Data-Dependency Annotations
Shawn Bowers, Timothy McPhillips, Bertram Ludäscher
An advantage of scientific workflow systems is their ability to collect runtime provenance information as an execution trace. Traces include the computation steps invoked as part of the workflow run along with the corresponding data consumed and produced by each workflow step. The information captured by a trace is used to infer "lineage" relationships among data items, which can help answer provenance queries to find workflow inputs that were involved in producing specific workflow outputs. Determining lineage relationships, however, requires an understanding of the dependency patterns that exist between each workflow step's inputs and outputs, and this information is often under-specified or generally assumed by workflow systems. For instance, most approaches assume all outputs depend on all inputs, which can lead to lineage "false positives". In prior work, we defined annotations for specifying detailed dependency relationships between inputs and outputs of computation steps. These annotations are used to define corresponding...
more | pdf | html
Figures
None.
Tweets
ludaesch: Kudos to Shawn and Tim for combining theory & practice (logic inference & #YesWorkflow) in powerful new ways! #IPAW2018 Best Paper is available here: https://t.co/te3ce7mV6Y https://t.co/WT0F0PcMz7
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 3
Total Words: 5891
Unqiue Words: 1520

0.0 Mikeys
#10. Diversification on Big Data in Query Processing
Meifan Zhang, Hongzhi Wang, Jianzhong Li, Hong Gao
Recently, in the area of big data, some popular applications such as web search engines and recommendation systems, face the problem to diversify results during query processing. In this sense, it is both significant and essential to propose methods to deal with big data in order to increase the diversity of the result set. In this paper, we firstly define a set's diversity and an element's ability to improve the set's overall diversity. Based on these definitions, we propose a diversification framework which has good performance in terms of effectiveness and efficiency. Also, this framework has theoretical guarantee on probability of success. Secondly, we design implementation algorithms based on this framework for both numerical and string data. Thirdly, for numerical and string data respectively, we carry out extensive experiments on real data to verify the performance of our proposed framework, and also perform scalability experiments on synthetic data.
more | pdf | html
Figures
None.
Tweets
Github
None.
Youtube
None.
Other stats
Sample Sizes : None.
Authors: 4
Total Words: 12496
Unqiue Words: 2304

About

Assert is a website where the best academic papers on arXiv (computer science, math, physics), bioRxiv (biology), BITSS (reproducibility), EarthArXiv (earth science), engrXiv (engineering), LawArXiv (law), PsyArXiv (psychology), SocArXiv (social science), and SportRxiv (sport research) bubble to the top each day.

Papers are scored (in real-time) based on how verifiable they are (as determined by their Github repos) and how interesting they are (based on Twitter).

To see top papers, follow us on twitter @assertpub_ (arXiv), @assert_pub (bioRxiv), and @assertpub_dev (everything else).

To see beautiful figures extracted from papers, follow us on Instagram.

Tracking 57,756 papers.

Search
Sort results based on if they are interesting or reproducible.
Interesting
Reproducible
Categories
All
Astrophysics
Cosmology and Nongalactic Astrophysics
Earth and Planetary Astrophysics
Astrophysics of Galaxies
High Energy Astrophysical Phenomena
Instrumentation and Methods for Astrophysics
Solar and Stellar Astrophysics
Condensed Matter
Disordered Systems and Neural Networks
Mesoscale and Nanoscale Physics
Materials Science
Other Condensed Matter
Quantum Gases
Soft Condensed Matter
Statistical Mechanics
Strongly Correlated Electrons
Superconductivity
Computer Science
Artificial Intelligence
Hardware Architecture
Computational Complexity
Computational Engineering, Finance, and Science
Computational Geometry
Computation and Language
Cryptography and Security
Computer Vision and Pattern Recognition
Computers and Society
Databases
Distributed, Parallel, and Cluster Computing
Digital Libraries
Discrete Mathematics
Data Structures and Algorithms
Emerging Technologies
Formal Languages and Automata Theory
General Literature
Graphics
Computer Science and Game Theory
Human-Computer Interaction
Information Retrieval
Information Theory
Machine Learning
Logic in Computer Science
Multiagent Systems
Multimedia
Mathematical Software
Numerical Analysis
Neural and Evolutionary Computing
Networking and Internet Architecture
Other Computer Science
Operating Systems
Performance
Programming Languages
Robotics
Symbolic Computation
Sound
Software Engineering
Social and Information Networks
Systems and Control
Economics
Econometrics
General Economics
Theoretical Economics
Electrical Engineering and Systems Science
Audio and Speech Processing
Image and Video Processing
Signal Processing
General Relativity and Quantum Cosmology
General Relativity and Quantum Cosmology
High Energy Physics - Experiment
High Energy Physics - Experiment
High Energy Physics - Lattice
High Energy Physics - Lattice
High Energy Physics - Phenomenology
High Energy Physics - Phenomenology
High Energy Physics - Theory
High Energy Physics - Theory
Mathematics
Commutative Algebra
Algebraic Geometry
Analysis of PDEs
Algebraic Topology
Classical Analysis and ODEs
Combinatorics
Category Theory
Complex Variables
Differential Geometry
Dynamical Systems
Functional Analysis
General Mathematics
General Topology
Group Theory
Geometric Topology
History and Overview
Information Theory
K-Theory and Homology
Logic
Metric Geometry
Mathematical Physics
Numerical Analysis
Number Theory
Operator Algebras
Optimization and Control
Probability
Quantum Algebra
Rings and Algebras
Representation Theory
Symplectic Geometry
Spectral Theory
Statistics Theory
Mathematical Physics
Mathematical Physics
Nonlinear Sciences
Adaptation and Self-Organizing Systems
Chaotic Dynamics
Cellular Automata and Lattice Gases
Pattern Formation and Solitons
Exactly Solvable and Integrable Systems
Nuclear Experiment
Nuclear Experiment
Nuclear Theory
Nuclear Theory
Physics
Accelerator Physics
Atmospheric and Oceanic Physics
Applied Physics
Atomic and Molecular Clusters
Atomic Physics
Biological Physics
Chemical Physics
Classical Physics
Computational Physics
Data Analysis, Statistics and Probability
Physics Education
Fluid Dynamics
General Physics
Geophysics
History and Philosophy of Physics
Instrumentation and Detectors
Medical Physics
Optics
Plasma Physics
Popular Physics
Physics and Society
Space Physics
Quantitative Biology
Biomolecules
Cell Behavior
Genomics
Molecular Networks
Neurons and Cognition
Other Quantitative Biology
Populations and Evolution
Quantitative Methods
Subcellular Processes
Tissues and Organs
Quantitative Finance
Computational Finance
Economics
General Finance
Mathematical Finance
Portfolio Management
Pricing of Securities
Risk Management
Statistical Finance
Trading and Market Microstructure
Quantum Physics
Quantum Physics
Statistics
Applications
Computation
Methodology
Machine Learning
Other Statistics
Statistics Theory
Feedback
Online
Stats
Tracking 57,756 papers.