Arguing for the need to combine declarative and probabilistic programming,
B\'ar\'any et al. (TODS 2017) recently introduced a probabilistic extension of
Datalog as a "purely declarative probabilistic programming language." We
revisit this language and propose a more principled approach towards defining
its semantics. It is based on standard notions from probability theory known as
stochastic kernels and Markov processes. This allows us to extend the semantics
to continuous probability distributions, thereby settling an open problem posed
by B\'ar\'any et al. We show that our semantics is fairly robust, allowing both
parallel execution and arbitrary chase orders when evaluating a program. We
cast our semantics in the framework of infinite probabilistic databases (Grohe
and Lindner, ICDT 2020), and we show that the semantics remains meaningful even
when the input of a probabilistic Datalog program is an arbitrary probabilistic
database.

more |
pdf
| html
None.

dbworld_:
https://t.co/6op0z9yob5 Generative Datalog with Continuous Distributions. (arXiv:2001.06358v1 [cs.DB]) #databases

arxiv_cslo:
Generative Datalog with Continuous Distributions https://t.co/F8PpfMrIZZ

None.

None.

Sample Sizes : None.

Authors: 4

Total Words: 18218

Unqiue Words: 3563

Recently, keyword search on Knowledge Graphs (KGs) becomes popular. Typical
keyword search approaches aim at finding a concise subgraph from a KG, which
can reflect a close relationship among all input keywords. The connection paths
between keywords are selected in a way that leads to a result subgraph with a
better semantic score. However, such a result may not meet user information
need because it relies on the scoring function to decide what keywords to link
closer. Therefore, such a result may miss close connections among some keywords
on which users intend to focus. In this paper, we propose a parallel keyword
search engine, called RAKS. It allows users to specify a query as two sets of
keywords, namely central keywords and marginal keywords. Specifically, central
keywords are those keywords on which users focus more. Their relationships are
desired in the results. Marginal keywords are those less focused keywords.
Their connections to the central keywords are desired. In addition, they
provide additional information that...

more |
pdf
| html
None.

dbworld_:
https://t.co/6aKA974OWv Efficient Radial Pattern Keyword Search on Knowledge Graphs in Parallel. (arXiv:2001.06770v1 [cs.DB]) #databases

None.

None.

Sample Sizes : None.

Authors: 2

Total Words: 0

Unqiue Words: 0

The AI revolution is data driven. AI "data wrangling" is the process by which
unusable data is transformed to support AI algorithm development (training) and
deployment (inference). Significant time is devoted to translating diverse data
representations supporting the many query and analysis steps found in an AI
pipeline. Rigorous mathematical representations of these data enables data
translation and analysis optimization within and across steps. Associative
array algebra provides a mathematical foundation that naturally describes the
tabular structures and set mathematics that are the basis of databases.
Likewise, the matrix operations and corresponding inference/training
calculations used by neural networks are also well described by associative
arrays. More surprisingly, a general denormalized form of hierarchical formats,
such as XML and JSON, can be readily constructed. Finally, pivot tables, which
are among the most widely used data analysis tools, naturally emerge from
associative array constructors. A common foundation in...

more |
pdf
| html
None.

dbworld_:
https://t.co/1n4hrkTSXy AI Data Wrangling with Associative Arrays. (arXiv:2001.06731v1 [cs.DB]) #databases

leclercq_ub:
RT @dbworld_: https://t.co/1n4hrkTSXy AI Data Wrangling with Associative Arrays. (arXiv:2001.06731v1 [cs.DB]) #databases

None.

None.

Sample Sizes : None.

Authors: 5

Total Words: 0

Unqiue Words: 0

Industrial AI systems are mostly end-to-end machine learning (ML) workflows.
A typical recommendation or business intelligence system includes many online
micro-services and offline jobs. We describe SQLFlow for developing such
workflows efficiently in SQL. SQL enables developers to write short programs
focusing on the purpose (what) and ignoring the procedure (how). Previous
database systems extended their SQL dialect to support ML. SQLFlow
(https://sqlflow.org/sqlflow ) takes another strategy to work as a bridge over
various database systems, including MySQL, Apache Hive, and Alibaba MaxCompute,
and ML engines like TensorFlow, XGBoost, and scikit-learn. We extended SQL
syntax carefully to make the extension working with various SQL dialects. We
implement the extension by inventing a collaborative parsing algorithm. SQLFlow
is efficient and expressive to a wide variety of ML techniques -- supervised
and unsupervised learning; deep networks and tree models; visual model
explanation in addition to training and prediction; data...

more |
pdf
| html
BrundageBot:
SQLFlow: A Bridge between SQL and Machine Learning. Yi Wang, Yang Yang, Weiguo Zhu, Yi Wu, Xu Yan, Yongfeng Liu, Yu Wang, Liang Xie, Ziyao Gao, Wenjing Zhu, Xiang Chen, Wei Yan, Mingjie Tang, and Yuan Tang https://t.co/ShIinYGV8M

arxivml:
"SQLFlow: A Bridge between SQL and Machine Learning",
Yi Wang, Yang Yang, Weiguo Zhu, Yi Wu, Xu Yan, Yongfeng Liu, …
https://t.co/a1O3sboMKL

dbworld_:
https://t.co/fn79jTa4tq SQLFlow: A Bridge between SQL and Machine Learning. (arXiv:2001.06846v1 [cs.DB]) #databases

Memoirs:
SQLFlow: A Bridge between SQL and Machine Learning. https://t.co/Q2tT66LHXP

darmont_lyon2:
RT @dbworld_: https://t.co/fn79jTa4tq SQLFlow: A Bridge between SQL and Machine Learning. (arXiv:2001.06846v1 [cs.DB]) #databases

None.

None.

Sample Sizes : None.

Authors: 14

Total Words: 3282

Unqiue Words: 1321

Significant amounts of data are currently being stored and managed on
third-party servers. It is impractical for many small scale enterprises to own
their private datacenters, hence renting third-party servers is a viable
solution for such businesses. But the increasing number of malicious attacks,
both internal and external, as well as buggy software on third-party servers is
causing clients to lose their trust in these external infrastructures. While
small enterprises cannot avoid using external infrastructures, they need the
right set of protocols to manage their data on untrusted infrastructures. In
this paper, we propose TFCommit, a novel atomic commitment protocol that
executes transactions on data stored across multiple untrusted servers. To our
knowledge, TFCommit is the first atomic commitment protocol to execute
transactions in an untrusted environment without using expensive Byzantine
replication. Using TFCommit, we propose an auditable data management system,
Fides, residing completely on untrustworthy infrastructure....

more |
pdf
| html
dbworld_:
https://t.co/PNqOLtNshK Fides: Managing Data on Untrusted Infrastructure. (arXiv:2001.06933v1 [cs.DB]) #databases

None.

None.

Sample Sizes : None.

Authors: 4

Total Words: 12840

Unqiue Words: 3000

Cohesive subgraph mining in bipartite graphs becomes a popular research topic
recently. An important structure k-bitruss is the maximal cohesive subgraph
where each edge is contained in at least k butterflies (i.e., (2,
2)-bicliques). In this paper, we study the bitruss decomposition problem which
aims to find all the k-bitrusses for k >= 0. The existing bottom-up techniques
need to iteratively peel the edges with the lowest butterfly support. In this
peeling process, these techniques are time-consuming to enumerate all the
supporting butterflies for each edge. To relax this issue, we first propose a
novel online index -- the BE-Index which compresses butterflies into k-blooms
(i.e., (2, k)-bicliques). Based on the BE-Index, the new bitruss decomposition
algorithm BiT-BU is proposed, along with two batch-based optimizations, to
accomplish the butterfly enumeration of the peeling process in an efficient
way. Furthermore, the BiT-PC algorithm is devised which is more efficient
against handling the edges with high butterfly supports....

more |
pdf
| html
None.

dbworld_:
https://t.co/NEQqS452lq Efficient Bitruss Decomposition for Large-scale Bipartite Graphs. (arXiv:2001.06111v1 [cs.DB]) #databases

None.

None.

Sample Sizes : None.

Authors: 5

Total Words: 11247

Unqiue Words: 2190

Assert is a website where the best academic papers on arXiv (computer science, math, physics), bioRxiv (biology), BITSS (reproducibility), EarthArXiv (earth science), engrXiv (engineering), LawArXiv (law), PsyArXiv (psychology), SocArXiv (social science), and SportRxiv (sport research) bubble to the top each day.

Papers are scored (in real-time) based on how verifiable they are (as determined by their Github repos) and how interesting they are (based on Twitter).

To see top papers, follow us on twitter @assertpub_ (arXiv), @assert_pub (bioRxiv), and @assertpub_dev (everything else).

To see beautiful figures extracted from papers, follow us on Instagram.

*Tracking 256,581 papers.*

Sort results based on if they are interesting or reproducible.

Interesting

Reproducible