Contraction Clustering (RASTER) is a very fast algorithm for density-based
clustering, which requires only a single pass. It can process arbitrary amounts
of data in linear time and in constant memory, quickly identifying approximate
clusters. It also exhibits good scalability in the presence of multiple CPU
cores. Yet, RASTER is limited to batch processing. In contrast, S-RASTER is an
adaptation of RASTER to the stream processing paradigm that is able to identify
clusters in evolving data streams. This algorithm retains the main benefits of
its parent algorithm, i.e. single-pass linear time cost and constant memory
requirements for each discrete time step in the sliding window. The sliding
window is efficiently pruned, and clustering is still performed in linear time.
Like RASTER, S-RASTER trades off an often negligible amount of precision for
speed. It is therefore very well suited to real-world scenarios where
clustering does not happen continually but only periodically. We describe the
algorithm, including a discussion of...

more |
pdf
| html
None.

arxiv_cs_LG:
S-RASTER: Contraction Clustering for Evolving Data Streams. Gregor Ulm, Simon Smith, Adrian Nilsson, Emil Gustavsson, and Mats Jirstrand https://t.co/9YS6UAvpBt

Memoirs:
S-RASTER: Contraction Clustering for Evolving Data Streams. https://t.co/vPjdsXnBbm

None.

None.

Sample Sizes : None.

Authors: 5

Total Words: 5320

Unqiue Words: 1625

In the field of algorithms and data structures analysis and design, most of
the researchers focus only on the space/time trade-off, and little attention
has been paid to energy consumption. Moreover, most of the efforts in the field
of Green Computing have been devoted to hardware-related issues, being green
software in its infancy. Optimizing the usage of computing resources,
minimizing power consumption or increasing battery life are some of the goals
of this field of research.
As an attempt to address the most recent sustainability challenges, we must
incorporate the energy consumption as a first-class constraint when designing
new compact data structures. Thus, as a preliminary work to reach that goal, we
first need to understand the factors that impact on the energy consumption and
their relation with compression. In this work, we study the energy consumption
required by several integer vector representations. We execute typical
operations over datasets of different nature. We can see that, as commonly
believed, energy...

more |
pdf
| html
None.

powturbo:
🆕🔥🗜️"Energy consumption in compact integer vectors: A study case"
🔗 https://t.co/DGLI1NP3Ic
#⃣ #DataCompression #IntegerCompression #DataStructures #Algorithms #EnergyConsumption https://t.co/TFoxg7lPb7

None.

None.

Sample Sizes : None.

Authors: 2

Total Words: 0

Unqiue Words: 0

We give new characterizations of the sample complexity of answering linear
queries (statistical queries) in the local and central models of differential
privacy:
*In the non-interactive local model, we give the first approximate
characterization of the sample complexity. Informally our bounds are tight to
within polylogarithmic factors in the number of queries and desired accuracy.
Our characterization extends to agnostic learning in the local model.
*In the central model, we give a characterization of the sample complexity in
the high-accuracy regime that is analogous to that of Nikolov, Talwar, and
Zhang (STOC 2013), but is both quantitatively tighter and has a dramatically
simpler proof.
Our lower bounds apply equally to the empirical and population estimation
problems. In both cases, our characterizations show that a particular
factorization mechanism is approximately optimal, and the optimal sample
complexity is bounded from above and below by well studied factorization norms
of a matrix associated with the queries.

more |
pdf
| html
None.

ccanonne_:
@AcharyaJayadev @thesasho I *knew* this would happen. Just not that soon.
https://t.co/u3mFllhr6j
The list grows. I'll end up buried under a stack of interesting papers.
(Joke aside: this does look great, @thesasho.) https://t.co/XsEZfHKK08

thesasho:
After years of talking about DP, Jon Ullman and I finally wrote a paper together https://t.co/WIaThvAFql. Also first paper with Alex Edmonds, supervised by me and Toni Pitassi. We give an essentially optimal algorithm to answer any set of statistical queries in local DP.

Memoirs:
The Power of Factorization Mechanisms in Local and Central Differential Privacy. https://t.co/IskrNmyvTo

heghbalz:
RT @ccanonne_: @AcharyaJayadev @thesasho I *knew* this would happen. Just not that soon.
https://t.co/u3mFllhr6j
The list grows. I'll end u…

bipr:
RT @thesasho: After years of talking about DP, Jon Ullman and I finally wrote a paper together https://t.co/WIaThvAFql. Also first paper wi…

None.

None.

Sample Sizes : None.

Authors: 3

Total Words: 0

Unqiue Words: 0

Motivated by applications such as discovering strong ties in social networks
and assembling genome subsequences in biology, we study the problem of
recovering a hidden $2k$-nearest neighbor (NN) graph in an $n$-vertex complete
graph, whose edge weights are independent and distributed according to $P_n$
for edges in the hidden $2k$-NN graph and $Q_n$ otherwise. The special case of
Bernoulli distributions corresponds to a variant of the Watts-Strogatz
small-world graph. We focus on two types of asymptotic recovery guarantees as
$n\to \infty$: (1) exact recovery: all edges are classified correctly with
probability tending to one; (2) almost exact recovery: the expected number of
misclassified edges is $o(nk)$. We show that the maximum likelihood estimator
achieves (1) exact recovery for $2 \le k \le n^{o(1)}$ if $ \liminf
\frac{2\alpha_n}{\log n}>1$; (2) almost exact recovery for $ 1 \le k \le
o\left( \frac{\log n}{\log \log n} \right)$ if $\liminf
\frac{kD(P_n||Q_n)}{\log n}>1$, where $\alpha_n \triangleq -2 \log \int \sqrt{d
P_n d...

more |
pdf
| html
None.

Memoirs:
Consistent recovery threshold of hidden nearest neighbor graphs. https://t.co/dIGIwjgbEv

SRoyLee:
Consistent recovery threshold of hidden nearest neighbor graphs - https://t.co/0nV3TOALuA

None.

None.

Sample Sizes : None.

Authors: 4

Total Words: 0

Unqiue Words: 0

We introduce a new family of compressed data structures to efficiently store
and query large string dictionaries in main memory. Our main technique is a
combination of hierarchical Front-coding with ideas from longest-common-prefix
computation in suffix arrays. Our data structures yield relevant space-time
tradeoffs in real-world dictionaries. We focus on two domains where string
dictionaries are extensively used and efficient compression is required: URL
collections, a key element in Web graphs and applications such as Web mining;
and collections of URIs and literals, the basic components of RDF datasets. Our
experiments show that our data structures achieve better compression than the
state-of-the-art alternatives while providing very competitive query times.

more |
pdf
| html
None.

powturbo:
🆕🗜️"Improved Compressed String Dictionaries" new family of CDS to efficiently store and query large string dictionaries in main memory...
📄https://t.co/Xk8PrxVAok
#⃣ #DataCompression #IntegerCompression #InformationRetrieval #DataStructures #Algorithms https://t.co/rXIPsxlnMd

okateim:
Improved Compressed String Dictionaries
https://t.co/YALkWfyOgC

None.

None.

Sample Sizes : None.

Authors: 4

Total Words: 0

Unqiue Words: 0

In this paper, we study the peak-aware energy scheduling problem using the
competitive framework with machine learning prediction. With the uncertainty of
energy demand as the fundamental challenge, the goal is to schedule the energy
output of local generation units such that the electricity bill is minimized.
While this problem has been tackled using classic competitive design with
worst-case guarantee, the goal of this paper is to develop learning-assisted
competitive algorithms to improve the performance in a provable manner. We
develop two deterministic and randomized algorithms that are provably robust
against the poor performance of learning prediction, however, achieve the
optimal performance as the error of prediction goes to zero. Extensive
experiments using real data traces verify our theoretical observations and show
15.13% improved performance against pure online algorithms.

more |
pdf
| html
None.

None.

None.

Sample Sizes : None.

Authors: 3

Total Words: 10370

Unqiue Words: 2022

Given query access to a set of constraints $S$, we wish to quickly check if
some objective function $\varphi$ subject to these constraints is at most a
given value $k$. We approach this problem using the framework of property
testing where our goal is to distinguish the case $\varphi(S) \le k$ from the
case that at least an $\epsilon$ fraction of the constraints in $S$ need to be
removed for $\varphi(S) \le k$ to hold. We restrict our attention to the case
where $(S, \varphi)$ are LP-Type problems which is a rich family of
combinatorial optimization problems with an inherent geometric structure. By
utilizing a simple sampling procedure which has been used previously to study
these problems, we are able to create property testers for any LP-Type problem
whose query complexities are independent of the number of constraints. To the
best of our knowledge, this is the first work that connects the area of LP-Type
problems and property testing in a systematic way. Among our results is a tight
upper bound on the query complexity of...

more |
pdf
| html
None.

None.

None.

Sample Sizes : None.

Authors: 2

Total Words: 0

Unqiue Words: 0

We study the single-choice Prophet Inequality problem when the gambler is
given access to samples. We show that the optimal competitive ratio of $1/2$
can be achieved with a single sample from each distribution. When the
distributions are identical, we show that for any constant $\varepsilon > 0$,
$O(n)$ samples from the distribution suffice to achieve the optimal competitive
ratio ($\approx 0.745$) within $(1+\varepsilon)$, resolving an open problem of
Correa, D\"utting, Fischer, and Schewior.

more |
pdf
| html
None.

DO:
Optimal Single-Choice Prophet Inequalities from Samples. https://t.co/bvp49WvhGZ

None.

None.

Sample Sizes : None.

Authors: 3

Total Words: 0

Unqiue Words: 0

We study how to estimate a nearly low-rank Toeplitz covariance matrix $T$
from compressed measurements. Recent work of Qiao and Pal addresses this
problem by combining sparse rulers (sparse linear arrays) with frequency
finding (sparse Fourier transform) algorithms applied to the Vandermonde
decomposition of $T$. Analytical bounds on the sample complexity are shown,
under the assumption of sufficiently large gaps between the frequencies in this
decomposition. In this work, we introduce random ultra-sparse rulers and
propose an improved approach based on these objects. Our random rulers
effectively apply a random permutation to the frequencies in $T$'s Vandermonde
decomposition, letting us avoid frequency gap assumptions and leading to
improved sample complexity bounds. In the special case when $T$ is circulant,
we theoretically analyze the performance of our method when combined with
sparse Fourier transform algorithms based on random hashing. We also show
experimentally that our ultra-sparse rulers give significantly more robust...

more |
pdf
| html
None.

None.

Sample Sizes : None.

Authors: 4

Total Words: 5386

Unqiue Words: 1720

We study high-dimensional sparse estimation tasks in a robust setting where a
constant fraction of the dataset is adversarially corrupted. Specifically, we
focus on the fundamental problems of robust sparse mean estimation and robust
sparse PCA. We give the first practically viable robust estimators for these
problems. In more detail, our algorithms are sample and computationally
efficient and achieve near-optimal robustness guarantees. In contrast to prior
provable algorithms which relied on the ellipsoid method, our algorithms use
spectral techniques to iteratively remove outliers from the dataset. Our
experimental evaluation on synthetic data shows that our algorithms are
scalable and significantly outperform a range of previous approaches, nearly
matching the best error rate without corruptions.

more |
pdf
| html
None.

StatsPapers:
Outlier-Robust High-Dimensional Sparse Estimation via Iterative Filtering. https://t.co/6a9KTJxHP1

paulportesi:
RT @StatsPapers: Outlier-Robust High-Dimensional Sparse Estimation via Iterative Filtering. https://t.co/6a9KTJxHP1

None.

None.

Sample Sizes : None.

Authors: 5

Total Words: 15097

Unqiue Words: 3010

Assert is a website where the best academic papers on arXiv (computer science, math, physics), bioRxiv (biology), BITSS (reproducibility), EarthArXiv (earth science), engrXiv (engineering), LawArXiv (law), PsyArXiv (psychology), SocArXiv (social science), and SportRxiv (sport research) bubble to the top each day.

Papers are scored (in real-time) based on how verifiable they are (as determined by their Github repos) and how interesting they are (based on Twitter).

To see top papers, follow us on twitter @assertpub_ (arXiv), @assert_pub (bioRxiv), and @assertpub_dev (everything else).

To see beautiful figures extracted from papers, follow us on Instagram.

*Tracking 226,490 papers.*

Sort results based on if they are interesting or reproducible.

Interesting

Reproducible