Reverse time migration (RTM) is an algorithm widely used in the oil and gas
industry to process seismic data. It is a computationally intensive task that
suits well in parallel computers. Because of it being massive and regular, this
type of task is often equally and statically distributed among the available
parallel workers. However, this strategy is often not optimal. When the workers
are heterogeneous, and even when most have similar processing power, many of
them might still have to wait idly for the slower workers. In this paper, we
show that even small performance differences between homogeneous cores can
considerably affect the overall performance of a 3D RTM application. We show
that dynamic load distribution has a significant advantage over the
conventional static distribution. However, the granularity of the dynamically
distributed chunks of work plays a key role in harvesting this advantage. In
order to find the optimal granularity, we propose a coupled simulated annealing
(CSA) based auto-tuning strategy that adjusts...

more |
pdf
| html
None.

arxiv_org:
Auto-tuning of dynamic load balancing applied to 3D reverse time migration on multicore s... https://t.co/x6Iq7UvS5u https://t.co/lYwGp8iJba

None.

None.

Sample Sizes : None.

Authors: 4

Total Words: 7049

Unqiue Words: 2077

The paper studies the problem of reaching agreement in a distributed
message-passing system prone to crash failures. Crashes are generated by
\constrained\ adversaries - a \wadapt\ adversary, who has to fix in advance the
set of $f$ crash-prone processes, or a \chainadapt\ adversary, who orders all
the processes into $k$ disjoint chains and has to follow this pattern when
crashing them. Apart from these constraints, both of them may crash processes
in an adaptive way at any time. While commonly used \sadapt\ adversaries model
attacks and \noadapt\ ones -- pre-defined faults, the constrained adversaries
model more realistic scenarios when there are fault-prone dependent processes,
e.g., in hierarchical or dependable software/hardware systems. We propose
time-efficient consensus algorithms against such adversaries and also show how
to improve the message complexity of proposed solutions. Finally, we show how
to reach consensus against a \kthick\ adversary, limited by an arbitrary
partial order \dk{with a maximal anti-chain of length...

more |
pdf
| html
None.

None.

None.

Sample Sizes : None.

Authors: 2

Total Words: 10052

Unqiue Words: 2201

Science reproducibility is a cornerstone feature in scientific workflows. In
most cases, this has been implemented as a way to exactly reproduce the
computational steps taken to reach the final results. While these steps are
often completely described, including the input parameters, datasets, and
codes, the environment in which these steps are executed is only described at a
higher level with endpoints and operating system name and versions. Though this
may be sufficient for reproducibility in the short term, systems evolve and are
replaced over time, breaking the underlying workflow reproducibility. A natural
solution to this problem is containers, as they are well defined, have a
lifetime independent of the underlying system, and can be user-controlled so
that they can provide custom environments if needed. This paper highlights some
unique challenges that may arise when using containers in distributed
scientific workflows. Further, this paper explores how the Pegasus Workflow
Management System implements container support to...

more |
pdf
| html
None.

None.

Sample Sizes : None.

Authors: 10

Total Words: 8185

Unqiue Words: 2405

We develop techniques to prove lower bounds for the BCAST(log n) Broadcast
Congested Clique model (a distributed message passing model where in each
round, each processor can broadcast an O(log n)-sized message to all other
processors). Our techniques are built to prove bounds for natural input
distributions. So far, all lower bounds for problems in the model relied on
constructing specifically tailored graph families for the specific problem at
hand, resulting in lower bounds for artificially constructed inputs, instead of
natural input distributions.
One of our results is a lower bound for the directed planted clique problem.
In this problem, an input graph is either a random directed graph (each
directed edge is included with probability 1/2), or a random graph with a
planted clique of size k. That is, k randomly chosen vertices have all of the
edges between them included, and all other edges in the graph appear with
probability 1/2. The goal is to determine whether a clique exists. We show that
when k = n^(1/4 - eps), this...

more |
pdf
| html
None.

None.

None.

Sample Sizes : None.

Authors: 2

Total Words: 22767

Unqiue Words: 3436

Scientific applications often contain large and computationally intensive
parallel loops. Dynamic loop self scheduling (DLS) is used to achieve a
balanced load execution of such applications on high performance computing
(HPC) systems. Large HPC systems are vulnerable to processors or node failures
and perturbations in the availability of resources. Most self-scheduling
approaches do not consider fault-tolerant scheduling or depend on failure or
perturbation detection and react by rescheduling failed tasks. In this work, a
robust dynamic load balancing (rDLB) approach is proposed for the robust self
scheduling of independent tasks. The proposed approach is proactive and does
not depend on failure or perturbation detection. The theoretical analysis of
the proposed approach shows that it is linearly scalable and its cost decrease
quadratically by increasing the system size. rDLB is integrated into an MPI DLS
library to evaluate its performance experimentally with two computationally
intensive scientific applications. Results show...

more |
pdf
| html
None.

None.

None.

Sample Sizes : None.

Authors: 3

Total Words: 8495

Unqiue Words: 2063

We present a new lock-free safe memory reclamation algorithm, Hyaline, which
is fast, scalable, and transparent to the underlying data structures. Due to
very low contention, Hyaline is generally faster than existing approaches,
including epoch-based reclamation. Moreover, Hyaline easily handles virtually
unbounded number of threads (or any concurrent entities) that can be created
and deleted dynamically, while retaining O(1) reclamation cost. Hyaline also
does not depend on any OS abstractions or compiler support. Hyaline's full
transparency and simple API enable easy integration into unmanaged C/C++ code,
while not causing any undue burden on programmers, potentially making the
entire process fully automatic through special language constructs. We also
extend Hyaline to avoid situations where stalled threads prevent others from
reclaiming newly allocated objects, a common problem with epoch-based
reclamation. We have implemented and tested Hyaline on the x86(-64), ARM32/64,
PPC, and MIPS architectures. The general approach...

more |
pdf
| html
None.

None.

None.

Sample Sizes : None.

Authors: 2

Total Words: 12548

Unqiue Words: 3001

We introduce the Adaptive Massively Parallel Computation (AMPC) model, which
is an extension of the Massively Parallel Computation (MPC) model. At a high
level, the AMPC model strengthens the MPC model by storing all messages sent
within a round in a distributed data store. In the following round, all
machines are provided with random read access to the data store, subject to the
same constraints on the total amount of communication as in the MPC model. Our
model is inspired by the previous empirical studies of distributed graph
algorithms using MapReduce and a distributed hash table service.
This extension allows us to give new graph algorithms with much lower round
complexities compared to the best known solutions in the MPC model. In
particular, in the AMPC model we show how to solve maximal independent set in
$O(1)$ rounds and connectivity/minimum spanning tree in $O(\log\log_{m/n} n)$
rounds both using $O(n^\delta)$ space per machine for constant $\delta < 1$. In
the same memory regime for MPC, the best known algorithms for...

more |
pdf
| html
None.

None.

None.

Sample Sizes : None.

Authors: 6

Total Words: 0

Unqiue Words: 0

In a prior work (ICDCS 2017), we presented a distributed self-adjusting
algorithm DSG for skip graphs. DSG performs topological adaption to
communication pattern to minimize the average routing costs between
communicating nodes. In this work, we present a distributed self-adjusting
algorithm (referred to as DyHypes) for topological adaption in hypercubic
networks. One of the major differences between hypercubes and skip graphs is
that hypercubes are more rigid in structure compared skip graphs. This property
makes self-adjustment significantly different in hypercubic networks than skip
graphs. Upon a communication between an arbitrary pair of nodes, DyHypes
transforms the network to place frequently communicating nodes closer to each
other to maximize communication efficiency, and uses randomization in the
transformation process to speed up the transformation and reduce message
complexity. We show that, as compared to DSG, DyHypes reduces the
transformation cost by a factor of $O(\log n)$, where $n$ is the number of
nodes involved...

more |
pdf
| html
None.

None.

Sample Sizes : None.

Authors: 2

Total Words: 13487

Unqiue Words: 2031

Graph analytics for large scale graphs has gained interest in recent years.
Many graph algorithms have been designed for vertex-centric distributed graph
processing frameworks to operate on large graphs with 100 M vertices and edges,
using commodity clusters and Clouds. Subgraph-centric programming models have
shown additional performance benefits than vertex-centric models. But direct
mapping of vertex-centric and shared-memory algorithms to subgraph-centric
frameworks are either not possible, or lead to inefficient algorithms. In this
paper, we present three subgraph-centric distributed graph algorithms for
triangle counting, clustering and minimum spanning forest, using variations of
shared and vertex-centric models. These augment existing subgraph-centric
algorithms that exist in literature, and allow a broader evaluation of these
three classes of graph processing algorithms and platforms.

more |
pdf
| html
None.

None.

None.

Sample Sizes : None.

Authors: 2

Total Words: 0

Unqiue Words: 0

The last decade has seen a shift in the computer systems industry where
heterogeneous computing has become prevalent. Nowadays, Graphics Processing
Units (GPUs) are in a variety of systems from supercomputers to mobile phones
and tablets. They are not only used for graphics operations but rather as
general-purpose special hardware (GPGPUs) to boost the performance of
compute-intensive applications. However, the percentage of undisclosed
characteristics beyond what vendors provide is small. In this paper, we propose
a very low overhead and portable analysis for exposing the hidden latency of
each individual instruction executing in the pipeline and different access
latencies of the various memory hierarchies at the microarchitecture level. We
also show the impact of the possible optimizations a CUDA compiler have over
the various latencies. We run our evaluation on seven different high-end NVIDIA
GPUs from five different generations/architectures namely: Kepler, Maxwell,
Pascal, Volta, and Turing. We believe that this work would...

more |
pdf
| html
None.

None.

None.

Sample Sizes : None.

Authors: 5

Total Words: 0

Unqiue Words: 0

Assert is a website where the best academic papers on arXiv (computer science, math, physics), bioRxiv (biology), BITSS (reproducibility), EarthArXiv (earth science), engrXiv (engineering), LawArXiv (law), PsyArXiv (psychology), SocArXiv (social science), and SportRxiv (sport research) bubble to the top each day.

Papers are scored (in real-time) based on how verifiable they are (as determined by their Github repos) and how interesting they are (based on Twitter).

To see top papers, follow us on twitter @assertpub_ (arXiv), @assert_pub (bioRxiv), and @assertpub_dev (everything else).

To see beautiful figures extracted from papers, follow us on Instagram.

*Tracking 129,963 papers.*

Sort results based on if they are interesting or reproducible.

Interesting

Reproducible