Rethinking Data-Intensive Science Using Scalable Analytics Systems

Latest revision as of 04:57, 2 February 2021

Abstract

"Next generation" data acquisition technologies are allowing scientists to collect exponentially more data at a lower cost. These trends are broadly impacting many scientific fields, including genomics, astronomy, and neuroscience. We can attack the problem caused by exponential data growth by applying horizontally scalable techniques from current analytics systems to accelerate scientific processing pipelines. In this paper, we describe ADAM, an example genomics pipeline that leverages the open-source Apache Spark and Parquet systems to achieve a 28x speedup over current genomics pipelines, while reducing cost by 63%. From building this system, we were able to distill a set of techniques for implementing scientific analyses efficiently using commodity "big data" systems. To demonstrate the generality of our architecture, we then implement a scalable astronomy image processing system which achieves a 2.8--8.9x improvement over the state-of-the-art MPI-based system.

Original document

The different versions of the original document can be found in:

http://dl.acm.org/ft_gateway.cfm?id=2742787&type=pdf

https://dblp.uni-trier.de/db/conf/sigmod/sigmod2015.html#NothaftMDZLYKAH15,

https://amplab.cs.berkeley.edu/wp-content/uploads/2015/03/adam.pdf,

https://dl.acm.org/citation.cfm?id=2742787,

https://dl.acm.org/citation.cfm?id=2723372.2742787,

http://fnothaft.net/docs/papers/adam-sigmod-2015.pdf,

https://doi.org/10.1145/2723372.2742787,

https://academic.microsoft.com/#/detail/1989017925

http://dl.acm.org/ft_gateway.cfm?id=2742787&ftid=1586788&dwn=1,

http://dx.doi.org/10.1145/2723372.2742787 under the license http://www.acm.org/publications/policies/copyright_policy#Background

Latest revision as of 04:57, 2 February 2021

Abstract

Original document

Document information

Document Score

Share this document

Keywords

claim authorship

Revision as of 04:56, 2 February 2021 (view source) Scipediacontent (talk \| contribs) (Created page with " == Abstract == "Next generation" data acquisition technologies are allowing scientists to collect exponentially more data at a lower cost. These trends are broadly impacting...")	Latest revision as of 04:57, 2 February 2021 (view source) Scipediacontent (talk \| contribs) m (Scipediacontent moved page Draft Content 293977025 to Yeksigian et al 2015a)
(No difference)