Large-scale text processing pipeline with Apache Spark

Latest revision as of 00:32, 2 February 2021

Abstract

In this paper, we evaluate Apache Spark for a data-intensive machine learning problem. Our use case focuses on policy diffusion detection across the state legislatures in the United States over time. Previous work on policy diffusion has been unable to make an all-pairs comparison between bills due to computational intensity. As a substitute, scholars have studied single topic areas. We provide an implementation of this analysis workflow as a distributed text processing pipeline with Spark dataframes and Scala application programming interface. We discuss the challenges and strategies of unstructured data processing, data formats for storage and efficient access, and graph processing at scale.

Original document

The different versions of the original document can be found in:

http://arxiv.org/abs/1912.00547

http://arxiv.org/pdf/1912.00547

http://xplorestaging.ieee.org/ielx7/7818133/7840573/07841068.pdf?arnumber=7841068,

http://dx.doi.org/10.1109/bigdata.2016.7841068

https://dblp.uni-trier.de/db/journals/corr/corr1912.html#abs-1912-00547,

https://academic.microsoft.com/#/detail/2584392848

Latest revision as of 00:32, 2 February 2021

Abstract

Original document

Document information

Document Score

Share this document

Keywords

claim authorship

Revision as of 00:32, 2 February 2021 (view source) Scipediacontent (talk \| contribs) (Created page with " == Abstract == In this paper, we evaluate Apache Spark for a data-intensive machine learning problem. Our use case focuses on policy diffusion detection across the state leg...")	Latest revision as of 00:32, 2 February 2021 (view source) Scipediacontent (talk \| contribs) m (Scipediacontent moved page Draft Content 896956230 to Shiraito et al 2019a)
(No difference)