(Created page with " == Abstract == Distributed data processing systems are the standard means for large-scale data analysis in the Big Data field. These systems are based on processing pipeline...") |
m (Scipediacontent moved page Draft Content 441602947 to Langius et al 2017a) |
(No difference)
|
Distributed data processing systems are the standard means for large-scale data analysis in the Big Data field. These systems are based on processing pipelines where the processing is done via a composition of multiple elements or steps. In current distributed data processing systems, the code and parameters that create the pipeline are set at design time, before the application starts processing any data. Any changes that have to be applied to the pipeline after it has been started, require the entire pipeline to be restarted. When a system needs to be operational 24/7 or has to respond in a timely fashion, restarting and having downtime is not acceptable. In this case, computing should be performed autonomously by the processing system that continuously takes the changes from the environment, and adjusts its processing steps, parameters, etc. on-the-fly. In this paper, we try to solve this problem by allowing changes to be made to a processing pipeline without restarting. We focus on two aspects of the problem: switching to another data source that is used as input, and changing the functional code and variables within the elements of a pipeline. Our system is built on top of Apache Spark, a framework widely used for distributed data processing.
The different versions of the original document can be found in:
Published on 31/12/16
Accepted on 31/12/16
Submitted on 31/12/16
Volume 2017, 2017
DOI: 10.1109/iccac.2017.11
Licence: CC BY-NC-SA license
Are you one of the authors of this document?