Exploring Genomic Datasets

Abstract

Genomic data management is focused on achieving high performance over big datasets using batch, cloud-based architectures; this enables the execution of massive pipelines, but hampers the capability of exploring the solution space when it is not well-defined, by choosing different experimental samples or query extraction parameters. We present PyGMQL, a Python-based interoperability software layer that enables testing of experimental pipelines; PyGMQL solves the impedance mismatch between a batch execution environment and the agile programming style of Python, and provides transparency of access when exploration requires integrating local and remote resources. Wrapping PyGMQL and Python primitives within Jupyter notebooks guarantees reproducibility of the pipeline when used in different contexts or by different scientists. The software is freely available at https://github.com/DEIB-GECO/PyGMQL.

Original document

The different versions of the original document can be found in:

http://hdl.handle.net/11311/1095264

https://re.public.polimi.it/retrieve/handle/11311/1095264/392193/Nanni.pdf

http://dl.acm.org/ft_gateway.cfm?id=3214710&ftid=2052771&dwn=1,

http://dx.doi.org/10.1145/3214708.3214710 under the license http://www.acm.org/publications/policies/copyright_policy#Background

https://dblp.uni-trier.de/db/conf/sigmod/exploredb2018.html#NanniPCC18,

https://re.public.polimi.it/handle/11311/1095264,

https://academic.microsoft.com/#/detail/2943444417

Abstract

Original document

Document information

Document Score

Share this document

Keywords

claim authorship