Abstract

Modern sequencing machines produce order of a terabyte of data per day, which need subsequently to go through a complex processing pipeline. The conventional workflow begins with a few independent, shared-memory tools, which communicate by means of intermediate files. Given its lack of robustness and scalability, this approach is ill-suited to exploiting the full potential of sequencing in the context of healthcare, where large-scale, population-wide applications are the norm. In this work we propose the adoption of stream computing to simplify the genomic resequencing pipeline, boosting its perfor­mance and improving its fault-tolerance. We decompose the first steps of the genomic processing in two distinct and specialized modules (preprocessing and alignment) and we loosely compose them via communication through Kafka streams, in order to allow for easy composability and integration in the already-existing YARN-based pipelines. The proposed solution is then experimentally validated on real data and shown to scale almost linearly.


Original document

The different versions of the original document can be found in:

http://dx.doi.org/10.1109/bhi.2018.8333418
https://academic.microsoft.com/#/detail/2949216712
http://dx.doi.org/10.1101/182030
http://www.biorxiv.org/content/early/2017/08/29/182030.full.pdf,
https://academic.microsoft.com/#/detail/2752195791


DOIS: 10.1101/182030 10.1109/bhi.2018.8333418

Back to Top

Document information

Published on 01/01/2017

Volume 2017, 2017
DOI: 10.1101/182030
Licence: CC BY-NC-SA license

Document Score

0

Views 0
Recommendations 0

Share this document

Keywords

claim authorship

Are you one of the authors of this document?