(Created page with " == Abstract == Accepted manuscript version. The final publication is available at Springer via <a href=http://dx.doi.org/10.1007/978-3-319-24462-4_22>http://dx.doi.org/10.10...")
 
 
(One intermediate revision by the same user not shown)
Line 3: Line 3:
  
 
Accepted manuscript version. The final publication is available at Springer via <a href=http://dx.doi.org/10.1007/978-3-319-24462-4_22>http://dx.doi.org/10.1007/978-3-319-24462-4_22</a>. Biological data analysis is typically implemented using a deep pipeline that combines a wide array of tools and databases. These pipelines must scale to very large datasets, and consequently require parallel and distributed computing. It is therefore important to choose a hardware platform and underlying data management and processing systems well suited for processing large datasets. There are many infrastructure systems for such data-intensive computing. However, in our experience, most biological data analysis pipelines do not leverage these systems. We give an overview of data-intensive computing infrastructure systems, and describe how we have leveraged these for: (i) scalable fault-tolerant computing for large-scale biological data; (ii) incremental updates to reduce the resource usage required to update large-scale compendium; and (iii) interactive data analysis and exploration. We provide lessons learned and describe problems we have encountered during development and deployment. We also provide a literature survey on the use of data-intensive computing systems for biological data processing. Our results show how unmodified biological data analysis tools can benefit from infrastructure systems for data-intensive computing.
 
Accepted manuscript version. The final publication is available at Springer via <a href=http://dx.doi.org/10.1007/978-3-319-24462-4_22>http://dx.doi.org/10.1007/978-3-319-24462-4_22</a>. Biological data analysis is typically implemented using a deep pipeline that combines a wide array of tools and databases. These pipelines must scale to very large datasets, and consequently require parallel and distributed computing. It is therefore important to choose a hardware platform and underlying data management and processing systems well suited for processing large datasets. There are many infrastructure systems for such data-intensive computing. However, in our experience, most biological data analysis pipelines do not leverage these systems. We give an overview of data-intensive computing infrastructure systems, and describe how we have leveraged these for: (i) scalable fault-tolerant computing for large-scale biological data; (ii) incremental updates to reduce the resource usage required to update large-scale compendium; and (iii) interactive data analysis and exploration. We provide lessons learned and describe problems we have encountered during development and deployment. We also provide a literature survey on the use of data-intensive computing systems for biological data processing. Our results show how unmodified biological data analysis tools can benefit from infrastructure systems for data-intensive computing.
 
Document type: Part of book or chapter of book
 
 
== Full document ==
 
<pdf>Media:Draft_Content_294469788-beopen445-5616-document.pdf</pdf>
 
  
  
Line 13: Line 8:
  
 
The different versions of the original document can be found in:
 
The different versions of the original document can be found in:
 
* [http://hdl.handle.net/10037/8816 http://hdl.handle.net/10037/8816]
 
  
 
* [https://hdl.handle.net/10037/8816 https://hdl.handle.net/10037/8816]
 
* [https://hdl.handle.net/10037/8816 https://hdl.handle.net/10037/8816]
  
 
* [https://munin.uit.no/bitstream/10037/8816/2/article.pdf https://munin.uit.no/bitstream/10037/8816/2/article.pdf]
 
* [https://munin.uit.no/bitstream/10037/8816/2/article.pdf https://munin.uit.no/bitstream/10037/8816/2/article.pdf]
 +
 +
* [http://link.springer.com/content/pdf/10.1007/978-3-319-24462-4_22 http://link.springer.com/content/pdf/10.1007/978-3-319-24462-4_22],
 +
: [http://dx.doi.org/10.1007/978-3-319-24462-4_22 http://dx.doi.org/10.1007/978-3-319-24462-4_22] under the license http://www.springer.com/tdm
 +
 +
* [https://link.springer.com/chapter/10.1007/978-3-319-24462-4_22 https://link.springer.com/chapter/10.1007/978-3-319-24462-4_22],
 +
: [https://core.ac.uk/display/141562029 https://core.ac.uk/display/141562029],
 +
: [https://munin.uit.no/bitstream/10037/8816/2/article.pdf https://munin.uit.no/bitstream/10037/8816/2/article.pdf],
 +
: [https://munin.uit.no/handle/10037/8816 https://munin.uit.no/handle/10037/8816],
 +
: [https://dblp.uni-trier.de/db/conf/cibb/cibb2014.html#BongoPE14 https://dblp.uni-trier.de/db/conf/cibb/cibb2014.html#BongoPE14],
 +
: [https://rd.springer.com/chapter/10.1007/978-3-319-24462-4_22 https://rd.springer.com/chapter/10.1007/978-3-319-24462-4_22],
 +
: [https://academic.microsoft.com/#/detail/2296514683 https://academic.microsoft.com/#/detail/2296514683]

Latest revision as of 17:10, 21 January 2021

Abstract

Accepted manuscript version. The final publication is available at Springer via <a href=http://dx.doi.org/10.1007/978-3-319-24462-4_22>http://dx.doi.org/10.1007/978-3-319-24462-4_22</a>. Biological data analysis is typically implemented using a deep pipeline that combines a wide array of tools and databases. These pipelines must scale to very large datasets, and consequently require parallel and distributed computing. It is therefore important to choose a hardware platform and underlying data management and processing systems well suited for processing large datasets. There are many infrastructure systems for such data-intensive computing. However, in our experience, most biological data analysis pipelines do not leverage these systems. We give an overview of data-intensive computing infrastructure systems, and describe how we have leveraged these for: (i) scalable fault-tolerant computing for large-scale biological data; (ii) incremental updates to reduce the resource usage required to update large-scale compendium; and (iii) interactive data analysis and exploration. We provide lessons learned and describe problems we have encountered during development and deployment. We also provide a literature survey on the use of data-intensive computing systems for biological data processing. Our results show how unmodified biological data analysis tools can benefit from infrastructure systems for data-intensive computing.


Original document

The different versions of the original document can be found in:

http://dx.doi.org/10.1007/978-3-319-24462-4_22 under the license http://www.springer.com/tdm
https://core.ac.uk/display/141562029,
https://munin.uit.no/bitstream/10037/8816/2/article.pdf,
https://munin.uit.no/handle/10037/8816,
https://dblp.uni-trier.de/db/conf/cibb/cibb2014.html#BongoPE14,
https://rd.springer.com/chapter/10.1007/978-3-319-24462-4_22,
https://academic.microsoft.com/#/detail/2296514683
Back to Top

Document information

Published on 01/01/2015

Volume 2015, 2015
DOI: 10.1007/978-3-319-24462-4_22
Licence: CC BY-NC-SA license

Document Score

0

Views 0
Recommendations 0

Share this document

claim authorship

Are you one of the authors of this document?