Abstract

Machine learning has been gaining traction in recent years to meet the demand for tools that can efficiently analyze and make sense of the ever-growing databases of biomedical data in health care systems around the world. However, effectively using machine learning methods requires considerable domain expertise, which can be a barrier of entry for bioinformaticians new to computational data science methods. Therefore, off-the-shelf tools that make machine learning more accessible can prove invaluable for bioinformaticians. To this end, we have developed an open source pipeline optimization tool (TPOT-MDR) that uses genetic programming to automatically design machine learning pipelines for bioinformatics studies. In TPOT-MDR, we implement Multifactor Dimensionality Reduction (MDR) as a feature construction method for modeling higher-order feature interactions, and combine it with a new expert knowledge-guided feature selector for large biomedical data sets. We demonstrate TPOT-MDR's capabilities using a combination of simulated and real world data sets from human genetics and find that TPOT-MDR significantly outperforms modern machine learning methods such as logistic regression and eXtreme Gradient Boosting (XGBoost). We further analyze the best pipeline discovered by TPOT-MDR for a real world problem and highlight TPOT-MDR's ability to produce a high-accuracy solution that is also easily interpretable.

Comment: 9 pages, 4 figures, submitted to GECCO 2017 conference and currently under review


Original document

The different versions of the original document can be found in:

http://dx.doi.org/10.1145/3071178.3071212 under the license http://www.acm.org/publications/policies/copyright_policy#Background
https://dblp.uni-trier.de/db/journals/corr/corr1702.html#SohnOM17,
https://dl.acm.org/citation.cfm?id=3071178.3071212,
https://ui.adsabs.harvard.edu/abs/2017arXiv170201780S/abstract,
https://dl.acm.org/citation.cfm?id=3071212,
http://dblp.uni-trier.de/db/journals/corr/corr1702.html#SohnOM17,
https://academic.microsoft.com/#/detail/2586298664
Back to Top

Document information

Published on 01/01/2017

Volume 2017, 2017
DOI: 10.1145/3071178.3071212
Licence: Other

Document Score

0

Views 0
Recommendations 0

Share this document

claim authorship

Are you one of the authors of this document?