Abstract

We present the Hippo system to enable the diagnosis of distributed machine learning (ML) pipelines by leveraging fine-grained data lineage. Hippo exposes a concise yet powerful API, derived from primitive lineage types, to capture fine-grained data lineage for each data transformation. It records the input datasets, the output datasets and the cell-level mapping between them. It also collects sufficient information that is needed to reproduce the computation. Hippo efficiently enables common ML diagnosis operations such as code debugging, result analysis, data anomaly removal, and computation replay. By exploiting the metadata separation and high-order function encoding strategies, we observe an O(10^3)x total improvement in lineage storage efficiency vs. the baseline of cell-wise mapping recording while maintaining the lineage integrity. Hippo can answer the real use case lineage queries within a few seconds, which is low enough to enable interactive diagnosis of ML pipelines.


Original document

The different versions of the original document can be found in:

https://doi.acm.org/10.1145/3078597.3078603,
https://dl.acm.org/doi/pdf/10.1145/3078597.3078603,
https://dl.acm.org/ft_gateway.cfm?id=3078603&type=pdf,
https://doi.org/10.1145/3078597.3078603,
https://academic.microsoft.com/#/detail/2705062272
http://dx.doi.org/10.1145/3078597.3078603 under the license http://www.acm.org/publications/policies/copyright_policy#Background
Back to Top

Document information

Published on 01/01/2017

Volume 2017, 2017
DOI: 10.1145/3078597.3078603
Licence: Other

Document Score

0

Views 4
Recommendations 0

Share this document

Keywords

claim authorship

Are you one of the authors of this document?