We present the Hippo system to enable the diagnosis of distributed machine learning (ML) pipelines by leveraging fine-grained data lineage. Hippo exposes a concise yet powerful API, derived from primitive lineage types, to capture fine-grained data lineage for each data transformation. It records the input datasets, the output datasets and the cell-level mapping between them. It also collects sufficient information that is needed to reproduce the computation. Hippo efficiently enables common ML diagnosis operations such as code debugging, result analysis, data anomaly removal, and computation replay. By exploiting the metadata separation and high-order function encoding strategies, we observe an O(10^3)x total improvement in lineage storage efficiency vs. the baseline of cell-wise mapping recording while maintaining the lineage integrity. Hippo can answer the real use case lineage queries within a few seconds, which is low enough to enable interactive diagnosis of ML pipelines.
The different versions of the original document can be found in:
Published on 01/01/2017
Volume 2017, 2017
DOI: 10.1145/3078597.3078603
Licence: Other
Are you one of the authors of this document?