Harnessing Lineage for Continuous Improvement of Deep Learning Datasets
Machine learning at scale requires a robust data engine geared towards continuous data quality improvement. In this article I share my insights on how to build such a system with a fine grained lineage tracking solution.
The Problem
Deep learning models’ success rests upon the datasets they are trained on. One of the?most reliable?source of performance is?better?training data.
However, the journey to model excellence doesn’t end at dataset creation, the process of continuous improvement of dataset quality is equally vital.
Continuous Improvement
Application of?machine?learning in?real?world?solutions needs?iteration. Just like software?solutions, over time we need to address new functionality, refinement to existing functionality, regression, bugs, etc. These are all critical for the success and usability of any application.
However, unlike most software projects, machine learning models rely heavily on datasets?for?most?of?these?improvements. This is why we need the ability to continuously improve the training datasets?and?fix?quality issues?as?we?find?them?while?keeping?track?of?what’s changed.
Reproducibility
Improving dataset quality usually means corrections on labels that may have been collected some time ago. We build datasets from?various?sources such as archives,?medical records, public information, internet, etc.
Over time as processes, techniques, tools and knowledge improves, these dataset and their labels need to evolve and improve, while still maintaining older versions for reproducibility.
Reproducibility Challenge
Why should we care about reproducibility?in?industrial applications?
Building machine learning models historically has been mostly an academic and scientific exercise, as such you may have come across the reproducibility challenge in this field.
This is simply a precursor to the reproducibility challenge we face in the industry as AI and machine learning models become prevalent in every application, ranging from chatbots, to copilots to medical assistants etc.
Training models for real world applications is an iterative exercise with a long evolutionary chain of experiments, tweaks and architectural changes. It is always a challenge to pick the best model for the application at?hand. To compare and evaluate models built across time, access to frozen?training datasets is?crucial. To track and explain improvement in models, access to the evolutionary history of your labelled dataset is equally?important.
Other than the need for an evolutionary framework, many applications of machine learning models warrant a deeper scrutiny and regulation of the data that was used for training the model. Applications in fields like medicine, insurance, law enforcement etc. can have massive real world repercussions due to small biases in the training dataset. As such these datasets should be subject to higher degree of analysis and tracking that ensures we eliminate biases?as?and?when?we?discover?them?over?time.
Needless to say, any serious machine learning operation needs to focus on reliable and reproducible datasets.
Tracking Versions
Traditionally open datasets have solved this problem by simply publishing newer versions of the entire dataset as newer copies. This works well for relatively slow moving datasets. COCO is an example of this approach, where newer snapshots are available to download independently.
For industrial applications, where we need to gather 10s of millions of images with 100s of millions of labels, this approach falls short.
Typically we expect labels to be improved continuously on a daily basis with an expert human workforce of 100s of people meticulously improving labels.
We expect to train new models every month if not every week. The speed of iteration combined with the scale of data means that this solution?falls?short.
We need granular lineage tracking that allows large scale datasets to evolve at speed while not compromising on tracking, reproducibility and flexibility.
What is Lineage?
Lineage refers to the historical record of the origin, transformation, and evolution of data. It encompasses the entire lifecycle of data, including its creation, processing, and any changes it undergoes over time.
In the context of this text I will use the word lineage to describe the transformation or correction of ground truth related to a feature, over time. In?practice this?may?look?like?the?following example:
Imagine the case of semantic segmentation to detect buildings on aerial imagery. We select a region on the map, and send it to be labelled by an expert human labeller:
While experts labelled the image to the best of their knowledge, there?is?always?a propensity for?systemic?issues?and?human?error.?For?example?the?labelling instructions?may?not?cover how to label a?small?portion?of?tiles?which?looks?like,?but?really aren’t a?part?of?any building.
Eventually, as we detect such systemic issues, and train our workforce in how to handle such cases, we need another human to verify and fix this label.
While this seems simple enough, the time between the original label and the correction of label can span months or years. Tracking this lineage graph is going to be instrumental.
A DAG emerges
As you can imagine tracking lineage like this can result in many different structures.
领英推荐
You can represent these as a series of Directed Acyclic Graphs or “DAGs of labels”, that sit within the larger graph of feature and label dataset. Capturing this information with your labels is only part of the challenge, what you really need is a way to transform this into a training dataset that needs a feature and a label pair.
A naive approach can be to aggregate all the labels ever gathered.
This approach makes some sense, as you are building consensus from all human labels.
However this approach has a massive problem: it ignores the fact that new information is more likely to be correct than the old. As we improve tooling, knowledge and processes over time, labels resulting from corrections, by definition are “better” than the original labels. Thus aggregating new corrected labels with older labels dilutes, and in some cases reverses the improvement.
An optimal approach would be to boost the data that has passed through more human attention. We can do this easily by figuring out the leaf nodes of these DAGs and discard the parent nodes.
This approach guarantees that there will always be an improvement with every correction in our dataset.
However, we may still end up with many leaf nodes, and we can use different strategies to tackle this case based on the problem at hand.
A couple of example solutions that can be considered:
Human in the loop consolidation of many labels into one.
Or simple aggregation of leaf nodes.
These approaches finally result into what the model is trained on, a feature + label pair.
However a singular feature label pair does not a dataset make. We need millions of such DAG traversals and consolidations to freeze a trainable dataset. Scale is a challenge.
Tackling Lineage at scale
A key challenge for this solution is how do we tackle these DAG traversals in a way that scales while still maintaining reliability and transparency?
Computing large scale relational data has been solved with relational databases and data warehouses for many years now, it is a battle hardened, well understood technology. SQL engines are excellent at handling relationship structures. They are extremely optimised to scan and calculate arbitrary relational data.
Consider any modern data warehouse such as AWS Athena, Snowflake, Google BigQuery. All of these technologies can scale well to calculate millions even billions of such graph traversals,?and?do?it?cheaply.
SQL lifts the heavy load
To optimally leverage SQL engines, we need to serialise a graph of metadata onto a structure that?databases?understand and?are?optimised for.
We serialise the only lineage information needed for this purpose, the nodes and their parent links.
Having immutable data is quite important for maintaining integrity and reproducibility over time, so only the children need to record their parent relationships. As new children are introduced, the table grows linearly.
Using simple SQL semantics, we can traverse and calculate children for each node with a simple SQL query. Here is an example pseudo sql (actual implementation will change slightly based on the underlying engine):
select
p.node,
p.parent,
list_agg(c.node) as children
from nodes as p
left outer join nodes as c
on p.node = c.parent
group by 1, 2
Lack of any children provides an easy marker to find leaf nodes.
From this point implementing flexible solutions to get to the final dataset is trivial.
This also allows easy access to ALL of the history of your dataset in exactly the same way you would access the latest version of your dataset.
Conclusion
When building training datasets for real world application of ML models, consider the long term evolution of your dataset. Capturing the parent lineage information with your labels?is?best?practice. Leveraging modern SQL warehouses provides powerful solution for publishing datasets?and?selecting?optimal?labels?from?the?lineage.?Treating?rich?historical data?as?first?class?citizen,?allows?easy?navigation of?the?iterative?path?that?is?machine?learning.
You can also read this article on medium.
Applied Behavioural Scientist enabling data-driven user understanding for product and technology development - Bridging Product | UX | Data | Engineering
1 年Great writing Ari. It brought back memories of tears and sweat (and laughs) whilst being involved in this. I wish that more people understand the intricacies and necessity to sample, label, and curate data at scale (and that it's not that simple), instead of fixating on developing models and thinking that the models will fix it all...
Director, Data Science (Growth)
1 年Great stuff Ari. I'll have to share around with our MLE's at Canva!