登录查看更多内容

Harnessing Lineage for Continuous Improvement of Deep Learning Datasets

Arihant Surana

Principal Machine Learning Engineer at Atlassian

发布日期: 2023年9月2日

Machine learning at scale requires a robust data engine geared towards continuous data quality improvement. In this article I share my insights on how to build such a system with a fine grained lineage tracking solution.

The Problem

Deep learning models’ success rests upon the datasets they are trained on. One of the?most reliable?source of performance is?better?training data.

However, the journey to model excellence doesn’t end at dataset creation, the process of continuous improvement of dataset quality is equally vital.

Continuous Improvement

Application of?machine?learning in?real?world?solutions needs?iteration. Just like software?solutions, over time we need to address new functionality, refinement to existing functionality, regression, bugs, etc. These are all critical for the success and usability of any application.

However, unlike most software projects, machine learning models rely heavily on datasets?for?most?of?these?improvements. This is why we need the ability to continuously improve the training datasets?and?fix?quality issues?as?we?find?them?while?keeping?track?of?what’s changed.

Reproducibility

Improving dataset quality usually means corrections on labels that may have been collected some time ago. We build datasets from?various?sources such as archives,?medical records, public information, internet, etc.

Over time as processes, techniques, tools and knowledge improves, these dataset and their labels need to evolve and improve, while still maintaining older versions for reproducibility.

Reproducibility Challenge

Why should we care about reproducibility?in?industrial applications?

Building machine learning models historically has been mostly an academic and scientific exercise, as such you may have come across the reproducibility challenge in this field.

This is simply a precursor to the reproducibility challenge we face in the industry as AI and machine learning models become prevalent in every application, ranging from chatbots, to copilots to medical assistants etc.

Training models for real world applications is an iterative exercise with a long evolutionary chain of experiments, tweaks and architectural changes. It is always a challenge to pick the best model for the application at?hand. To compare and evaluate models built across time, access to frozen?training datasets is?crucial. To track and explain improvement in models, access to the evolutionary history of your labelled dataset is equally?important.

Other than the need for an evolutionary framework, many applications of machine learning models warrant a deeper scrutiny and regulation of the data that was used for training the model. Applications in fields like medicine, insurance, law enforcement etc. can have massive real world repercussions due to small biases in the training dataset. As such these datasets should be subject to higher degree of analysis and tracking that ensures we eliminate biases?as?and?when?we?discover?them?over?time.

Needless to say, any serious machine learning operation needs to focus on reliable and reproducible datasets.

Tracking Versions

Traditionally open datasets have solved this problem by simply publishing newer versions of the entire dataset as newer copies. This works well for relatively slow moving datasets. COCO is an example of this approach, where newer snapshots are available to download independently.

For industrial applications, where we need to gather 10s of millions of images with 100s of millions of labels, this approach falls short.

Typically we expect labels to be improved continuously on a daily basis with an expert human workforce of 100s of people meticulously improving labels.

We expect to train new models every month if not every week. The speed of iteration combined with the scale of data means that this solution?falls?short.

We need granular lineage tracking that allows large scale datasets to evolve at speed while not compromising on tracking, reproducibility and flexibility.

What is Lineage?

Lineage refers to the historical record of the origin, transformation, and evolution of data. It encompasses the entire lifecycle of data, including its creation, processing, and any changes it undergoes over time.

In the context of this text I will use the word lineage to describe the transformation or correction of ground truth related to a feature, over time. In?practice this?may?look?like?the?following example:

Imagine the case of semantic segmentation to detect buildings on aerial imagery. We select a region on the map, and send it to be labelled by an expert human labeller:

While experts labelled the image to the best of their knowledge, there?is?always?a propensity for?systemic?issues?and?human?error.?For?example?the?labelling instructions?may?not?cover how to label a?small?portion?of?tiles?which?looks?like,?but?really aren’t a?part?of?any building.

Eventually, as we detect such systemic issues, and train our workforce in how to handle such cases, we need another human to verify and fix this label.

While this seems simple enough, the time between the original label and the correction of label can span months or years. Tracking this lineage graph is going to be instrumental.

A DAG emerges

As you can imagine tracking lineage like this can result in many different structures.

领英推荐

4 Ways to Tackle the Lack of Machine Learning Datasets

Naveen Joshi 3 年前

What are some common misconceptions about machine…

Machine Learning 2 年前

Revolutionizing Businesses with Deep Learning and…

eInfochips (An Arrow Company) 1 年前

You can represent these as a series of Directed Acyclic Graphs or “DAGs of labels”, that sit within the larger graph of feature and label dataset. Capturing this information with your labels is only part of the challenge, what you really need is a way to transform this into a training dataset that needs a feature and a label pair.

A naive approach can be to aggregate all the labels ever gathered.

This approach makes some sense, as you are building consensus from all human labels.

However this approach has a massive problem: it ignores the fact that new information is more likely to be correct than the old. As we improve tooling, knowledge and processes over time, labels resulting from corrections, by definition are “better” than the original labels. Thus aggregating new corrected labels with older labels dilutes, and in some cases reverses the improvement.

An optimal approach would be to boost the data that has passed through more human attention. We can do this easily by figuring out the leaf nodes of these DAGs and discard the parent nodes.

Boost the leaf nodes for better accumulation of knowledge.

This approach guarantees that there will always be an improvement with every correction in our dataset.

However, we may still end up with many leaf nodes, and we can use different strategies to tackle this case based on the problem at hand.

A couple of example solutions that can be considered:

Human in the loop consolidation of many labels into one.

Or simple aggregation of leaf nodes.

These approaches finally result into what the model is trained on, a feature + label pair.

However a singular feature label pair does not a dataset make. We need millions of such DAG traversals and consolidations to freeze a trainable dataset. Scale is a challenge.

Tackling Lineage at scale

A key challenge for this solution is how do we tackle these DAG traversals in a way that scales while still maintaining reliability and transparency?

Computing large scale relational data has been solved with relational databases and data warehouses for many years now, it is a battle hardened, well understood technology. SQL engines are excellent at handling relationship structures. They are extremely optimised to scan and calculate arbitrary relational data.

Consider any modern data warehouse such as AWS Athena, Snowflake, Google BigQuery. All of these technologies can scale well to calculate millions even billions of such graph traversals,?and?do?it?cheaply.

SQL lifts the heavy load

To optimally leverage SQL engines, we need to serialise a graph of metadata onto a structure that?databases?understand and?are?optimised for.

We serialise the only lineage information needed for this purpose, the nodes and their parent links.

Having immutable data is quite important for maintaining integrity and reproducibility over time, so only the children need to record their parent relationships. As new children are introduced, the table grows linearly.

Serialise the lineage into sql relations.

Using simple SQL semantics, we can traverse and calculate children for each node with a simple SQL query. Here is an example pseudo sql (actual implementation will change slightly based on the underlying engine):

select
   p.node, 
   p.parent, 
   list_agg(c.node) as children
from nodes as p
left outer join nodes as c
on p.node = c.parent
group by 1, 2

Calculate children with simple SQL join.

Lack of any children provides an easy marker to find leaf nodes.

Boosted leaf nodes are identified for easy querying.

From this point implementing flexible solutions to get to the final dataset is trivial.

This also allows easy access to ALL of the history of your dataset in exactly the same way you would access the latest version of your dataset.

Conclusion

When building training datasets for real world application of ML models, consider the long term evolution of your dataset. Capturing the parent lineage information with your labels?is?best?practice. Leveraging modern SQL warehouses provides powerful solution for publishing datasets?and?selecting?optimal?labels?from?the?lineage.?Treating?rich?historical data?as?first?class?citizen,?allows?easy?navigation of?the?iterative?path?that?is?machine?learning.

You can also read this article on medium.

Quan Huynh-Thu

Applied Behavioural Scientist enabling data-driven user understanding for product and technology development - Bridging Product | UX | Data | Engineering

1 年

Great writing Ari. It brought back memories of tears and sweat (and laughs) whilst being involved in this. I wish that more people understand the intricacies and necessity to sample, label, and curate data at scale (and that it's not that simple), instead of fixating on developing models and thinking that the models will fix it all...

1 次回应

Jake Warner

Director, Data Science (Growth)

1 年

Great stuff Ari. I'll have to share around with our MLE's at Canva!

1 次回应

查看更多评论

要查看或添加评论，请登录

Arihant Surana的更多文章

Chaos to Clarity: Immutability forms the foundation for extensible data

2023年9月19日

Chaos to Clarity: Immutability forms the foundation for extensible data

Introduction The AI race is on. There are weekly breakthroughs in the space of large language models, vision models…
Democratise all your Assets in the Data Lake with FastAPI and Cassandra

2020年8月10日

Democratise all your Assets in the Data Lake with FastAPI and Cassandra

A REST API wrapper for Datasets stored in Apache Cassandra Motivation A data platform can be thought of as a scalable…

2 条评论
Streamline your team with a Helpdesk

2020年7月31日

Streamline your team with a Helpdesk

Help your team execute on their projects while keeping stakeholders happy! Change the way you interface Many teams…

2 条评论
Sustainable Growth through Quality Engineering

2020年7月24日

Sustainable Growth through Quality Engineering

Any technology-driven organisation that has successfully grown over a long time, would have done so by a mix of…

1 条评论
Versatile Data Engineering Toolkit for?Python

2020年7月23日

Versatile Data Engineering Toolkit for?Python

hipages’s latest contribution to the free & open-source software (FOSS) ecosystem The what and the why? Working with…

3 条评论
Bayesian optimisations of ML models on Kubernetes

2019年6月25日

Bayesian optimisations of ML models on Kubernetes

At #hipages we've just open-sourced our production ready, autoscaling, multi-tenant solution for running highly…

4 条评论

See all articles

Harnessing Lineage for Continuous Improvement of Deep Learning Datasets

Arihant Surana

Principal Machine Learning Engineer at Atlassian

The Problem

Continuous Improvement

Reproducibility

Reproducibility Challenge

Tracking Versions

What is Lineage?

A DAG emerges

领英推荐

Tackling Lineage at scale

SQL lifts the heavy load

Conclusion

Arihant Surana的更多文章

社区洞察

其他会员也浏览了

AutoGluon: Empowering AI with Automated Wizardry

Is Bad Training Data Hurting Your AI Models? Check for These 10 Issues and How to Avoid Them [Checklist + Tools]

DeepSeek: Revolutionizing the Future of Artificial Intelligence and Deep Learning

14 YouTube channels that provide valuable content for learning machine learning, AI, and data science

Importance of Datasets in Machine Learning and AI Research

Data analysis, machine learning, deep learning, and Artificial Intelligence: differences and synergies.

Big Data's Role in Supercharging Machine Learning and AI

Machine Learning Data Structures: Architecting the Foundation of Intelligent Systems

Machine Learning

7 Game-Changing Differences Between Machine Learning and Deep Learning

The Problem

Continuous Improvement

Reproducibility

Reproducibility Challenge

Tracking Versions

What is Lineage?

A DAG emerges

领英推荐

Tackling Lineage at scale

SQL lifts the heavy load

Conclusion

Arihant Surana的更多文章

Chaos to Clarity: Immutability forms the foundation for extensible data

Democratise all your Assets in the Data Lake with FastAPI and Cassandra

Streamline your team with a Helpdesk

Sustainable Growth through Quality Engineering

Versatile Data Engineering Toolkit for?Python

Bayesian optimisations of ML models on Kubernetes

社区洞察

其他会员也浏览了

AutoGluon: Empowering AI with Automated Wizardry

Is Bad Training Data Hurting Your AI Models? Check for These 10 Issues and How to Avoid Them [Checklist + Tools]

DeepSeek: Revolutionizing the Future of Artificial Intelligence and Deep Learning

14 YouTube channels that provide valuable content for learning machine learning, AI, and data science

Importance of Datasets in Machine Learning and AI Research

Data analysis, machine learning, deep learning, and Artificial Intelligence: differences and synergies.

Big Data's Role in Supercharging Machine Learning and AI

Machine Learning Data Structures: Architecting the Foundation of Intelligent Systems

Machine Learning

7 Game-Changing Differences Between Machine Learning and Deep Learning