登录查看更多内容

Data Council 2022: Building Lakehouse with Delta Lake

Vini Jaiswal

Top Voice in AI + Data | Open Source, Fortune 500 & Unicorns Advisor | Databricks, Citi Alumni | Speaker

发布日期: 2022年3月22日

I am building machine learning models, but my data is siloed”. “I need to ensure that the models I am building are based on reliable data, so my company can make quality decisions.” “I need to ensure that I am serving the right data to the right audience.” “I need to ensure governance so I can be prepared for the audit and GDPR.” “I also need to ensure that I build efficient performant pipelines as the data volume grows.” ---- says Data Engineers, Data Scientists, Data Architects, ML Practitioners, and so on.

?Does any of this sound like the architecture considerations when you go through building data architectures? This is what I can resonate with, too. So I want to take you through a journey of how we can solve these data engineering problems. Come learn at this workshop at Data Council on why Delta Lake checks the boxes on solutions to all these problems and why lakehouse architectures have become the modern architectures for companies building analytics and AI applications.

delta lakehouse architecture explained by Vini Jaiswal

There is no shortage of challenges associated with building data pipelines, and this workshop walks through how to tackle them and make data pipelines robust and reliable. This allows downstream users to both realize the significant value and rely on their data to make critical data-driven decisions.?

Given the location of the event and considering that the housing market has been trending in Austin, it's perfect to use lending club data for our workshop to see how we qualify for a loan. For the workshop, we will use the Databricks Community edition so we can have the data and storage easily accessible for hands-on lab.

features of data lake with delta protocol

We will go through the following cool features about Delta Lake:

Unified batch + streaming data processing with multiple concurrent readers and writers: To demonstrate this functionality,?We will write two different data streams into our Delta Lake table at the same time. We will create two continuous streaming readers of our Delta Lake table to illustrate streaming progress. And, add a batch query, for good measure.

ACID transactions: So how is unified batch and streaming all possible simultaneously? It is because of ACID transactions. Delta Lake uses a transaction log that serves as a master record of all changes made to each table. You can view the transaction log at any time by running the describe history command.
Medallion architecture: Working with many customers in the data and AI space, I have found that many customers are able to simplify and streamline their data architectures using a tiered architectural approach. In terms of Delta Lake terminology, we also call it multi-hop architecture which is composed of Bronze, Silver, and Gold tables.

How can you architect data lakes to be efficient for AI workloads.

Schema Enforcement: But a Delta Lake does a lot more than just using ACID transactions to combine batch and streaming. It also offers features like schema enforcement to protect the quality of data in your delta tables. Without schema enforcement, data with mismatching schema can break your entire pipeline causing cascading failures downstream.
Schema Evolution: In the event that we do need to change our table schema, we also need schema evolution. By using merge schema, we can quickly and easily evolve the schema of our delta tables.
Time Travel: Another major feature, we will look at is Delta’s ability to travel back in time, it is also called data versioning or Time Travel. Because every change is recorded as an atomic transaction in the delta log directory, we can use this information to recreate the exact state of our table at any point in time.
Rollbacks, Reproducibility, and Governance: We will also explore some use cases like rollbacks, governance, and scaling machine learning experiments. Time Travel helps to avoid making irreversible changes to your tables. Using restore, we can completely undo the changes and simply roll back to the previous version of our data. Taking time travel one step further makes your data sets and experiments reproducible and offers a verifiable data lineage for audit and governance purposes.

领英推荐

A day in the Life of a Data Engineer

Srivatsan Srinivasan 5 年前

Scaling Data Pipelines: 6 Hard Lessons Every Data…

Steven Murhula 1 个月前

Come Hell or High Water: Some Lessons from Four Years…

Andrei Zaichikov 1 年前

How to use time travel to roll back changes.

Full DML support: Another feature we will explore is the full support for transactional DML commands like update, merge and delete. These are the SQL commands that make manipulating big data tables quick and easy with minimal code.
GDPR use case: Before Delta Lake, deleting a user's data from a data lake to comply with the GDPR request was difficult to perform without running the risk of data loss and corruption. But with Delta lake, we can delete the users' data transactionally in just one line of code.
Merge Operation: Finally, delta lake supports merge operations which are a mix of inserts and updates. Normally, merges are a difficult expensive operation that involves several intermediate steps. With delta lake, we can skip all that complexity and simply use the merge command.

Performance: Finally, before wrapping up, I will demonstrate some performance features of Delta Lake.

Conclusion

Delta Lake is used by 5000+ organizations in production to power their Lakehouse reliably. This workshop is curated so that you can leave feeling good about getting started with Delta Lake, learn about its benefits, we will also have a Q&A at the end to provide you the opportunity to ask us questions.

Very excited to see you at the Data Community at Data Council on Mach 23rd at 11 AM CST. Here's the link to the event: https://www.datacouncil.ai/austin.

If you are not attending the Data Council conference, I will provide the notebooks and useful links afterward. Also if you would like to stay tuned with the innovations in delta lake or want to contribute to the project, please reach the community on slack, google group, linkedin, github or you can find us on youtube doing an AMA or community event. Thank you!

Jérémy Ravenel

?? Building bridges @naas.ai Universal Data & AI Platform | Research Associate in Applied Ontology | Senior Advisor Data & AI Services

3 年

cannot come :) but is there something online?

查看更多评论

要查看或添加评论，请登录

Vini Jaiswal的更多文章

Grace Hopper Open Source Day 2024

2024年10月6日

Grace Hopper Open Source Day 2024

The Grace Hopper Open Source Day (OSD) is a virtual hackathon that kicked off the Grace Hopper Celebration 2024. As a…
Open Source at the United Nations

2024年7月16日

Open Source at the United Nations

As I stepped into the United Nations headquarters in New York, I felt a surge of excitement and pride. The Open Source…

5 条评论
?? A Moment of Empowerment and Resilience on this International Women's Day ??

2024年3月8日

?? A Moment of Empowerment and Resilience on this International Women's Day ??

Exactly a year ago on this very day, I found myself in a room buzzing with energy and inspiration at the SXSW 2023…

1 条评论
British Columbia's Commitment to Open Source Software: Driving Innovation and Efficiency

2023年5月17日

British Columbia's Commitment to Open Source Software: Driving Innovation and Efficiency

The government of British Columbia (BC) has made a remarkable commitment to embracing open source software, which…
Create On LinkedIn: My story

2022年11月13日

Create On LinkedIn: My story

What is your content creation story? I am writing this article for a little inspiration and to share why I create on…

3 条评论
7 Predictions On The Future Of Technology and Innovation

2022年11月3日

7 Predictions On The Future Of Technology and Innovation

What does the future of technology look like? How will the technology evolve? How this might be different? How will it…
Delta Lake 1.2.1 release announcement

2022年4月27日

Delta Lake 1.2.1 release announcement

Excited to share the Delta Lake release 1.2!!! Please see the release notes for complete details.
Contributions to our Planet Earth... every act counts!

2022年4月23日

Contributions to our Planet Earth... every act counts!

Although it doesn’t have to be Earth Day to do something special for our Planet. It's a revolution and by making small…

1 条评论

See all articles

Data Council 2022: Building Lakehouse with Delta Lake

Vini Jaiswal

Top Voice in AI + Data | Open Source, Fortune 500 & Unicorns Advisor | Databricks, Citi Alumni | Speaker

领英推荐

Vini Jaiswal的更多文章

社区洞察

其他会员也浏览了

Understanding the Power of OWL in Information Modeling: A Comparison of Data Architects and Ontologists

Data Lakes, Warehouses & Lakehouses: Declassifying the Modern Data Ecosystem

?? DATA Pill #109 - Databricks LakeFlow, GKE + Gemma + Ollama = ?

DATA Pill #075 - 5 Best Data Observability Platforms, to dbt or not to dbt

Introduction to the 21-Day Data Engineering Journey

DATA Pill #020 - The Rise of DataOps and The Power of MLOps

Data Engineering Flow

Ten Trends in Data Science 2015

All Hands on Data #102

领英推荐

Vini Jaiswal的更多文章

Grace Hopper Open Source Day 2024

Open Source at the United Nations

?? A Moment of Empowerment and Resilience on this International Women's Day ??

British Columbia's Commitment to Open Source Software: Driving Innovation and Efficiency

Create On LinkedIn: My story

7 Predictions On The Future Of Technology and Innovation

Delta Lake 1.2.1 release announcement

Contributions to our Planet Earth... every act counts!

社区洞察

其他会员也浏览了

Understanding the Power of OWL in Information Modeling: A Comparison of Data Architects and Ontologists

Data Lakes, Warehouses & Lakehouses: Declassifying the Modern Data Ecosystem

?? DATA Pill #109 - Databricks LakeFlow, GKE + Gemma + Ollama = ?

DATA Pill #075 - 5 Best Data Observability Platforms, to dbt or not to dbt

Introduction to the 21-Day Data Engineering Journey

DATA Pill #020 - The Rise of DataOps and The Power of MLOps

Data Engineering Flow

Ten Trends in Data Science 2015

All Hands on Data #102