Data Council 2022: Building Lakehouse with Delta Lake
Workshop at Data Council Austin

Data Council 2022: Building Lakehouse with Delta Lake

I am building machine learning models, but my data is siloed”. “I need to ensure that the models I am building are based on reliable data, so my company can make quality decisions.” “I need to ensure that I am serving the right data to the right audience.” “I need to ensure governance so I can be prepared for the audit and GDPR.” “I also need to ensure that I build efficient performant pipelines as the data volume grows.” ---- says Data Engineers, Data Scientists, Data Architects, ML Practitioners, and so on.

?Does any of this sound like the architecture considerations when you go through building data architectures? This is what I can resonate with, too. So I want to take you through a journey of how we can solve these data engineering problems. Come learn at this workshop at Data Council on why Delta Lake checks the boxes on solutions to all these problems and why lakehouse architectures have become the modern architectures for companies building analytics and AI applications.

delta lakehouse architecture explained by Vini Jaiswal

There is no shortage of challenges associated with building data pipelines, and this workshop walks through how to tackle them and make data pipelines robust and reliable. This allows downstream users to both realize the significant value and rely on their data to make critical data-driven decisions.?

Given the location of the event and considering that the housing market has been trending in Austin, it's perfect to use lending club data for our workshop to see how we qualify for a loan. For the workshop, we will use the Databricks Community edition so we can have the data and storage easily accessible for hands-on lab.

features of data lake with delta protocol

We will go through the following cool features about Delta Lake:

  • Unified batch + streaming data processing with multiple concurrent readers and writers: To demonstrate this functionality,?We will write two different data streams into our Delta Lake table at the same time. We will create two continuous streaming readers of our Delta Lake table to illustrate streaming progress. And, add a batch query, for good measure.

No alt text provided for this image

  • ACID transactions: So how is unified batch and streaming all possible simultaneously? It is because of ACID transactions. Delta Lake uses a transaction log that serves as a master record of all changes made to each table. You can view the transaction log at any time by running the describe history command.
  • Medallion architecture: Working with many customers in the data and AI space, I have found that many customers are able to simplify and streamline their data architectures using a tiered architectural approach. In terms of Delta Lake terminology, we also call it multi-hop architecture which is composed of Bronze, Silver, and Gold tables.

How can you architect data lakes to be efficient for AI workloads.

  • Schema Enforcement: But a Delta Lake does a lot more than just using ACID transactions to combine batch and streaming. It also offers features like schema enforcement to protect the quality of data in your delta tables. Without schema enforcement, data with mismatching schema can break your entire pipeline causing cascading failures downstream.
  • Schema Evolution: In the event that we do need to change our table schema, we also need schema evolution. By using merge schema, we can quickly and easily evolve the schema of our delta tables.
  • Time Travel: Another major feature, we will look at is Delta’s ability to travel back in time, it is also called data versioning or Time Travel. Because every change is recorded as an atomic transaction in the delta log directory, we can use this information to recreate the exact state of our table at any point in time.
  • Rollbacks, Reproducibility, and Governance: We will also explore some use cases like rollbacks, governance, and scaling machine learning experiments. Time Travel helps to avoid making irreversible changes to your tables. Using restore, we can completely undo the changes and simply roll back to the previous version of our data. Taking time travel one step further makes your data sets and experiments reproducible and offers a verifiable data lineage for audit and governance purposes.

How to use time travel to roll back changes.

  • Full DML support: Another feature we will explore is the full support for transactional DML commands like update, merge and delete. These are the SQL commands that make manipulating big data tables quick and easy with minimal code.
  • GDPR use case: Before Delta Lake, deleting a user's data from a data lake to comply with the GDPR request was difficult to perform without running the risk of data loss and corruption. But with Delta lake, we can delete the users' data transactionally in just one line of code.
  • Merge Operation: Finally, delta lake supports merge operations which are a mix of inserts and updates. Normally, merges are a difficult expensive operation that involves several intermediate steps. With delta lake, we can skip all that complexity and simply use the merge command.

No alt text provided for this image

  • Performance: Finally, before wrapping up, I will demonstrate some performance features of Delta Lake.

No alt text provided for this image

Conclusion

Delta Lake is used by 5000+ organizations in production to power their Lakehouse reliably. This workshop is curated so that you can leave feeling good about getting started with Delta Lake, learn about its benefits, we will also have a Q&A at the end to provide you the opportunity to ask us questions.

Very excited to see you at the Data Community at Data Council on Mach 23rd at 11 AM CST. Here's the link to the event: https://www.datacouncil.ai/austin.

If you are not attending the Data Council conference, I will provide the notebooks and useful links afterward. Also if you would like to stay tuned with the innovations in delta lake or want to contribute to the project, please reach the community on slack, google group, linkedin, github or you can find us on youtube doing an AMA or community event. Thank you!





Jérémy Ravenel

?? Building bridges @naas.ai Universal Data & AI Platform | Research Associate in Applied Ontology | Senior Advisor Data & AI Services

3 年

cannot come :) but is there something online?

回复

要查看或添加评论,请登录

Vini Jaiswal的更多文章

社区洞察

其他会员也浏览了