GroupBy #11: Python at Meta, Netflix Incremental Processing with Apache Iceberg, 2023 AI year in brief
Plus: No-cost Generative AI courses, data streaming pipeline project
NOTE
This issue is originally published at GroupBy newsletter.
GroupBy is the place where I compile valuable data engineering resources for you to learn and grow.
So, if you find my work valuable and want to receive a weekly issue, subscribe here:
issue find you well.
?? Side Project
40+ hours of debugging and you still want some more?
To get your hand dirty (more), this week I will bring you a project:
In this guide, we’ll delve deep into constructing a robust data pipeline, leveraging a combination of Kafka for data streaming, Spark for processing, Airflow for orchestration, Docker for containerization, S3 for storage, and Python as our primary scripting language.
Airflow┆Kafka┆Zookeeper┆Kafka Connect┆Schema Registry┆Spark
?? Learning resource
If the world ends up like The Terminator, we should prepare knowledge about out enemy, right? (just kidding)
?? resources from Microsoft and Google Cloud for you to get start in the world of Generative AI:
??┆From Microsoft:
A 12 Lesson course teaching everything you need to know to start building Generative AI applications
??┆…
??┆From Google Cloud:
These will help you gain critical skills as generative AI becomes more widely available.
??┆Introduction to Generative AI and Large Language Models
?? Engineering
I have to believe in a world outside my own mind. — Memento (2000)
This post will review what open table formats are, their main benefits, and some examples with Apache Iceberg. By the end of this post, you will know what OTFs are, why you use them, and how they work.
? Jun He, Yingyi Zhang, and Pawan Dixit
We will show how we are building a clean and efficient incremental processing solution (IPS) by using Netflix Maestro and Apache Iceberg.
How Meta’s Python Foundation Team works to improve the developer experience of everyone working with Python at Meta; Fixit 2, Meta’s recently open-sourced linter framework; and what exactly the role of production engineer at Meta entails.
Backfill is the process of filling in missing data from the past on a new table that didn’t exist before, or replacing old data with new records.
In this article, I plan to cover a basic idea behind how objects (or the data types) are implemented and represented within CPython. If you look at the CPython code, you will see a lot of references to PyObject, it plays a central role in the implementation of objects in Cpython.
? Data
领英推荐
The one thing that this job has taught me is that truth is stranger than fiction.
Unfortunately for the developers of semantic layers, there is an ever-expanding set of technologies that customers expect to integrate with. One of my colleagues recently remarked “No one said it was going to be easy” and while I agree with him, there is something we can adopt from other areas of technology with competing implementations: standardization.
My belief is that Data Contracts are the key to building a production-grade Data Warehouse and breaking the silo between data producers and data consumers. But what exactly is a data contract and why would you need one?
Trying to define something that needs definition but has a history that can't be changed easily.
…Many data issues are manually detected by users weeks or even months after they start. Data regressions are hard to catch because the most impactful ones are generally silent. They do not impact metrics and ML models in an obvious way until someone notices something is off, which finally unearths the data issue.
As a data engineer, I always feel less confident about the quality of data I handle than the quality of code I write. Code, at least, I can run it interactively and write tests before deploying to production. Data, I most often have to wait for it to flow through the system and be used to encounter data quality issues.
?? AI┆ML┆Data Science
You know, Burke, I don’t know which species is worse.
This article is a brief recap of the most interesting trends and events that have most defined this 2023.
With the boom in generative AI, the size of foundational large language models (LLMs) has grown exponentially, utilizing hundreds of billions of parameters and trillions of training tokens.
? Daniel Parker and Amit Chudasma
At Dropbox, AI-powered tools and features are quickly transforming the way our customers find, organize, and understand their data. Dropbox Dash brings AI-powered universal search to all your apps, browser tabs, and cloud docs, while Dropbox AI can summarize and answer questions about the content of your files.
How Airbnb leverages ML/NLP to extract useful information about listings from unstructured text data to power personalized experiences for guests.
At Netflix, we want our viewers to easily find TV shows and movies that resonate and engage. Our creative team helps make this happen by designing promotional artwork that best represents each title featured on our platform. What if we could use machine learning and computer vision to support our creative team in this process?
?? Catch up
…Next Saturday night, we're sending you back to the future!
[??] OneTable┆Microsoft and Google join forces on OneTable, an open-source solution for data lake challenges
?? It will steal 7 seconds from you
Random thoughts, ideas.
I'm drowning in deadlines.
(Trying to save my annual performance review ??)
So, I will leave you guys alone this week and will be back blabbing next time. ??
“Hasta la vista, baby”
-T800, Terminator 2: Judgment Day (1991)
Before you leave...
?? I love learning from people who are smarter and more experienced than me by consuming their data engineering resources on the Internet.
?? These resources will be compiled every week in the form of a GroupBy newsletter by me, which I first publish on Substack.
Then, I deliver it again on LinkedIn to make it more accessible to all of you.
So, if you want to learn and grow with me, subscribe to my Substack here:
?? Which will motivate me a lot.