GroupBy #11: Python at Meta, Netflix Incremental Processing with Apache Iceberg, 2023 AI year in brief

Vu Trinh

I write for 4k+ readers at vutr.substack.com

发布日期: 2023年11月29日

+ 关注

Plus: No-cost Generative AI courses, data streaming pipeline project

NOTE

This issue is originally published at GroupBy newsletter.

GroupBy is the place where I compile valuable data engineering resources for you to learn and grow.

So, if you find my work valuable and want to receive a weekly issue, subscribe here:

?? vutr.substack.com

issue find you well.

?? Side Project

40+ hours of debugging and you still want some more?

To get your hand dirty (more), this week I will bring you a project:

Building a Data Streaming Pipeline: Leveraging Kafka, Spark, Airflow, and Docker

? Simardeep Singh

In this guide, we’ll delve deep into constructing a robust data pipeline, leveraging a combination of Kafka for data streaming, Spark for processing, Airflow for orchestration, Docker for containerization, S3 for storage, and Python as our primary scripting language.

Airflow┆Kafka┆Zookeeper┆Kafka Connect┆Schema Registry┆Spark

?? Learning resource

If the world ends up like The Terminator, we should prepare knowledge about out enemy, right? (just kidding)

?? resources from Microsoft and Google Cloud for you to get start in the world of Generative AI:

??┆From Microsoft:

??┆Generative AI for Beginners - A Course

A 12 Lesson course teaching everything you need to know to start building Generative AI applications

??┆Introduction to Generative AI and LLMs

??┆Exploring and comparing different LLMs

??┆Using Generative AI Responsibly

??┆Prompt Engineering Fundamentals

??┆Creating Advanced prompts

??┆Building Text Generation Applications

??┆…

??┆From Google Cloud:

??┆Seven new no-cost generative AI training courses to advance your cloud career

These will help you gain critical skills as generative AI becomes more widely available.

??┆Introduction to Generative AI and Large Language Models

??┆Attention Mechanism

??┆Transformer Models and BERT Model

??┆Introduction to Image Generation

??┆Create Image Captioning Models

??┆Encoder-Decoder Architecture

?? Engineering

I have to believe in a world outside my own mind. — Memento (2000)

??┆What is an Open Table Format? & Why to use one?

? Joseph Machado | startdataengineering

This post will review what open table formats are, their main benefits, and some examples with Apache Iceberg. By the end of this post, you will know what OTFs are, why you use them, and how they work.

??┆Incremental Processing using Netflix Maestro and Apache Iceberg

? Jun He, Yingyi Zhang, and Pawan Dixit

We will show how we are building a clean and efficient incremental processing solution (IPS) by using Netflix Maestro and Apache Iceberg.

???┆Writing and linting Python at scale┆Meta

?? Pascal Hartig

How Meta’s Python Foundation Team works to improve the developer experience of everyone working with Python at Meta; Fixit 2, Meta’s recently open-sourced linter framework; and what exactly the role of production engineer at Meta entails.

??┆Demystify Data Backfilling

? Xiaoxu Gao

Backfill is the process of filling in missing data from the past on a new table that didn’t exist before, or replacing old data with new records.

??┆CPython Object System Internals: Understanding the Role of PyObject

? Abhinav Upadhyay

In this article, I plan to cover a basic idea behind how objects (or the data types) are implemented and represented within CPython. If you look at the CPython code, you will see a lot of references to PyObject, it plays a central role in the implementation of objects in Cpython.

? Data

Yogesh Singh 3 年前

TensorFlow

Anjali Kumari 1 年前

Intro to Machine Learning with Python Workshop on…

Tony Ojeda 9 年前

The one thing that this job has taught me is that truth is stranger than fiction.

??┆The Need for an Open Standard for the Semantic Layer┆Cube

? Artyom Keydunov, Brian Bickell

Unfortunately for the developers of semantic layers, there is an ever-expanding set of technologies that customers expect to integrate with. One of my colleagues recently remarked “No one said it was going to be easy” and while I agree with him, there is something we can adopt from other areas of technology with competing implementations: standardization.

??┆The Rise of Data Contracts

? Chad Sanderson

My belief is that Data Contracts are the key to building a production-grade Data Warehouse and breaking the silo between data producers and data consumers. But what exactly is a data contract and why would you need one?

??┆Tracking/Measurement/Collection/Creation - what was the question again?

? Timo Dechau

Trying to define something that needs definition but has a history that can't be changed easily.

??┆D3: An Automated System to Detect Data Drifts

? Uber Engineer Blog

…Many data issues are manually detected by users weeks or even months after they start. Data regressions are hard to catch because the most impactful ones are generally silent. They do not impact metrics and ML models in an obvious way until someone notices something is off, which finally unearths the data issue.

??┆Why is data quality harder than code quality?┆Airbyte

? Ari Bajo Rouvinen

As a data engineer, I always feel less confident about the quality of data I handle than the quality of code I write. Code, at least, I can run it interactively and write tests before deploying to production. Data, I most often have to wait for it to flow through the system and be used to encounter data quality issues.

?? AI┆ML┆Data Science

You know, Burke, I don’t know which species is worse.

??┆The 2023 AI year in brief

? Salvatore Raieli

This article is a brief recap of the most interesting trends and events that have most defined this 2023.

??┆Google Cloud demonstrates the world’s largest distributed training job for large language models across 50000+ TPU v5e chips

? Rajesh Anantharaman

With the boom in generative AI, the size of foundational large language models (LLMs) has grown exponentially, utilizing hundreds of billions of parameters and trillions of training tokens.

??┆From AI to sustainability, why our latest data centers use 400G networking┆Dropbox

? Daniel Parker and Amit Chudasma

At Dropbox, AI-powered tools and features are quickly transforming the way our customers find, organize, and understand their data. Dropbox Dash brings AI-powered universal search to all your apps, browser tabs, and cloud docs, while Dropbox AI can summarize and answer questions about the content of your files.

??┆Wisdom of Unstructured Data: Building Airbnb’s Listing Knowledge from Big Text Data

? Hongwei Harvey Li

How Airbnb leverages ML/NLP to extract useful information about listings from unstructured text data to power personalized experiences for guests.

??┆Causal Machine Learning for Creative Insights

? Billur Engin, Yinghong Lan, Grace Tang, Cristina Segalin, Kelli Griggs, Vi Iyengar

At Netflix, we want our viewers to easily find TV shows and movies that resonate and engage. Our creative team helps make this happen by designing promotional artwork that best represents each title featured on our platform. What if we could use machine learning and computer vision to support our creative team in this process?

?? Catch up

…Next Saturday night, we're sending you back to the future!

[??] OneTable┆Microsoft and Google join forces on OneTable, an open-source solution for data lake challenges

[??] Soda┆Releases OSS Data Contract Engine

[??] Kafka┆The marriage of Parquet and Kafka

[??] Flink┆Now generally available for Amazon EMR on EKS

[??] dbt┆dbt Cloud is now available for Microsoft Fabric

?? It will steal 7 seconds from you

Random thoughts, ideas.

I'm drowning in deadlines.

(Trying to save my annual performance review ??)

So, I will leave you guys alone this week and will be back blabbing next time. ??

“Hasta la vista, baby”

-T800, Terminator 2: Judgment Day (1991)

Before you leave...

?? I love learning from people who are smarter and more experienced than me by consuming their data engineering resources on the Internet.

?? These resources will be compiled every week in the form of a GroupBy newsletter by me, which I first publish on Substack.

Then, I deliver it again on LinkedIn to make it more accessible to all of you.

So, if you want to learn and grow with me, subscribe to my Substack here:

?? vutr.substack.com

?? Which will motivate me a lot.

要查看或添加评论，请登录

Vu Trinh的更多文章

GroupBy #18: Uber - GC Tuning for Improved Presto Reliability, How Meta is advancing GenAI

2024年1月16日

GroupBy #18: Uber - GC Tuning for Improved Presto Reliability, How Meta is advancing GenAI

Plus: Python 3.13 gets a JIT, Removing data transfer fees when moving off Google Cloud ?? Hi, my name is Vu Trinh, a…
GroupBy #17: Pinterest’s new wide column database using RocksDB, Fault tolerance Kafka on Kubernetes at Grab

2024年1月10日

GroupBy #17: Pinterest’s new wide column database using RocksDB, Fault tolerance Kafka on Kubernetes at Grab

Plus: Deploying Apache Airflow on a K8s, Data Modeling in the Modern Data Stack NOTE ?? It will take you 1 minutes and…
GroupBy #16: Uber's Anomaly Detection & Alerting System, many layers of data lineage

2024年1月3日

GroupBy #16: Uber's Anomaly Detection & Alerting System, many layers of data lineage

Plus: Data modeling side project, Data Engineer roadmap 2024. NOTE This issue is first published at GroupBy newsletter.
GroupBy #15: How Meta built the infrastructure for Threads, Notion's data scale journey

2023年12月27日

GroupBy #15: How Meta built the infrastructure for Threads, Notion's data scale journey

Plus: Grokking Concurrency Book, Prompt engineering Guide from Open AI NOTE This issue is first published at GroupBy…

1 条评论
GroupBy #14: What it takes to be a Senior IC at Meta, Netflix Data Engineering Summit

2023年12月20日

GroupBy #14: What it takes to be a Senior IC at Meta, Netflix Data Engineering Summit

Plus: GCP Data Engineering Project, Conceptual vs logical vs physical data models NOTE This issue is first published…
GroupBy #13: Explaining Kubernetes To My Uber Driver, Data Modelling For Data Engineers

2023年12月13日

GroupBy #13: Explaining Kubernetes To My Uber Driver, Data Modelling For Data Engineers

Plus: Data Engineering Design Patterns Book Release, Reddit DE project NOTE This issue is first published at GroupBy…
GroupBy #12: AWS re:Invent 2023, Druid and ClickHouse at Lyft, Apache Hudi History

2023年12月6日

GroupBy #12: AWS re:Invent 2023, Druid and ClickHouse at Lyft, Apache Hudi History

Plus: 7 Free Apache Kafka Tutorials and Courses, An end-to-end Data Engineer Project NOTE This issue is first…
GroupBy #10: Netflix's Psyberg, Parquet format, SQL is not Designed for Analytics

2023年11月22日

GroupBy #10: Netflix's Psyberg, Parquet format, SQL is not Designed for Analytics

Plus: Data Engineering Stream Project, Distributed System Course from MIT NOTE This issue is originally published at…
GroupBy #9: FDAP stack, Iceberg and Hudi ACID Guarantees, Data Driven Management

2023年11月16日

GroupBy #9: FDAP stack, Iceberg and Hudi ACID Guarantees, Data Driven Management

Plus: uber data analytics side project, dbt learning resource NOTE This issue is originally published at GroupBy…
GroupBy #8: Demystifying the Parquet File, the future of the data engineer, intro to data modeling.

2023年11月7日

GroupBy #8: Demystifying the Parquet File, the future of the data engineer, intro to data modeling.

Plus: Building a Data Engineering Project in 20 Minutes. This issue is originally published at GroupBy newsletter.

See all articles

GroupBy #11: Python at Meta, Netflix Incremental Processing with Apache Iceberg, 2023 AI year in brief

Vu Trinh

I write for 4k+ readers at vutr.substack.com

?? Side Project

?? Learning resource

?? Engineering

? Data

领英推荐

?? AI┆ML┆Data Science

?? Catch up

?? It will steal 7 seconds from you

“Hasta la vista, baby”

-T800, Terminator 2: Judgment Day (1991)

Before you leave...

Vu Trinh的更多文章

社区洞察

其他会员也浏览了

Comparing MLOPs Libraries

Deploying Machine Learning Model using Python in Docker container?? ??????

The Top 8 Key Missteps to Avoid in Implementing Python for Machine Learning in 2024

"I'll Never Let Go, Jack!": How to predict who survived the Titanic with machine learning

Supercharge ML models with Distributed Xgboost on CML

Shapash : Machine Learning Interpretable & Understandable

Innovative Trends in Machine Learning with Python

Machine Learning with Python Workshop on Feb 20th

?? Side Project

?? Learning resource

?? Engineering

? Data

领英推荐

?? AI┆ML┆Data Science

?? Catch up

?? It will steal 7 seconds from you

“Hasta la vista, baby”

-T800, Terminator 2: Judgment Day (1991)

Before you leave...

Vu Trinh的更多文章

GroupBy #18: Uber - GC Tuning for Improved Presto Reliability, How Meta is advancing GenAI

GroupBy #17: Pinterest’s new wide column database using RocksDB, Fault tolerance Kafka on Kubernetes at Grab

GroupBy #16: Uber's Anomaly Detection & Alerting System, many layers of data lineage

GroupBy #15: How Meta built the infrastructure for Threads, Notion's data scale journey

GroupBy #14: What it takes to be a Senior IC at Meta, Netflix Data Engineering Summit

GroupBy #13: Explaining Kubernetes To My Uber Driver, Data Modelling For Data Engineers

GroupBy #12: AWS re:Invent 2023, Druid and ClickHouse at Lyft, Apache Hudi History

GroupBy #10: Netflix's Psyberg, Parquet format, SQL is not Designed for Analytics

GroupBy #9: FDAP stack, Iceberg and Hudi ACID Guarantees, Data Driven Management

GroupBy #8: Demystifying the Parquet File, the future of the data engineer, intro to data modeling.

社区洞察

其他会员也浏览了

Comparing MLOPs Libraries

Deploying Machine Learning Model using Python in Docker container?? ??????

The Top 8 Key Missteps to Avoid in Implementing Python for Machine Learning in 2024

"I'll Never Let Go, Jack!": How to predict who survived the Titanic with machine learning

Supercharge ML models with Distributed Xgboost on CML

Shapash : Machine Learning Interpretable & Understandable

Innovative Trends in Machine Learning with Python

Machine Learning with Python Workshop on Feb 20th

GroupBy #10: Netflix's Psyberg, Parquet format, SQL is not Designed for Analytics