GroupBy #11: Python at Meta, Netflix Incremental Processing with Apache Iceberg, 2023 AI year in brief

GroupBy #11: Python at Meta, Netflix Incremental Processing with Apache Iceberg, 2023 AI year in brief

Plus: No-cost Generative AI courses, data streaming pipeline project


NOTE

This issue is originally published at GroupBy newsletter.

GroupBy is the place where I compile valuable data engineering resources for you to learn and grow.

So, if you find my work valuable and want to receive a weekly issue, subscribe here:

?? vutr.substack.com


issue find you well.

?? Side Project

40+ hours of debugging and you still want some more?

To get your hand dirty (more), this week I will bring you a project:

Building a Data Streaming Pipeline: Leveraging Kafka, Spark, Airflow, and Docker

? Simardeep Singh

In this guide, we’ll delve deep into constructing a robust data pipeline, leveraging a combination of Kafka for data streaming, Spark for processing, Airflow for orchestration, Docker for containerization, S3 for storage, and Python as our primary scripting language.
AirflowKafkaZookeeperKafka ConnectSchema RegistrySpark
from author original post

?? Learning resource

If the world ends up like The Terminator, we should prepare knowledge about out enemy, right? (just kidding)

?? resources from Microsoft and Google Cloud for you to get start in the world of Generative AI:

??┆From Microsoft:

??┆Generative AI for Beginners - A Course

A 12 Lesson course teaching everything you need to know to start building Generative AI applications

??┆Introduction to Generative AI and LLMs

??┆Exploring and comparing different LLMs

??┆Using Generative AI Responsibly

??┆Prompt Engineering Fundamentals

??┆Creating Advanced prompts

??┆Building Text Generation Applications

??┆

??┆From Google Cloud:

??┆Seven new no-cost generative AI training courses to advance your cloud career

These will help you gain critical skills as generative AI becomes more widely available.

??┆Introduction to Generative AI and Large Language Models

??┆Attention Mechanism

??┆Transformer Models and BERT Model

??┆Introduction to Image Generation

??┆Create Image Captioning Models

??┆Encoder-Decoder Architecture


?? Engineering

I have to believe in a world outside my own mind. — Memento (2000)

??┆What is an Open Table Format? & Why to use one?

? Joseph Machado | startdataengineering

This post will review what open table formats are, their main benefits, and some examples with Apache Iceberg. By the end of this post, you will know what OTFs are, why you use them, and how they work.

??┆Incremental Processing using Netflix Maestro and Apache Iceberg

? Jun He, Yingyi Zhang, and Pawan Dixit

We will show how we are building a clean and efficient incremental processing solution (IPS) by using Netflix Maestro and Apache Iceberg.

???┆Writing and linting Python at scaleMeta

?? Pascal Hartig

How Meta’s Python Foundation Team works to improve the developer experience of everyone working with Python at Meta; Fixit 2, Meta’s recently open-sourced linter framework; and what exactly the role of production engineer at Meta entails.

??┆Demystify Data Backfilling

? Xiaoxu Gao

Backfill is the process of filling in missing data from the past on a new table that didn’t exist before, or replacing old data with new records.

??┆CPython Object System Internals: Understanding the Role of PyObject

? Abhinav Upadhyay

In this article, I plan to cover a basic idea behind how objects (or the data types) are implemented and represented within CPython. If you look at the CPython code, you will see a lot of references to PyObject, it plays a central role in the implementation of objects in Cpython.

? Data

The one thing that this job has taught me is that truth is stranger than fiction.

??┆The Need for an Open Standard for the Semantic LayerCube

? Artyom Keydunov, Brian Bickell

Unfortunately for the developers of semantic layers, there is an ever-expanding set of technologies that customers expect to integrate with. One of my colleagues recently remarked “No one said it was going to be easy” and while I agree with him, there is something we can adopt from other areas of technology with competing implementations: standardization.

??┆The Rise of Data Contracts

? Chad Sanderson

My belief is that Data Contracts are the key to building a production-grade Data Warehouse and breaking the silo between data producers and data consumers. But what exactly is a data contract and why would you need one?

??┆Tracking/Measurement/Collection/Creation - what was the question again?

? Timo Dechau

Trying to define something that needs definition but has a history that can't be changed easily.

??┆D3: An Automated System to Detect Data Drifts

? Uber Engineer Blog

…Many data issues are manually detected by users weeks or even months after they start. Data regressions are hard to catch because the most impactful ones are generally silent. They do not impact metrics and ML models in an obvious way until someone notices something is off, which finally unearths the data issue.

??┆Why is data quality harder than code quality?Airbyte

? Ari Bajo Rouvinen

As a data engineer, I always feel less confident about the quality of data I handle than the quality of code I write. Code, at least, I can run it interactively and write tests before deploying to production. Data, I most often have to wait for it to flow through the system and be used to encounter data quality issues.

?? AI┆ML┆Data Science

You know, Burke, I don’t know which species is worse.

??┆The 2023 AI year in brief

? Salvatore Raieli

This article is a brief recap of the most interesting trends and events that have most defined this 2023.

??┆Google Cloud demonstrates the world’s largest distributed training job for large language models across 50000+ TPU v5e chips

? Rajesh Anantharaman

With the boom in generative AI, the size of foundational large language models (LLMs) has grown exponentially, utilizing hundreds of billions of parameters and trillions of training tokens.

??┆From AI to sustainability, why our latest data centers use 400G networkingDropbox

? Daniel Parker and Amit Chudasma

At Dropbox, AI-powered tools and features are quickly transforming the way our customers find, organize, and understand their data. Dropbox Dash brings AI-powered universal search to all your apps, browser tabs, and cloud docs, while Dropbox AI can summarize and answer questions about the content of your files.

??┆Wisdom of Unstructured Data: Building Airbnb’s Listing Knowledge from Big Text Data

? Hongwei Harvey Li

How Airbnb leverages ML/NLP to extract useful information about listings from unstructured text data to power personalized experiences for guests.

??┆Causal Machine Learning for Creative Insights

? Billur Engin, Yinghong Lan, Grace Tang, Cristina Segalin, Kelli Griggs, Vi Iyengar

At Netflix, we want our viewers to easily find TV shows and movies that resonate and engage. Our creative team helps make this happen by designing promotional artwork that best represents each title featured on our platform. What if we could use machine learning and computer vision to support our creative team in this process?

?? Catch up

…Next Saturday night, we're sending you back to the future!

[??] OneTableMicrosoft and Google join forces on OneTable, an open-source solution for data lake challenges

[??] SodaReleases OSS Data Contract Engine

[??] KafkaThe marriage of Parquet and Kafka

[??] FlinkNow generally available for Amazon EMR on EKS

[??] dbtdbt Cloud is now available for Microsoft Fabric


?? It will steal 7 seconds from you

Random thoughts, ideas.

I'm drowning in deadlines.

(Trying to save my annual performance review ??)

So, I will leave you guys alone this week and will be back blabbing next time. ??


“Hasta la vista, baby”

-T800, Terminator 2: Judgment Day (1991)


Before you leave...

?? I love learning from people who are smarter and more experienced than me by consuming their data engineering resources on the Internet.

?? These resources will be compiled every week in the form of a GroupBy newsletter by me, which I first publish on Substack.

Then, I deliver it again on LinkedIn to make it more accessible to all of you.

So, if you want to learn and grow with me, subscribe to my Substack here:

?? vutr.substack.com

?? Which will motivate me a lot.




要查看或添加评论,请登录

Vu Trinh的更多文章

社区洞察

其他会员也浏览了