GroupBy #13: Explaining Kubernetes To My Uber Driver, Data Modelling For Data Engineers
Plus: Data Engineering Design Patterns Book Release, Reddit DE project
NOTE
This issue is first published at GroupBy newsletter.
Original issue: Here
GroupBy is the place where I compile valuable data engineering resources for you to learn and grow.
So, if you find my work valuable and want to receive a weekly issue, subscribe here:
??┆Book
If you are a Data Engineer, don’t skip this book. That’s all I want to say.
??Side Project
40+ hours of debugging and you still want some more?
In this article, we’ll walk through the process of creating a data pipeline that fetches data from Reddit, uses Apache Airflow for orchestration, stores the data in Amazon S3, processes it with AWS Glue, queries with Amazon Athena, and finally, loads it into Amazon Redshift for analysis.
?? Learning resource
I love to learn, and I assume you do too.
All you need to learn ML in 2024 is a laptop and a list of the steps you need to take.
?? Engineering
I have to believe in a world outside my own mind. — Memento (2000)
A study of Google's code review tooling (Critique), AI-powered improvements, and recent statistics
DoorDash’s Engineering teams revamped Kafka Topic creation by replacing a Terraform/Atlantis based approach with an in-house API, Infra Service. This has reduced real-time pipeline onboarding time by 95% and saved countless developer hours.
This blog post addresses two different subjects:
TL;DR: OneTable provides a seamless way to interoperate between different table formats by translating table format metadata.
? Data
The one thing that this job has taught me is that truth is stranger than fiction. — Predestination (2014)
领英推荐
Data pipeline observability is your ability to monitor and understand the state of a data pipeline at any time.
The definitive guide for beginners
In this post, we’ll explore some of these challenges in detail, offering insights into how they can be effectively managed to ensure your MDM strategy delivers the most value.
In this piece, we examine the Data Quality Maturity Curve—a representation of how data quality works itself out at different stages of your organizational and analytical maturity…
This is the consumer-defined data contract. The consumer-defined contract is created by the owners of data applications, with requirements derived from their needs and use cases.
?? AI┆ML┆Data Science
You know, Burke, I don’t know which species is worse. — Ripley, Aliens (1986)
After a few years and with the hype gone, it has become apparent that MLOps overlap more with Data Engineering than most people believed.
Semantic layers provide both a knowledge graph and a constrained interface for an LLM.
Learn how we're experimenting with generative AI models to extend GitHub Copilot across the developer lifecycle.
?? Catch up
…Next Saturday night, we're sending you back to the future! — Dr. Emmett Brown, Back to the Future (1985)
→ Everyone is excited but the truth is …
“Hasta la vista, baby”
-T800, Terminator 2: Judgment Day (1991)
Before you leave...
?? I love learning from people who are smarter and more experienced than me by consuming their data engineering resources on the Internet.
?? These resources will be compiled every week in the form of a GroupBy newsletter by me, which I first publish on Substack.
Then, I deliver it again on LinkedIn to make it more accessible to all of you.
So, if you want to learn and grow with me, subscribe to my Substack here:
?? Which will motivate me a lot.