GroupBy #16: Uber's Anomaly Detection & Alerting System, many layers of data lineage
Plus: Data modeling side project, Data Engineer roadmap 2024.
NOTE
This issue is first published at GroupBy newsletter.
Original issue: HERE
GroupBy is the place where I compile valuable data engineering resources for you to learn and grow.
So, if you find my work valuable and want to receive a weekly issue, subscribe here:
?? It will steal 37 seconds from you
NEWSLETTER UPDATE.
FOR READER THAT ALREADY SUBSCRIBED:
THIS UPDATE WILL NOT AFFECT YOU READING EXPERIENCE AND NUMBER OF EMAIL YOU WILL RECEIVE WEEKLY.
You still receive only ONE EMAIL EVERY WEEK:
The GROUPBY WEEKLY issue.
(like the one you’re reading)
From beginning of 2024, I will launch a sub-newsletter with co-exist with this newsletter . This mean my newsletter will contain two sub-newsletter:
Subscriber who subscribed:
Subscriber have the control over which newsletter they want to receive:
FOR READER THAT ALREADY SUBSCRIBED:
?? Side Project
40+ hours of debugging and you still want some more?
This project's central goal is creating a structured database design that includes a central table of facts and the required dimension tables to establish connections between different elements. This will enable meaningful comparisons and analysis.
I am always looking for a data modeling project. Finally, I found one.
?? Learning resource
I love to learn, and I assume you do too.
In this blog, we'll reveal the layers of the ultimate roadmap for eager newcomers through the essential skills that define the data engineering.
I agree with most steps in this roadmap; just want to add data modeling and dbt into it.
?? Engineering
I have to believe in a world outside my own mind. — Memento (2000)
I've heard a lot about Avro, Parquet, ORC, Arrow and Feather, but I also keep hearing about Iceberg and Delta Lake. As a "database person", I’ve been struggling to understand all of these different things, and how they relate to Data Lakes and Data Lakehouses (and what exactly are these?). So, I’ve decided to study them, and consolidate my knowledge in writing.
In this post, we'll explain how we built our RU (rolling update) framework to power a frictionless deployment experience on a large-scale Hadoop cluster, achieving a >99% success rate free from interruptions or downtime and reducing significant toil for our SRE and Dev teams.
But what about the long tail of issues that lurk in the shadows, sometimes remaining undetected until they cause chaos? For these, traditional strategies may not suffice.
In this blogpost, we shared a few challenges that we encountered while aiming to achieve reliability at scale at Adyen with Airflow.
In this article, I wish to share with you the ten most valuable lessons I've learned as a Kubernetes cluster manager.
? Data
The one thing that this job has taught me is that truth is stranger than fiction. — Predestination (2014)
Super Tables (ST) are pre-computed, denormalized, and consistently consolidated attributes and insights of entities or events that are optimized for common and efficient analytic use cases.
...I wanted to provide some tips to help those either in leadership positions or who want to break into these positions plan out their data roadmap for 2024.
In this post we’ll discuss how we can learn from the field of cartography and Google Maps to extract the untapped potential of data lineage, and build this ideal interface to improve data literacy and observability.
In this blog, we will discuss the higher-level design and usage of of Data Access Level, how it fits in within the overall data platform ecosystem, and share some observations and lessons learned.
?? AI┆ML┆Data Science
You know, Burke, I don’t know which species is worse. — Ripley, Aliens (1986)
??? Andrej Karpathy
And so now, we return to the original question that took us down this long and winding path - should we even care about connecting enterprise data to natural language queries by LLMs?
If I was to summarize the goal of this article, it's that we're going to learn to light a campfire with a lighter (GPT2) and not a flamethrower (GPT3.5).
This blog post delves into the learnings and challenges on our journey towards implementing and scaling state-of-the-art deep learning approaches. We’ll shed light on how to use the newest machine-learning approaches in a controlled and reliable manner.
Airbnb had a significant presence at KDD 2023 with two papers accepted into the main conference proceedings and 11 talks and presentations. In this blog post, we’ll summarize our team’s contributions and share highlights from an exciting week of research talks, workshops, panel discussions, and more.
This article will be an exploration of prompt techniques we’ve used for our internal productivity tooling at Instacart.
Before you leave...
?? I love learning from people who are smarter and more experienced than me by consuming their data engineering resources on the Internet.
?? These resources will be compiled every week in the form of a GroupBy newsletter by me, which I first publish on Substack.
Then, I deliver it again on LinkedIn to make it more accessible to all of you.
So, if you want to learn and grow with me, subscribe to my Substack here:
?? Which will motivate me a lot.
“Hasta la vista, baby”
-T800, Terminator 2: Judgment Day (1991)