GroupBy #9: FDAP stack, Iceberg and Hudi ACID Guarantees, Data Driven Management
Plus: uber data analytics side project, dbt learning resource
NOTE
This issue is originally published at GroupBy newsletter.
GroupBy is the place where I compile valuable data engineering resources for you to learn and grow.
So, if you find my work valuable and want to receive a weekly issue, subscribe here:
?? Side Project
40+ hours of debugging and you still want some more?
To get your hand dirty (more), this week I will bring you a project:
?? Uber Data Analytics | End-To-End Data Engineering Project
?? Click here | ? Darshil Parmar
?? Uber Data Analytics | End-To-End Data Engineering Project
The goal of this project is to perform data analytics on Uber data using various tools and technologies, including GCP Storage, Python, Compute Instance, Mage Data Pipeline Tool, BigQuery, and Looker Studio.
Suggestions from me to get life harder
Self-learn data modeling, concept like scd type, Kimball data modeling approach, different between Kimball and Inmon approach,…
?? Learning resource
I love to learn, and I assume you do too.
dbt, a popular tool for abstraction transforming and modeling data.
Learning dbt is essential for streamlining data processes, ensuring data quality, and accelerating analytics development, making it a valuable skill for anyone involved in data analysis and management.
Here some (FREE) learning resource:
?? | dbt Fundamentals
?? Click HERE
Learn the Fundamentals of dbt including modeling, sources, testing, documentation, and deployment. (approximately 5 hours)
?? | Jinja, Macros, Packages
?? Click HERE
Extend the functionality of dbt with Jinja/macros and leverage models and macros from packages. (approximately 2 hours)
?? | Advanced Materializations
?? Click HERE
Learn about the advanced materializations built into dbt Core - ephemeral models, incremental models, and snapshots. (approximately 2 hours)
?? | Refactoring SQL for Modularity
?? Click HERE
Learn with the analytics engineers of dbt Labs how to migrate legacy transformation code into modular dbt data models. Useful if you're porting stored procedures or SQL scripts into your dbt project. (approximately 3.5 hours)
?? | Advanced Testing
?? Click HERE
Learn more about the theory of data testing and the practice of creating custom generic tests, leveraging tests in packages, and applying test configurations. (approximately 4 hours).
Approximately 16.5 hours for you to understand that “dbt is not just a SQL generator“
Thanks for scrolling this far (not so far)! ?? Subscribe to my weekly newsletter: vutr.substack.com in case you want to scroll my newsletter right in your mailbox :D
?? Engineering
Engineering is the practice of using natural science, mathematics, and the engineering design process to solve technical problems, increase efficiency and productivity, and improve systems. — wikipedia
??┆Flight, DataFusion, Arrow, and Parquet: Using the FDAP Architecture to build InfluxDB 3.0
?? Click HERE | ? Andrew Lamb
??┆Iceberg and Hudi ACID Guarantees┆Tablular
In this post, I make the case that Iceberg is reliable and Apache Hudi is not.
??┆Running Unified PubSub Client in Production at Pinterest
?? Click HERE | ? Pinterest Engineering
In a distributed PubSub environment, complexities related to client-server communication can often be hard blockers for application developers, and solving them often require a joint investigation between the application and platform teams.
??┆How we Built the Ingestion Framework┆OpenMetadata
?? Click HERE | ? Pere Miquel Brull
Without metadata, there are no discovery, collaboration, or quality tests. The ingestion process is a requirement that unlocks the rest of the features, and we are constantly pushing for improvements.
??┆Scheduling Jupyter Notebooks at Meta
?? Click HERE | ? Steve Dini
At Meta, Bento is our internal Jupyter notebooks platform that is leveraged by many internal users. Notebooks are also being used widely for creating reports and workflows (for example, performing data ETL) that need to be repeated at certain intervals.
? Data
The one thing that this job has taught me is that truth is stranger than fiction.
??┆Data Driven Management: The Why, Who, What and How?
?? Click HERE | ? janmeskens
??┆Going All-In On Data Quality
?? Click HERE | ? Matthew Weingarten
A principle that I think is useful to follow when it comes to data quality is the idea of staging tables
??┆Data Quality ≠ Data Trust: Bridging the Data Trust Gap
A broken pipeline. A source system gone down. A change made to a column name. Three unique root causes, but the same end result: broken trust.
??┆The Clash Between Data Quality and AI: Unisphere’s Latest Findings
?? Click HERE | ? Sydney Blanchard
Data quality issues have been a looming threat for any and all enterprises, often surfaced by the proliferation of new data analytics and AI projects that, incidentally, rely on good data to succeed.
??┆5 Signs That Your Data is Modeled Poorly
?? Click HERE | ? Matthew Gazzano
To be able to model your teams data properly, you need to be able to conceptualize relevant business entities and organize them in a way that is conducive to common questions asked within your organization.
?? AI┆ML┆Data Science
You know, Burke, I don’t know which species is worse.
??┆The architecture of today’s LLM applications┆GitHub
?? Click HERE | ? Nicole Choi
??┆What I’m Reading on the Rise of Artificial Intelligence
?? Click HERE | ? Barack Obama
…I wanted to share some of the books, articles, and podcasts that have helped shape my perspective over the past year. This list offers a range of viewpoints on the threats, opportunities, and challenges posed by AI and some thoughtful ideas on how to respond.
??┆AI ‘breakthrough’: neural net has human-like ability to generalize language
?? Click HERE | ? Max Kozlov & Celeste Biever
Scientists have created a neural network with the human-like ability to make generalizations about language.
??┆Harvard professor Lawrence Lessig on why AI and social media are causing a free speech crisis for the internet
?? Click HERE | ? Nilay Patel
After 30 years teaching law, the internet policy legend is as worried as you’d think about AI and TikTok — and he has surprising thoughts about balancing free speech with protecting democracy.
?? Catch up
…Next Saturday night, we're sending you back to the future!
[??] Airflow┆Release of Airflow 2.7.3
?? Click HERE
[??] BigQuery┆Work with text analyzers
?? Click HERE
[??] Spark┆Arrow-optimized Python UDFs in Apache Spark? 3.5
?? Click HERE
[??] Google Cloud┆Cloud Functions now supports the Python 3.12 runtime.
?? Click HERE
[??] Snowflake┆Search Optimization: Support for Substring Search in Semi-Structured Data
?? Click HERE
?? The next section contain my own writing. Don't blame me if you feel distressed after reading this; you chose to read it, although you can skip without thinking twice.
?? It will steal 97 seconds from you
Random thoughts, ideas.
The hardest truth I’ve learned as a data engineer is this: No matter how fancy your pipeline or infrastructure is, if your data foundation doesn't have the ability to support the business, everything you do is just ??.
You put in all your effort to deliver an internal tool to support analytics, but nobody uses it.
Your tool is ??.
You tune your SQL script to run 2.5x faster, but the data output is “wrong” and leads to “really bad“ decisions.
Your SQL script is ??.
The lesson here is that anything you do, if you want it to bring value (so that you can lead a meaningful life), make sure it can help your “customer” solve problems.
Put yourself in your customer’s shoes.
Before developing an internal tool, sit down and talk to your DAs and DSs.
When developing a data pipeline, talk to the business to help define “constraints” and “rules” to control the quality and correctness of your data.
So, to apply this lesson and save this newsletter from being ??…
…I need you…
…yes, you, the “customers” of this newsletter.
I need your feedback on which aspects I need to improve and things that you expect from this newsletter to help me grow as a DE.
I will adjust my work.
Promise. (Unless your ideas is too “wild”)
Switching the context between “your DE work is ?? if … “ to “I need your feedback“ is… weird."
“Hasta la vista, baby”
-T800, Terminator 2: Judgment Day (1991)
Before you leave...
?? I love learning from people who are smarter and more experienced than me by consuming their data engineering resources on the Internet.
?? These resources will be compiled every week in the form of a GroupBy newsletter by me, which I first publish on Substack.
Then, I deliver it again on LinkedIn to make it more accessible to all of you.
So, if you want to learn and grow with me, subscribe to my Substack here:
?? Which will motivate me a lot.