Jackpot World Free coins today gift skomibest.Claim Your Free 999 Pesos Bonus Today

Plus: uber data analytics side project, dbt learning resource

NOTE

This issue is originally published at GroupBy newsletter.

GroupBy is the place where I compile valuable data engineering resources for you to learn and grow.

So, if you find my work valuable and want to receive a weekly issue, subscribe here:

?? vutr.substack.com

?? Side Project

40+ hours of debugging and you still want some more?

To get your hand dirty (more), this week I will bring you a project:

?? Uber Data Analytics | End-To-End Data Engineering Project

?? Click here | ? Darshil Parmar

?? Uber Data Analytics | End-To-End Data Engineering Project

? Darshil Parmar

The goal of this project is to perform data analytics on Uber data using various tools and technologies, including GCP Storage, Python, Compute Instance, Mage Data Pipeline Tool, BigQuery, and Looker Studio.

Suggestions from me to get life harder

Self-learn data modeling, concept like scd type, Kimball data modeling approach, different between Kimball and Inmon approach,…

?? Learning resource

I love to learn, and I assume you do too.

dbt, a popular tool for abstraction transforming and modeling data.

Learning dbt is essential for streamlining data processes, ensuring data quality, and accelerating analytics development, making it a valuable skill for anyone involved in data analysis and management.

Here some (FREE) learning resource:

?? | dbt Fundamentals

?? Click HERE

Learn the Fundamentals of dbt including modeling, sources, testing, documentation, and deployment. (approximately 5 hours)

?? | Jinja, Macros, Packages

?? Click HERE

Extend the functionality of dbt with Jinja/macros and leverage models and macros from packages. (approximately 2 hours)

?? | Advanced Materializations

?? Click HERE

Learn about the advanced materializations built into dbt Core - ephemeral models, incremental models, and snapshots. (approximately 2 hours)

?? | Refactoring SQL for Modularity

?? Click HERE

Learn with the analytics engineers of dbt Labs how to migrate legacy transformation code into modular dbt data models. Useful if you're porting stored procedures or SQL scripts into your dbt project. (approximately 3.5 hours)

?? | Advanced Testing

?? Click HERE

Learn more about the theory of data testing and the practice of creating custom generic tests, leveraging tests in packages, and applying test configurations. (approximately 4 hours).

Approximately 16.5 hours for you to understand that “dbt is not just a SQL generator“

Thanks for scrolling this far (not so far)! ?? Subscribe to my weekly newsletter: vutr.substack.com in case you want to scroll my newsletter right in your mailbox :D

?? Engineering

Engineering is the practice of using natural science, mathematics, and the engineering design process to solve technical problems, increase efficiency and productivity, and improve systems. — wikipedia

??┆Flight, DataFusion, Arrow, and Parquet: Using the FDAP Architecture to build InfluxDB 3.0

?? Click HERE | ? Andrew Lamb

??┆Iceberg and Hudi ACID Guarantees┆Tablular

?? Click HERE | ? Ryan Blue

In this post, I make the case that Iceberg is reliable and Apache Hudi is not.

??┆Running Unified PubSub Client in Production at Pinterest

?? Click HERE | ? Pinterest Engineering

In a distributed PubSub environment, complexities related to client-server communication can often be hard blockers for application developers, and solving them often require a joint investigation between the application and platform teams.

??┆How we Built the Ingestion Framework┆OpenMetadata

?? Click HERE | ? Pere Miquel Brull

Without metadata, there are no discovery, collaboration, or quality tests. The ingestion process is a requirement that unlocks the rest of the features, and we are constantly pushing for improvements.

??┆Scheduling Jupyter Notebooks at Meta

?? Click HERE | ? Steve Dini

At Meta, Bento is our internal Jupyter notebooks platform that is leveraged by many internal users. Notebooks are also being used widely for creating reports and workflows (for example, performing data ETL) that need to be repeated at certain intervals.

? Data

The one thing that this job has taught me is that truth is stranger than fiction.

??┆Data Driven Management: The Why, Who, What and How?

?? Click HERE | ? janmeskens

??┆Going All-In On Data Quality

?? Click HERE | ? Matthew Weingarten

A principle that I think is useful to follow when it comes to data quality is the idea of staging tables

??┆Data Quality ≠ Data Trust: Bridging the Data Trust Gap

?? Click HERE | ? Prukalpa

A broken pipeline. A source system gone down. A change made to a column name. Three unique root causes, but the same end result: broken trust.

??┆The Clash Between Data Quality and AI: Unisphere’s Latest Findings

?? Click HERE | ? Sydney Blanchard

Data quality issues have been a looming threat for any and all enterprises, often surfaced by the proliferation of new data analytics and AI projects that, incidentally, rely on good data to succeed.

??┆5 Signs That Your Data is Modeled Poorly

?? Click HERE | ? Matthew Gazzano

To be able to model your teams data properly, you need to be able to conceptualize relevant business entities and organize them in a way that is conducive to common questions asked within your organization.

?? AI┆ML┆Data Science

You know, Burke, I don’t know which species is worse.

??┆The architecture of today’s LLM applications┆GitHub

?? Click HERE | ? Nicole Choi

??┆What I’m Reading on the Rise of Artificial Intelligence

?? Click HERE | ? Barack Obama

…I wanted to share some of the books, articles, and podcasts that have helped shape my perspective over the past year. This list offers a range of viewpoints on the threats, opportunities, and challenges posed by AI and some thoughtful ideas on how to respond.

??┆AI ‘breakthrough’: neural net has human-like ability to generalize language

?? Click HERE | ? Max Kozlov & Celeste Biever

Scientists have created a neural network with the human-like ability to make generalizations about language.

??┆Harvard professor Lawrence Lessig on why AI and social media are causing a free speech crisis for the internet

?? Click HERE | ? Nilay Patel

After 30 years teaching law, the internet policy legend is as worried as you’d think about AI and TikTok — and he has surprising thoughts about balancing free speech with protecting democracy.

?? Catch up

…Next Saturday night, we're sending you back to the future!

[??] Airflow┆Release of Airflow 2.7.3

?? Click HERE

[??] BigQuery┆Work with text analyzers

?? Click HERE

[??] Spark┆Arrow-optimized Python UDFs in Apache Spark? 3.5

?? Click HERE

[??] Google Cloud┆Cloud Functions now supports the Python 3.12 runtime.

?? Click HERE

[??] Snowflake┆Search Optimization: Support for Substring Search in Semi-Structured Data

?? Click HERE

?? The next section contain my own writing. Don't blame me if you feel distressed after reading this; you chose to read it, although you can skip without thinking twice.

?? It will steal 97 seconds from you

Random thoughts, ideas.

The hardest truth I’ve learned as a data engineer is this: No matter how fancy your pipeline or infrastructure is, if your data foundation doesn't have the ability to support the business, everything you do is just ??.

You put in all your effort to deliver an internal tool to support analytics, but nobody uses it.

Your tool is ??.

You tune your SQL script to run 2.5x faster, but the data output is “wrong” and leads to “really bad“ decisions.

Your SQL script is ??.

The lesson here is that anything you do, if you want it to bring value (so that you can lead a meaningful life), make sure it can help your “customer” solve problems.

Put yourself in your customer’s shoes.

Before developing an internal tool, sit down and talk to your DAs and DSs.

When developing a data pipeline, talk to the business to help define “constraints” and “rules” to control the quality and correctness of your data.

So, to apply this lesson and save this newsletter from being ??…

…I need you…

…yes, you, the “customers” of this newsletter.

I need your feedback on which aspects I need to improve and things that you expect from this newsletter to help me grow as a DE.

(In the comment section or directly contact me through my mail or linkedIn)

I will adjust my work.

Promise. (Unless your ideas is too “wild”)

Switching the context between “your DE work is ?? if … “ to “I need your feedback“ is… weird."

“Hasta la vista, baby”

-T800, Terminator 2: Judgment Day (1991)

Before you leave...

?? I love learning from people who are smarter and more experienced than me by consuming their data engineering resources on the Internet.

?? These resources will be compiled every week in the form of a GroupBy newsletter by me, which I first publish on Substack.

Then, I deliver it again on LinkedIn to make it more accessible to all of you.

So, if you want to learn and grow with me, subscribe to my Substack here:

?? vutr.substack.com

?? Which will motivate me a lot.

?? Side Project

?? Uber Data Analytics | End-To-End Data Engineering Project

?? Uber Data Analytics | End-To-End Data Engineering Project

Suggestions from me to get life harder

?? Learning resource

?? | dbt Fundamentals

?? | Jinja, Macros, Packages

?? | Advanced Materializations

?? | Refactoring SQL for Modularity

?? | Advanced Testing

?? Engineering

??┆Flight, DataFusion, Arrow, and Parquet: Using the FDAP Architecture to build InfluxDB 3.0

??┆Iceberg and Hudi ACID Guarantees┆Tablular

??┆Running Unified PubSub Client in Production at Pinterest

??┆How we Built the Ingestion Framework┆OpenMetadata

??┆Scheduling Jupyter Notebooks at Meta

? Data

??┆Data Driven Management: The Why, Who, What and How?

??┆Going All-In On Data Quality

领英推荐

??┆Data Quality ≠ Data Trust: Bridging the Data Trust Gap

??┆The Clash Between Data Quality and AI: Unisphere’s Latest Findings

??┆5 Signs That Your Data is Modeled Poorly

?? AI┆ML┆Data Science

??┆The architecture of today’s LLM applications┆GitHub

??┆What I’m Reading on the Rise of Artificial Intelligence

??┆AI ‘breakthrough’: neural net has human-like ability to generalize language

??┆Harvard professor Lawrence Lessig on why AI and social media are causing a free speech crisis for the internet

?? Catch up

[??] Airflow┆Release of Airflow 2.7.3

[??] BigQuery┆Work with text analyzers

[??] Spark┆Arrow-optimized Python UDFs in Apache Spark? 3.5

[??] Google Cloud┆Cloud Functions now supports the Python 3.12 runtime.

[??] Snowflake┆Search Optimization: Support for Substring Search in Semi-Structured Data

?? It will steal 97 seconds from you

“Hasta la vista, baby”

-T800, Terminator 2: Judgment Day (1991)

Before you leave...

GroupBy #18: Uber - GC Tuning for Improved Presto Reliability, How Meta is advancing GenAI

2024年1月16日

GroupBy #17: Pinterest’s new wide column database using RocksDB, Fault tolerance Kafka on Kubernetes at Grab

2024年1月10日

GroupBy #16: Uber's Anomaly Detection & Alerting System, many layers of data lineage

2024年1月3日

GroupBy #15: How Meta built the infrastructure for Threads, Notion's data scale journey

2023年12月27日

GroupBy #14: What it takes to be a Senior IC at Meta, Netflix Data Engineering Summit

2023年12月20日

GroupBy #13: Explaining Kubernetes To My Uber Driver, Data Modelling For Data Engineers

2023年12月13日

GroupBy #12: AWS re:Invent 2023, Druid and ClickHouse at Lyft, Apache Hudi History

2023年12月6日

GroupBy #11: Python at Meta, Netflix Incremental Processing with Apache Iceberg, 2023 AI year in brief

2023年11月29日

GroupBy #10: Netflix's Psyberg, Parquet format, SQL is not Designed for Analytics

2023年11月22日

GroupBy #8: Demystifying the Parquet File, the future of the data engineer, intro to data modeling.

2023年11月7日

社区洞察

其他会员也浏览了

Exploring the Databricks Community Tool: A Hub for Data Enthusiasts

dbt Journey: From Data Mess to Success

Elevate Your Data Pipeline Workflow with Kedro!

The 10 essential big data skills. Number five: data structures and algorithms.

Intro to Data Analytics & Quality Workshop

Building your own data science platform

Building a Simple Data Pipeline with Mage: A Beginner's Guide

Databricks: A Contemporary Solution for Today’s Data Engineering Obstacles

SQLMesh: The future of DataOps

End-to-End Data Engineering: OpenAQ API to Real-Time Dashboards Using Spark and Airflow

GroupBy #11: Python at Meta, Netflix Incremental Processing with Apache Iceberg, 2023 AI year in brief

GroupBy #10: Netflix's Psyberg, Parquet format, SQL is not Designed for Analytics