登录查看更多内容

This Month Data Engineering (September '24)

Nicolai Ernst

I am a Data Engineer by profession & Lufthanseat by passion, who can‘t live without his camera.

发布日期: 2024年9月26日

+ 关注

?? This Month's Highlights

Databricks' publish to Power BI feature and its VS Code extension become GA
DuckDB v1.1.0 released, a column-oriented database used at Facebook, Google and Airbnb
Uber highlights development on QueryGPT to convert natural language to SQL using GenAI

?? Technical Deep Dive

As a software engineer, I am pretty sure you are familiar with the software development lifecycle (SDLC) — it is a widely understood and battle-tested framework, that is also known as DevOps. In 2016, Tristan Handy wrote a blog post about bringing software engineering best practices into the work of data practitioners. Fast forward to today, the blog post received an update and introduced the analytics development lifecycle (ADLC).

The framework applies to the entire analytical system, from ingestion through transformation to analysis and ultimately building applications on top. I highly recommend reading the blog post.

??? Tool of the Month

As a data engineer, encountering unknown datasets is quite common. In those situations, I consider creating a profile a good first step — like understanding a datasets’ shape, its data types, a column’s minimum and maximum values, a correlation matrix and other key characteristics. If you’re already using pandas for data manipulation, you’re actually just one line of code away from creating a comprehensive report of your dataset — thanks to ydata.

Also, you may want to use the gained knowledge as input to great_expectations, a tool to write units tests on data — I covered it in last month’s issue already.

?? Learning Corner

Amidst the numerous technologies that come and go, being proficient in SQL remains a crucial and enduring skill for a data engineer. As a newbie, I’d start with the basics such as

understanding databases
core SQL commands (select, from, where, insert, update, delete) and
simple functions (distinct, join)

progress to intermediate techniques like

aggregate functions (count, sum, avg),

and finally dive into advanced topics like

stored procedures
common table expressions (CTEs), and
window functions.

I must admit, I am a fan of row_number(), a window function that helps me partition and effectively eliminate duplicates in a dataset for example. Unfortunately, and that’s the case for all window functions, they can’t be applied immediately to the where clause — thus it’s quite common ending up with a CTE or subquery.

Recently, I stumbled upon the qualify clause in Databricks, that allows applying the result of a window function in a where clause directly.

Let’s imagine I am running a (not so) successful online shop and selling (not so) fancy socks and (ugly) t-shirts with celebrity faces on them. In a database table, I track all the quantities sold per category per product per day.

As I have little to no storage at home, I want to identify the poor-selling products per category by simply seeing at a glance, when each product per category was sold the last time. Using the row_number() window function, I can approach this by:

领英推荐

Best books to learn Data Engineering

GUVI Geek Networks, IITM Research Park 1 年前

Data Engineering & Ice Cream, Together At Last

Cleartelligence 1 年前

Mastering Pandas for Data Engineers: A 60-Day Data…

ITVersity, Inc. 1 个月前

Approach using Common-Table-Expression (CTE)

Leveraging the qualify statement, the whole query can be simplified to:

?? Industry Insights & Trends

Klarna is a Swedish fintech company that basically started as a payment solution for shopping online, often associated with buy now, pay later, but evolved into a neobank in a sense that the company offers banking services like card management and balances, serving over 100 million consumers globally. In November 2022, Klarna’s CEO Sebastian Siemiatkowski discovered ChatGPT on Twitter and resolved to pursue this technology further by encouraging people internally to lean in and try it. As a lot of concerns regarding the usage of internal data had been addressed by the employees in the early days, focusing on approaching a GDPR-compliant solution has turned out to be key to success.

Also, deprecating Salesforce in favor of developing an internal knowledge graph based on Neo4J allowed Klarna to come up with a Chatbot that is capable of seamlessly accessing all data. Klarna has proven that access to data is crucial for success — without comprehensive and well-integrated data, GenAI models can't achieve their full potential, making it essential to break down silos and ensure data is available across the organization.

I highly recommend listening to the podcast or at least reading the summary ??.

While Klarna's success story highlights the potential of GenAI in transforming business operations, it's important to note that not all companies have achieved such smooth integration. The transition from traditional systems to AI-powered solutions often comes with its own set of challenges and considerations.

According to the 2024 CDOIQ conference report for example, only a handful of applications have made it into production. Many companies are still grappling with essential challenges related to GenAI — successful applications require a robust foundation that ensures high data quality and strong governance. In a recent article, McKinsey covered seven actions a company should pursue in the realm of data to scale GenAI ambitions. Key takeaways: Focus on data quality and build data engineering talent. Want to learn more about the other five actions? I recommend reading the full article The data dividend: Fueling generative AI.

?? Community Spotlight

Apache SeaTunnel is an open-source, distributed data integration platform designed for efficient data ingestion and real-time data processing. Recently, the framework incorporated support for LLMs to process data.

When pursuing data engineering projects, I would typically start by evaluating the project’s scope, complexity, and scalability requirements. For small-scale projects or proof of concepts where the data volume is manageable and the requirements are straightforward, I favor starting from scratch with essential libraries like pandas over extensive frameworks.

For larger-scale projects with high data volume and complexity, Apache Spark, Airflow or Apache SeaTunnel can save development time and improve productivity — and with integration to LLMs, data can be enriched with the help of GenAI fairly simple now. For example, incoming customer feedback could automatically be linked to the department it concerns in a data pipeline thanks to OpenAI’s structured outputs.

?? Upcoming Events

coalesce — built by data people, for data people

October 7 - 10, Las Vegas, USA
Agenda at a glance

Microsoft Ignite

November 19 - 22, Chicago, USA
Agenda at a glance

AWS re:Invent

December?2 - 6,?2024, Las Vegas, USA
Areas of focus

Stay tuned for the next issue. Happy engineering! ??

Klementina Idrizi

Data Engineer at Lufthansa Industry Solutions

5 个月

Very interesting, really enjoying ur articles????

1 次回应

要查看或添加评论，请登录

Nicolai Ernst的更多文章

This Month Data Engineering (August '24)

2024年9月2日

This Month Data Engineering (August '24)

?? This Month's Highlights Drugstore operator dm released a corporate version of ChatGPT called dmGPT (German) OpenAI…
Some Thoughts on Artificial General Intelligence

2024年8月21日

Some Thoughts on Artificial General Intelligence

Artificial general intelligence refers to a machine that is capable of behaving intelligently across a wide range of…

1 条评论
Hey, can you pull this data for me?

2023年10月9日

Hey, can you pull this data for me?

Sure, let me just ..

This Month Data Engineering (September '24)

Nicolai Ernst

I am a Data Engineer by profession & Lufthanseat by passion, who can‘t live without his camera.

?? This Month's Highlights

?? Technical Deep Dive

??? Tool of the Month

?? Learning Corner

领英推荐

?? Industry Insights & Trends

?? Community Spotlight

?? Upcoming Events

Nicolai Ernst的更多文章

社区洞察

其他会员也浏览了

Databricks: A Contemporary Solution for Today’s Data Engineering Obstacles

Elevating Code Quality in Data Science: Unveiling Best Practices from Leading Organizations

How to use DagsHub for Data?Science

ProntoPro’s Data team - Gaining insights into the future of local services!

?? Data Science Mastery: Unlocking Business Potential Through FinOps Strategies ??

Elevate Your Data Pipeline Workflow with Kedro!

Why use Delta Live Tables in Databricks?

Analytics and Data Science News for the Week of June 24; Updates from Dataiku, Incorta, Voltron Data, and More

The CopilotKit Project, Data Engineering, MLOps, and Data Science Resources, Fine Tuning LLM

Data Engineering: From Zero ETL in the Past to LLM as the New Future.

?? This Month's Highlights

?? Technical Deep Dive

??? Tool of the Month

?? Learning Corner

领英推荐

?? Industry Insights & Trends

?? Community Spotlight

?? Upcoming Events

Nicolai Ernst的更多文章

This Month Data Engineering (August '24)

Some Thoughts on Artificial General Intelligence

Hey, can you pull this data for me?

社区洞察

其他会员也浏览了

Databricks: A Contemporary Solution for Today’s Data Engineering Obstacles

Elevating Code Quality in Data Science: Unveiling Best Practices from Leading Organizations

How to use DagsHub for Data?Science

ProntoPro’s Data team - Gaining insights into the future of local services!

?? Data Science Mastery: Unlocking Business Potential Through FinOps Strategies ??

Elevate Your Data Pipeline Workflow with Kedro!

Why use Delta Live Tables in Databricks?

Analytics and Data Science News for the Week of June 24; Updates from Dataiku, Incorta, Voltron Data, and More

The CopilotKit Project, Data Engineering, MLOps, and Data Science Resources, Fine Tuning LLM

Data Engineering: From Zero ETL in the Past to LLM as the New Future.