This Month Data Engineering (September '24)
Nicolai Ernst
I am a Data Engineer by profession & Lufthanseat by passion, who can‘t live without his camera.
?? This Month's Highlights
?? Technical Deep Dive
As a software engineer, I am pretty sure you are familiar with the software development lifecycle (SDLC) — it is a widely understood and battle-tested framework, that is also known as DevOps. In 2016, Tristan Handy wrote a blog post about bringing software engineering best practices into the work of data practitioners. Fast forward to today, the blog post received an update and introduced the analytics development lifecycle (ADLC).
The framework applies to the entire analytical system, from ingestion through transformation to analysis and ultimately building applications on top. I highly recommend reading the blog post.
??? Tool of the Month
As a data engineer, encountering unknown datasets is quite common. In those situations, I consider creating a profile a good first step — like understanding a datasets’ shape, its data types, a column’s minimum and maximum values, a correlation matrix and other key characteristics. If you’re already using pandas for data manipulation, you’re actually just one line of code away from creating a comprehensive report of your dataset — thanks to ydata.
Also, you may want to use the gained knowledge as input to great_expectations, a tool to write units tests on data — I covered it in last month’s issue already.
?? Learning Corner
Amidst the numerous technologies that come and go, being proficient in SQL remains a crucial and enduring skill for a data engineer. As a newbie, I’d start with the basics such as
progress to intermediate techniques like
and finally dive into advanced topics like
I must admit, I am a fan of row_number(), a window function that helps me partition and effectively eliminate duplicates in a dataset for example. Unfortunately, and that’s the case for all window functions, they can’t be applied immediately to the where clause — thus it’s quite common ending up with a CTE or subquery.
Recently, I stumbled upon the qualify clause in Databricks, that allows applying the result of a window function in a where clause directly.
Let’s imagine I am running a (not so) successful online shop and selling (not so) fancy socks and (ugly) t-shirts with celebrity faces on them. In a database table, I track all the quantities sold per category per product per day.
As I have little to no storage at home, I want to identify the poor-selling products per category by simply seeing at a glance, when each product per category was sold the last time. Using the row_number() window function, I can approach this by:
领英推荐
Leveraging the qualify statement, the whole query can be simplified to:
?? Industry Insights & Trends
Klarna is a Swedish fintech company that basically started as a payment solution for shopping online, often associated with buy now, pay later, but evolved into a neobank in a sense that the company offers banking services like card management and balances, serving over 100 million consumers globally. In November 2022, Klarna’s CEO Sebastian Siemiatkowski discovered ChatGPT on Twitter and resolved to pursue this technology further by encouraging people internally to lean in and try it. As a lot of concerns regarding the usage of internal data had been addressed by the employees in the early days, focusing on approaching a GDPR-compliant solution has turned out to be key to success.
Also, deprecating Salesforce in favor of developing an internal knowledge graph based on Neo4J allowed Klarna to come up with a Chatbot that is capable of seamlessly accessing all data. Klarna has proven that access to data is crucial for success — without comprehensive and well-integrated data, GenAI models can't achieve their full potential, making it essential to break down silos and ensure data is available across the organization.
I highly recommend listening to the podcast or at least reading the summary ??.
While Klarna's success story highlights the potential of GenAI in transforming business operations, it's important to note that not all companies have achieved such smooth integration. The transition from traditional systems to AI-powered solutions often comes with its own set of challenges and considerations.
According to the 2024 CDOIQ conference report for example, only a handful of applications have made it into production. Many companies are still grappling with essential challenges related to GenAI — successful applications require a robust foundation that ensures high data quality and strong governance. In a recent article, McKinsey covered seven actions a company should pursue in the realm of data to scale GenAI ambitions. Key takeaways: Focus on data quality and build data engineering talent. Want to learn more about the other five actions? I recommend reading the full article The data dividend: Fueling generative AI.
?? Community Spotlight
Apache SeaTunnel is an open-source, distributed data integration platform designed for efficient data ingestion and real-time data processing. Recently, the framework incorporated support for LLMs to process data.
When pursuing data engineering projects, I would typically start by evaluating the project’s scope, complexity, and scalability requirements. For small-scale projects or proof of concepts where the data volume is manageable and the requirements are straightforward, I favor starting from scratch with essential libraries like pandas over extensive frameworks.
For larger-scale projects with high data volume and complexity, Apache Spark, Airflow or Apache SeaTunnel can save development time and improve productivity — and with integration to LLMs, data can be enriched with the help of GenAI fairly simple now. For example, incoming customer feedback could automatically be linked to the department it concerns in a data pipeline thanks to OpenAI’s structured outputs.
?? Upcoming Events
coalesce — built by data people, for data people
Microsoft Ignite
AWS re:Invent
Stay tuned for the next issue. Happy engineering! ??
Data Engineer at Lufthansa Industry Solutions
5 个月Very interesting, really enjoying ur articles????