Software Engineering vs Data Engineering, parallels and differences - Part 1

Software Engineering vs Data Engineering, parallels and differences - Part 1

To me from a software background, with a history of building data warehouses, I see parallels of all software development topics that also in that appear building data, but have a completely different methodology. Some of this may seem very obvious to some, but it is something that comes up often enough, that I'll share here why not to use some common software practices for data warehouse pipelines, and what to use instead. In this first part will lay out the concept and an example first hand story.


More and more software engineers build data pipelines. And just as you can tell from reading code if someone has a stronger different language such as java or python just from the coding patterns, there are even clearer traces of prior experience when changing to build a data warehouses. I think of it as more set based combinations vs more classes, libraries, and iterative code.

There isn't necessarily a right or wrong way, without considering the team and business objectives. When I was learning object orientated development, it was open ended on how to build the relationship between the objects, so I read Design Patterns to see what has worked in the industry. Similarly, I find it valuable to share data warehouse techniques that are established and known to work, when coming from a software background.


Data Modeling is System Design.

Data Modeling is a form of system design. When interviewing for a senior and staff level of software engineers roles, there is a larger importance of being able to have experience to design a cohesive system that addresses the business goals, and decompose it into something to be delivered. Similarly, with data engineers you would have the same expectation. It is not required to be senior to data model, anyone one can learn it how to do it, yet senior roles have the expectations of experience in how to data model.


My first SQL Optimization - Aha moment (8 hrs -> 12 min)

At the beginning of my data journey, I had to fix a slow stored procedure calculation that took over 8 hours. It calculated all of the commissions for the company with lots of logic for the various things someone can get credit for. This was before distributed database. The general structure was:

1. a while loop calling a stored procedure for each store, passing the store_id as a variable
2. in that stored procedure, a for loop called another stored procedure for each commission type, passing store and commission ids as variables
3. in that second inner stored procedure, it deleted, inserted, selected data.

From a software perspective, these stored procedures are equivalent to functions. Functions are abstractions commonly used to modularize code to do something specific and reusable. The time complexity evaluation for this in a system design interview would be an O(n*m). Or in plain English, those SQL statements will run as many times as the number of stores multiplied times the number of commission types. Not only that, it will typically wastefully read all of the same data each time.

The Fix

I removed the loops and replaced the passed parameters with joins on those values. The code was just as modular without without repeating code, while omitting wrapping it in functions (stored procedures). Each stored procedure call has the overhead of starting a new sql statement and ends by closing a transaction. The final result was less code, less modules of places to look for the code, and 12 minute execution time instead of 8 hours.

This is when I learned about set based thinking, and doing all the work with less reads of the data.

The take away is not to avoid functions or code libraries. It is to understand how it works and the implications. A user defined function in Spark for example is notoriously slow, and I often look into them only to replace them. Although, It's possible that is the only solution for something like encryption, which you wouldn't code in SQL, a native built in library function would be preferred.

What About a CTE?

I almost didn't include this section, because I know people now default to using CTE's and LOVE IT. I like CTE's and I prefer to use them, but the answer is not straight forward and depends on your execution engine and how well it can optimize.

Since a chain of CTEs is all together a single statement, the engine could have the ability to understand all of the layers and optimize it. But you don't know unless you check. If it is doing what the SQL statement explicitly says, then every CTE call would take another pass to read all of the data.

For performance, it usually depends on the SQL Engine

Things change all of the time with SQL engines. Anyone learning the patterns, and being faced with 2 options, don't just look to an expert or online, you will get the most information from running an explain plan for each of the 2 options, and learn the true answer for your situation. See what the clock time is, CPU compute used, memory, and I/O differences.

What About DRY (Don't Repeat Yourself)?

When I took Programming 101, the rule was if you are writing something at least 3 times, consider a function. With the ubiquitous DRY rule, I too often see a desire to eliminate even a second copy with an abstraction. It may work, but each abstraction is something that someone on the team will have to learn and someone will have to review, that will not be the same abstractions in another company. Layers and layers of abstractions optimize specific things at the expense of others. All system design involves identifying and understanding the trade offs, because in development design decisions are not free. Make it easier on your future self who will return to this code, and save the team time and development cost, by limiting unnecessary "just in case" abstractions, until it is absolutely clear that it is needed.

def get_df_sales_stores_calendar:
   # function to return a data frame, joining 3 tables
    return df_sales.join(df_stores, ...).select(...)        

For example, lets say you are creating the same data frame in PySpark in several modules, that join the same 3 tables in the same way throughout the code base. As a software engineer, I think it would make sense to put into a function. That is one way to modularize it, but the engine will perform the same calculation each time it is called. Consider instead creating a physical table and loading this frequently used combination, and reading from the already joined data. That is an important part of the data modeling exercise is to identify what are the common components that you will reuse often. These could be dimensions, but can also be configuration tables, intermediate staged tables, metadata tables. There are many ways that reuse can be built into how you structure your table.

Besides not repeating the compute, which can be exchanged for inexpensive storage, you get the added benefit of being able to more easily inspect and debug the business logic results in the reusable table to see if it is how you expected it to look.


Summary

To summarize, understand the trade offs of how to represent code reuse through data modeling with less code abstractions. When in doubt or uncertain on what performs best, check the execution plan for better detailed understanding. Optimization is not only about system resources, but also human resources considering developer time which is very expensive. Make it easy for other developers to use and code review.



#dataengineer #datamodel #datamodelingrants

Gabriella Alumbugu

Analyst @ Bancorp Consulting || Data Analyst || Excel || SQL || PowerBI || Chemical Engineer

1 周

Great insights on balancing code reuse with simplicity Lenny A..! The trade-off between abstraction and readability is crucial, especially when optimizing for both system and human resources. I completely agree that checking the execution plan offers a better understanding of performance, allowing for informed decisions. Prioritizing clarity not only improves code maintainability but also fosters efficient collaboration among developers. Streamlined code reviews and ease of use are key in maximizing developer productivity and long-term scalability.”

要查看或添加评论,请登录

社区洞察

其他会员也浏览了