why You Can't Do All Your Data Engineering with SQL

There is a common misunderstanding in data engineering that you can do everything you need to create a big data pipeline with SQL. This notion is being promoted by some vendors and companies. But they’re wrong: you can’t do all of your data engineering with SQL. You will eventually need a programming language to fill in the gaps. This goes well beyond adding UDFs (user defined functions) to accommodate custom functions.


Customization

This need for programming comes from companies expecting to do custom things. They’ve always had programmers on staff to create custom code for custom use cases. Given time and the additions of even more custom cases, you have something that can’t be handled by anything other than custom code. The older the company, the messier the integration problems and the more data engineering is needed.

Integrating Systems

The most common object for big data is to bring data from all over the company and into a single place. The integration of all of these systems is a difficult part of the data engineering process.

A common integration workflow is to start out with binary data. This binary data will come from some sort of legacy system or embedded device sending out data. How do you decode binary data with a SQL query? From there, the data needs to be processed, enriched, and smoothed. I’ve seen companies try to do these steps in SQL. It leads to incomprehensible SQL queries that would have been better expressed with code. More advanced smoothing or normalizing algorithms can’t be expressed with SQL.

You might think this is just an enterprise or larger organization issue. If you’re a startup, you’ll likely have to bring in data from other sources. You’ll be lucky if these outside data sources are usable in their current format. This is because most companies fail at exposing data as a product.

A common outside data source workflow is to take the data and normalize it to your system. The incoming data is often a binary, custom, or quasi-standards-compliant format. The SQL tools expect standards-compliant JSON, XML, or other format. The incoming data doesn’t conform to the standard 100% because there are small but important differences. You are faced with how to change the data into something that works. You’re back to a task that can potentially be expressed with SQL, but would be so complex and incomprehensible that you’re better served with code. Once again, this assumes that problem is solvable with SQL. In my experience, the majority of problems can’t even be expressed with SQL.

If your company isn’t dealing with the many issues of integrating systems, count your lucky stars. You may hit it eventually.

Where Does SQL Go?

Basically, it goes at the end of the pipeline, when the pipeline is close to the end product. You might even go so far as to say at the end of the data pipeline, and you’d use SQL to create a new and derivative data product. SQL can only be done at the tail end of lots of data engineering work (that’s been coded).

Don’t get me wrong. SQL really has its place. If a team is limited to SQL as their only programming language, they will be severely limited or completely unable to create data pipelines. It’s imperative that data engineers be able to program. A data engineer must be able to choose the right language for the job – be that SQL, Java, Scala, or Python.

SQL augments a data engineer’s abilities. Sometimes, SQL is just plain better at expressing a problem. A few common examples of this are JOIN, GROUP BY, or simple WHERE clauses. Some of the better SQL integrations in big data frameworks allow data engineers to come in and out of SQL and code. This really gives the best of both worlds and allows data engineers to be even more productive.

SQL is one of the best ways to expose a data pipeline for consumption by other, less technical groups for ad hoc queries and report creation. When I’m evaluating technologies for a team, I’m always looking for SQL support. However, these are the systems receiving and serving up the previous work done in the data pipeline.

Testing

How do you test SQL pipelines? These tests can be unit, integration, or system tests. I’ve had several clients who were transitioning off of RDBMS systems with heavy SQL and stored procedure usage. Their main difficulty was how to replicate the stored procedures and verify them. This made it really difficult to verify that the new code was working as expected. Some of the real-time SQL processing technologies will have this issue. There will be no way to test the SQL that is being run or to unit test against it.

Some issues with data quality will remain the same, whether you’re coding or using SQL. I advocate a fail-fast policy for data quality. You fail, log it, and move on. Make sure you run negative tests to verify you’re handling bad data correctly.

要查看或添加评论,请登录

Mayank Desai的更多文章

  • What Can Business Analysts Do To Remove Bias In Machine Learning?

    What Can Business Analysts Do To Remove Bias In Machine Learning?

    Do machines make better decisions than people? This is a question every BA should be pondering and debating. Machine…

    2 条评论
  • The Five rules of Big Data

    The Five rules of Big Data

    Big data offers a new style of data analysis which is different from traditional business intelligence (BI). Let me…

    1 条评论

社区洞察

其他会员也浏览了