SQL
SQL is an amazing language. So many other languages come and go, but SQL is here to stay. Stands for Structured Query Language, it was invented in 1970s. So it’s 50 years old! Because most databases are relational, SQL becomes the de-facto standard for querying databases. And data warehouse. It becomes the standard for updating relational tables too.
In the era of Big Data, when Hadoop was first invented, it was MapReduce that was used to process data. But not long after that, guess what people used to query Hadoop? SQL language! They put something called Pig and Hive, which enabled us to query Hadoop using SQL language. Yes it does sound like a zoo.
When Spark came along, people query it using Scala. It gave you access to the Resilient Distributed Dataset (RDD). In Spark, each RDD is represented by a Scala object (link). But today almost everyone query Spark using SQL. Including on Databricks. You can use Scala and Java on Databricks, but most people use PySpark and SparkSQL. PySpark is Python language, but the Python code is not running on a traditional Python environment. The PySpark code is converted to (guess what?) SQL and run on JVM cluster.
Sometimes people asked me whether they should learn Python or SQL for data jobs. Well, both! Both Python and SQL are used in data jobs. For querying database, SQL is the best, by far. But there are tons of things that SQL can’t do that Python can, like building applications. So you need to learn both. Whether you do data analysis or data science, you need to use both Python and SQL.
Those of us who have been using SQL for years would think that it is easy to use. But those who are just about to start their career in the data industry might have a big fear about SQL. For those people let me give you a piece of advice: it is just about “SELECT FROM TABLE1 WHERE Column1 = ABC”. It is just about selecting some rows from a table. That’s it. So you will be able to do it within half an hour. Don’t let other things scare you. Forget about INSERT, DELETE, UPDATE and everything else. Just do “SELECT FROM TABLE1” on a few tables and look at the content of those tables. Apply WHERE clause to pick certain rows. And that’s it.
Whether you are on SQL Server, Oracle, Redshift or Snowflake, you will be doing the same thing: SELECT * FROM TABLE1 WHERE Column1 = 'ABC'. You will be able to query data from tables. Just make it easy for yourself: focus on SELECT from one table. Then after you are comfortable with it, use JOIN to do select from two tables. Just with SELECT, WHERE and JOIN, you will be able to query most of the data. And then you can do your analysis with that data.
So SQL in many ways is just a tool to grab the data out of the database. That is what SQL is for many business users. They want to analyse the data in Excel. The data scientists might prefer to analyse the data using Python, the data analysts might prefer to analyse the data using SQL, but most of the business users prefer to analyse the data in Excel. They ask IT departments to get the data for them using SQL, and put it on Excel. Can’t they do it in Qlik? Of course they can. Can’t they do it in Power BI? Of course they can. Can’t they do it in SQL? Of course they can. But they are more comfortable using Excel.
That’s why, the BI tools like Looker end up being used as a query tool. Just to grab the data. And put it on Excel. Not just looker, but Cognos, Sisense, Power BI, Tableau, Zoho, they all are used as a tool to grab the data. To you and me who are from IT department, they are tools to build BI applications. To the business users, the BI tools are just replacement for SQL. Just a tool to bring the data out from the database, into Excel.
Whether it is sales manager, finance analysts, purchasing or customer relations, they all use Excel. Can they learn SQL? Of course they can. Is it going to happen? Of course it won’t. A few people talked to me about AI driving the SQL. “Soon”, so they say, “we no longer need to write SQL. AI will be understand plain English and grab the data for you.” Who need to write SQL anyway? Look at it from the business users point of view. They never need to write SQL. They get their data from SAS. Or Qlik. Or Tableau. Or whatever tool you gave them with. They can see the data on the screen with that tool, and they can select the data they need. That AI talk is like inventing a problem which doesn’t exist. Who needs to write SQL? Not the business users.
It is the IT department who write SQL. And does those IT developer wants AI to write SQL for them? Well, yes! It is already happening. Whether you code using Python, SQL or Javascript, you can ask ChatGPT or Copilot or Gemini to write the code for you. Then you need to fix it or modify it to do what the code needs to do. But at least you don’t start from scratch.
But yes, because the data is stored in databases, you will need to use SQL to query the data. You can use AI to write the SQL for you, but you will need to use SQL to query the data. And as those of us who has written SQL for years can testify, it is not difficult to write SQL. In the core of it, it’s just SELECT, WHERE and JOIN. Right? Well you can do fancy stuff like Regex, Cursor, Rank, etc. but the single statement that we use most is the plain old SELECT.
Would appreciate your comments and opinions. And please correct me if I’m wrong.
Business lntelligence Developer | Database Administrator
2 周???
Senior Business Intelligence Consultant with focus on the MS BI platform. Fully booked until May 2025.
3 周SQL was build for end users in the first place but became something very complicated during the recent years. The relational model with 3NF is still the most productive way to handle data and more scenarios are being added continuously to get data into tables, SQL is strong and I agree about what you write here.
Consultant database, data warehouse, BI, data mart, cube, ETL, SQL, analysis, design, development, documentation, test, management, SQL Server, Access, ADP+, Kimball practitioner. JOIN people ON data.
3 周SQL is a declarative language that describes what should be done, not how it should be done. I often wonder how extended the language would have been today, if we not have lost one of the two SQL-father Raymond F. Boyce in June 1974. RIP. The other SQL-father is Donald D. Chamberlin. https://www.joakimdalby.dk/HTM/sequel.pdf
SQL is the best thing that happened before the Internet. And still is. Thanks for this post