WHY SQL IS YOUR FOUNDATION FOR PRACTICING DATA SCIENCE
Kristen Kehrer
Mavens of Data Podcast Host, [in]structor, Co-Author of Machine Learning Upgrade
A number of times last year I was asked “What do you think is the most important skill in data science?” I always replied “SQL”. Although this response was always met with a nod of agreement, I was often told that wasn’t the typical response. Understandable, the Python vs. R debate is apparently more sexy. But on day 1 of your first data job they’re going to introduce you to their data warehouse. This is the data you’ll utilize to analyze data, by writing SQL queries.
I’ve used SQL at multiple jobs throughout my career, but I wanted to make sure that other companies were doing the same. Here is a current job listing at Google for a Data Scientist, and they want experience with SQL:
The major cloud providers are now offering relational databases in the cloud:
Also, Google Cloud SQL and Azure Database for PostgreSQL. The data is getting bigger, but SQL is here to stay (and scale).
If you read my article on data science FAQs, we saw that 51% of job openings titled “Data Scientist” in the US were asking for SQL.
Companies are using it, there is demand for the skill, and it’s here to stay. Even if for some reason you do not need SQL at your first job, I’m sure you’ll need it at some point during your career if you’re in data science.
OBJECTION: YOU WORK WITH UNSTRUCTURED DATA
Yes, I sometimes work with unstructured data in the Big Data environment. But if I’m in there and find a variable that would be relevant for repeated use, we’ll typically have the big data team make it available in the data warehouse.
I pull some grouped data (using Hive which is quite similar to SQL) onto my local machine and do an analysis.
At some point in the future (length of time depending on the current priorities) I’ll have that data available for me to use in one of the tables in the database. And life keeps moving on.
I could spend more time in the big data environment, but queries run much faster in the relational database.
OBJECTION: SOMEONE OTHER THAN YOURSELF PULLS YOUR DATA
We all know that it is important to understand how the tables are related and the logic behind the data. I want to write the query that builds the model I’m putting my name on. Understanding all of the intricacies and nuances of the fields. Having a full understanding of the potential bias and caveats that will need to go along with my model allows you to communicate these caveats with the business. I also like to think I’m pretty creative when thinking about variables. This is partly due to having a good understanding of the different tables in the relational database.
There are often questions you’ll want to be able to answer by yourself. If something doesn’t seem right with your data, you’ll want to be able to dig into the discrepancy to find out what is going on. You don’t want to be blindly following data that someone else provided, and you don’t want to get held up if the data doesn’t seem quite right. I want to dig in and look for answers immediately.
OBJECTION: MAYBE YOU JUST WANT TO USE PYTHON OR R.
Cool, I pull data from the database into Python and R too. However, I start my query in SQL. I find that for complex queries where I am joining multiple tables it makes sense to write my query in SQL first. The errors when I misspell something are much easier to catch and track down when I’m directly in SQL rather than when I write a query directly in Python and then find that it doesn’t run for some reason. Python just let’s me know that there is an error, it’s not going to give me hints about what the problem was like I’d get in SQL.
Although you can use Python or R syntax that is not SQL to speak to the database, you still have to understand the schema and how relational databases work to be successful querying this way.It’s is fairly easy to learn, even for total newbies.
SUMMARY
The learning curve is quite easy, so you’ll be writing queries in no time if you decide to learn.
Learn it once, use it again and again in your career. More than 50% of the “Data Scientist” positions in the US are specifically asking for SQL. Do not underestimate the value of learning this skill!
Originally posted on https://datamovesme.com
Circle Head
5 年Fantastic
Principal Advisor Innovation - Data and Information Management at Rio Tinto
5 年SQL has been my daily hammer for 20 years. Such a powerful tool. Great post!
Data Scientist
5 年How do you get the data?...SQL..nothing further