WHY SQL IS YOUR FOUNDATION FOR PRACTICING DATA SCIENCE

A number of times last year I was asked “What do you think is the most important skill in data science?” I always replied “SQL”. Although this response was always met with a nod of agreement, I was often told that wasn’t the typical response. Understandable, the Python vs. R debate is apparently more sexy. But on day 1 of your first data job they’re going to introduce you to their data warehouse. This is the data you’ll utilize to analyze data, by writing SQL queries.

I’ve used SQL at multiple jobs throughout my career, but I wanted to make sure that other companies were doing the same. Here is a current job listing at Google for a Data Scientist, and they want experience with SQL:

No alt text provided for this image

The major cloud providers are now offering relational databases in the cloud:

No alt text provided for this image

Also, Google Cloud SQL and Azure Database for PostgreSQL. The data is getting bigger, but SQL is here to stay (and scale).

If you read my article on data science FAQs, we saw that 51% of job openings titled “Data Scientist” in the US were asking for SQL.

Companies are using it, there is demand for the skill, and it’s here to stay. Even if for some reason you do not need SQL at your first job, I’m sure you’ll need it at some point during your career if you’re in data science.


OBJECTION: YOU WORK WITH UNSTRUCTURED DATA

Yes, I sometimes work with unstructured data in the Big Data environment. But if I’m in there and find a variable that would be relevant for repeated use, we’ll typically have the big data team make it available in the data warehouse.

I pull some grouped data (using Hive which is quite similar to SQL) onto my local machine and do an analysis.

At some point in the future (length of time depending on the current priorities) I’ll have that data available for me to use in one of the tables in the database. And life keeps moving on.

I could spend more time in the big data environment, but queries run much faster in the relational database.


OBJECTION: SOMEONE OTHER THAN YOURSELF PULLS YOUR DATA

We all know that it is important to understand how the tables are related and the logic behind the data. I want to write the query that builds the model I’m putting my name on. Understanding all of the intricacies and nuances of the fields. Having a full understanding of the potential bias and caveats that will need to go along with my model allows you to communicate these caveats with the business. I also like to think I’m pretty creative when thinking about variables. This is partly due to having a good understanding of the different tables in the relational database.

There are often questions you’ll want to be able to answer by yourself. If something doesn’t seem right with your data, you’ll want to be able to dig into the discrepancy to find out what is going on. You don’t want to be blindly following data that someone else provided, and you don’t want to get held up if the data doesn’t seem quite right. I want to dig in and look for answers immediately.

No alt text provided for this image


OBJECTION: MAYBE YOU JUST WANT TO USE PYTHON OR R.

Cool, I pull data from the database into Python and R too. However, I start my query in SQL. I find that for complex queries where I am joining multiple tables it makes sense to write my query in SQL first. The errors when I misspell something are much easier to catch and track down when I’m directly in SQL rather than when I write a query directly in Python and then find that it doesn’t run for some reason. Python just let’s me know that there is an error, it’s not going to give me hints about what the problem was like I’d get in SQL.

Although you can use Python or R syntax that is not SQL to speak to the database, you still have to understand the schema and how relational databases work to be successful querying this way.It’s is fairly easy to learn, even for total newbies.


SUMMARY

The learning curve is quite easy, so you’ll be writing queries in no time if you decide to learn.

Learn it once, use it again and again in your career. More than 50% of the “Data Scientist” positions in the US are specifically asking for SQL. Do not underestimate the value of learning this skill!


Originally posted on https://datamovesme.com


Fantastic

回复
Bevan Ward

Principal Advisor Innovation - Data and Information Management at Rio Tinto

5 年

SQL has been my daily hammer for 20 years. Such a powerful tool. Great post!

How do you get the data?...SQL..nothing further

要查看或添加评论,请登录

Kristen Kehrer的更多文章

  • HOW TO CREATE A COMPUTER VISION DATASET FROM VIDEO IN R

    HOW TO CREATE A COMPUTER VISION DATASET FROM VIDEO IN R

    I wanted to write a quick article about creating image datasets from video for computer vision. Here we’ll be taking a…

  • STRONG DATA SCIENCE CONTENT FOR YOUR RESUME

    STRONG DATA SCIENCE CONTENT FOR YOUR RESUME

    The biggest pain point or challenge I hear when people are writing their resume is that they want concise, crisp…

    13 条评论
  • TRYING TO CHANGE CAREERS OR GET YOUR START IN DATA SCIENCE?

    TRYING TO CHANGE CAREERS OR GET YOUR START IN DATA SCIENCE?

    If you’re someone who is looking to make a move to data science, there are some ways that you can polish your approach…

    3 条评论
  • EFFECTIVE DATA SCIENCE PRESENTATIONS

    EFFECTIVE DATA SCIENCE PRESENTATIONS

    If you’re new to the field of Data Science, I wanted to offer some tips on how to transition from presentations you…

    8 条评论
  • SETTING YOUR HYPOTHESIS TEST UP FOR SUCCESS

    SETTING YOUR HYPOTHESIS TEST UP FOR SUCCESS

    Setting up your hypothesis test for success as a data scientist is critical. I want to go deep with you on exactly how…

    1 条评论
  • GETTING INTO DATA SCIENCE FAQS

    GETTING INTO DATA SCIENCE FAQS

    I often see similar questions in my inbox or asked in webinars. I’d like to set the record straight with data.

  • ASKING GREAT QUESTIONS AS A DATA SCIENTIST

    ASKING GREAT QUESTIONS AS A DATA SCIENTIST

    Asking questions can sometimes seem scary. No one wants to appear “silly.

    3 条评论
  • KEY INGREDIENTS TO BEING DATA DRIVEN

    KEY INGREDIENTS TO BEING DATA DRIVEN

    Companies love to exclaim “we’re data driven”. There are obvious benefits to being a data driven organization, and…

    3 条评论
  • PARENTS OF DATA SCIENCE SURVEY – R CODE

    PARENTS OF DATA SCIENCE SURVEY – R CODE

    A couple weeks ago, Kate Strachnyi and I posted a survey across social media to try and collect data on demographics…

社区洞察

其他会员也浏览了