Data Science Curriculum for Total Beginner
3-6 months blitz program

Data Science Curriculum for Total Beginner

Hi everyone,

I hope you all are having a great Sunday! In this week's edition, I will show you a very straightforward curriculum for learning data science for people without a technical background.

6-step top-down roadmap

It's a condensed version of a pragmatic machine learning roadmap I've written which is aimed at busy professionals (instead of university students).

Disclaimer: This isn’t the type of roadmap that should take you 4 years to complete. It's a roadmap designed to get you to a point where you can do any kind of data science project fast.

There are 2-3 learning steps that when you finally go through them, you go from not knowing how to do data science to being able to do it. I suggest to anyone that is serious about learning this skill to set aside some focused time to blitz through the material.

Bias: This curriculum is heavily biased by my personal experience. I'm pushing Python instead of R or Matlab, I highlight specific libraries or ways of working that might not align 100% with your reality. Still, I believe that this roadmap is the simplest and most straightforward to get you this professional skill.

Table of Content

  1. Step 1: Start at the Top and Study a Complete Data Science Analysis.
  2. Step 2: Take a Beginner Python Course.
  3. Step 3: Learn the Basic Data Science Tech Libraries.
  4. Step 4: Take a Dataset and do Exploratory Data Analysis.
  5. Step 5: Learn Pragmatic Statistical Learning Skills.
  6. Step 6: Take a Data Set and Start Modeling End-to-End.
  7. Extra Step: Dive Deeper Into Statistical Learning.
  8. Conclusion

Step 1: Start at the Top and Study a Complete Data Science Analysis.

Goal: Understand what making a complete data science analysis looks like.

The very first step in the roadmap is to look at the whole game. There is a tendency when technical topics are involved to start by learning the fundamentals. At face value it makes sense, you don't want someone not knowing what they are doing running a non-sensical analysis.

However, pushed to an extreme this type of first-principle learning is tedious for beginners who feel rapidly discouraged. Here's a good analogy from Dr. Perkins who popularized the top-down learning method we are using in this roadmap:

You don’t learn to play baseball by a year of batting practice, but in learning math, for instance, students are all too often presented with prescribed problems with only one right solution and no clear indication how they connect with the real world. (Perkins, 2009)

In our context, a full data science game is a complete notebook that goes from raw data to modelization.

The best place to check out how accomplished data scientist are organizing their work is Kaggle. It’s like a giant repository of tons of analysis written in what is called a notebook.

Great website for data science practitioners.
Notebooks are a type of program that lets you create programmatic analysis and plot visualization in the same format. A bit like a Word document with sections that can run code.

What I like about Kaggle is that it abstracts a lot of the connex skills you need to build to be a complete data scientist:

  1. raw data management
  2. setting up a coding environment
  3. notebook versioning

Yet, this allows you to quickly get the essential data science skills and not spend a full week setting up your GPUs or your Python environment.

What I suggest for this step is to go in the Kaggle website in the competition sections and look at the winners of various data science challenges :

The competitions are a great place to learn how others are running data science projects.

Try to filter for competition with data you would use in your day-to-day. Open up a few of the winning notebooks and look at the code, visualization, and result.

You will most likely not understand much of what is going on, but by doing so you can see an end-to-end complete analysis. The interesting thing is that the code is usually not that verbose in most instances:

Check this one that analyzes the obesity risk in some medical data.

Well structured, not that verbose.

It’s the thought process behind the analysis that is the most important part of most data science projects and that's what this roadmap is trying to teach you, fast.

I suggest that you bookmark the notebooks you find the most interesting so that you revisit them periodically. You will see as you learn more machine learning topics you will quickly get a good grasp of what is going on in these analyses.

Step 2: Take a Beginner Python Course.

Goal: Get basic programming knowledge to write scripts and snippets of code.

The basic programming course material I recommend is usually those from freeCodeCamp. It’s a non-profit organization that creates great complete long-form tutorials for learning how to program.

This 4h course should be enough for your case to get a handle on the basics:

If you prefer a more interactive learning experience you can also take this 5h course on Kaggle

The cool thing is that you can code straight into the course.

Don't spend too much time trying to understand all the intricacy of the language, you will have plenty of time to delve deeper into how to use Python efficiently later on.

PS: There are other languages than Python you can use, but data science is much easier to learn in Python in my humble opinion.

Step 3: Learn the Basic Data Science Tech Libraries.

Goal: Learn the programming tools to create robust analysis.

The main data science libraries I recommend to learn about are these:

  1. Pandas : it’s like a programmatic excel sheet.
  2. Numpy : this contains everything you need to calculate, do stats, and play with your data.
  3. Matplotlib : it’s like a graphic library to create visualizations.
  4. Sklearn : it’s a library that contains all the machine learning techniques already programmed for you.

From the notebook mentioned above, we already see 3 of the 4 usual suspects in the import section.

There are tons of others you can use that would fit your use cases, but 95% of all analyses can be run from start to finish with these four.

These are also heavily used by the community, meaning that the probability that any of the questions or errors you will encounter will be answered is super high.

I also heavily suggest using a cloud notebook platform like Google Colab to go through tutorials or analysis.

Very fast to get started and you get access to GPU for beefier modelization.

These platforms are usually free and have everything pre-installed so that you don’t have to worry about configuring low-level things like the Python version or the various packages.

PS: You can also use the Kaggle Cloud notebook directly, works like a charm.

Step 4: Take a Dataset and do Exploratory Data Analysis.

Goal: Getting comfortable manipulating data around.

Exploratory data analysis is the first step in any data science or machine learning experiment. It means exploring your data and understanding how it is structured.

Example analysis for the notebook mentioned earlier.

This is where you create visualizations, get statistical information, and start to generate value.

I highly recommend checking the first sections of some analysis on Kaggle that fit the sort of data you are working with.

This way you will be able to get inspired on what you could be calculating and get snippets of code you can directly use in your analysis (use the bookmarked notebook from the first steps).

There is an infinite number of analyses you can do on some data, here be creative and explore.?

You can also start here with this Kaggle mini-course to learn some plotting techniques that can be easy to use . This will at least get you started with a few tools in your toolbox!

PS: I've elaborated some more on this tutorial on which project structure I like to adopt for long-term data science projects:

For a shorter type of project, keep it simple and do everything in one notebook.

Step 5: Learn Pragmatic Statistical Learning Skills.

Goal: Learn how to make robust machine learning models.

Modelization is sometimes left out of data science learning roadmaps. However, my view on the subject is that the tooling these days to do machine learning is so user-friendly that it doesn't make sense to not include it.

Continuing on our top-down approach to learning these sets of skills I suggest not starting from foundational machine-learning topics like Linear algebra or Statistics ?

The reasoning behind not starting from these is that:

  1. You don’t need to understand them to create a good machine-learning system. Another analogy is that you don’t necessarily need to understand mechanical engineering to know how to drive your car.
  2. It’s going to take you way too long and you will get discouraged. I tutored lots of University students and starting from the basics weeds out too many people compared to a top-down approach.

I elaborated some more in the following tutorial:

It doesn’t mean that you should never learn the basics. I recommend learning that in the last step of this roadmap.

It means that whenever you are going to dive into basics, you should have full knowledge of where this basic information fits in the greater scheme of things.

A bit like learning a sport, no athlete is doing targeted drills if they have never seen what their sport looks like.

The book I always recommend for beginners to learn pragmatic statistical learning skills is this one:

Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems

Either version is good!

The book has two sections, one for shallow learning systems and one for deep learning.

I would suggest that you don’t go through the deep learning section because it’s using technology that is difficult to understand. The machine learning community usually prefers to go with the PyTorch deep learning framework.

Use that book to get a good understanding of everything to make a true machine-learning system that is properly cross-validated.

Once you are done, I highly recommend getting a grasp of deep learning systems using this course from fast.ai :

Practical Deep Learning for Coders

Very solid material in there.


The course is extremely well thought out and enforces a very pragmatic approach to training neural networks.

It uses the fast AI library which is a wrapper over the Pytorch framework, meaning that it is extremely easy to train a deep learning system in a few lines of code.

With these two resources, there is nothing you cannot start building.

Step 6: Take a Data Set and Start Modeling End-to-End.

After running through the exercises in step 5 a few times, you are more than ready to do a full end-to-end analysis.

The general process is simple enough:

  1. Take a data set that you are interested in (it helps when it's a topic you love).
  2. Pre-processed it to put it into a clean data frame format.
  3. Do some exploratory data analysis to get a feel for what is in the data.
  4. Explore and document how the data was generated and what is the meta-context of the data set (this is a very important step).
  5. Define your project objective.
  6. Separate your data into training, validation, and test sets for modelization (if necessary).
  7. Train a naive machine learning system and optimize iteratively.

While you are going through this project I would suggest topping up your learning with intermediate level mini-course on Kaggle on specific topics :

They are free and very fast to do. They cover topics such as:

  • Data visualization
  • Feature Engineering
  • Time series

And voila ??

At this point, you have officially learned how to do data science taking raw data up to the modelization stage.

It's a good idea to go into an iterative type of learning at this point and improve your weakest point.

Extra Step: Dive Deeper Into Statistical Learning.

Once you get a hang of the whole flow from data to modeling, it’s a good investment of time to read a complete piece of work about statistical learning.

The resource that I found the most useful and that I come back to again and again is Element of Statistical Learning.

You can find the full book for free on the author's website here:

?? https://hastie.su.domains/publications.html ?

One of the best books about statistical learning

It's a very useful book, but you don’t need to dive into it linearly.

I would suggest a motivated learner only dive into the specific section that pertains to a technique they are using in their current analysis (or that you saw in a bookmarked notebook).

Check out the table of contents and jump to the right section straight away.

The book is very visual, the explanations are robust and it’s authoritative in the field. A very good investment of your time ??

Conclusion

I hope this is useful for people getting started, if you have any questions don't hesitate to reach out to DM so I can point you in the right direction.

The hardest part is to commit for more than 3 months, you got this!

You can also check out the full blog post I mentioned at the beginning for extra resources but don't get buried in too many courses. Remember that the goal is to be able to go from raw data to a full data science analysis as quickly as possible.

PS: Another skill that might be very useful to you is SQL to be able to extract data from your various databases, but from my experience, you don’t need to learn a lot to be proficient in data science.

A nice resource that I enjoy for SQL-related topics is the YouTube channel of Alex Freberg . It’s beginner-friendly and teaches you good skills to get data and do all sorts of manipulation.

Good luck ??????

Martina Banda Hamakuni

Hatchi and Hatchi Pty Ltd

1 个月

Thank you for the information!looking forward to the next 3 months

Jose Bustillo

Procurement Excellence @SpendHQ | COE | Former - KPMG US, BVLGARI, LVMH | Content Creator | Digital Transformation & Change Management Advisor | Driving Success Through Procurement Excellence.

7 个月

Great stuff! Thank you for putting this together

要查看或添加评论,请登录

社区洞察

其他会员也浏览了