Data Science Curriculum for Total Beginner
Hi everyone,
I hope you all are having a great Sunday! In this week's edition, I will show you a very straightforward curriculum for learning data science for people without a technical background.
It's a condensed version of a pragmatic machine learning roadmap I've written which is aimed at busy professionals (instead of university students).
Disclaimer: This isn’t the type of roadmap that should take you 4 years to complete. It's a roadmap designed to get you to a point where you can do any kind of data science project fast.
There are 2-3 learning steps that when you finally go through them, you go from not knowing how to do data science to being able to do it. I suggest to anyone that is serious about learning this skill to set aside some focused time to blitz through the material.
Bias: This curriculum is heavily biased by my personal experience. I'm pushing Python instead of R or Matlab, I highlight specific libraries or ways of working that might not align 100% with your reality. Still, I believe that this roadmap is the simplest and most straightforward to get you this professional skill.
Table of Content
Step 1: Start at the Top and Study a Complete Data Science Analysis.
Goal: Understand what making a complete data science analysis looks like.
The very first step in the roadmap is to look at the whole game. There is a tendency when technical topics are involved to start by learning the fundamentals. At face value it makes sense, you don't want someone not knowing what they are doing running a non-sensical analysis.
However, pushed to an extreme this type of first-principle learning is tedious for beginners who feel rapidly discouraged. Here's a good analogy from Dr. Perkins who popularized the top-down learning method we are using in this roadmap:
You don’t learn to play baseball by a year of batting practice, but in learning math, for instance, students are all too often presented with prescribed problems with only one right solution and no clear indication how they connect with the real world. (Perkins, 2009)
In our context, a full data science game is a complete notebook that goes from raw data to modelization.
The best place to check out how accomplished data scientist are organizing their work is Kaggle. It’s like a giant repository of tons of analysis written in what is called a notebook.
Notebooks are a type of program that lets you create programmatic analysis and plot visualization in the same format. A bit like a Word document with sections that can run code.
What I like about Kaggle is that it abstracts a lot of the connex skills you need to build to be a complete data scientist:
Yet, this allows you to quickly get the essential data science skills and not spend a full week setting up your GPUs or your Python environment.
What I suggest for this step is to go in the Kaggle website in the competition sections and look at the winners of various data science challenges :
Try to filter for competition with data you would use in your day-to-day. Open up a few of the winning notebooks and look at the code, visualization, and result.
You will most likely not understand much of what is going on, but by doing so you can see an end-to-end complete analysis. The interesting thing is that the code is usually not that verbose in most instances:
It’s the thought process behind the analysis that is the most important part of most data science projects and that's what this roadmap is trying to teach you, fast.
I suggest that you bookmark the notebooks you find the most interesting so that you revisit them periodically. You will see as you learn more machine learning topics you will quickly get a good grasp of what is going on in these analyses.
Step 2: Take a Beginner Python Course.
Goal: Get basic programming knowledge to write scripts and snippets of code.
The basic programming course material I recommend is usually those from freeCodeCamp. It’s a non-profit organization that creates great complete long-form tutorials for learning how to program.
This 4h course should be enough for your case to get a handle on the basics:
If you prefer a more interactive learning experience you can also take this 5h course on Kaggle
Don't spend too much time trying to understand all the intricacy of the language, you will have plenty of time to delve deeper into how to use Python efficiently later on.
PS: There are other languages than Python you can use, but data science is much easier to learn in Python in my humble opinion.
Step 3: Learn the Basic Data Science Tech Libraries.
Goal: Learn the programming tools to create robust analysis.
The main data science libraries I recommend to learn about are these:
There are tons of others you can use that would fit your use cases, but 95% of all analyses can be run from start to finish with these four.
These are also heavily used by the community, meaning that the probability that any of the questions or errors you will encounter will be answered is super high.
I also heavily suggest using a cloud notebook platform like Google Colab to go through tutorials or analysis.
These platforms are usually free and have everything pre-installed so that you don’t have to worry about configuring low-level things like the Python version or the various packages.
PS: You can also use the Kaggle Cloud notebook directly, works like a charm.
Step 4: Take a Dataset and do Exploratory Data Analysis.
Goal: Getting comfortable manipulating data around.
Exploratory data analysis is the first step in any data science or machine learning experiment. It means exploring your data and understanding how it is structured.
This is where you create visualizations, get statistical information, and start to generate value.
I highly recommend checking the first sections of some analysis on Kaggle that fit the sort of data you are working with.
This way you will be able to get inspired on what you could be calculating and get snippets of code you can directly use in your analysis (use the bookmarked notebook from the first steps).
领英推荐
There is an infinite number of analyses you can do on some data, here be creative and explore.?
You can also start here with this Kaggle mini-course to learn some plotting techniques that can be easy to use . This will at least get you started with a few tools in your toolbox!
PS: I've elaborated some more on this tutorial on which project structure I like to adopt for long-term data science projects:
For a shorter type of project, keep it simple and do everything in one notebook.
Step 5: Learn Pragmatic Statistical Learning Skills.
Goal: Learn how to make robust machine learning models.
Modelization is sometimes left out of data science learning roadmaps. However, my view on the subject is that the tooling these days to do machine learning is so user-friendly that it doesn't make sense to not include it.
Continuing on our top-down approach to learning these sets of skills I suggest not starting from foundational machine-learning topics like Linear algebra or Statistics ?
The reasoning behind not starting from these is that:
I elaborated some more in the following tutorial:
It doesn’t mean that you should never learn the basics. I recommend learning that in the last step of this roadmap.
It means that whenever you are going to dive into basics, you should have full knowledge of where this basic information fits in the greater scheme of things.
A bit like learning a sport, no athlete is doing targeted drills if they have never seen what their sport looks like.
The book I always recommend for beginners to learn pragmatic statistical learning skills is this one:
Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems
The book has two sections, one for shallow learning systems and one for deep learning.
I would suggest that you don’t go through the deep learning section because it’s using technology that is difficult to understand. The machine learning community usually prefers to go with the PyTorch deep learning framework.
Use that book to get a good understanding of everything to make a true machine-learning system that is properly cross-validated.
Once you are done, I highly recommend getting a grasp of deep learning systems using this course from fast.ai :
Practical Deep Learning for Coders
The course is extremely well thought out and enforces a very pragmatic approach to training neural networks.
It uses the fast AI library which is a wrapper over the Pytorch framework, meaning that it is extremely easy to train a deep learning system in a few lines of code.
With these two resources, there is nothing you cannot start building.
Step 6: Take a Data Set and Start Modeling End-to-End.
After running through the exercises in step 5 a few times, you are more than ready to do a full end-to-end analysis.
The general process is simple enough:
While you are going through this project I would suggest topping up your learning with intermediate level mini-course on Kaggle on specific topics :
They are free and very fast to do. They cover topics such as:
And voila ??
At this point, you have officially learned how to do data science taking raw data up to the modelization stage.
It's a good idea to go into an iterative type of learning at this point and improve your weakest point.
Extra Step: Dive Deeper Into Statistical Learning.
Once you get a hang of the whole flow from data to modeling, it’s a good investment of time to read a complete piece of work about statistical learning.
The resource that I found the most useful and that I come back to again and again is Element of Statistical Learning.
You can find the full book for free on the author's website here:
It's a very useful book, but you don’t need to dive into it linearly.
I would suggest a motivated learner only dive into the specific section that pertains to a technique they are using in their current analysis (or that you saw in a bookmarked notebook).
Check out the table of contents and jump to the right section straight away.
The book is very visual, the explanations are robust and it’s authoritative in the field. A very good investment of your time ??
Conclusion
I hope this is useful for people getting started, if you have any questions don't hesitate to reach out to DM so I can point you in the right direction.
You can also check out the full blog post I mentioned at the beginning for extra resources but don't get buried in too many courses. Remember that the goal is to be able to go from raw data to a full data science analysis as quickly as possible.
PS: Another skill that might be very useful to you is SQL to be able to extract data from your various databases, but from my experience, you don’t need to learn a lot to be proficient in data science.
A nice resource that I enjoy for SQL-related topics is the YouTube channel of Alex Freberg . It’s beginner-friendly and teaches you good skills to get data and do all sorts of manipulation.
Good luck ??????
Hatchi and Hatchi Pty Ltd
1 个月Thank you for the information!looking forward to the next 3 months
Procurement Excellence @SpendHQ | COE | Former - KPMG US, BVLGARI, LVMH | Content Creator | Digital Transformation & Change Management Advisor | Driving Success Through Procurement Excellence.
7 个月Great stuff! Thank you for putting this together