Getting Started Guide for Aspiring Data Scientist/Data Engineer/GCP Data Engineer
Srivatsan Srinivasan
Chief Data Scientist | Gen AI | AI Advocate | YouTuber (bit.ly/AIEngineering)
Let me start with reason for this article. I keep getting quiet a good number of messages from Aspiring Data Scientist and Data Engineer on how to get started in respective field and what should they focus on?
This article is to provide one stop reference on how to go about in this field. Some of this is already available via my old posts and video interviews and consolidating here to help new comers in this field
This post is also to bring automation for myself as sometimes I get overwhelmed by messages to respond individually. In case if your response to message by me is this link, please do not consider me as being rude. Remember, I have a day job to do as well
When I say aspiring data scientist, below are some of the category I have noticed
- College graduates aspiring to enter this field with relevant data science background or from background other than Data Engineering or Data Science
- ETL Developers or Database Experts who want to graduate to perform Data Engineering with new age tools or get into Machine Learning
- Software Engineers with strong programming skills who want to get into this field
One thing to understand, all of the above individuals has a role to play in Data Science but one has to customize the learning path based on their comfort zone and need. One I am going to talk as response to commonly asked question below is overall coverage of Data Science. Pick and choose one that fits for your need
Also take my recommendation as suggestions and not guideline. This is my view of story and I might be completely wrong in some cases
You are the best judge to decide for you
Let us first quickly try to understand the end to end data science cycle in real world within enterprises implementing AI/ML solutions
Also various roles that are required to successfully deliver an AI project mapped to data science cycle is below
Note: Depending on size of ML project, One person might play multiple role or
there might be multiple person required for single role. Some role might also
be part time or some components can be built as capability that can be
leveraged across projects
The point of the above representation is to help one map their current skillset to data science roles and see how one can graduate or get comfortable with other areas. Data Science is very broad and being expert in all segments of it is difficult, said that it is good to focus and develop depth in couple of segment above and have the know how of few other segments
Let me get into some of frequently asked question by Aspiring data scientist and my responses/references provided. Below are Top 5 questions that I frequently come across from aspiring data guys
Question 1 - I am new to data science, How can I get started with Learning data science?
You have 6 key learning to undertake be become data scientist. You can skip steps you are already comfortable with
- Programming Skills (Python if you are starting to learn programming, If you are comfortable with R or Scala or any other then stick on to it while learning and slowly graduate to learn Python)
- Maths (Statistics, Linear Algebra and Calculus)
- Understand End to End Data Science Pipeline (Answered in question 2 below)
- Learn the internal on how algorithm works and get intuition on how it can be applied
- Learn Data Engineering (Answered in Question 3). Just learn to the extent required if you are looking purely from ML development perspective. For Data Engineers go detailed as mentioned to answer in Question 3
- Practice the learning in above steps and build portfolio to stand out
Before getting into courses, where to start learning?.
Start the way you feel comfortable. Check my short post on this topic
Learning path for Math
Best place to learn Math is through below 2 you tube channels
- Khan Academy
- 3Blue1Brown
The more depth you go is better. If you feel lost into learning Math, try atleast to be advanced to Expert in Statistics, Intermediate on Linear Algebra and if it is getting too complex to learn calculus just try to understand chain rule and intuition behind back propagation
What to learn is also explained in my video below (Just refer to first 4 to 5 min)
Learning path for Machine Learning
Once you get hang of basic Math I would recommend taking Machine Learning course by Andrew Ng in Coursera. This is Free for Audit course. The course uses Octave but I have attached link to python repository you can follow for this course
Python repository for the course is below
When starting I would advise to focus on Machine Learning and later if needed you can get on to Deep Learning. Andrew Ng also has Deep Learning course which is good one. Best thing is to complete Machine Learning course and practice, practice, practice
Main purpose of Andrew Ng course is to get intuition and math behind Machine Learning algorithm. Even though it is in Octave the concepts he explains is next to none. Once you are done with ML course above follow the below 2 part videos to get into details of sci-kit learn based practice and play around with different datasets
Again there are endless resources you can follow or learn from but nothing matches to doing and practicing data science along with understanding why and how of it
Practice the learning and build portfolio
The are plenty of resources and data sets that one can tap on to practice ML and also see how the same problem is solved by experts in Kaggle or online to learn from it. Check my post below on build portfolio to stand out
And Video on this topic
Before I get to the next question, Just remember there is lot of noise in this space. Cheat sheets, Interview question, free webinars and the list goes on
There is no shortcut to success. When in learning phase stay away from noise, put your heads down to focus and learn. Unplug yourself with not so important things
Question 2 - In academics and courses we study machine learning algorithms and apply it on various datasets. One thing we miss is how real world machine learning looks like. Can you help me understand that?
Learning Data Science and Doing Data Science are two different world
I have put a detailed presentation on focus of courses/academics vs enterprise reality. This calls out the difference and also talks about how end to end machine learning pipeline in enterprises look like
The presentation speaks in detail on how real world machine learning process is.
- Use this information to fill in the skills that can get you closer to industry needs.
- Use this content to define strategy for yourself to land a job in enterprise world.
DevOps is another component that plays a key role in enterprise to create reproducible machine learning deployment. It is good to have knowledge of GIT at least to begin with.
You might find good number of articles in medium or other channel covering individual sections of ML pipeline mentioned in the presentation above (Slide 11). If you are looking to learn data engineering in particular check answer to my next question
Question 3 - How can I get started in Data Engineering?. Is it important for Data Scientist to know Data Engineering?
Why is it important for Data Scientist to know Data Engineering is highlighted in my post below. If your focus is more on Machine Learning side of it learn data engineering to the extent that allows you to create complex feature engineering pipeline.
If you are focus is on Data Engineering the try to learn the end to end Data Engineering aspect starting from Data Collection till Data/Feature transformation
Study path for Data Engineering is the below link
You can also check my you tube video that talks about need for Data Engineering and Dev Ops in Data Science world and why it is important to learn
Question 4 - What books do you recommend for Learning and Understanding Machine Learning algorithms?
There are 2 books I have read and recommend if you are planing to buy one
Data Smart by John W. Foreman. This is must read for business leader who is looking to understand the working behind algorithm. Author does excellent job creating algorithms from scratch using Excel and still staying away from complex algorithmic terms
For advanced users my favorite is "Hands-On Machine Learning with Scikit-Learn and TensorFlow" by Aurélien Géron . This book balances with right amount of theory and implementation code to understand the algorithm in detail
Github repo for second book below and contains good notebook to understand and play with
Question 5 - What is the learning path for Google Cloud Data Engineer Certification?
Below is the learning path I took to get google data engineer certification
Below sections were key for the Data Engineering exam
https://cloud.google.com/docs/#data-and-analytics
https://cloud.google.com/docs/#databases
https://cloud.google.com/docs/#ai-and-ml
You can get 300$ cloud credit on signing up and follow the quick start guide
in each section and get hands on. Additionally play around with their Big Query,
CMLE and Dataproc service
Google has published 2 case studies that will be referred during the exam.
One is basis for lift and shift migration and other of re-engineering.
IAM and Security for each of the service was key as well
https://cloud.google.com/iam/docs/overview
https://cloud.google.com/iam/docs/understanding-roles
Based on my exam the distribution of question was as below, but this may vary.
Majority of questions were from Big Query, IAM, Pub Sub and Data flow
Big Query – 25%
Data Flow – 15%
IAM + Security – 20%
Cloud SQL + Spanner – 10%
Big Table – 8%
Data Store – 2%
Pub Sub – 10%
ML API + ML – 10%
Data model best practice for Big Data, Big Query, Spanner, Data Store and is a must.
Also, Understand the various data and AI components in GCP stack. I have written 2 blogs that you can refer to
Bonus Question
Question : How long will it take for me to become a data scientist?
Your passion and learning determines that
Question : You are asking me to build portfolio but are there any datasets you recommend?
Rather than thinking single readily available data set. Think how you can combine multiple datasets, perform EDA, Build Model and tell story out of it
Ex: If you are using Lending Club dataset, you can combine it with micro and macro economic dataset or job claims dataset and see if you get better interpretation and model on loan defaults
Also keep writing articles or posts in LinkedIn. Share your experience and not cheatsheets. Remember cheatsheets and copying other content will get you engagement but will not create Brand for yourself
Share your thought and experience. There will always be someone who might have a contradicting view. Learn if he is right, Defend if you feel you are right and Ignore when there is no agreement
Question : Can you endorse my skills?
No. Unless you provide value, differentiation and expertise in any of the skills
Question: I am having issues with my code. Can you review and help me debug the issue?
No.
Procure-to-Pay Consultant @ SoftCo | UCD Smurfit MBA ‘23
4 年Mrinal Jhamb
Open for part-time positions in and around Christchurch, Canterbury, New Zealand
4 年An excellent source of information for all the data science enthusiasts ..?
Loves to solve Complex problems | Technical reviewer for books and video courses | Dot.net , Devops , ML
4 年This valuable information even work for working Data Scientists as most of the people don;t know complete solutions.
Software Development Manager at Tesco | Data Engineering | Data Platform | Data Analytics
4 年Hi Srivatsan Srinivasan - Thank you so much for sharing this valuable classified information. Please help me find the link for "Study path for Data Engineering"