Data Science for beginners
Shailaja Gupta
Sr. Manager - Data & AI @EY Malaysia || 100+ Analytics, AI & Product Talks @IIMs, DU, Josh Talks, InsideIIM || IIMCom Founder
What all do we need for entering Data Science
- Query language like SQL
- Programming language like R/Python
- Visualisation tool like PowerBI/Qliksense/Qlikview/Tableau etc.
- Basic Statistics for Machine Learning
- Machine learning algorithms (make sure you try out use cases in the domain where you wish to expertise in, sales, finance, HR, ops etc. Use cases will be different for all)
- Practise and implementation
a) Query Language
Types of query languages you can learn: SQL is hands down the best in market and it is not going anywhere.
One more query language you can learn is elasticsearch. It is very much in use nowadays. I learnt it via a Udemy course. The difference between SQL and elasticsearch is that in elasticsearch, you have a dataframe where not every row has the same column values. For eg. a database car. Let’s say for some cars we have only model name, price and color info. For some we may have color, model, price, number of models till date, parent company etc., for another type of car we may have info on only model and name etc. In SQL this is captured by putting NAN values where fields are missing. In elasticsearch, there is no concept of NAN.
Sources of learning: You can learn from W3Schools/Tutorialspoint or anywhere actually since it will hardly take one week to learn SQL. If you dont have access to databases to practise, you can load any csv file as a database and practise.
Elasticsearch / Kibana can be learnt via the same websites. Also you can take a Udemy course on the same since this is harder to learn comapred to SQL.
Environment: In case you are learning SQL, you can install MySQL/PostGRESQL and start the practise. For elasticsearch you can install Kibana and practise.
Why is Query language important for Data Science?
We have huge datasets in Data science having millions of rows and millions of columns. For an analysis you dont need all the data. We need to extract the relevant data using query language and then proceed with analysis.
Big Data: You can also learn big data languages or techniques such as Scala and Hadoop/Mapreduce etc. However that is a good to have for most D.S. jobs and not a must have. It is more like an icing on the cake. Big data is a part of Data engineering and usually invloves coding mostly unlike Data science which is a blend of statistics, maths, coding and domain knowledge. Big data should be learnt after you are thorough with Data Science.
Time taken to learn: SQL can be learnt in 1 week at max. Practise can be alongside during projects / hackathons. Elasticsearch can take 2 weeks.
b) Programming Language
Languages in vogue: There are 2 main languages R and python. R is a language designed by statisticians and mathemeticians. Python is a lanbguage of the coders. Both are good. However nowadays Python is more in vogue due to it’s large support eco-system from the programming world, better scalability and better integration with APIs and other codes for the complete product.
Learning Source: I learnt Python via Coursera. Course name was python for data science. You can also learn via tutorialspoint and W3schools. More than the course, I learnt via major and minor project that we were supposed to do as part of our certificate.
I learnt R via udemy. Advanced Course on R for data science. It is good to learn one language in depth and have an overview of another language alongwith it. That is because in a team you will be working with many data scientists. Some will be comfortable with R, and some with python.
Environment: Anaconda Navigator, Pycharm, Spyder , RStudio or Jupyter notebooks.
An environment is a place where you code basically and implement your codes.
My personal favorite environment is Jupyter notebooks. Jupyter notebooks lets you type in R/Python/HTML etc., so you can use the best libraries or techniques of a particular language and use it in Jupyter. Also if two people are using different languages, then you can easily collaborate using Jupyter notebooks.
Time taken to learn: This can take close to one month too if you wish to learn all the nuances of the language. Rest you will learn with each project. Stackoverflow and Stackexchange are great places to ask queries about language issues. People usually reply within 5 minutes at maximum.
c) Visualisation tools
Tools you can learn: Tableau/PowerBI/Qliksense/Qlikview are the most common tools. PowerBI and Tableau are most widely used.
PowerBI is almost like an advanced version of excel and very easy to learn. The only issue with excel is that it crashes and becomes slow with huge data. Also not too many options are available for amazing visualisations. All this is possible in PowerBI. Another good thing is that PowerBI is free. For Tableau you can download a trial version of 30 days but in case of PowerBI you can use it as long as you want (most of the features are available in the free version of PowerBI). PowerBI/Qliksense/Qlikview are not supported on Apple though since they are windows based products.
Sources of learning: I have learnt PowerBI through Udemy. I selected the course based on ratings . Any course with rating above 4.3 is cool. I learnt Tableau via Tableau website.
Time taken to learn: Tableau took me 1 week and PowerBI took the same time.
d) Statistics for Data Science:
Basic terms you should know: Standard deviation, mean, median, mode, skewness, hypothesis testing, central limit theorem, population versus sample, z score, confidence interval, p value, statistical significance, critical value, proportion testing, two tailed, one tailed, pareto principal , chi square test, z test, t test, noraml distribution, gausian distribution etc.
Sources of learning: Udemy course on Statistics for Data science by Kirill Eremko, Introduction to Statistical Learning in R or any basic book of stats.
Time taken to learn: 2 days
e) Machine learning algorithms:
Types of Algos mostly used: XGBoost, RandomForest, Deep Learning, Neural networks, Time series, Decision Trees, Clustering and classification algos
Sources of learning: Kirill Eremko course on Udemy for machine learning is amazing.
Otxts book is great for visualisation algos of ML. It is free source and one of the best resources available online. https://otexts.com/fpp2/graphics-exercises.html
Introduction to statistical learning in R is good for basics of stats and applications in ML
Time taken to learn : 1 month
How to decide which algos to implement:
First know the uses case of your problem. What do you want to do? Is your output a continuous variable?(taking values 1,3,10 etc. basically any value) or is it binary kind of decision or do you want to club people and take some decisions for a bunch of folks?
For eg. for one project, I had to work on credit risk analysis in finance. You have to decide whether a person will be able to pay back a loan or not. So it is a binary decision. In this case you need a classification algo. Either classify the person as a defaulter or a non defaulter. So I use Logistic Regression (used when the output is binary yes/no etc), XGBoost or Decision trees. You can basically use any classification algo. Other classification algos which can be used here are SVN,Naive Bayes etc. (you can google calssification algos for the complete list).
If in a project I need to predict the value of a stock or house, then it can take any value starting from Rs.10 to Rs.1,000,000 or more etc. (Continuous variable). Here I will use regression. Regression can be of many types, simple linear, multiple linear, polynomial, SVR etc. The right kind of regression technique can be found out by checking the r squared error(how far the predicted values lie from the actual).
Similarly if I need to find out the target segments, I will need to use clustering algo. K means, heirarchical clustering are few types of clustering algos.
So first learn about the algorithms ( at least one classification algo, 1 clustering algo, 1 regression algo to begin) and then see what algo you need for the problem at hand and then start the analysis.
What all types of projects and algos should I learn:
First choose your domain. My domain for instance was Finance and marketing. So I can talk about those here. Though the list is endless , I will just write the most important algos here I learnt
Marketing: Time series modeling (ARIMA modeling mainly) to predict future sales, volume and value. Time series modeling means that you have data for one thing lets say sales data for a company for a period of time(could be months, years, days etc) and you predict sales for the company for the coming years/months/days.
Clustering algos to cluster target groups and design separate categories for each group.
Churn modeling to decide how many of the field staff will stay and how many will leave to manage workflow.
Finance: Classification algos like Logistic Regression (Logistic Regression and linear or multiple regression are completely different, only the name is similar), SVN, Naive Bayes, XGBoost etc. for Credit Risk analysis to find out who will default.
Regression for asset valuation (value of stock, asset for mortgage etc.)
f) Implementation
How to implement your knowledge:
- Projects in courses of Coursera, Udemy etc
- Hackathons of Analytics Vidhya, Kaggle etc to see where you stand among the crowds.
- Live projects
- In the Company and on the job
Hope this helps :)
I help individuals to reduce Stress, find Happiness and achieve overall Well-being in 60 days through Holistic Healing without any Medication or Quick fixes ??
3 年Simple and crisp insight. Its really helpful for new comers in DS field ??
??Global MSME & Realty Strategist… | 2x Profits in 9months | Get Funding in 30days | Business Transformation | SME IPO | RERA
3 年Very good learning insights ??
Senior AI & ML Developer @ Wipro
4 年Yes.. Thanks for sharing it??