登录查看更多内容

Data Science for beginners

Shailaja Gupta

Sr. Manager - Data & AI @EY Malaysia || 100+ Analytics, AI & Product Talks @IIMs, DU, Josh Talks, InsideIIM || IIMCom Founder

发布日期: 2020年1月28日

+ 关注

What all do we need for entering Data Science

Query language like SQL
Programming language like R/Python
Visualisation tool like PowerBI/Qliksense/Qlikview/Tableau etc.
Basic Statistics for Machine Learning
Machine learning algorithms (make sure you try out use cases in the domain where you wish to expertise in, sales, finance, HR, ops etc. Use cases will be different for all)
Practise and implementation

a) Query Language

Types of query languages you can learn: SQL is hands down the best in market and it is not going anywhere.

One more query language you can learn is elasticsearch. It is very much in use nowadays. I learnt it via a Udemy course. The difference between SQL and elasticsearch is that in elasticsearch, you have a dataframe where not every row has the same column values. For eg. a database car. Let’s say for some cars we have only model name, price and color info. For some we may have color, model, price, number of models till date, parent company etc., for another type of car we may have info on only model and name etc. In SQL this is captured by putting NAN values where fields are missing. In elasticsearch, there is no concept of NAN.

Sources of learning: You can learn from W3Schools/Tutorialspoint or anywhere actually since it will hardly take one week to learn SQL. If you dont have access to databases to practise, you can load any csv file as a database and practise.

Elasticsearch / Kibana can be learnt via the same websites. Also you can take a Udemy course on the same since this is harder to learn comapred to SQL.

Environment: In case you are learning SQL, you can install MySQL/PostGRESQL and start the practise. For elasticsearch you can install Kibana and practise.

Why is Query language important for Data Science?

We have huge datasets in Data science having millions of rows and millions of columns. For an analysis you dont need all the data. We need to extract the relevant data using query language and then proceed with analysis.

Big Data: You can also learn big data languages or techniques such as Scala and Hadoop/Mapreduce etc. However that is a good to have for most D.S. jobs and not a must have. It is more like an icing on the cake. Big data is a part of Data engineering and usually invloves coding mostly unlike Data science which is a blend of statistics, maths, coding and domain knowledge. Big data should be learnt after you are thorough with Data Science.

Time taken to learn: SQL can be learnt in 1 week at max. Practise can be alongside during projects / hackathons. Elasticsearch can take 2 weeks.

b) Programming Language

Languages in vogue: There are 2 main languages R and python. R is a language designed by statisticians and mathemeticians. Python is a lanbguage of the coders. Both are good. However nowadays Python is more in vogue due to it’s large support eco-system from the programming world, better scalability and better integration with APIs and other codes for the complete product.

Learning Source: I learnt Python via Coursera. Course name was python for data science. You can also learn via tutorialspoint and W3schools. More than the course, I learnt via major and minor project that we were supposed to do as part of our certificate.

I learnt R via udemy. Advanced Course on R for data science. It is good to learn one language in depth and have an overview of another language alongwith it. That is because in a team you will be working with many data scientists. Some will be comfortable with R, and some with python.

Environment: Anaconda Navigator, Pycharm, Spyder , RStudio or Jupyter notebooks.

An environment is a place where you code basically and implement your codes.

My personal favorite environment is Jupyter notebooks. Jupyter notebooks lets you type in R/Python/HTML etc., so you can use the best libraries or techniques of a particular language and use it in Jupyter. Also if two people are using different languages, then you can easily collaborate using Jupyter notebooks.

Time taken to learn: This can take close to one month too if you wish to learn all the nuances of the language. Rest you will learn with each project. Stackoverflow and Stackexchange are great places to ask queries about language issues. People usually reply within 5 minutes at maximum.

c) Visualisation tools

Tools you can learn: Tableau/PowerBI/Qliksense/Qlikview are the most common tools. PowerBI and Tableau are most widely used.

PowerBI is almost like an advanced version of excel and very easy to learn. The only issue with excel is that it crashes and becomes slow with huge data. Also not too many options are available for amazing visualisations. All this is possible in PowerBI. Another good thing is that PowerBI is free. For Tableau you can download a trial version of 30 days but in case of PowerBI you can use it as long as you want (most of the features are available in the free version of PowerBI). PowerBI/Qliksense/Qlikview are not supported on Apple though since they are windows based products.

Sources of learning: I have learnt PowerBI through Udemy. I selected the course based on ratings . Any course with rating above 4.3 is cool. I learnt Tableau via Tableau website.

Time taken to learn: Tableau took me 1 week and PowerBI took the same time.

d) Statistics for Data Science:

Basic terms you should know: Standard deviation, mean, median, mode, skewness, hypothesis testing, central limit theorem, population versus sample, z score, confidence interval, p value, statistical significance, critical value, proportion testing, two tailed, one tailed, pareto principal , chi square test, z test, t test, noraml distribution, gausian distribution etc.

Sources of learning: Udemy course on Statistics for Data science by Kirill Eremko, Introduction to Statistical Learning in R or any basic book of stats.

Time taken to learn: 2 days

e) Machine learning algorithms:

Types of Algos mostly used: XGBoost, RandomForest, Deep Learning, Neural networks, Time series, Decision Trees, Clustering and classification algos

Sources of learning: Kirill Eremko course on Udemy for machine learning is amazing.

Otxts book is great for visualisation algos of ML. It is free source and one of the best resources available online. https://otexts.com/fpp2/graphics-exercises.html

Introduction to statistical learning in R is good for basics of stats and applications in ML

Time taken to learn : 1 month

How to decide which algos to implement:

First know the uses case of your problem. What do you want to do? Is your output a continuous variable?(taking values 1,3,10 etc. basically any value) or is it binary kind of decision or do you want to club people and take some decisions for a bunch of folks?

For eg. for one project, I had to work on credit risk analysis in finance. You have to decide whether a person will be able to pay back a loan or not. So it is a binary decision. In this case you need a classification algo. Either classify the person as a defaulter or a non defaulter. So I use Logistic Regression (used when the output is binary yes/no etc), XGBoost or Decision trees. You can basically use any classification algo. Other classification algos which can be used here are SVN,Naive Bayes etc. (you can google calssification algos for the complete list).

If in a project I need to predict the value of a stock or house, then it can take any value starting from Rs.10 to Rs.1,000,000 or more etc. (Continuous variable). Here I will use regression. Regression can be of many types, simple linear, multiple linear, polynomial, SVR etc. The right kind of regression technique can be found out by checking the r squared error(how far the predicted values lie from the actual).

Similarly if I need to find out the target segments, I will need to use clustering algo. K means, heirarchical clustering are few types of clustering algos.

So first learn about the algorithms ( at least one classification algo, 1 clustering algo, 1 regression algo to begin) and then see what algo you need for the problem at hand and then start the analysis.

What all types of projects and algos should I learn:

First choose your domain. My domain for instance was Finance and marketing. So I can talk about those here. Though the list is endless , I will just write the most important algos here I learnt

Marketing: Time series modeling (ARIMA modeling mainly) to predict future sales, volume and value. Time series modeling means that you have data for one thing lets say sales data for a company for a period of time(could be months, years, days etc) and you predict sales for the company for the coming years/months/days.

Clustering algos to cluster target groups and design separate categories for each group.

Churn modeling to decide how many of the field staff will stay and how many will leave to manage workflow.

Finance: Classification algos like Logistic Regression (Logistic Regression and linear or multiple regression are completely different, only the name is similar), SVN, Naive Bayes, XGBoost etc. for Credit Risk analysis to find out who will default.

Regression for asset valuation (value of stock, asset for mortgage etc.)

f) Implementation

How to implement your knowledge:

Projects in courses of Coursera, Udemy etc
Hackathons of Analytics Vidhya, Kaggle etc to see where you stand among the crowds.
Live projects
In the Company and on the job

Hope this helps :)

Yatendra M.

I help individuals to reduce Stress, find Happiness and achieve overall Well-being in 60 days through Holistic Healing without any Medication or Quick fixes ??

3 年

Simple and crisp insight. Its really helpful for new comers in DS field ??

CMA PANKAJ JAIN

3 年

Very good learning insights ??

Jaya Bharath P

Senior AI & ML Developer @ Wipro

4 年

Yes.. Thanks for sharing it??

1 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

Data Science for beginners

Shailaja Gupta

Sr. Manager - Data & AI @EY Malaysia || 100+ Analytics, AI & Product Talks @IIMs, DU, Josh Talks, InsideIIM || IIMCom Founder

更多精彩文章

社区洞察

其他会员也浏览了

7 Free Courses for Data Analysts You Must Know in 2024

Making Sense of Millions of Amazon Reviews Using SQL, Spark and Python - Big Data Project

Pandas for Data Science

PySpark Why and When to Use

How I’d Become a Data Scientist (If I Had to Start Over)

Get Started with Data Science - Minimum Viable Tool (MVT)

Data & Analytics Manager: "SQL is More Important Than Python To Me"

10 Best Data Science Questions for Beginners

Data Science Resources For Beginners

Best Ways to Use Pandas with PySpark

From competitors to allies: How Amazon bridged the gap with Meta & Shopify

2023年11月21日

Spotify's strategy to increase paid users in India

2023年11月13日

Microfinance Companies: The new channel for Consumer Durables

2017年6月9日

Demographic Dividend of India - Threat or opportunity

2016年3月4日

Startups – Beginning of a technological innovation or bubble

2016年3月3日