登录查看更多内容

Self-Study Data Science

Joseph Sefara

Senior Data Scientist Specialist

发布日期: 2020年3月9日

As a data science consultant, lots of people interested in getting into data science have contacted me for guidance on how to get into the field of data science. This article will discuss the recommended topics that one has to study to build essential skills in data science.

The topics presented here, if studied thoroughly, will provide the minimum background needed to start doing data science. This curriculum could also be used for designing an introductory college-level course in data science.

Keep in mind that knowledge acquired from courses alone will not make you a data scientist. Course work has to be accompanied by a capstone project or an internship. Kaggle competitions can be used for capstones, as they provide an opportunity to work on real-world data science projects.

1. Math Basics

(I)

Most machine learning models are built with a data set having several features or predictors. Hence familiarity with multivariable calculus is extremely important for building a machine learning model. Here are the topics you need to be familiar with:

Functions of several variables
Derivatives and gradients
Step function, Sigmoid function, Logit function, ReLU (Rectified Linear Unit) function
Cost function
Plotting of functions
Minimum and Maximum values of a function

(II) Linear Algebra

Linear algebra is the most important math skill in machine learning. A dataset is represented as a matrix. Linear algebra is used in data preprocessing, data transformation, and model evaluation. Here are the topics you need to be familiar with:

Vectors
Matrices
Transpose of a matrix
The inverse of a matrix
The determinant of a matrix
Dot product
Eigenvalues
Eigenvectors

(III) Optimization Methods

Most machine learning algorithms perform predictive modelling by minimising an objective function, thereby learning the weights that must be applied to the testing data in order to obtain the predicted labels. Here are the topics you need to be familiar with:

Cost function/Objective function
Likelihood function
Error function
Gradient Descent Algorithm and its variants (e.g., Stochastic Gradient Descent Algorithm)

2. Programming Basics

Python and R are considered the top programming languages for data science. You may decide to focus on just one language. Python is widely adopted by industries and academic training programs. As a beginner, it is recommended that you focus on one language only.

Here are some Python and R basics topics to master:

Basic R syntax
Foundation R programming concepts such as data types, vectors arithmetic, indexing, and data frames
How to perform operations in R including sorting, data wrangling using dplyr, and data visualisation with ggplot2
R studio
Object-oriented programming aspects of Python
Jupyter notebooks
Be able to work with Python libraries such as NumPy, pylab, seaborn, matplotlib, pandas, scikit-learn, TensorFlow, PyTorch

3. Data Basics

Learn how to manipulate data in various formats, for example, CSV file, pdf file, text file, etc. Learn how to clean data, impute data, scale data, import and export data, and scrap data from the internet. Some packages of interest are pandas, NumPy, pdf tools, stringr, etc. Additionally, R and Python contain several inbuilt data sets that can be used for practice. Learn data transformation and dimensionality reduction techniques such as covariance matrix plot, principal component analysis (PCA), and linear discriminant analysis (LDA).

4. Probability and Statistics Basics

Statistics and Probability is used for visualisation of features, data pre-processing, feature transformation, data imputation, dimensionality reduction, feature engineering, model evaluation, etc. Here are the topics you need to be familiar with:

Mean
Median
Mode
Standard deviation/variance
Correlation coefficient and the covariance matrix
Probability distributions (Binomial, Poisson, Normal)
p-value
Baye’s Theorem (Precision, Recall, Positive Predictive Value, Negative Predictive Value, Confusion Matrix, ROC Curve)
A/B Testing
Monte Carlo Simulation

5. Data Visualisation Basics

Learn essential components of a good data visualisation. A good data visualisation is made up of several components that have to be pieced up together to produce an end product:

a) Data Component: An important first step in deciding how to visualise data is to know what type of data it is, e.g., categorical data, discrete data, continuous data, time-series data, etc.

b) Geometric Component: Here is where you decide what kind of visualisation is suitable for your data, e.g., scatter plot, line graphs, bar plots, histograms, Q-Q plots, smooth densities, boxplots, pair plots, heatmaps, etc.

c) Mapping Component: Here, you need to decide what variable to use as your x-variable and what to use as your y-variable. This is important, especially when your data set is multi-dimensional with several features.

d) Scale Component: Here, you decide what kind of scales to use, e.g., linear scale, log scale, etc.

e) Labels Component: This includes things like axes labels, titles, legends, font size to use, etc.

f) Ethical Component: Here, you want to make sure your visualisation tells the true story. You need to be aware of your actions when cleaning, summarising, manipulating, and producing a data visualisation and ensure you aren’t using your visualisation to mislead or manipulate your audience.

Important data visualisation tools include Python’s matplotlib and seaborn packages, and R’s ggplot2 package.

6. Linear Regression Basics

Learn the fundamentals of simple and multiple linear regression analysis. Linear regression is used for supervised learning with continuous outcomes. Some tools for performing linear regression are given below:

Python: NumPy, pylab, sci-kit-learn

R: caret package

7. Machine Learning Basics

a) Supervised Learning (Continuous Variable Prediction)

Basic regression
Multi regression analysis
Regularised regression

b) Supervised Learning (Discrete Variable Prediction)

Logistic Regression Classifier
Support Vector Machine (SVM) Classifier
K-nearest neighbor (KNN) Classifier
Decision Tree Classifier
Random Forest Classifier
Naive Bayes

c) Unsupervised Learning

Kmeans clustering algorithm

Python tools for machine learning: Scikit-learn, Pytorch, TensorFlow.

8. Time Series Analysis Basics

Use for a predictive model in cases where the outcome is time-dependent, e.g., predicting stock prices. There are 3 basic methods for analysing time-series data:

Exponential Smoothing
ARIMA (Auto-Regressive Integrated Moving Average), which is a generalisation of exponential smoothing
GARCH (Generalized Auto Regressive Conditional Heteroskedasticity), which is an ARIMA-like model for analyzing variance.

These 3 techniques can be implemented in Python and R.

9. Productivity Tools Basics

Knowledge on how to use basic productivity tools such as R studio, Jupyter notebook, and GitHub, is essential. For Python, Anaconda Python is the best productivity tool to install. Advanced productivity tools such as AWS and Azure are also important tools to learn.

10. Data Science Project Planning Basics

Learn basics on how to plan a project. Before building any machine learning model, it is important to sit down carefully and plan what you want your model to accomplish. Before delving into writing code, it is important that you understand the problem to be solved, the nature of the data set, the type of model to build, how the model will be trained, tested, and evaluated. Project planning and project organisation are essential for increasing productivity when working on a data science project.

I hope you found this useful.

I write about Machine Learning and Data science. If any of those topics interest you, read more here and follow me on LinkedIn && Twitter. ??

Malome Tebatso Khomo

Everywhere, knowingly with the bG-Hum; Crusties!

5 年

What about text analysis, say for instance XSLT?

Ramaabele Johanna Sefala

Employee | Student | Mother

5 年

great insight.?

1 次回应

查看更多评论

要查看或添加评论，请登录

Joseph Sefara的更多文章

162+ Data Science Interview Questions

2020年3月2日

162+ Data Science Interview Questions

A typical interview process for a data science position includes multiple rounds. Often, one of such rounds covers…
Importance of Data Normalisation for Data Science and Machine Learning Models

2020年3月2日

Importance of Data Normalisation for Data Science and Machine Learning Models

Normalisation is a technique often applied as part of data preparation for machine learning. The goal of normalisation…
Using Data Science to Know if the Customers will Buy the Products in their Cart or not?

2020年2月20日

Using Data Science to Know if the Customers will Buy the Products in their Cart or not?

Using Machine Learning (ML) Classifier specifically XGBoost to predict if a customer will eventually make a purchase…

2 条评论
Pandas DataFrame: Convert the column type from string to datetime format

2019年6月25日

Pandas DataFrame: Convert the column type from string to datetime format

While working with data in Pandas, it is not an unusual thing to encounter time series data and we know Pandas is a…
SVMs versus Logistic Regression

2019年6月5日

SVMs versus Logistic Regression

Like logistic regression (LR), support vector machines (SVMs) can also be generalised to categorical output variables…

1 条评论
When to Scale, Standardise, or Normalise with Scikit-Learn

2019年4月29日

When to Scale, Standardise, or Normalise with Scikit-Learn

Many machine learning algorithms work better when features are on a relatively similar scale and close to normal…

1 条评论
Logistic Regression with Keras

2018年10月17日

Logistic Regression with Keras

Logistic Regression (LR) is a simple yet quite effective method for carrying out binary classification tasks. There are…
Advice to Recent Graduates: Plan, Negotiate and Network

2018年6月8日

Advice to Recent Graduates: Plan, Negotiate and Network

It has been more than 20 years since I graduated college. Since then my career has been productive and focused around…
Respect and Love Your Elders

2017年10月12日

Respect and Love Your Elders

Wise people are very few, if we put a habit of hearing then wise people from our home to the outside world will find…

See all articles

Self-Study Data Science

Joseph Sefara

Senior Data Scientist Specialist

1. Math Basics

2. Programming Basics

3. Data Basics

4. Probability and Statistics Basics

5. Data Visualisation Basics

6. Linear Regression Basics

7. Machine Learning Basics

8. Time Series Analysis Basics

9. Productivity Tools Basics

10. Data Science Project Planning Basics

Joseph Sefara的更多文章

社区洞察

其他会员也浏览了

Demystifying Data Careers: Your Guide to Data Analyst vs Scientist vs Data Engineer vs ML Engineer

Mastering Data Science From Basics to Advanced

Cracking the Code of Data Science: A 12 Step Guide to Becoming a Data Scientist

Best Institute for Data Science

What math(s) do you need to learn as a data scientist?

How Can I Start My Career in Data Science?

10 Best Data Science Questions for Beginners

How to Transition into Data Science: A Three-Step Approach

Top Data Science Resources on the Internet right now

Episode #88: How to learn data science and machine learning from scratch with Santiago Viquez

1. Math Basics

2. Programming Basics

3. Data Basics

4. Probability and Statistics Basics

5. Data Visualisation Basics

6. Linear Regression Basics

7. Machine Learning Basics

8. Time Series Analysis Basics

9. Productivity Tools Basics

10. Data Science Project Planning Basics

Joseph Sefara的更多文章

162+ Data Science Interview Questions

Importance of Data Normalisation for Data Science and Machine Learning Models

Using Data Science to Know if the Customers will Buy the Products in their Cart or not?

Pandas DataFrame: Convert the column type from string to datetime format

SVMs versus Logistic Regression

When to Scale, Standardise, or Normalise with Scikit-Learn

Logistic Regression with Keras

Advice to Recent Graduates: Plan, Negotiate and Network

Respect and Love Your Elders

社区洞察

其他会员也浏览了

Demystifying Data Careers: Your Guide to Data Analyst vs Scientist vs Data Engineer vs ML Engineer

Mastering Data Science From Basics to Advanced

Cracking the Code of Data Science: A 12 Step Guide to Becoming a Data Scientist

Best Institute for Data Science

What math(s) do you need to learn as a data scientist?

How Can I Start My Career in Data Science?

10 Best Data Science Questions for Beginners

How to Transition into Data Science: A Three-Step Approach

Top Data Science Resources on the Internet right now

Episode #88: How to learn data science and machine learning from scratch with Santiago Viquez