登录查看更多内容

Introduction to Machine Learning using Python

Gayatri P.

AI/ML Senior Analyst

发布日期: 2018年6月18日

As the title suggests, this article aims the newbie developers, like me, interested to be a part of this digital revolution, Data Science, who possess minimal knowledge on machine learning and Python.

What is Machine Learning?

Machine learning is the field of computational sciences and mathematics that often uses statistical techniques to give computers the ability to "learn" with data, without being programmed explicitly. It's an application of Artificial Intelligence(AI). Practically, it means, we need to feed data into an algorithm, and use it to make predictions about what might happen in the future.

The name 'machine learning' was coined in 1959 by Arthur Samuel.

In 1997, Tom Mitchell gave a definition that has proven more useful to engineering types :

“A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.”

So if you want your program to predict, for example, traffic patterns at a busy intersection (task T), you can run it through a machine learning algorithm with data about past traffic patterns (experience E) and, if it has successfully “learned”, it will then do better at predicting future traffic patterns (performance measure P). [source : Toptal]

Among the different types of ML tasks, a crucial distinction is drawn between supervised and unsupervised learning:

Supervised machine learning: The program is “trained” on a pre-defined set of “training examples”, which then facilitate its ability to reach an accurate conclusion when given new data.

Unsupervised machine learning: The bunch of data is fed to the program and it should find patterns and relationships among the features/attributes therein.

There is a really vast range of applications which involves domains such as,

Healthcare(e.g., personalized treatments and medications, drug manufacturing)
Finance(e.g.,fraud detection)
Retail(e.g.,product recommendations, improved customer service)
Travel(e.g.,dynamic pricing like, how does Uber determine the price of your ride, and sentimental analysis, like, TripAdvisor collects information of the travellers from social media when we share photos and reviews, and tries on improvising its service based on the reviews)
Media(e.g., facebook, from personalizing news feed to rendering targeted ads, machine learning is the heart of all social media platforms for their own and user benefits)

On the other hand, Unlike R, Python is a complete language and platform that you can use for both research and development and developing production systems. It can feel overwhelming to choose from multiple libraries and modules.

So, let's start with the step by step procedure to be followed by beginners to start with machine learning using Python.

Our first step shall be to learn Python.

Python is a general-purpose interpreted, interactive, object-oriented, and high-level programming language which was created by Guido van Rossum during 1985- 1990 . Python source code is available under the GNU General Public License (GPL).

You can follow the following sources to leverage your Python skills :

Google's Python Class

Google Developer Python Course

Python has an amazing ecosystem of libraries that make machine learning easy to get started with. It's one of the most popular and in-demand language in the job market,today. This is why, we can get plenty of resources online to learn. Learners will find hardly any difficulty.

The next step is installing Anaconda from the given link, https://docs.anaconda.com/anaconda/install/

Follow the instructions and procedure for the installation stated in the site. The Anaconda package contains the required package to explore machine learning.

You have to learn the basic machine learning skills.

If you want to have an overall idea about Machine learning, from the scratch, you might want to follow this crash course by Google :

Machine Learning Crash Course

Machine Learning by Andrew Ng is a great source to learn from.

Once we are comfortable with Python and Machine Learning, we shall shift to Python libraries.

a. Pandas :

Our first step is to read in the data and bring out some relevant and quick summary statistics, for which we shall use Pandas library. Pandas provide data structures and data analysis tool that make manipulating data in Python much quicker and effective.

We'll read in our data from a csv file into a Pandas dataframe, using the read_csv method.

b. NumPy :

The most common data structure is called a dataframe. A dataframe is an extension of a matrix.

A matrix is a two-dimensional data structure, with rows and columns. Matrices in Python can be used via the NumPy library. As in case of matrices, we can't easily access columns and rows by name, and each column has to have the same datatype,hence, we use Dataframes, which can have different datatypes in each column. It has has a lot of built-in features for analyzing data.

c. Matplotlib :

Matplotlib is the main plotting infrastructure in Python, and most other plotting libraries, like seaborn and ggplot2 are built on top of Matplotlib. We import Matplotlib's plotting functions with import matplotlib.pyplot as plt. We can then draw and show plots.

d. Scikit-learn :

The library is built upon the SciPy that must be installed before you can use scikit-learn. This stack that includes:

NumPy : Base n-dimensional array package
SciPy : Fundamental library for scientific computing
Matplotlib : Comprehensive 2D/3D plotting
Ipython : Enhanced interactive console
Pandas : Data structures and analysis

Extensions or modules for SciPy care conventionally named SciKits. As such, the module provides learning algorithms and is named scikit-learn.

Now, as you have had a grip on the basics of Python and its libraries and Machine learning algorithms, it's always best to start with a small end to end project. Here are the steps how to start with the project :

Define a Problem
Prepare the Data
Evaluate the Algorithms
Improve the Results
Present the Results

To start with Machine Learning using Python, after the above given step of installing Anaconda, first check the version of python you are using, then,

1. Import the libraries, such as sklearn, pandas, matplotlib,scipy,numpy

2. Load the dataset : We load the data using pandas.

3. Summarize the dataset : This includes :

Dimensions of the dataset :

print(dataset.shape)

Statistical summary :

print(dataset.head(20))
print(dataset.describe())

4. Visualization of the dataset :

Data Visualization comprises of 2 kinds of plots - Univariate and Multivariate

Univariate plots are used to understand each attribute better. In this case, we can create box-plots and histograms. Whereas, Multivariate plots are used to understand the relationship between each attributes better. In this case, scatter plot can describe the correlation between the attributes.

5. Evaluation of some algorithm :

Firstly, separate out the validation set from the dataset, let's say, it's 20% of the dataset,which the algorithm won't be able to see or access.

Next, we shall split the remaining dataset into 2 parts, Training (80%) and Test(20%).Now set a scoring metric, based on which evaluation is to be done on the models, let's say, accuracy.

Above, is the ratio of number of correctly predicted instances by the total number of instances in the dataset and on being multiplied by 100 gives you a percentage (for e.g., 95% accuracy).

Hence, after setting up everything, we shall build the model.

To get a good accuracy, we need to pass the training dataset in different models, after which we can find out the accuracy of each model. Then, the model with maximum accuracy shall be considered the best suit for the given problem.

6. Make Predictions : After getting the best model, we want to get an idea on the validation set. We shall run the best fit model directly on the validation set and summarize the results as a final accuracy score. It's always a good practice to keep a validation set as it shall help us find whether the training set is overfitted and giving us some overly optimistic results.

Priyank Sharma

Sales Director Public Service - Accenture | Government | Technology Sales, Business Consulting

6 年

Great article Gayatri!

1 次回应

Nagesh Vasanthakumara

Senior Member Of Technical Staff @ Salesforce | Performance Engineer

6 年

Very well written, Thanks for the post. It will help many new people like me.

1 次回应

Dr. Jeeva Jose

Professor, Author , Software trainer, Researcher

6 年

Abhishek Kumar

AVP at Barclays

6 年

Great Work... It helps for beginners who are just started or looking forward in fields of machine learning... Well explained step by step... looking forward to many more to come like above post... Keep it up..

2 次回应

Ravi J Singh

Founder at Akunka Foods | Building a nutrition first, plant based food brand.

6 年

Sidak Singh Aulakh

3 次回应

查看更多评论

要查看或添加评论，请登录

Gayatri P.的更多文章

What is Big Data...!

2018年6月27日

What is Big Data...!

Since the last decade, when the term Big Data was coined until today, the term has remained an enigma to most people…

10 条评论
Big Data Applications

2018年6月27日

Big Data Applications

While the field of Big Data is continuing to evolve in a humongous manner, it definitely won’t be fading away anytime…

Introduction to Machine Learning using Python

Gayatri P.

AI/ML Senior Analyst

Gayatri P.的更多文章

社区洞察

其他会员也浏览了

Geometric Learning in Python: Introduction

Tools & Resources

10 Best AI Frameworks for Developers

Fractal Dimension of Images in Python

TensorFlow vs PyTorch vs Keras: Which Framework is Right for You?

?? Boost Data Annotation Efficiency with Python Transformers ??

Machine Learning with (Monty) Python

News Classifier with Naive Bayes in Python

Artificial Intelligence With Python: Machine Learning

Machine Learning Made Simple: A Beginner's Guide with Pandas

Gayatri P.的更多文章

What is Big Data...!

Big Data Applications

社区洞察

其他会员也浏览了

Geometric Learning in Python: Introduction

Tools & Resources

10 Best AI Frameworks for Developers

Fractal Dimension of Images in Python

TensorFlow vs PyTorch vs Keras: Which Framework is Right for You?

?? Boost Data Annotation Efficiency with Python Transformers ??

Machine Learning with (Monty) Python

News Classifier with Naive Bayes in Python

Artificial Intelligence With Python: Machine Learning

Machine Learning Made Simple: A Beginner's Guide with Pandas