Introduction to Machine Learning using Python

Introduction to Machine Learning using Python

As the title suggests, this article aims the newbie developers, like me, interested to be a part of this digital revolution, Data Science, who possess minimal knowledge on machine learning and Python.

What is Machine Learning?

Machine learning is the field of computational sciences and mathematics that often uses statistical techniques to give computers the ability to "learn" with data, without being programmed explicitly. It's an application of Artificial Intelligence(AI). Practically, it means, we need to feed data into an algorithm, and use it to make predictions about what might happen in the future.

The name 'machine learning' was coined in 1959 by Arthur Samuel.

In 1997, Tom Mitchell gave a definition that has proven more useful to engineering types :

“A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.” 

So if you want your program to predict, for example, traffic patterns at a busy intersection (task T), you can run it through a machine learning algorithm with data about past traffic patterns (experience E) and, if it has successfully “learned”, it will then do better at predicting future traffic patterns (performance measure P). [source : Toptal]

Among the different types of ML tasks, a crucial distinction is drawn between supervised and unsupervised learning:

Supervised machine learning: The program is “trained” on a pre-defined set of “training examples”, which then facilitate its ability to reach an accurate conclusion when given new data.

Unsupervised machine learning: The bunch of data is fed to the program and it should find patterns and relationships among the features/attributes therein.


There is a really vast range of applications which involves domains such as,

  • Healthcare(e.g., personalized treatments and medications, drug manufacturing)
  • Finance(e.g.,fraud detection)
  • Retail(e.g.,product recommendations, improved customer service)
  • Travel(e.g.,dynamic pricing like, how does Uber determine the price of your ride, and sentimental analysis, like, TripAdvisor collects information of the travellers from social media when we share photos and reviews, and tries on improvising its service based on the reviews)
  • Media(e.g., facebook, from personalizing news feed to rendering targeted ads, machine learning is the heart of all social media platforms for their own and user benefits)


On the other hand, Unlike R, Python is a complete language and platform that you can use for both research and development and developing production systems. It can feel overwhelming to choose from multiple libraries and modules.

So, let's start with the step by step procedure to be followed by beginners to start with machine learning using Python.

  • Our first step shall be to learn Python.

Python is a general-purpose interpreted, interactive, object-oriented, and high-level programming language which was created by Guido van Rossum during 1985- 1990 . Python source code is available under the GNU General Public License (GPL).

You can follow the following sources to leverage your Python skills :

Google's Python Class

Google Developer Python Course

Python has an amazing ecosystem of libraries that make machine learning easy to get started with. It's one of the most popular and in-demand language in the job market,today. This is why, we can get plenty of resources online to learn. Learners will find hardly any difficulty.

Follow the instructions and procedure for the installation stated in the site. The Anaconda package contains the required package to explore machine learning.

  • You have to learn the basic machine learning skills.

If you want to have an overall idea about Machine learning, from the scratch, you might want to follow this crash course by Google :

Machine Learning Crash Course

Machine Learning by Andrew Ng is a great source to learn from.

  • Once we are comfortable with Python and Machine Learning, we shall shift to Python libraries.

a. Pandas : 

Our first step is to read in the data and bring out some relevant and quick summary statistics, for which we shall use Pandas library. Pandas provide data structures and data analysis tool that make manipulating data in Python much quicker and effective.

We'll read in our data from a csv file into a Pandas dataframe, using the read_csv method.

b. NumPy :

The most common data structure is called a dataframe. A dataframe is an extension of a matrix.

A matrix is a two-dimensional data structure, with rows and columns. Matrices in Python can be used via the NumPy library. As in case of matrices, we can't easily access columns and rows by name, and each column has to have the same datatype,hence, we use Dataframes, which can have different datatypes in each column. It has has a lot of built-in features for analyzing data.

c. Matplotlib : 

Matplotlib is the main plotting infrastructure in Python, and most other plotting libraries, like seaborn and ggplot2 are built on top of Matplotlib. We import Matplotlib's plotting functions with import matplotlib.pyplot as plt. We can then draw and show plots.

d. Scikit-learn : 

The library is built upon the SciPy that must be installed before you can use scikit-learn. This stack that includes:

  • NumPy : Base n-dimensional array package
  • SciPy : Fundamental library for scientific computing
  • Matplotlib : Comprehensive 2D/3D plotting
  • Ipython : Enhanced interactive console
  • Pandas : Data structures and analysis

Extensions or modules for SciPy care conventionally named SciKits. As such, the module provides learning algorithms and is named scikit-learn.

Now, as you have had a grip on the basics of Python and its libraries and Machine learning algorithms, it's always best to start with a small end to end project. Here are the steps how to start with the project :

  1. Define a Problem
  2. Prepare the Data
  3. Evaluate the Algorithms
  4. Improve the Results
  5. Present the Results

To start with Machine Learning using Python, after the above given step of installing Anaconda, first check the version of python you are using, then, 

1. Import the libraries, such as sklearn, pandas, matplotlib,scipy,numpy

2. Load the dataset : We load the data using pandas.

3. Summarize the dataset : This includes :

Dimensions of the dataset :

  • print(dataset.shape)  

Statistical summary : 

  • print(dataset.head(20))
  • print(dataset.describe())

4. Visualization of the dataset :

Data Visualization comprises of 2 kinds of plots - Univariate and Multivariate

Univariate plots are used to understand each attribute better. In this case, we can create box-plots and histograms. Whereas, Multivariate plots are used to understand the relationship between each attributes better. In this case, scatter plot can describe the correlation between the attributes.

5. Evaluation of some algorithm :

Firstly, separate out the validation set from the dataset, let's say, it's 20% of the dataset,which the algorithm won't be able to see or access.  

Next, we shall split the remaining dataset into 2 parts, Training (80%) and Test(20%).Now set a scoring metric, based on which evaluation is to be done on the models, let's say, accuracy. 

Above, is the ratio of number of correctly predicted instances by the total number of instances in the dataset and on being multiplied by 100 gives you a percentage (for e.g., 95% accuracy).

Hence, after setting up everything, we shall build the model. 

To get a good accuracy, we need to pass the training dataset in different models, after which we can find out the accuracy of each model. Then, the model with maximum accuracy shall be considered the best suit for the given problem. 

6. Make Predictions : After getting the best model, we want to get an idea on the validation set. We shall run the best fit model directly on the validation set and summarize the results as a final accuracy score. It's always a good practice to keep a validation set as it shall help us find whether the training set is overfitted and giving us some overly optimistic results.

Priyank Sharma

Sales Director Public Service - Accenture | Government | Technology Sales, Business Consulting

6 年

Great article Gayatri!

Nagesh Vasanthakumara

Senior Member Of Technical Staff @ Salesforce | Performance Engineer

6 年

Very well written, Thanks for the post. It will help many new people like me.

Dr. Jeeva Jose

Professor, Author , Software trainer, Researcher

6 年

.

回复

Great Work... It helps for beginners who are just started or looking forward in fields of machine learning... Well explained step by step... looking forward to many more to come like above post... Keep it up..

Ravi J Singh

Founder at Akunka Foods | Building a nutrition first, plant based food brand.

6 年

要查看或添加评论,请登录

Gayatri P.的更多文章

  • What is Big Data...!

    What is Big Data...!

    Since the last decade, when the term Big Data was coined until today, the term has remained an enigma to most people…

    10 条评论
  • Big Data Applications

    Big Data Applications

    While the field of Big Data is continuing to evolve in a humongous manner, it definitely won’t be fading away anytime…

社区洞察

其他会员也浏览了