Supervise The UnSupervised Learning (Part 2)
credit: www.autodesk.com

Supervise The UnSupervised Learning (Part 2)

Hello everyone, welcome to the continuation article of Supervise The UnSupervised Learning. If you haven't gone through the first part of the article, kindly go through it.

Lets recap what we have studied till now.

  • Unsupervised Learning is a class of Machine Learning techniques to find the patterns in data.
  • In clustering, the data is divided into several groups. i.e. the aim is to segregate groups with similar traits and assign them into clusters.
  • K means is an iterative clustering algorithm that aims to find local maxima in each iteration.
  • PCA is basically used for dimensionality reduction and is a feature extraction technique.

In this article we will explore K-Means and PCA using python.

Dataset:

In this article we use, Iris dataset for making predictions. The dataset contains a set of 150 records under 5 attributes — Petal Length , Petal Width , Sepal Length , Sepal width and Class. Iris Setosa, Iris Virginica and Iris Versicolor are the three classes.

Preparing data:

We will use sklearn Library in Python to load Iris dataset, and matplotlib for data visualization.

from sklearn import datasets
from sklearn.decomposition import PCA


# import some data to play with
iris = datasets.load_iris()

# Features of Iris data
x=iris.data

#Target for Iris data
y=iris.target
y
import matplotlib.pyplot as plt
#Dataset Slicing
x_axis = iris.data[:, 0# Sepal Length
y_axis = iris.data[:, 2# Sepal Width


# Plotting
plt.scatter(x_axis, y_axis, c=iris.target)
plt.show()

K-Means Clustering in Python:

In this, we will be going through two different approaches to solve K-means-

  1. K-Means without Elbow method (used when number of clusters is known as in this case of Iris dataset).
  2. K-Means with Elbow method (vice-a-versa).

K-Means without Elbow method

from sklearn.cluster import KMeans
#Applying kmeans to the dataset 
kmeans = KMeans(n_clusters = 3)
y_kmeans = kmeans.fit_predict(x)

Parameters of K-means:

n_clusters : int, optional, default: 8 The number of clusters to form as well as the number of centroids to generate.

init : {‘k-means++’, ‘random’ or an ndarray}

Method for initialization, defaults to ‘k-means++’:

‘k-means++’ : selects initial cluster centers for k-mean clustering in a smart way to speed up convergence.

‘random’: choose k observations (rows) at random from data for the initial centroids. If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.

n_init : int, default: 10

Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia.

max_iter : int, default: 300

Maximum number of iterations of the k-means algorithm for a single run.

random_state : int, RandomState instance or None (default)

Determines random number generation for centroid initialization. Use an int to make the randomness deterministic.

#Visualising the clusters
plt.scatter(x[y_kmeans == 0, 0], x[y_kmeans == 0, 1], s = 100, c = 'red', label = 'Iris-setosa')
plt.scatter(x[y_kmeans == 1, 0], x[y_kmeans == 1, 1], s = 100, c = 'blue', label = 'Iris-versicolour')
plt.scatter(x[y_kmeans == 2, 0], x[y_kmeans == 2, 1], s = 100, c = 'green', label = 'Iris-virginica')


#Plotting the centroids of the clusters
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:,1], s = 100, c = 'yellow', label = 'Centroids')


plt.legend()
# Prediction on the entire data
predictions = kmeans.predict(iris.data)


# Printing Predictions
print(predictions)

K-means with elbow method

In the above program, we kept n_clusters as 3 and default value is 8 as the Iris dataset is very famous. However, if we are not familiar with the dataset we need a method to estimate n_clusters. The very famous and easy method is ELBOW.

The elbow method looks at the percentage of variance explained as a function of the number of clusters: One should choose a number of clusters so that adding another cluster doesn’t give much better modeling of the data.

Most of the time, Elbow method is used with either squared error(sse) or within cluster sum of errors(wcss). The within-cluster sum of squares is a measure of the variability of the observations within each cluster. wcss is the summation of the each clusters distance between that specific clusters each points against the cluster centroid. An interesting fact, a cluster that has a small sum of squares is more compact than a cluster that has a large sum of squares and clusters that have higher values exhibit greater variability of the observations within the cluster.

As we have already imported all the required libraries and also dataset we will directly apply the elbow method.

wcss = []


for i in range(1, 11):
    kmeans = KMeans(n_clusters = i, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)
    kmeans.fit(x)
    wcss.append(kmeans.inertia_)
    
#Plotting the results onto a line graph, allowing us to observe 'The elbow'
plt.plot(range(1, 11), wcss)
plt.title('The elbow method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS') #within cluster sum of squares
plt.show()

As the elbow points to value 3 in the above graph, it is clear the value of n-clusters is 3. Now that we have the optimum amount of clusters, we can move on to applying K-means clustering to the Iris dataset.

#Applying kmeans to the dataset 
kmeans = KMeans(n_clusters = 3, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)
y_kmeans = kmeans.fit_predict(x)
#Visualising the clusters
plt.scatter(x[y_kmeans == 0, 0], x[y_kmeans == 0, 1], s = 100, c = 'red', label = 'Iris-setosa')
plt.scatter(x[y_kmeans == 1, 0], x[y_kmeans == 1, 1], s = 100, c = 'blue', label = 'Iris-versicolour')
plt.scatter(x[y_kmeans == 2, 0], x[y_kmeans == 2, 1], s = 100, c = 'green', label = 'Iris-virginica')


#Plotting the centroids of the clusters
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:,1], s = 100, c = 'yellow', label = 'Centroids')


plt.legend()

PCA in Python:

The libraries are same for PCA too, as mentioned in Preparing data section.

from sklearn.decomposition import PCA as sklearnPCA
pca = sklearnPCA(n_components=2)
y_pca = pca.fit_transform(x)

Parameters of PCA-

n_components : int, float, None or string Number of components to keep.

If n_components is not set all components are kept:

n_components == min(n_samples, n_features)

iterated_power : int >= 0, or ‘auto’, (default ‘auto’)

Number of iterations for the power method computed by svd_solver == ‘randomized’.

random_state : int, RandomState instance or None, optional (default None)

If int, random_state is the seed used by the random number generator;

If RandomState instance, random_state is the random number generator;

If None, the random number generator is the RandomState instance used by np.random. Used when svd_solver == ‘arpack’ or ‘randomized’.

with plt.style.context('seaborn-whitegrid'):
    plt.figure(figsize=(6, 4))
    for lab, col in zip(('Iris-setosa', 'Iris-versicolor', 'Iris-virginica'),
                        ('blue', 'red', 'green')):
        plt.scatter(y_pca[y==lab, 0],
                    y_pca[y==lab, 1],
                    label=lab,
                    c=col)
    plt.xlabel('Principal Component 1')
    plt.ylabel('Principal Component 2')
    plt.legend(loc='lower center')
    plt.tight_layout()
    plt.show()

End Notes

I hope now you are comfortable in writing a program for any dataset for K-means and PCA in python. Usually data is first visualized by placing it in visual context also known as Data Visualization. Patterns, trends and correlations that might go undetected in text-based data can be exposed and recognized easier with data visualization.

We have not done the visualization part in this article. In my next article I will cover Data Visualization for the same dataset and discover When,How and Why to use it.

Did you find this article helpful? Please share your opinions / thoughts in the comments section below.

Regards!!!

Srikant Kumar

Advance Data Scientist @ Honeywell Technology| Transforming Data for Effective Decisions

6 年

Brilliant idea to publish small courses. Shows your skills!!! :)

回复

要查看或添加评论,请登录

Sunakshi Mamgain的更多文章

  • Employee In Search of New Waters!!!

    Employee In Search of New Waters!!!

    With the advancement in technology and increasing job opportunities in every field of work, employees are leaving the…

    2 条评论
  • Detecting Covid-19 in x-ray Images

    Detecting Covid-19 in x-ray Images

    Inspiration of the work Where it all began? The corona-virus outbreak came to light on December 31, 2019 when China…

  • The Art of Sampling

    The Art of Sampling

    When we are in the process of building a model corresponding to a dataset, we tend to focus on several steps such as:…

  • Vectorization Implementation in Machine Learning: TF-IDF

    Vectorization Implementation in Machine Learning: TF-IDF

    Facebook, Twitter, Instagram, Snapchat, Medium, Towards Data Science, Analytics Vidhya, Udemy,..

  • Stopwords: Important for the Language not so in NLP

    Stopwords: Important for the Language not so in NLP

    What is NLP? Language to humans is as important as to eat food.Hence with growth of Artificial intelligence (AI), one…

  • Exploring the Trees of Data: An approach to Data Visualization

    Exploring the Trees of Data: An approach to Data Visualization

    Mirror mirror on the wall visualize data in all Confused?? I know me too. When I first started with data science, I…

  • Supervise The UnSupervised Learning (Part 1)

    Supervise The UnSupervised Learning (Part 1)

    What is Unsupervised Learning? It’s too cliche to define unsupervised learning as many blogs on internet do. Let me…

    10 条评论

社区洞察

其他会员也浏览了