Supervise The UnSupervised Learning (Part 2)
Sunakshi Mamgain
Senior Manager Content Strategy at Great Learning | Ex-upGrad | Data Science | NLP
Hello everyone, welcome to the continuation article of Supervise The UnSupervised Learning. If you haven't gone through the first part of the article, kindly go through it.
Lets recap what we have studied till now.
- Unsupervised Learning is a class of Machine Learning techniques to find the patterns in data.
- In clustering, the data is divided into several groups. i.e. the aim is to segregate groups with similar traits and assign them into clusters.
- K means is an iterative clustering algorithm that aims to find local maxima in each iteration.
- PCA is basically used for dimensionality reduction and is a feature extraction technique.
In this article we will explore K-Means and PCA using python.
Dataset:
In this article we use, Iris dataset for making predictions. The dataset contains a set of 150 records under 5 attributes — Petal Length , Petal Width , Sepal Length , Sepal width and Class. Iris Setosa, Iris Virginica and Iris Versicolor are the three classes.
Preparing data:
We will use sklearn Library in Python to load Iris dataset, and matplotlib for data visualization.
from sklearn import datasets
from sklearn.decomposition import PCA
# import some data to play with
iris = datasets.load_iris()
# Features of Iris data
x=iris.data
#Target for Iris data
y=iris.target
y
import matplotlib.pyplot as plt
#Dataset Slicing
x_axis = iris.data[:, 0] # Sepal Length
y_axis = iris.data[:, 2] # Sepal Width
# Plotting
plt.scatter(x_axis, y_axis, c=iris.target)
plt.show()
K-Means Clustering in Python:
In this, we will be going through two different approaches to solve K-means-
- K-Means without Elbow method (used when number of clusters is known as in this case of Iris dataset).
- K-Means with Elbow method (vice-a-versa).
K-Means without Elbow method
from sklearn.cluster import KMeans
#Applying kmeans to the dataset
kmeans = KMeans(n_clusters = 3)
y_kmeans = kmeans.fit_predict(x)
Parameters of K-means:
n_clusters : int, optional, default: 8 The number of clusters to form as well as the number of centroids to generate.
init : {‘k-means++’, ‘random’ or an ndarray}
Method for initialization, defaults to ‘k-means++’:
‘k-means++’ : selects initial cluster centers for k-mean clustering in a smart way to speed up convergence.
‘random’: choose k observations (rows) at random from data for the initial centroids. If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.
n_init : int, default: 10
Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia.
max_iter : int, default: 300
Maximum number of iterations of the k-means algorithm for a single run.
random_state : int, RandomState instance or None (default)
Determines random number generation for centroid initialization. Use an int to make the randomness deterministic.
#Visualising the clusters
plt.scatter(x[y_kmeans == 0, 0], x[y_kmeans == 0, 1], s = 100, c = 'red', label = 'Iris-setosa')
plt.scatter(x[y_kmeans == 1, 0], x[y_kmeans == 1, 1], s = 100, c = 'blue', label = 'Iris-versicolour')
plt.scatter(x[y_kmeans == 2, 0], x[y_kmeans == 2, 1], s = 100, c = 'green', label = 'Iris-virginica')
#Plotting the centroids of the clusters
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:,1], s = 100, c = 'yellow', label = 'Centroids')
plt.legend()
# Prediction on the entire data
predictions = kmeans.predict(iris.data)
# Printing Predictions
print(predictions)
K-means with elbow method
In the above program, we kept n_clusters as 3 and default value is 8 as the Iris dataset is very famous. However, if we are not familiar with the dataset we need a method to estimate n_clusters. The very famous and easy method is ELBOW.
The elbow method looks at the percentage of variance explained as a function of the number of clusters: One should choose a number of clusters so that adding another cluster doesn’t give much better modeling of the data.
Most of the time, Elbow method is used with either squared error(sse) or within cluster sum of errors(wcss). The within-cluster sum of squares is a measure of the variability of the observations within each cluster. wcss is the summation of the each clusters distance between that specific clusters each points against the cluster centroid. An interesting fact, a cluster that has a small sum of squares is more compact than a cluster that has a large sum of squares and clusters that have higher values exhibit greater variability of the observations within the cluster.
As we have already imported all the required libraries and also dataset we will directly apply the elbow method.
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters = i, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)
kmeans.fit(x)
wcss.append(kmeans.inertia_)
#Plotting the results onto a line graph, allowing us to observe 'The elbow'
plt.plot(range(1, 11), wcss)
plt.title('The elbow method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS') #within cluster sum of squares
plt.show()
As the elbow points to value 3 in the above graph, it is clear the value of n-clusters is 3. Now that we have the optimum amount of clusters, we can move on to applying K-means clustering to the Iris dataset.
#Applying kmeans to the dataset
kmeans = KMeans(n_clusters = 3, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)
y_kmeans = kmeans.fit_predict(x)
#Visualising the clusters
plt.scatter(x[y_kmeans == 0, 0], x[y_kmeans == 0, 1], s = 100, c = 'red', label = 'Iris-setosa')
plt.scatter(x[y_kmeans == 1, 0], x[y_kmeans == 1, 1], s = 100, c = 'blue', label = 'Iris-versicolour')
plt.scatter(x[y_kmeans == 2, 0], x[y_kmeans == 2, 1], s = 100, c = 'green', label = 'Iris-virginica')
#Plotting the centroids of the clusters
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:,1], s = 100, c = 'yellow', label = 'Centroids')
plt.legend()
PCA in Python:
The libraries are same for PCA too, as mentioned in Preparing data section.
from sklearn.decomposition import PCA as sklearnPCA
pca = sklearnPCA(n_components=2)
y_pca = pca.fit_transform(x)
Parameters of PCA-
n_components : int, float, None or string Number of components to keep.
If n_components is not set all components are kept:
n_components == min(n_samples, n_features)
iterated_power : int >= 0, or ‘auto’, (default ‘auto’)
Number of iterations for the power method computed by svd_solver == ‘randomized’.
random_state : int, RandomState instance or None, optional (default None)
If int, random_state is the seed used by the random number generator;
If RandomState instance, random_state is the random number generator;
If None, the random number generator is the RandomState instance used by np.random. Used when svd_solver == ‘arpack’ or ‘randomized’.
with plt.style.context('seaborn-whitegrid'):
plt.figure(figsize=(6, 4))
for lab, col in zip(('Iris-setosa', 'Iris-versicolor', 'Iris-virginica'),
('blue', 'red', 'green')):
plt.scatter(y_pca[y==lab, 0],
y_pca[y==lab, 1],
label=lab,
c=col)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(loc='lower center')
plt.tight_layout()
plt.show()
End Notes
I hope now you are comfortable in writing a program for any dataset for K-means and PCA in python. Usually data is first visualized by placing it in visual context also known as Data Visualization. Patterns, trends and correlations that might go undetected in text-based data can be exposed and recognized easier with data visualization.
We have not done the visualization part in this article. In my next article I will cover Data Visualization for the same dataset and discover When,How and Why to use it.
Did you find this article helpful? Please share your opinions / thoughts in the comments section below.
Regards!!!
Advance Data Scientist @ Honeywell Technology| Transforming Data for Effective Decisions
6 年Brilliant idea to publish small courses. Shows your skills!!! :)