Reducing Dimensionality: PCA
Reeshabh Choudhary
Student. Software Architect. Author : Objects, Data & AI; Concurrency & Parallelism. Interested in History & Sports.
??♂? Software Architecture Series — Part 34.
Earlier we learned about how to capture dimensionality in real-life dataset to make our neural networks effective. However, as datasets become more complex, it becomes increasingly difficult to analyse them using traditional methods. Machine Learning or real life, we try to prune away unnecessary information and retain the important bits to focus on the bigger picture. Not all the features in a dataset are informative or useful.
For example, consider image processing, where pixel measurements form a high dimensional vector space. However, most of the images are highly compressed, which basically means that the relevant information is captured in much lower dimensional space. If a dataset has ’n’ number of features, we call the dataset n-dimensional. It is easy to visualize in 2-D or 3-D, but beyond that visualization will get blurry. In real life, data is multi-dimensional and sometimes features of data can be corelated with each other.
In order to makes sense of data, we try to reduce dimensions using model-based techniques which reference the data and try to gain insight from it. Principal Component Analysis (PCA) is one such technique which makes use of applied Linear Algebra and focuses on dimensionality reduction of a dataset containing a large number of interrelated variables, while retaining as much as possible variation present in the dataset.
PCA aims to capture and extract the most significant patterns and variations present in the original data. By transforming the data into a new set of orthogonal features (principal components), PCA identifies the directions along which the data exhibits the most variability. These principal components represent the essential information that contributes to the observed variations in the dataset. These combinations are chosen in a way that the first principal component explains the maximum possible variance, the second principal component explains the maximum variance remaining after the first component, and so on. Each principal component is orthogonal to the others, ensuring that they are uncorrelated.
By reducing the number of features, less relevant and noisy features are discarded. This reduction in noise can mitigate the risk of overfitting, as the model becomes less likely to learn from irrelevant variations and focuses on capturing the most significant patterns in the data. It also helps in speeding up the training of a ML algorithm. High-dimensional data can lead to increased computational complexity during model training. By reducing the number of features through PCA, the training process becomes more efficient and faster. Fewer features require fewer computations and less memory, resulting in quicker training times and more efficient use of computing resources. Moreover, this reduction in dimensions simplifies data visualization, enabling the creation of scatter plots, heatmaps, and other visualizations that are easier to interpret and communicate. Such simplified visualizations can aid in understanding patterns, clusters, and relationships in the data.
Mathematical explanation of?PCA
Principal Component Analysis (PCA) can be interpreted as a statistical application of the Singular Value Decomposition (SVD) on the covariance matrix of the data. SVD is a matrix factorization technique that decomposes a matrix into three component matrices, which can be leveraged to reveal the underlying structure of the data. It provides a systematic way to capture low dimensional approximation of high dimensional data in terms of relevant or dominant patterns present in the dataset.
Suppose we want to analyse a dataset X, which can be represented in a matrix form, as below.
Here, different columns represent features or measurements from datasets or experiments.
The SVD is a unique matrix decomposition that exists for every complex valued matrix X:?
?X = UΣVT
Here, X is the original matrix that we want to decompose. It could represent data, a transformation, or any other type of matrix.
U is an orthogonal matrix (i.e., its columns are mutually perpendicular unit vectors). It represents the left singular vectors of X and captures the relationships between the rows of X.
Σ is a diagonal matrix containing the singular values of X. Singular values are non-negative numbers that indicate the importance of each singular vector in U and the corresponding singular vector in VT. They are usually arranged in decreasing order along the diagonal of Σ.
VT is the transpose of an orthogonal matrix V. It represents the right singular vectors of X and captures the relationships between the columns of X.
PCA can be seen as a special case of SVD, where the data matrix X is centred (mean-subtracted) and standardized before applying SVD. Standardization is done to ensure that all features have the same scale and prevents any feature from dominating the analysis. The resulting orthogonal matrix V in SVD corresponds to the principal components in PCA. The matrix U Contains the left singular vectors of ??, which correspond to the relative alignment of the original data points with the principal components. The projected data in the principal component space can be obtained by matrix UΣ. The rows of UΣ represent each data point in the new coordinate system defined by the principal components. Each principal component is a linear combination of the original features and is orthogonal (uncorrelated) to the other components. Ultimately, we get a hierarchical representation of the data in terms of a new coordinate system defined by dominant correlations within the data.
Although, it provides a robust and stable decomposition, even for matrices that might not be perfectly symmetric, SVD can be computationally expensive, especially for large datasets, as it involves calculating all eigenvalues and eigenvectors. This is because the eigenvalues and eigenvectors of a matrix are the roots of a polynomial equation, which can be a very difficult problem to solve.
Hence, a more computationally efficient approach is to perform PCA using only eigen vectors and calculating covariance matrices. Eigenvector-based PCA is often computationally more efficient than SVD, especially when the number of features is much larger than the number of observations. It may require less memory and computational resources compared to SVD. However, we must always keep in mind the limitation of this approach as it is only applicable with covariance matrix. SVD based approach is generally more numerically stable and less prone to precision errors.
PCA Implementation
Let us now try to implement PCA on a dataset using eigen-based approach. Although, libraries like scikit-learn have built-in models to perform PCA on a given dataset, we would take the route of explicit programming to understand the internal details of the implementation. Following is the implementation of PCA on iris dataset, widely available on internet. It contains information about various attributes of different species of iris flowers.
The Iris dataset was introduced by the British biologist and statistician Ronald A. Fisher in 1936. It consists of measurements of four features (attributes) of iris flowers from three different species: Iris setosa, Iris versicolor, and Iris virginica. The four features are:
· Sepal Length: The length of the sepal (a leaf-like structure at the base of the flower).
· Sepal Width: The width of the sepal.
· Petal Length: The length of the petal (a colourful structure that attracts pollinators).
· Petal Width: The width of the petal.
Each species of iris has its own distinct characteristics, and the goal is often to use the feature measurements to classify the flowers into their respective species. By applying PCA to the Iris dataset, we can gain a better understanding of the data’s structure, relationships, and patterns. Visualizing the data in a reduced-dimensional space can reveal clusters, separations, and potential overlaps between different species of iris flowers. By projecting the data onto the principal components, one can compare the distribution of different iris species in the reduced-dimensional space. This might reveal how well-separated the species are or if there’s overlap between them.
Let us have a glance on the dataset.
Steps to perform in?PCA
The assumption of linear correlations between features implies that PCA is most effective when relationships between features are linear or can be approximated as such. PCA aims to identify common factors in input dataset, seeking to capture underlying structure or patterns in a more compact representation.
PCA does not directly interpret correlation between features. It identifies the direction along which data varies the most. When performing PCA, correlation or covariance matrix is used as input. They are used to compute eigen vectors and eigen values.
Using correlation matrix can be advantageous when scales of features differ significantly, or emphasis is on relationship between features rather than their individual variances. If scales are similar and focus is on variability, co-variance matrix is used. The correlation matrix is a standardized covariance matrix, which means that the variances of the variables have been normalized.
Standardize the data: Given a dataset with n observations and p features, the first step is to standardize the data by subtracting the mean and dividing by the standard deviation for each feature. This ensures that all features have the same scale and prevents any feature from dominating the analysis.
Standardized feature, z?? = (x?? - μ?) / σ?
Where x?? is the jth feature value of the ith observation, μ? is the mean of the jth feature, and σ? is the standard deviation of the jth feature.
Calculate the Covariance Matrix: The covariance matrix is computed from the standardized data to capture the relationships between features.
Covariance between Features j and k, cov(X?, X?) = (1 / (n - 1)) ∑???? [(z?? - ??)(z?? - ??)]
Where z?? and z?? the standardized values of features j and k for observation i; and ???,?? are respective means.
Here, we try to understand how the variables of the input data set are varying from the mean with respect to each other, or in other words, to see if there is any relationship between them.
Since the covariance of a variable with itself is its variance ( ), in the main diagonal (Top left to bottom right) we actually have the variances of each initial variable. And since the covariance is commutative (cov(x?, x?) = cov(x?, x?) ), the entries of the covariance matrix are symmetric with respect to the main diagonal, which means that the upper and the lower triangular portions are equal.
Covariances value which are present as entries of the matrix tell us about correlation of two variables. If the value is positive, the two variables are correlated, which means they increase or decrease together. If the value is negative, this implies that one increases when the other decreases (inverse correlation).
Compute Eigenvectors and Eigenvalues: The next step is to compute the eigenvectors and eigenvalues of the covariance matrix. Eigenvectors represent the directions (principal components) along which the data has the most variance, and eigenvalues quantify the amount of variance along each eigenvector.
C?v= λ?v
Where C is the covariance matrix, v is the eigenvector, and λ is the eigenvalue.
Eigen vectors represent the principal components and eigen values indicate the amount of variance explained by each principal component. Eigen vector with higher eigen value captures the direction of maximum variability in the data.
Principal components are new variables that we create by combining the original variables in a linear manner. Imagine we have several attributes (variables) that describe an object, like its height, weight, age, and so on. PCA allows us to construct new attributes (principal components) that are mixtures of these original attributes. PCA is designed to create these new principal components in a way that they are uncorrelated with each other. In other words, they are independent and don’t carry redundant information. This is valuable because it simplifies the data representation and makes it easier to analyse.
If a principal component has high loadings (absolute value) for a set of features, it suggests that these features are strongly correlated or have similar patterns of variation.
Selection of Principal Components: PCA aims to maximize the amount of information packed into the first principal component. This component captures the most significant patterns or trends in the data. The second principal component captures the remaining patterns that are uncorrelated with the first component, and so on. This hierarchical arrangement ensures that we don’t lose much information as we move to subsequent components. We achieve dimensionality reduction by discarding the components with low information and considering the remaining components as our new variables.
Transform the Data: The final step is to project the standardized data onto the chosen principal components to obtain the reduced-dimensional representation.
Y=X?V
Where Y is the transformed data matrix, X is the standardized data matrix, and V contains the eigenvectors (principal components) as columns.
??In summarized form, PCA involves the computation of eigenvectors and eigenvalues of the covariance matrix to identify the principal components, which represent the most significant directions of variance in the data. These principal components can then be used to transform the data into a lower-dimensional space, achieving dimensionality reduction while retaining the essential information in the dataset.
A sample program to perform and visualize PCA is available at following repo link:
Data_Cluster_Tutorial/Program/PCA at main · reeshabh90/Data_Cluster_Tutorial Contribute to reeshabh90/Data_Cluster_Tutorial development by creating an account on GitHub.github.com
??If you are a programmer and want to cover the basics of Machine Learning and AI to progress in your career, you can refer to my book “Objects, Data & AI”. ??? Print version is available at Amazon.?
Student. Software Architect. Author : Objects, Data & AI; Concurrency & Parallelism. Interested in History & Sports.
1 个月This article had a critical error about SVD decomposition and its interpretation, which has now been rectified. Apologies for the same.