Concise Basic Stats Series - Part XII: Introduction to Multivariate Analysis
Hello everyone, welcome back to the Concise Basic Stats series. Hope this article finds you well. It has been an incredible journey so far. For this episode, we are going to explore yet another more topic in stats, which I am certain that contains relevant information if you are dealing with data science or data analysis, in the real world.
It is not a surprise trying to model reality is not a simple task. In fact, it is inherently complex, things happen simultaneously, as a result of the interaction of many different variables. Most social or physical phenomenon are usually multivariate problem. Whether is conducting market research to identify key variables that impact the launch of a new product, how multiple factors can impact crop growth, or predicting the value of a variable based on several other covariates, or even segment your audience into clusters which can benefit for marketing campaigns, these are all examples of opportunities to employ multivariate techniques. Often, the human mind is over-whelmed by the sheer bulk of the data available. The need to understand the relationships between many variables is the driver behind a body of methodologies called multivariate analysis. In this chapter, we are going to introduce some statistical techniques that are used to make sense of multivariate datasets.
Thankfully for us, we live in the age of computers. Which is very convenient since multivariate techniques must, invariably, be implemented on a computer. The objectives of multivariate methods usually include the following:
Descriptive Statistics
Now that we understand the purpose of some of the multivariate techniques. Let's start with the basics. Let's see how we appropriately perform Exploratory Data Analysis on Multivariate data.
As we know, much of the information contained in the data can be assessed by calculating certain summary numbers, we've seen that many times before throughout this series. But how things would change in a multivariate scenario? Well, the only difference is that, instead of dealing with scalar values, we will be dealing with vectors (or arrays) and matrices instead. See image below:
This image shows us the representation of the descriptive statistics in vector form. Let's say we have a dataset with r rows and p columns. When calculating the mean, you will end up with a value for the mean for each of the variables of your dataset, therefore you'll end up with p means. These mean values are then combined into a sample means vector X_bar which you see as the first item in the image.
Then, another important descriptive stats that we want to obtain in order to get a better understanding of our data would be the variance. For that, you would again obtain p different values for variance, one for each variable in your data. These values are noted as S_11, ..., S_pp in the diagonal of our Sample Variances and Covariances Matrix (see image above). However we don't stop there. In the case of a multivariate analysis, one will be also interested in the covariance value of each pair combination of variables. These values make up the rest of the values of the aforementioned matrix. The covariance is a measure of linear association between the measurements of variables i and j and is denoted by (example calculation for variables 1 and 2):
think of it as the average product of the deviations from their respective means. If large values for one variable are observed in conjunction with large values for the other variable, and the small values also occur together, S_12 will be positive. If large values from one variable (bigger than its mean) occur with small values for the other variable, S_12 will be negative. If there is no particular association between the values for the two variables, S_12 will be approximately zero. Note that the population notation for the Variance-Covariance matrix is Σ.
The final descriptive statistic considered here is the Pearson's correlation coefficient which we should be all familiar by now. The sample correlation coefficient is a standardized version of the sample covariance, where the product of the square roots of the sample variances provides the standardization. See the formula above for the details:
If you come back to the "Arrays of Descriptive Statistics" image we've shown previously, it should be trivial to you to understand why the diagonal of the correlation matrix is all equal to 1. This is because the diagonal represents the correlation of each of the variables with itself, thus it is a perfect correlation and equal to 1. All the values of the correlation matrix follow the known properties of the Pearson's r. That is, it is bounded by 1 and -1, r measures the strength f the linear association. If r = 0, implies lack of linear association, and so on.
Random Vectors and Matrices
We are now going to explore a few concepts that emerge from the extension of the probability theory we've seen so far into the multivariate scenario. The first idea we want to take a look is the concept of a random vector. Briefly, a random vector is an array whose elements are random variables. Similarly, a random matrix is a matrix comprised of random vectors.
Now let's discuss about mean vectors and covariance matrices. We know that for a random vector, each element is a random variable. Therefore, each of those variables has its own probability distribution. In particular, we are they have their own marginal probability distribution, pertaining only to itself, without affect of other variables. Suppose X = [X_1, X_2, ..., X_p] is a random vector. The marginal means and variances of each element of X are defined as ?μ_i = E(Xi) [EXPECTATION IN PREVIOUS ARTICLES] and σ_i^2 = E(X_i - μ_i)^2, where i = 1, 2, ... , p, respectively. Note that the behavior of any pair of random variabes, such as X_i and X_k, is described by their joint probability function, and a measure of the linear association between them is provided by the covariance, expressed by σ_{i,k} = E(X_i - μ_i)(X_k - μ_k). When i = k, the covariance becomes the marginal variance.
More generally, the collective behavior of the p random variables X_1, X_2, ..., X_p (or equivalently, the random vector X is described by a joint probability distribution f(x_1, x_2, ... , x_p) = f(x). One important property to know about joint probability distribution pertains to the notion of independence. The p continuous random variables X_1, X_2, ..., X_p are statistically independent if their joint density can be factored as
meaning that the joint distribution can be expressed as a product of the marginal probability distributions. Statistical independence has an important implication for covariance. The factorization above implies that Cov(X_i, X_k) = 0. Thus,
The converse is not true in general (Cov(X_i, X_k) = 0 does not imply X_i and X_k independent).
The Multivariate Normal Distribution
Most of the techniques encountered in multivariate analysis are based on the assumption that the data were generated from a multivariate normal distribution. Much like its univariate counterpart, the normal density is often a useful approximation to the true population distribution, although real world data are never exactly multivariate normal.
Recall the (univariate) normal density [SEE IF THIS WAS PRESENTED]:
The multivariate generalization for a p-dimensional normal density for the random vector X = [X_1, X_2, ..., X_p] has the form:
领英推荐
and we denote this p-dimensional normal density by N(μ, Σ) which is analogous to the normal density in the univariate case. See below an example of a bivariate normal distribution.
Multivariate Hypothesis Testing: Hotelling's T squared
This method is the multivariate counterpart of the Student's T. Recall from the T test that we obtain the t test statistic via the following formula:
This statistic follows a t distribution provided that X is (or approximately) normally distributed. When we want to generalize this test to p variables, we can use the T^2 statistic, and we can use the formula below:
with X_bar being our sample means vector and μ _0 the hypothesized mean vector.
S^-1 is the inverse of the sample variance-covariance matrix S, and n is the sample size upon which our sample means vector X_bar is based.
From then, we want to test if the true mean is equal to our hypothesized mean, i.e. μ = μ _0. It is well known that when μ = μ _0 then Hotelling's T squared follows the below distribution.
in which F(p, n-p) represents the F-distribution with p degrees of freedom for the numerator and n-p for the denominator. Thus, if our hypothesized value μ _0 would to differ too much from the true mean μ, then for a particular significance level (α) we would have our T^2 in the extremes of the above distribution. In other words, we can obtain our results by comparing our obtained T^2 score with
Principal Components PCA & Dimensionality Reduction
Principal Component Analysis (PCA) is concerned with explaining the variance-covariance structure of a set of variables through a few linear combinations of these variables. Remember how in the beginning of this text we saw how one of the objectives of Multivariate Analysis is to find ways to reduce high-dimensional data into less components without losing crucial information? This is what we are looking for here with the help of PCA. Doing this can help tremendously not only with interpretation of the data (as we have less variables to worry about at the end of this process) but also to avoid introducing unnecessary noise and complexity into our predictive models. Note that the key here is preserving variability.
But what do we mean by components? Let's imagine a scenario where we have a dataset with p variables. In the multivariate analysis' terms we have p components. In order to understand the total system variability, we would need to consider all the p components of our dataset. However, often much of this variability can be accounted for by a small number k of the principal components. If so, there is (almost) as much information in the k components as there is in the original p variables.?The k principal components can the replace the initial p variables. and the original dataset can be reduced from containing n measurements of p variables, to now consisting of n measurements on k principal components (p > k). In more technical terms, what we are doing is selecting the hyperplane that lies closest to the data, and then projecting the data onto it.The unit vector that defines the i-th best lower dimensional hyperplane is called the ith principal component. PCA is an algorithm helps us choose the best hyperplane to project our data onto, which will preserves the largest amount of variance. It does that by using a matrix factorization technique known as Singular Value Decomposition (SVD), which can decompose the original dataset X into the matrix multiplication of three matrices U Σ V^t, where V contains all the principal components that we are looking for.
Eigenvalues and Eigenvectors for PCA
The core of PCA is build on the concept of Eigenvectors and Eigenvalues.
To help us understand these two very important concepts, let's take a lower dimensional example. Suppose we have a scatter plot (2-dimensional) of random variables, and a line of best fit is drawn between these two points. This line of best fit, shows the direction of maximum variance in the dataset. The eigenvector is the direction of that line, while the eigenvalue is a number that tells us how the dataset is spread out on the line which is an eigenvector. Through these two objects, we can describe the hyperplane (in this case, a one dimensional axis) in which we can project our data onto it. That would be the black line in the image below, and this is our first principal component.
To get the subsequent principal components, we are going to the eigenvectors that will be perpendicular or orthogonal to the previous one. That is because through that we are assuring that the eigenvectors can span the whole x-y area.
At the end of this, we ended up with a new coordinate system, since the second principal component (the line perpendicular to that first black line) will be our new Y-axis. We can now rotate our data to fit these new axes and come up with its new coordinates. We do that by multiplying the original X, Y data by eigenvectors. These data which have been reoriented are known as score. Once we transformed the variables, we can then drop the ones which account for less variability of the data, based on the eigenvalues. Congratulations! We have successfully performed PCA.
Final Comments
I hope this was a useful, concise introduction to the multivariate analysis domain. As with most advanced topics, this is a huge area with lots more to cover, and might be subject to a dedicated series of articles in the future. If this topic piqued your interested and you wish to learn more, I would recommend continuing your journey by looking into topics like MANOVA, the multivariate analog of the Analysis of Variance (ANOVA), as well as Discriminant Analysis and Factor Analysis. For now, this is what I've got for you. As always, it was a pleasure to be sitting here reviewing and writing about these subjects. I am not kidding when I say that this helps me more than anything when consolidating previously acquired knowledge. I hope it somewhat helped you a little bit, too. Thank you very much for your patience and for reading all the way through. Please share if you liked this content and have a wonderful day!
Best,
Luiz.