Linear Discriminant Analysis

Linear Discriminant Analysis

Linear discriminant analysis (LDA) group data into categories, as such, this technique is used for dimensionality reduction and classification problems. LDA is composed by discriminant function (for more than two groups, a set of discriminant functions), these functions are linear combination of independent variables (which looks like multiple regression equation) that will discriminate between the categories in perfect manner. The result model can be used for prediction (assignment of new cases into defined groups).

Let’s have data presented in 2D space. This data consist in two class and has been presented without discrimination.

Data in two-dimension

Now the researcher can find a vector direction that best discriminates between these two classes.

Data in one-dimension

While PCA aims to find the most accurate data representation in a lower dimensional space spanned by the maximum variance directions, this might not work in some cases. In other hand, Discriminant Analysis represent data in lower dimension preserving the discriminatory information between different classes of the dataset.

As stated above, PCA looks the most variation in the data, LDA tries to maximize the separation of known categories.

Means and Variances of two classes

To have more understanding, let’s have an example: imagine that the researcher needs to classify students based their achievement. For each student enrolled, some information like 'test score' and others are collected. At the end, the researcher can have students into groups, and we can also have the percentage of those correctly classified. New student to be enrolled, can be classified based on the resulting?model. The researcher can combine this information into function to determine how good the students can be discriminated between groups.

Assumptions

The assumptions are the same as those for MANOVA. LDA are quite sensitive to outliers. Independent variables must be normal in each group. Variances among group variables are the same across levels of predictors. LDA assume that covariances are equal while Quadratic Discriminant Analysis may be used when covariances are not equal. The sample are randomly selected and score on one variable is assumed to be independent of scores for all other observation included. Group membership must be mutually exclusive (cases can’t belongs to more than one group).

LDA may still be reliable when using dichotomous variables (where multivariate normality is often violated)

The steps in LDA

? Formulating the problem before analysis.

? Estimate discriminant function coefficients.

? Determination of significance of discriminant functions.

? Interpretation of the results obtained.

? Validity of the result.

When use LDA

·?????? When classes are well-separated is better Linear discriminant analysis than Logistic regression because estimates become more unstable for logistic analysis.

·?????? When n is small and distribution of predictors are approximately normal in each class.

·?????? When have more than two response classes, because it also provides low-dimensional views of data.

John Bernabé Rafael Baptista Tomás

Assistente de dados na ICAP at Columbia University

9 个月

Thank you, for sharing this knowledge ??.

回复

要查看或添加评论,请登录

José Jaime Comé的更多文章

  • Machine Learning: Predicting outcomes using Binary Logistic Regression

    Machine Learning: Predicting outcomes using Binary Logistic Regression

    Logistic regression is a statistical model that is used for binary classification by linear combination of data of one…

  • Prediction Model using Autoregressive Integrated Moving Average (ARIMA)

    Prediction Model using Autoregressive Integrated Moving Average (ARIMA)

    An autoregressive integrated moving average (ARIMA) is a statistical analysis model that predict values based on…

  • Comparing means of different groups (Analysis of Variance)

    Comparing means of different groups (Analysis of Variance)

    Analysis of Variance (ANOVA) is collection of statistical tests used to analyze the difference between means of more…

    2 条评论
  • Factor Analysis

    Factor Analysis

    Factor analysis is a statistical method used to describe variability among large number of observed, correlated…

    1 条评论
  • Principal Component Analysis (PCA)

    Principal Component Analysis (PCA)

    The number of features or dimensions in a dataset can lead to issues such as overfitting, increasing computation…

    1 条评论
  • Data Governance

    Data Governance

    While Data management is part of the overall management of data. Data governance in short is just documentation…

  • Data Mining with Cluster Analysis

    Data Mining with Cluster Analysis

    The Cluster analysis is technique of statistical analysis and one of the method of data mining that consist of dividing…

社区洞察

其他会员也浏览了