Statistics with Excel, the Correlation
Image credits by Clark Tibbs on Unsplash

Statistics with Excel, the Correlation

Knowing whether a particular agent can influence the diffusion of a virus or a macro-political event can affect the trend of a financial asset, are problems that a priori can not have immediate and predictable response but whose solution makes difference for the common interest.

In general, the analysis of data that describe a phenomenon and their realistic interpretation are the basis of scientific research that has as its goal the understanding of the phenomenon itself and its most likely future development. The scientific approach generally tends to find what and how many may be the causes that lead to the manifestation of the event. However, the cause-effect relationship is not always evident and, therefore, it is not so immediate to find an optimal solution to a contingent problem.

This article talks about a mathematical method to identify whether between two quantities, of which any kind of association is a priori ignored, there can be a relationship such that the variation of one involves the variation of the other and, if so, by what measure. That is the correlation. We will use a powerful and simple tool as Excel.

The correlation … what's it for

It often happens that describing the behavior of a variable, perhaps dependent on others, is not sufficient for the analysis of a much more complex phenomenon, in which there are other parameters whose mutual relationship with the studied variable is not yet known.

In this case we have to verify if there is a relationship between the characteristics of the phenomenon (the variables) and, if it is so, of what intensity. It is necessary to study the correlation between the variables.

In the correlation analysis, the variables are indicated with X1 and X2 and not X and Y, as in the case of cause-effect relation that we all know, just to highlight the absence of the concept of functional dependence.

We don't yet know if and what kind of relationship exists between them. We use the correlation to assess if there is a linear relationship between them.

If the analysis is negative, i.e. there is no linear relationship, this doesn't mean that there can be no other relationship. For example, there could be a polynomial relationship of a degree > 1. We should therefore use other, more sophisticated techniques to verify this.

Chart Analysis with Excel

In general the chart is the first and most important analysis tool that allows us to make the first considerations about the trend of a phenomenon. Also in this case the first thing to do is to analyze the chart of distributions of the values of the two quantities and thus visually verify if there can be a relation between the two variables.

We normally use the Scatter Chart or Bubble Chart present in the Excel charts collection.

After obtaining the data of the two quantities, using the appropriate data mining techniques and after verifying the number of values to be represented, creating a scatter chart is very simple:

1.        Select the data you want to plot in the scatter chart

2.        Click the Insert tab, and then click Insert Scatter (X, Y) or Bubble Chart.

3.        Click Scatter

Insert Scatter Chart

Figure 1 - Create a Scatter chart

Let's take an example. We want to compare the pairs of price values of two different stocks to determine whether there is a correlation between the two stocks. For educational purposes only, we try to compare the closing prices of Intel and AMD. We create a first scatter by putting the Intel values on the X-axis and the AMD values on the Y-axis. Then a second one doing the opposite, i.e. putting the AMD values on the X-axis and the Intel values on the Y-axis.

In both cases we add a trendline to scatter chart and display R^2 value, that measures the soundness of the linear model obtained, and Equation.

No alt text provided for this image

Figure 2 – Scatter chart of INTC (X-axis) and AMD (Y-axis)

No alt text provided for this image

Figure 3 – Scatter chart of AMD (X-axis) and INTC (Y-axis)

What we get seems impressive. Although the equations are different, the value of R^2 is the same in both cases. This means that the correlation makes no distinction on which quantity is reported in the horizontal axis and which in the vertical one. The correlation studies the relationship between the quantities in an absolute sense, which is in the absence of an a priori hypothesis of a certain functional dependence between the represented variables.

Generally its use is appropriate when the scatter has an oval shape as in the examples shown in the two previous figures.

Correlation coefficient

The explanation for the above fact lies in the very nature of the correlation as the measure of the intensity of the associations and not the dependency of the relationship. That is, it is said that between the two variables there is a correlation when the tendency of a variable to vary with a more or less high level of intensity as a function of another one occurs, but no diagnosis of the type of relationship is made.

Sometimes the variations of one variable derive from those of the other (e.g. the relation between heritable somatic characters), others are common (relation between stature and individual weight), sometimes mutually dependent (relation between price and demand: the price influences to modify the demand, the demand influences to modify the price).

The correlation coefficient measures the intensity of these relationships. There are different types, corresponding to different calculation methods and, therefore, there are different formulas for this. We consider the Pearson correlation coefficient with the following formula:

No alt text provided for this image

The following can be deduced from the analysis of the formula:

  • rxy > 0 means that X and Y are directly correlated
  • rxy = 0 means that X and Y are not correlated
  • rxy < 0 means that X and Y are inversely correlated

The absolute value (|0,X|) indicates the degree of interdependence:

  • rxy < 0.3 weak correlation
  • rxy >= 0.3 And rxy<= 0.7 moderate correlation
  • rxy > 0.7 strong correlation

Correlation with Excel

From Pearson's formula it can be seen that the correlation is based on the concept of covariance between two variables. Therefore you can manually construct this coefficient step-by-step using the Excel Statistical Functions:

STDEV.P(number1,[number2],...)

where:

Number1: (Required) The first number argument corresponding to a population.

Number2 ... : (Optional) Number arguments 2 to 254 corresponding to a population. You can also use a single array or a reference to an array instead of arguments separated by commas.

and

COVARIANCE.P(array1,array2)

where:

Array1: (Required) The first cell range

Array2: (Required) The second cell range

Otherwise you can directly use the CORREL function with the following syntax:

CORREL(array1, array2)

where

Array1: (Required) A range of cell values

Array2 : (Required) A second range of cell values

Conclusion

In any field of application, medical, economic, financial, political, etc., correlation analysis is the first step in the analysis of dependence between variables. In this short article, we have seen how to do it via Excel, a powerful and, at the same time, popular tool.

From here we can then deepen the relationship through more specific analysis methodologies, such as, for example, regression (simple, linear, multiple). These are tools of mathematics and statistics and require the user's knowledge of their meaning.

Today we also have innovative Artificial Intelligence techniques that allow us to delegate to the machine the duty of finding meaningful relationships between data. But we will talk about this, if you want, in another article

What we’ve seen

  • Data analysis
  • Correlation analysis
  • Data processing
  • Scatter charts
  • Correlation coefficient
  • Excel statistical functions

If you found this article interesting and useful I will be happy if you want to share it ??

要查看或添加评论,请登录

Donata Petrelli的更多文章

  • Il Quantum Computing

    Il Quantum Computing

    Un possibile percorso di studio per conoscere l’argomento Dall’epoca dei computer grandi quanto una stanza agli attuali…

    9 条评论
  • Intelligenza Artificiale e Esports

    Intelligenza Artificiale e Esports

    L’evoluzione delle tecniche di analisi attraverso lo Sport Osservare l’evoluzione dello Sport è una diversa prospettiva…

    2 条评论
  • "The Black Swan" prediction

    "The Black Swan" prediction

    Predictive mathematical models in "particular" contexts We are living one of the most complex moments in history that…

    4 条评论
  • Bayes with Excel

    Bayes with Excel

    From company manager to family man, from politician to school director, in critical situations we all have to make…

    2 条评论
  • Correlation with Bayes

    Correlation with Bayes

    Never before must we be able to read and interpret data for our own good. From the biological sector to the medical…

    5 条评论
  • Chi ha spostato la maionese dal frigo?

    Chi ha spostato la maionese dal frigo?

    Archiviare i dati conviene sempre Quanto tempo perdiamo nel cercare oggetti che non ricordiamo dove li abbiamo riposti…

    1 条评论
  • Morphological optimization of Neural Networks

    Morphological optimization of Neural Networks

    How to pick the optimal model for the efficiency of training algorithm Among the Machine Learning models, the one of…

    3 条评论
  • Classic Math Vs Artificial Intelligence

    Classic Math Vs Artificial Intelligence

    The transition from a “function-centric world” to a “data-centric world” From the primitive shepherds, through the…

  • Intelligenza Artificiale per il Trading

    Intelligenza Artificiale per il Trading

    Il modello Petrelli-Cesarini, un metodo per la previsione di prezzi nei mercati finanziari “La gioia nell’osservare e…

  • La correlazione con Excel

    La correlazione con Excel

    Individuare relazioni tra variabili In ogni attività, l’analisi dei dati caratteristici di un qualche fenomeno ed una…

其他会员也浏览了