Concise Basic Stats - Part X: Distribution-free tests (Nonparametric Statistics)
Hello fellow readers and welcome back. We are here once again with another article of the concise basic stats series. This time around we will be learning about nonparametric tests, tests and methodologies that are not based solely on parametrized families of probability distributions (parameters like mean and variance, for example). It can give us more flexibility with the data we have, and allow us to apply powerful tests even when we don't have data which necessarily suffices all assumptions required for a "conventional" parametric test. (like normality). Let's dive in.
During this series of articles we came across a multitude of different tests. Tests about means, proportion, variances, categorical variables, normality, etc... You may have noticed that when describing such tests I made sure to have a small section referring to assumptions for that particular procedure. Assumptions are important for parametric tests (aka most of the tests we have seen so far) because it can help us frame our problem and see if the particular test is in fact appropriate for the kind of data we have. However, sometimes the assumptions do not hold. Like for example, we need to assume normality in order to perform a one-way ANOVA. But what if such criteria does not hold for our data? In that case we could try some transformations like log-transformations or Box-Cox transforms, but sometimes it is not possible to achieve the level of normality required for the test (transformations are generally less effective for reducing kurtosis than for reducing skew). In such cases, the best is to look for a distribution-free alternative for the kind of test we are trying to perform. Here is a little cheat-sheet of some the parametric tests we have seen before, next to their respective non-parametric counterpart.
In this article, we are going to explore some of the most common non-parametric tests and give some context as to when we could use each of them.
About Ranks
We will see that ranked the data is a recurrent theme in this topic. Since we are talking about distribution-free tests, data is not mapped to a particular known distribution, instead we are talking about empirical distributions. Which are obtained by the observable data.
But we must still have an idea of where each of the observations lie within the overall range of values observed. That is where ranking comes into play. When we say ranking, we mean to assign a value in a ascending scale related to the magnitude of a value within a sample. For example if we have an array of values like [10, 14, 11, 7], we could rank them, respectively, like this [2,4,3,1]. See how I did this? I assign a rank value for each value in a ascending fashion. Thus, the lowest value gets rank 1, the second lowest value gets rank 2 and so on and so forth until we get to the highest value in the sample, which gets assigned the highest rank. In this way we get an idea of how each value is located in sample, and will be used by the non-parametric tests below to get our final result.
Kruskal-Wallis
Since we mentioned it above mentioned, let's start our non-parametric tests journey with the Kruskal-Wallis test. In short, the Kruskal-Wallis test can be understood as the distribution-free version of a completely randomized one-way ANOVA test. We already know that one of the most popular statistical tests for analyzing differences among group means is the ANOVA test. And we also know that although it is a great tool, ANOVA assumes that the data in question follows a normal distribution. When such assumption doesn't hold, that's when the Kruskal-wallace test comes in handy.
Analogous to the ANOVA one-way test, our hypothesis states that:
- H0: The medians (mean on ranks) are equal across the samples
- H1: At least one median is different
Next, we prepare and rank our data in the way we previously described: by arranging data from all groups in an ascending order and assigning a rank to each of the data entries. Them we sum the ranks for each group.
Test statistic
The test statistic for the Kruskall-Wallis test is known as H. And it is calculated following the equation below:
Where.
- Ri is the rank sum of group i.
- ni is the sample size of group i
- n is the total sample size across all groups, n = n1 + ... + ni.
We still have the assumption that samples that independent among each other and which come from independent groups. One subject cannot be in more than one group.
Finally, we compare our test statistic with a critical cutoff point, which in turn is determined by the Chi-Square distribution. The Chi-Square distribution is used here because it is a good approximation of H, especially if we have each group's sample size with ni >= 5.
Example
Let's illustrate what we've read about with an example. Fifteen different patients, chosen at random, subjected to three drugs. We want to test if at least one of the three mean patient response to drug is different at α = 0.05.
Let's fill in the missing ranks above together. In drug 2, since 5.50 is the immediate next value in our sample that is greater than 5.49, we will assign the next rank, 7. The other missing ranks are also from values of 5.50. Since we cannot have same values for ranks, we will assign ranks 8 and 9 to the other two missing values. In total we will have a total rank sum of 40 for drug 2.
Ok, now let's obtain our H test statistic. let's use python to run these calculations for us. I've created a custom class for this test which you can refer to below.
def get_split_points(array_lenghts):
? ? split_points = np.array([])
? ? prev_number =0
? ? for number in array_lenghts:
? ? ? ? number += prev_number
? ? ? ? split_points = np.append(split_points, int(number) )
? ? ? ? prev_number = number
? ? return split_points
class KruskalWallisTest():
? ? def __init__(
? ? ? ? self,
? ? ? ? input_arrays
? ? ):
? ? ? ? self.input_arrays= input_arrays
? ? ? ? self.combined_arrays = np.concatenate(input_arrays)
? ? ? ? self.array_lenghts = [ array.__len__() for array in input_arrays]? ??
? ??
? ? def __repr__(self):
? ? ? ? return (
? ? ? ? ? ? ? ? f'Kruskall wallis '?
? ? ? ? ? ? ? ? f'test with {len(self.input_arrays)} groups'
? ? ? ? )
? ??
? ? def get_ranks(self):
? ? ? ? temp = self.combined_arrays.argsort()
? ? ? ? ranks = np.empty_like(temp)
? ? ? ? ranks[temp] = np.arange(len(self.combined_arrays))
? ? ? ? ranks += 1
? ? ? ? return ranks
? ??
? ? def get_kruskal_wallis_H_statistic(self):
? ? ? ? ranks = self.get_ranks()? ??
? ? ? ? split_points = get_split_points(self.array_lenghts)
? ? ? ? ranks_by_group = np.split(ranks, split_points[:-1].astype(int))?
? ? ? ? sum_ranks_by_group =np.array(ranks_by_group).sum(axis=1)
? ? ? ? summation_part = np.sum(
? ? ? ? ? ? np.power(sum_ranks_by_group, 2) / array_lenghts
? ? ? ? )
? ? ? ? H = (12/(N*(N+1))) * summation_part - (3*(N+1))
? ? ? ? return H
? ?
By calling the test statistic method we get:
drug1 = np.array([5.9, 5.92, 5.91, 5.89, 5.88]
drug2 = np.array([5.51, 5.50, 5.50, 5.49, 5.50])
drug3 = np.array([5.01, 5.00, 4.99, 4.98, 5.02]))
Test = KruskalWallisTest( [drug1,drug2, drug3] )
test.kruskal_wallis_H_statistic()
Output:
12.5
Let's confirm this by using the official scipy module for the kruskal- Walis test.
from scipy import stats
H = stats.kruskal(drug1, drug2, drug3)
print(H.statistic)
Output:
12.589928057553957
Ok, looks like we did a good job. Now let's conclude on the test based on the p-value
print(H.pvalue)
# output was 0.0018455756794483428
Therefore, since p-value = 0.001 < ?α = 0.050, we can safely reject the null hypothesis and conclude that at least one of the average drug responses is different. Even if we know that not all the ranks are equal, we don't know which groups are not equal, hence we run a multiple comparisons post-hoc test to compare all the pairs. The most common post-hoc tests after a significant Kruskal-Wallis test are: Dunn test, Pairwise Wilcoxon test, and Conover test.
Mann-Whitney U test (Wilcoxon Rank Sum test)
We now look at another nonparametric both, which is analogous to the independent two-sample t test that we've seen previously (one of the first hypothesis test we learned about). The Mann-Whitney U Test is a statistical test to determine if 2 groups are significantly different from each other on a variable of interest. In this scenario, the 2 groups should be independent. In general, we also want to make sure we have roughly the same amount of observations in the each group. Let's understand this method with the help of an example.
领英推è
A pharmaceutical company created a new drug to cure sleepwalking and observed the result on a group of 5 patients after each month. Another group of 5 has been taking the old drug for a month (control group). The organization recorded the number of sleepwalking cases in the last month for each patient. The result was:
From purely looking at the raw data, we see that the number of sleepwalking cases is consistently lower while talking the new drug when compared to cases reported while taking the old drug. But is this difference statistically significant?
The hypothesis are given below.
- H0: The two groups report same number of cases
- H1: The two groups report different number of cases
We select a significance level of 5% (α= 0.05) as usual. Now, let's find our test statistic.
For Mann Whitney U test, the test statistic is denoted by U, which is the minimum of U1 and U2 ( U = min(U1, U2) ), which in turn are given as follows:
We see that we have once again to do some computation involving ranks. However, we must first understand how to properly assign the ranks for each observation. First, we will combine the two samples and arrange them in ascending order (OD and ND represent Old Drug and New Drug, respectively). Next we assign the ranks. The lowest value is assigned rank 1 and the second lowest value is assigned rank 2 and so on.
However, in this case we have such that 1,4 and 8 area appearing more than once in the combined sample. We must fix the ranking assignment. In the case of ties we take the mean of 1 and 2 (because the number 1 is appearing at 1st and 2nd position) and assign the mean to the number 1 as shown below. We follow the same steps for number 4 and 8. The number 4 is appearing at 5th and 6th position, therefore their mean is 5.5. Similar for rank number 8.
We now compute the sum of ranks for group 1 and 2 (R1 and R2, respectively).
R1 = 15.5;
R2 = 39.5
Using the formula above for the test statistic, we compute U1 and U2:
U1 = 24.5
U2 = 0.5
Now, U = min(U1, U2) = 0.5
For Mann Whitney U test, the values of U lies in the range of 0 to n1*n2 (n1 and n2 are the number of observations in each group), where 0 indicates the the two groups are completely different from each other and a value close to n1*n2 indicates some relation between the two groups.
Finally, we determine a critical value , or extract the p-value using a software or table for critical values, in order to reject or accept the null hypothesis H0. In this case U = 0.5 < critical value. Therefore we reject the null hypothesis and conclude that there is sufficient evidence to say that the groups do NOT present the same number of sleepwalking cases (the new drug seems to work).
Fisher Exact test - Independence of Variables
We've talked about Kruskal-Wallis and the Man-Whitney U test, which are the non-parametric counterparts of ANOVA and T-test, respectively. These are "alternative" versions of these tests that have more forgiving assumptions, especially related to normality. For our next test we are going to be exploring Fisher's Exact Test. This test can be considered as an alternative to the Chi-Square Test of Independence (we've just addressed this topic in the last chapter of this series). It is called "exact" in the sense that it is not based on a test statistic that is approximately distributed as, for example, Chi Square. But just like in Chi Square test for independence, we want to use this test to determine whether or not there is a significant association between two categorical variables.
The example we are going to use for this test could not be more classic. It is now a famous story that happened in an otherwise unremarkable summer afternoon in Cambridge in the 1920s. In that day, a group of friends were sitting and drinking tea, one of the presents eventually made a claim that she could discriminate whether milk was poured first or last when preparing a cup of tea with milk, just by the taste of it! However, to her misfortunate of easily getting away with this claim, sitting at that table was none other than sir Ronal Fisher (our closest acquaintance). Of course, being a falsifiable claim, it tingled Fisher's experimental zest, and being the godfather of modern statistics, he designed a simple, but effective experiment to test the veracity of the claim.
The story, together with a detailed description of how to conduct this test was presented in Chapter 2 of Fisher's 1935 book The Design of Experiments. See below a passage from this chapter, where Fisher describes the scenario and exposes the experiment procedure:
A lady declares that by tasting a cup of tea made with milk she can discriminate whether the milk or the tea infusion was first added to the cup. We will consider the problem of designing an experiment by means of which this assertion can be tested. […] [It] consists in mixing eight cups of tea, four in one way and four in the other, and presenting them to the subject for judgment in a random order. The subject has been told in advance of that the test will consist, namely, that she will be asked to taste eight cups, that these shall be four of each kind […]. — Fisher, 1935.
Framing this problem as an independence test, we would like to test if the the variables "Lady's guess" and "True Tea Order" are in fact associated (meaning the lady can prpoperly discern the tea order) or they are completely unrelated/independent (this would mean that the lady is simply randomly guessing the tea cups "types"). An important aspect of this test is that the row totals and the column totals of the contingency table are both fixed by the design of the study. By telling the subject in advance that there are four cups of each type, guarantees that the answer will include for of each. See the results of this test in the contingency table below:
The results tell us that the lady in question answered correctly 6 out of 8 trials (sum of the diagonal above). The results seem to corroborate the lady's claim. But the usual questions remain: how can we tell if this result wasn't obtained simply by random guessing? Do these numbers indicate a statistically significant result? Let's formulate and test the hypotheses.
- H0: The lady is purely guessing
- H1: The lady can differentiate between the two types of tea
Under H0, the lady has no discerning ability over the two types of tea. It would mean the same as taking a random sample of 4 out of the 8 cups and guessing that they were poured "tea first". Since she has labeled 3 cups correctly as "tea first", the probability of getting this same result (i.e. guessing 3 out of 4"tea first" correctly) purely by chance would be given by a hypergeometric distribution.
The denominator represents all the number of total ways of extracting a sample of size 4 from 8 items. The numerator represents the number of ways to obtain a sample of 3 "tea first" and 1 "milk first" out of 4 items. The final number gives the exact probability of selecting, at random, 4 cups out of 8 in which 3 of these cups are "tea first" and 1 of the cups is "milk first". This turns out to be 22.9%. Let's now obtain a p-value for this result. A p-value is the probability of getting a result equal or more extreme than the actually observed, assuming H0 is true. That would be correctly guessing at random 3 or 4 "tea first" cups out of 4 cups. By making this computation using the probability mass function above, we obtain that the probability of correctly guessing randomly 3 or 4 out 4 cups is .243 = 24.3%. This is only weak evidence against the null. We conclude that there is not enough evidence to reject the null hypothesis that the lady is just purely guessing. However, in her defense, experiments with small amounts of data are generally not very powerful to begin with, given the limited information.
Now you understand the basics of non-parametric procedures and know a few alternatives of well-known parametric tests. For now, try to read more about these concepts and look into other distribution free solutions. One that could be of interest is the Spearman's rank correlation, which measures the level of association between two ranked variables (think of it as the regular Pearson's Correlation coefficient (r) but for ordinal variables). Another non-parametric test to look at is the Friedman test (If you like Kruskal-Wallis then I assure you are going to enjoy this one, as it is the non-parametric version of a two-way ANOVA). Once again, thanks for sticking until the end and have a great rest of your day, hope to see you again soon!