Exploring Two-Sample Kolmogorov-Smirnov Test with Simulations
The Kolmogorov-Smirnov test (abbreviated as K-S test) is widely used in the context of distribution fitting. Today, we will explore one of its applications: testing whether two independent samples come from the same distribution and develop an intuition for it.
Let's begin by generating a normal distribution of continuous variables ranging from 10 to 50.
import random
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from scipy.stats import ks_2samp
from matplotlib.gridspec import GridSpec
# Set the parameters for the Gaussian distribution
mean = 30
std_dev = 5
# Set the size of the distribution
size = 1000000
# Generate the Gaussian distribution
distribution = [random.gauss(mean, std_dev) for _ in range(size)]
# Plot the histogram
plt.hist(distribution, bins=50, edgecolor='black')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.title('Gaussian Distribution Histogram')
plt.show()
Now we want to draw a sample the size of 1000 from this distribution and display its CDF (Cumulative Distribution Function plot).
# Set the size of the sample
sample_size = 1000
# Draw a sample from the population
sample = random.sample(distribution, sample_size)
# Compute the CDF of the sample
sorted_sample = np.sort(sample)
cdf = np.arange(1, len(sorted_sample) + 1) / float(len(sorted_sample))
# Plot the CDF of the sample
plt.plot(sorted_sample, cdf)
plt.xlabel('Values')
plt.ylabel('Cumulative Probability')
plt.title('CDF of Sample')
plt.show()
A quick reminder: a CDF plot shows the cumulative probability distribution of a dataset. It provides information about the probability of a random variable taking on a value less than or equal to a given value.
Now, instead of displaying a CDF plot of a single sample, let's draw 1000 samples, each with a size of 1000, and display their CDF curves on a single plot.
# Set the size of the sample
sample_size = 1000
# Set the number of repetitions
num_repetitions = 1000
# Plot each sample's CDF with a unique color
for _ in range(num_repetitions):
? ? # Draw a sample from the population
? ? sample = random.sample(distribution, sample_size)
? ? # Compute the CDF of the sample
? ? sorted_sample = np.sort(sample)
? ? cdf = np.arange(1, len(sorted_sample) + 1) / float(len(sorted_sample))
? ? # Generate a random color for each CDF
? ? color = random.choice(['red', 'blue', 'green', 'orange', 'purple', 'yellow'])
? ? # Plot the CDF of the sample with the assigned color
? ? plt.plot(sorted_sample, cdf, color=color, alpha=0.3)
plt.xlabel('Values')
plt.ylabel('Cumulative Probability')
plt.title('CDF of Sample (Repeated 1000 times)')
plt.show()
What we are seeing here is expected. CDF curves don't perfectly match each other since, as with all things statistics, some variation is inevitable. However, this variation falls within expected boundaries, and the distance between any two associated points on the curves shouldn't exceed an expected value, D (representing "distance"). If, however, we observe that there are two associated points that lie further apart than expected, we can reject the hypothesis that the two samples come from the same distribution.
Let's modify the code a little so that we draw exactly 2 samples and display their CDFs on a single plot. We can visually assess how closely they stick together. We will also quantify the maximum distance between them, known as the K-S statistic. You can rerun the code a few times, and each time the curves will be quite close to each other, but on rare occasions, they may be further apart (if you are lucky).
# Set the size of the sample
sample_size = 1000
# Draw two samples from the population
sample_a = random.sample(distribution, sample_size)
sample_b = random.sample(distribution, sample_size)
# Compute the CDF of sample_a
sorted_sample_a = np.sort(sample_a)
cdf_a = np.arange(1, len(sorted_sample_a) + 1) / float(len(sorted_sample_a))
# Compute the CDF of sample_b
sorted_sample_b = np.sort(sample_b)
cdf_b = np.arange(1, len(sorted_sample_b) + 1) / float(len(sorted_sample_b))
# Calculate the K-S statistic
combined_sorted_samples = np.sort(np.concatenate((sorted_sample_a, sorted_sample_b)))
cdf_a_interp = np.interp(combined_sorted_samples, sorted_sample_a, cdf_a)
cdf_b_interp = np.interp(combined_sorted_samples, sorted_sample_b, cdf_b)
ks_statistic = np.max(np.abs(cdf_a_interp - cdf_b_interp))
print("K-S statistic:", ks_statistic)
# Plot the CDF of sample_a
plt.plot(sorted_sample_a, cdf_a, label='Sample A', color='blue', alpha=0.7)
# Plot the CDF of sample_b
plt.plot(sorted_sample_b, cdf_b, label='Sample B', color='red', alpha=0.7)
plt.xlabel('Values')
plt.ylabel('Cumulative Probability')
plt.title('CDF of Sample A and Sample B')
plt.legend()
plt.show()
Now let's modify the code so that we:
# Set the sample size
sample_size = 1000
# Set the number of tests
tests = 10000
# Create a list to store the K-S statistics
ks_stats_manual = []
# Create a list to store the K-S statistics calculated by scipy
ks_stats_scipy = []
# Counter to keep track of tests with p-value <= 0.05
pvalue_counter = 0
# Perform KS two-sample test with critical values
def ks_test(sample_a, sample_b):
? ? ks_stat_sp, p_value = ks_2samp(sample_a, sample_b, method="exact")
? ? n1, n2 = len(sample_a), len(sample_b)
? ? critical = 1.36 * np.sqrt((n1 + n2) / (n1 * n2))
? ? return ks_stat_sp, p_value, critical
for i in range(tests):
? ? # Draw two samples from the population
? ? sample_a = random.sample(distribution, sample_size)
? ? sample_b = random.sample(distribution, sample_size)
? ? # Compute the CDF of sample_a
? ? sorted_a = np.sort(sample_a)
? ? cdf_a = np.arange(1, len(sorted_a) + 1) / float(len(sorted_a))
? ? # Compute the CDF of sample_b
? ? sorted_b = np.sort(sample_b)
? ? cdf_b = np.arange(1, len(sorted_b) + 1) / float(len(sorted_b))
? ? # Calculate the K-S statistic
? ? combined_sorted_samples = np.sort(np.concatenate((sorted_a, sorted_b)))
? ? cdf_a_interp = np.interp(combined_sorted_samples, sorted_a, cdf_a)
? ? cdf_b_interp = np.interp(combined_sorted_samples, sorted_b, cdf_b)
? ? ks_stat_manual = np.max(np.abs(cdf_a_interp - cdf_b_interp))
? ? # Store the K-S statistic
? ? ks_stats_manual.append(ks_stat_manual)
? ? # Calculate and store the K-S statistic using scipy
? ? ks_stat_sp, p_value, critical = ks_test(sample_a, sample_b)
? ? ks_stats_scipy.append(ks_stat_sp)
? ? # Determine if line should be highlighted in green
? ? if p_value <= 0.05:
? ? ? ? pvalue_counter += 1
# Calculate the proportion of tests with significant p_value
p_significant = pvalue_counter / tests
# Print summary
significant_prop = sum(ks > critical for ks in ks_stats_scipy) / tests
print("\nSummary:")
print(f"Proportion of tests with p-value <= 0.05: {p_significant}")
print(f"Proportion of tests where K-S statistic exceeds critical value: {significant_prop}")
# Plot the KDE of K-S statistics
sns.kdeplot(ks_stats_manual, fill=True, label='Manual Calculation')
sns.kdeplot(ks_stats_scipy, fill=True, label='Scipy Calculation')
plt.xlabel('K-S Statistic')
plt.ylabel('Density')
plt.title('KDE of K-S Statistics over Tests')
# Add vertical line at the critical value
plt.axvline(critical, color='red', linestyle='--', label='Critical Value')
# Add text for the critical value
plt.text(critical, 0.2, f'Critical: {critical:.4f}', color='red', ha='right')
plt.legend()
plt.show()
One important thing to keep in mind here is that the definition of a "Two-Tailed" test in the context of the K-S test is somewhat different. In a Two-Tailed K-S test, we are only interested in the ABSOLUTE maximum distance between two CDF curves. This means that the distance can be negative or positive. These represent the tails, as opposed to the two tails of the distribution of the K-S statistic itself, where we are still only interested in the right tail.
领英推荐
After conducting 10,000 simulations, we obtained a typical distribution of the K-S statistic. By visually examining it, we can observe that only a small fraction of tests resulted in a K-S statistic larger than the calculated critical value. Specifically, only 480 tests yielded an exceptionally high K-S statistic, which accounts for 4.80% of the total tests. This percentage is close enough to our false positive rate.
Drawing from Different Distributions
Now let's move on to something a little bit more interesting and simulate drawing sample sizes from different distributions. The following code builds upon the previous one. We basically do the same, but instead of drawing Sample A and Sample B from the same distribution, we create two separate population distributions. In the code below we only slightly altered the mean of the second distribution (from 30 to 30.2) to see what kind of results we'll get.
# Set the population size
population_size = 100000
# Set the sample size
sample_size = 1000
# Set the number of tests
tests = 1000
# Generate the Gaussian distributions
distribution_a = np.random.normal(loc=30, scale=1, size=population_size)
distribution_b = np.random.normal(loc=30.1, scale=1, size=population_size)
# Create a list to store the K-S statistics
ks_stats_manual = []
# Create a list to store the K-S statistics calculated by scipy
ks_stats_scipy = []
# Counter to keep track of tests with p-value <= 0.05
pvalue_counter = 0
# Perform KS two-sample test with critical values
def ks_test(sample_a, sample_b):
? ? ks_stat_sp, p_value = ks_2samp(sample_a, sample_b, method="exact")
? ? n1, n2 = len(sample_a), len(sample_b)
? ? critical = 1.36 * np.sqrt((n1 + n2) / (n1 * n2))
? ? return ks_stat_sp, p_value, critical
for i in range(tests):
? ? # Draw samples from the generated distributions
? ? sample_a = random.sample(list(distribution_a), sample_size)
? ? sample_b = random.sample(list(distribution_b), sample_size)
? ? # Compute the CDF of sample_a
? ? sorted_a = np.sort(sample_a)
? ? cdf_a = np.arange(1, len(sorted_a) + 1) / float(len(sorted_a))
? ? # Compute the CDF of sample_b
? ? sorted_b = np.sort(sample_b)
? ? cdf_b = np.arange(1, len(sorted_b) + 1) / float(len(sorted_b))
? ? # Calculate the K-S statistic
? ? combined_sorted_samples = np.sort(np.concatenate((sorted_a, sorted_b)))
? ? cdf_a_interp = np.interp(combined_sorted_samples, sorted_a, cdf_a)
? ? cdf_b_interp = np.interp(combined_sorted_samples, sorted_b, cdf_b)
? ? ks_stat_manual = np.max(np.abs(cdf_a_interp - cdf_b_interp))
? ? # Store the K-S statistic
? ? ks_stats_manual.append(ks_stat_manual)
? ? # Calculate and store the K-S statistic using scipy
? ? ks_stat_sp, p_value, critical = ks_test(sample_a, sample_b)
? ? ks_stats_scipy.append(ks_stat_sp)
? ? # Determine if line should be highlighted in green
? ? if p_value <= 0.05:
? ? ? ? pvalue_counter += 1
# Calculate the proportion of tests with significant p_value
p_significant = pvalue_counter / tests
# Print summary
significant_prop = sum(ks > critical for ks in ks_stats_scipy) / tests
summary_text = f"Proportion of p-value <= 0.05: {p_significant:.4f}\n" \
? ? ? ? ? ? ? ?f"Proportion of K-S > critical: {significant_prop:.4f}"
# Create a grid layout for the plots
fig = plt.figure(figsize=(12, 12))
gs = GridSpec(3, 2)
# Plot histogram for Distribution A
ax1 = fig.add_subplot(gs[0, 0])
sns.histplot(distribution_a, bins=50, kde=True, color='blue', ax=ax1)
ax1.axvline(np.mean(distribution_a), color='red', linestyle='--', label='Mean')
ax1.axvline(np.mean(distribution_a) + np.std(distribution_a), color='green', linestyle='--', label='Mean + Std')
ax1.axvline(np.mean(distribution_a) - np.std(distribution_a), color='green', linestyle='--', label='Mean - Std')
ax1.text(0.02, 0.95, f"Mean: {np.mean(distribution_a):.4f}\nStd: {np.std(distribution_a):.4f}",
? ? ? ? ?transform=ax1.transAxes, ha='left', va='top', bbox=dict(facecolor='white', alpha=0.8))
ax1.set_xlabel('Value')
ax1.set_ylabel('Frequency')
ax1.set_title('Distribution A')
ax1.legend()
# Plot histogram for Distribution B
ax2 = fig.add_subplot(gs[0, 1])
sns.histplot(distribution_b, bins=50, kde=True, color='orange', ax=ax2)
ax2.axvline(np.mean(distribution_b), color='red', linestyle='--', label='Mean')
ax2.axvline(np.mean(distribution_b) + np.std(distribution_b), color='green', linestyle='--', label='Mean + Std')
ax2.axvline(np.mean(distribution_b) - np.std(distribution_b), color='green', linestyle='--', label='Mean - Std')
ax2.text(0.02, 0.95, f"Mean: {np.mean(distribution_b):.4f}\nStd: {np.std(distribution_b):.4f}",
? ? ? ? ?transform=ax2.transAxes, ha='left', va='top', bbox=dict(facecolor='white', alpha=0.8))
ax2.set_xlabel('Value')
ax2.set_ylabel('Frequency')
ax2.set_title('Distribution B')
ax2.legend()
# Plot the KDE of K-S statistics
ax3 = fig.add_subplot(gs[1, :])
sns.kdeplot(ks_stats_manual, fill=True, label='Manual Calculation', ax=ax3)
sns.kdeplot(ks_stats_scipy, fill=True, label='Scipy Calculation', ax=ax3)
ax3.set_xlabel('K-S Statistic')
ax3.set_ylabel('Density')
ax3.set_title('KDE of K-S Statistics over Tests')
ax3.axvline(critical, color='red', linestyle='--', label='Critical Value')
ax3.text(0.98, 0.02, summary_text, transform=ax3.transAxes, ha='right', va='bottom',
? ? ? ? ?bbox=dict(facecolor='white', alpha=0.8, boxstyle='round,pad=0.3'))
ax3.legend()
# Plot the CDF curves of all samples
ax4 = fig.add_subplot(gs[2, :])
for i in range(tests):
? ? sample_a = random.sample(list(distribution_a), sample_size)
? ? sample_b = random.sample(list(distribution_b), sample_size)
? ? sorted_a = np.sort(sample_a)
? ? sorted_b = np.sort(sample_b)
? ? cdf_a = np.arange(1, len(sorted_a) + 1) / float(len(sorted_a))
? ? cdf_b = np.arange(1, len(sorted_b) + 1) / float(len(sorted_b))
? ? ax4.plot(sorted_a, cdf_a, color='green', alpha=0.1)
? ? ax4.plot(sorted_b, cdf_b, color='red', alpha=0.1)
ax4.set_xlabel('Value')
ax4.set_ylabel('CDF')
ax4.set_title('CDF Curves of Samples')
ax4.legend(['Sample A', 'Sample B'])
plt.tight_layout()
plt.show()
The code printout a bunch of plots but let's take a look at the one we are already familiar with first - KDE of K-S statistics.
We can clearly see that roughly half of all K-S statistics exceeds the critical value that we calculated based on the significance level of 5%. Since we know that under null hypothesis we should only get only roughly 5% of values that exceed the critical one, it's safe to assume that we are dealing with different distributions. Note, that the shapes of two distributions are identical, we only changed the location of the second one (it's mean)
# Generate the Gaussian distribution
distribution_a = np.random.normal(loc=30, scale=1, size=population_size)
distribution_b = np.random.normal(loc=30.1, scale=1, size=population_size)
Kolmogorov-Smirnov test is sensitive to both changes in shape of the distribution and changes in its central tendency, spread etc.)
Now let's take a look at the CDF plot that our code spits out.
We can clearly see that there is some variation in CDF curves of the Sample A (in green) and Sample B (in red), but this time CDF curves of the Sample A stick together and the curves of the Sample B do the same - both groups don't mingle with each other much. There's clearly some separation between the two groups. But to make it more prominent, let's shift our second distribution just a little bit more.
# Generate the Gaussian distribution
distribution_a = np.random.normal(loc=30, scale=1, size=population_size)
distribution_b = np.random.normal(loc=30.2, scale=1, size=population_size)
There appears to be a "dead" region between the two groups where none of the curves fall into. Meaning, there's no overlap between the variations at all. This is as clear as it gets - two samples are drawn from different distributions.
And, to quantify the difference, we may find that 100% of the simulations results in the K-S statistic that exceeds the critical value of 0.0608 for the given sample sizes.
If you'd like to experiment with different sample sizes, number of iterations or different distribution shapes etc. you can do so by following this link to my GitHub repository.
That would be it for the moment. Thanks for sticking by!
Civic Technology | Statistics | Data | Science
10 个月Great article! I have a minor quibble. The vertical distance should be calculated between the step functions (Empirical CDFs). With 1000 sample size it would not matter much, but you can write that as a caveat.