GPT-Python Pulse: Multiclass Cohen's Kappa
Asad Kazmi
AI Educator ? Simplifying AI ? I Help You Win with AI ? AI won’t steal your job, but someone who masters it might. Master AI. Stay Unstoppable.
As AI continues to reshape industries, understanding its practical applications can significantly enhance your data analysis skills.
In today’s issue of GPT-Python Pulse, we’ll dive into a critical concept for evaluating model performance in multiclass settings: Multiclass Cohen’s Kappa.
This extension of the well-known Cohen’s Kappa metric is designed to measure the agreement between two raters or classifiers when there are more than two categories involved.
Let's explore the ins and outs of this powerful statistical tool and walk through how to compute it using Python.
What is Multiclass Cohen’s Kappa?
Multiclass Cohen's Kappa is an extension of the binary Cohen’s Kappa used when classifying instances into more than two categories. While the binary version is typically used to compare two raters' agreement on a classification with two possible outcomes (e.g., Yes/No), multiclass Cohen's Kappa handles cases where there are multiple categories or labels.
It measures the agreement between two raters (or classifiers), adjusting for the likelihood of agreement occurring by chance. This adjustment makes it a reliable metric for understanding how much better two raters are agreeing compared to random guessing.
Key Points of Multiclass Cohen’s Kappa
Example: Calculating Multiclass Cohen’s Kappa
Let's walk through a example where we calculate Cohen's Kappa for a multiclass classification problem using a confusion matrix. Suppose two annotators are classifying tweets into three categories: Positive, Negative, and Neutral.
Given Confusion Matrix:
Step 1: Breakdown of the Confusion Matrix
Total samples: 36 (sum of all elements in the confusion matrix)
Step 2: Calculating Observed Agreement (P_o)
Observed agreement is the proportion of times both annotators agree on the same class. We compute this by summing the diagonal elements and dividing by the total number of samples:
P_o= 8+7+6/36=0.583
领英推荐
Step 3: Calculating Expected Agreement (P_e)
To calculate expected agreement, we use the marginal probabilities for each class. First, we calculate the marginal probabilities for each annotator:
Annotator 1:
Annotator 2:
Now, we calculate the expected agreement:
P_e=(P(T1Positive)×P(P1Positive)) + (P(T1Negative)×P(P1Negative)) + (P(T1Neutral)×P(P1Neutral))
Pe=(0.389×0.389) + (0.361×0.361) + (0.250×0.250) = 0.344
Step 4: Calculate Cohen’s Kappa (κ\kappa)
Finally, we compute Cohen’s Kappa using the formula:
κ= Po?Pe/1?Pe
κ= 0.583?0.344/1?0.344
κ= 0.365
Here’s the Python code to calculate the Cohen’s Kappa for the given confusion matrix:
import numpy as np
# Given confusion matrix (rows represent Annotator1, columns represent Annotator2)
conf_matrix = np.array([
[8, 4, 2],
[5, 7, 1],
[1, 2, 6]
])
# Total number of samples
total_samples = conf_matrix.sum()
# Step 1: Observed Agreement (P_o)
observed_agreement = np.trace(conf_matrix) / total_samples
print(f"Observed Agreement (P_o): {observed_agreement:.3f}")
# Step 2: Marginal probabilities for Annotator 1
P_T1_positive = conf_matrix[0].sum() / total_samples
P_T1_negative = conf_matrix[1].sum() / total_samples
P_T1_neutral = conf_matrix[2].sum() / total_samples
# Marginal probabilities for Annotator 2
P_P1_positive = conf_matrix[:, 0].sum() / total_samples
P_P1_negative = conf_matrix[:, 1].sum() / total_samples
P_P1_neutral = conf_matrix[:, 2].sum() / total_samples
# Step 3: Expected Agreement (P_e)
P_e = (P_T1_positive * P_P1_positive) + (P_T1_negative * P_P1_negative) + (P_T1_neutral * P_P1_neutral)
print(f"Expected Agreement (P_e): {P_e:.3f}")
# Step 4: Calculate Cohen's Kappa
kappa = (observed_agreement - P_e) / (1 - P_e)
print(f"Cohen's Kappa: {kappa:.3f}")
Output:
In this issue, we’ve learned how to calculate Multiclass Cohen’s Kappa using a confusion matrix and Python. A Kappa value of 0.365 indicates a fair level of agreement between the two annotators, which can help us assess inter-rater reliability in various fields, from medical image classification to sentiment analysis.
Stay tuned for more insights into practical applications of Python and AI in upcoming issues of GPT-Python Pulse!