Meet Entropy
Saifeldin Elghetany
Senior Data Analyst @ MNT-Halan | Teaching Assistant @ FEPS
Imagine that you are working on a classification problem using a dataset that is diverse, with instances spread across different classes without a clear pattern, and you want to make accurate predictions about the class or outcome of a data point.
Faced with such a challenge, and being a resourceful problem-solver, you know that you will have to find a way to measure this “Uncertainty” in order to then find a way to reduce it, right?
Enter “Entropy”.
In Data Science, Entropy is a metric that measures the degree of uncertainty in a dataset.
Considering this definition, then reduced entropy is “Information Gain”.
Great, now how to actually measure it?
You check your dataset, you should have:
And you use the formula for Entropy:
领英推荐
?
Now you evaluate the numbers, Entropy ranges from 0 to 1 where the closer we get to 0 it indicates that the classes are more homogenous, making predictions more certain.
Now that you've measured uncertainty, you can reduce it by using “Decision Trees”.
A decision tree is a supervised learning algorithm that divides a dataset into subsets, at each “node” (a point where the tree makes a decision or performs a split based on a certain feature of the input data) the algorithm selects the feature that best separates the data, creating a tree-like structure of decision rules.
The idea is to choose splits that minimize the entropy of the resulting subsets, thereby creating more homogeneous groups with respect to the target variable that you are trying to predict.
Once your decision tree is in place using your preferred tool for analytics, you should evaluate some of its performance metrics such as accuracy, precision, recall, F1 score, confusion matrix, and ROC curve. That is to ensure the effectiveness of your decision tree for classification.
?
Thanks for reading!