登录查看更多内容

Meet Entropy

Saifeldin Elghetany

Senior Data Analyst @ MNT-Halan | Teaching Assistant @ FEPS

发布日期: 2024年1月14日

Imagine that you are working on a classification problem using a dataset that is diverse, with instances spread across different classes without a clear pattern, and you want to make accurate predictions about the class or outcome of a data point.

Faced with such a challenge, and being a resourceful problem-solver, you know that you will have to find a way to measure this “Uncertainty” in order to then find a way to reduce it, right?

Enter “Entropy”.

In Data Science, Entropy is a metric that measures the degree of uncertainty in a dataset.

Considering this definition, then reduced entropy is “Information Gain”.

Great, now how to actually measure it?

You check your dataset, you should have:

k possible values
pi which is the proportion of instances in class i

And you use the formula for Entropy:

领英推荐

DECISION TREES AND TITANIC DATASET

Giancarlo Ronci 1 个月前

TIQ Part 4 – Being Time intelligent

Nikola Ilic 4 年前

Introduction to Group Feature Selection

Aviv Levi 1 年前

Now you evaluate the numbers, Entropy ranges from 0 to 1 where the closer we get to 0 it indicates that the classes are more homogenous, making predictions more certain.

Now that you've measured uncertainty, you can reduce it by using “Decision Trees”.

A decision tree is a supervised learning algorithm that divides a dataset into subsets, at each “node” (a point where the tree makes a decision or performs a split based on a certain feature of the input data) the algorithm selects the feature that best separates the data, creating a tree-like structure of decision rules.

The idea is to choose splits that minimize the entropy of the resulting subsets, thereby creating more homogeneous groups with respect to the target variable that you are trying to predict.

Once your decision tree is in place using your preferred tool for analytics, you should evaluate some of its performance metrics such as accuracy, precision, recall, F1 score, confusion matrix, and ROC curve. That is to ensure the effectiveness of your decision tree for classification.

Thanks for reading!

Meet Entropy

Saifeldin Elghetany

Senior Data Analyst @ MNT-Halan | Teaching Assistant @ FEPS

领英推荐

社区洞察

其他会员也浏览了

Understanding the Minimum Description Length Principle: A Balance Between Model Complexity and Data Fit

I ran 580 model-dataset experiments to show that, even if you try very hard, it is almost impossible to know that a model is degrading just by looking

Can an age old concept of milestone based sequential puzzle solving game of TreasureHunt be applied in a simulation based learn-by-doing environment?

Metrics in classification (RECAP) -2

Softer Side of Data (1): Introduction

At Least 3 Reasons Why an Accurate Answer May Not Be the Right Answer

K-fold Cross-Validation: Reliable Model Evaluation

Unraveling Data Distributions: Understanding Symmetry, Gaussian, and the Role of Skewness

Probability Distribution Zoo