登录查看更多内容

Secrets of Decision Trees: A Guide to Entropy, Gini, and Information Gain

Hari Galla

Techno-Functional Manager | Process Mining Consultant | Intelligent Automation | Generative AI & ML | FINTECH & Emerging Trends | Digital Transformation| Trainer & Mentor | Tech Talk | Partnering for Client Success |

发布日期: 2024年3月4日

+ 关注

Application: Decision trees are supervised learning algorithms used for classification and regression tasks.

Focus: Classification with decision trees

Basic Concepts:

Root Node: Top of the tree contains most important attribute Ex: Outlook
Split: Division of nodes into two or more sub nodes.
Leaf Node: Nodes that do not split are called Leaf nodes or terminal nodes. Overcast is the example of showing in screen image
CART focuses on binary splits at each node whilst ID3 more than binary splits

Measuring Purity and Impurity:

Entropy: This measure quantifies the level of uncertainty or randomness within a dataset regarding its class labels. Impurity: Imagine a dataset with equal numbers of positive and negative examples. This scenario holds maximum uncertainty, resulting in an entropy value of 1. Purity: Conversely, a perfectly homogenous dataset with all instances belonging to a single class has zero entropy, signifying complete certainty. Type of decision tree algorithms : ID3 etc.,Dealt with multi-class classification problems and suitable for small datasets

Calculation:
Entropy = - Σ(p(i) * log2(p(i)))

领英推荐

All Hands on Data #95

Shipyard 10 个月前

Understanding XGBoost from A to Z!

Damien Benveniste, PhD 9 个月前

Exploring the Limitations of KMeans and the…

Jason Raper 4 个月前

Gini Index

This metric estimates the likelihood of randomly misclassifying an instance within a dataset. Purity: A perfectly balanced dataset (equal class distribution) has the highest Gini impurity of 0.5, indicating a 50% chance of misclassification. Impurity: As the data becomes more homogenous, the Gini index approaches 0, signifying a lower probability of misclassification.

Type of decision tree algorithms : CART ,

Computationally more efficient than entropy and suitable for large datasets

Gini Impurity = 1 - Σ(p(i))^2

Choosing the split:

Information Gain: This concept builds upon the notion of entropy and measures the reduction in uncertainty brought about by splitting the data based on a specific feature. The feature leading to the highest information gain is chosen for the split at a particular node, as it promotes the most significant reduction in randomness and aids in clearer class separation.

Information Gain = Entropy(parent) - Σ [ (weight of child) * Entropy(child) ]

Conclusion:

Decision Trees are simple for classification problems especially if the output is discrete set of categorical values

Dr.Sateesh VVS

"Ph.D. in Lean Six Sigma | Leading Consulting, Digital Transformation & Innovation, Operational Excellence, Analytics, and AI | Expert in Large-Scale Transformations, leveraging Gen AI, Agentic AI, in Finance, SCM and HR

1 年

I truly appriciate your sincere effort, however for betterment, i am suggesting few points, How can you reduce the impurity in data sets ?, you could have given few example solutions in which entropy, Gini calculations are used, for CART models you could have given examples in banking or pharma data set with drill down decision tree .finally few points on evalution of neural networks as enhancements in decision trees

2 次回应

查看更多评论

要查看或添加评论，请登录

Hari Galla的更多文章

ADVANCED RAG SERIES

2024年9月26日

ADVANCED RAG SERIES

INDEXING STRATEGIES - PART I In many industries, processing large documents into manageable chunks is essential for…
Celonis PI Graph: Revolutionizing Process Mining with a Unified Data and Knowledge Platform

2024年3月17日

Celonis PI Graph: Revolutionizing Process Mining with a Unified Data and Knowledge Platform

Conclusion: By combining a standardized data model, centralized process knowledge, and pre-built applications, the PI…
P2P Comprehensive view

2024年3月12日

P2P Comprehensive view

I have insights on how hyper-automation can streamline your F&A operations resulting unlocking Efficiency in…

2 条评论
Bye Bye to Invoice manual processing

2024年3月11日

Bye Bye to Invoice manual processing

Business Case: French Handwritten Invoice Image Extraction: LLM's Invoice extraction Do you want to know more about it?…

1 条评论
Beyond Prompts: Fine-Tuning Your LLM

2024年3月3日

Beyond Prompts: Fine-Tuning Your LLM

WHY FINE TUNING? While both prompt engineering and fine-tuning aim to enhance the capabilities of large language models…
How 1-Bit LLMs Are Revolutionizing Efficiency

2024年3月1日

How 1-Bit LLMs Are Revolutionizing Efficiency

Challenges with Traditional LLMs: Large size: Traditional LLMs have billions of parameters, leading to: Deployment…
OpenAI's Revolutionary Text-to-Video Model

2024年2月26日

OpenAI's Revolutionary Text-to-Video Model

Introduction: OpenAI's Sora is a game-changing text-to-video model, captivating the AI community with its remarkable…
Unlocking AI for Everyone: Google's Gemma Opens the Door

2024年2月24日

Unlocking AI for Everyone: Google's Gemma Opens the Door

Google's Gemma Opens Doors to Responsible Development Large Language Models (LLMs) have captivated the world with their…
Customize Your LLM Pipelines (No Coding Needed!)

2024年2月23日

Customize Your LLM Pipelines (No Coding Needed!)

Learn how to simplify LLMOps and build LLM Pipelines in minutes without writing any code using Vext platform…
Bye-Bye RNNs, Hello Transformers: Why We Upgraded!

2024年2月19日

Bye-Bye RNNs, Hello Transformers: Why We Upgraded!

Recurrent Neural Networks (RNNs) face similar challenges: 1. Vanishing or Exploding Gradients: Example: Translating a…

2 条评论

See all articles

Secrets of Decision Trees: A Guide to Entropy, Gini, and Information Gain

Hari Galla

Techno-Functional Manager | Process Mining Consultant | Intelligent Automation | Generative AI & ML | FINTECH & Emerging Trends | Digital Transformation| Trainer & Mentor | Tech Talk | Partnering for Client Success |

领英推荐

Hari Galla的更多文章

社区洞察

其他会员也浏览了

23-4-1 Getting started with Pinecone Vector Database

Beginner's Guide to Vector Databases

RANDOM FOREST MODEL(RFM)

KNN:K-Nearest Neighbor

Building a model? Here is the first question you should ask

Bagging , Random Forest and Adaboost

Announcing DoubleML Coverage

Demystifying the K-Nearest Neighbors (KNN) Algorithm: A Deep Dive into Its Mechanics and Applications

MLflow: a better way to track your models

5 MUST KNOW QUESTIONS FOR A DATA SCIENTIST

领英推荐

Hari Galla的更多文章

ADVANCED RAG SERIES

Celonis PI Graph: Revolutionizing Process Mining with a Unified Data and Knowledge Platform

P2P Comprehensive view

Bye Bye to Invoice manual processing

Beyond Prompts: Fine-Tuning Your LLM

How 1-Bit LLMs Are Revolutionizing Efficiency

OpenAI's Revolutionary Text-to-Video Model

Unlocking AI for Everyone: Google's Gemma Opens the Door

Customize Your LLM Pipelines (No Coding Needed!)

Bye-Bye RNNs, Hello Transformers: Why We Upgraded!

社区洞察

其他会员也浏览了

23-4-1 Getting started with Pinecone Vector Database

Beginner's Guide to Vector Databases

RANDOM FOREST MODEL(RFM)

KNN:K-Nearest Neighbor

Building a model? Here is the first question you should ask

Bagging , Random Forest and Adaboost

Announcing DoubleML Coverage

Demystifying the K-Nearest Neighbors (KNN) Algorithm: A Deep Dive into Its Mechanics and Applications

MLflow: a better way to track your models

5 MUST KNOW QUESTIONS FOR A DATA SCIENTIST