登录查看更多内容

Exploring Decision Trees: The Branching Paths of Data

Jashneet Kaur

Research Associate @IIIT Delhi | Data Scientist | NLP & LLM Enthusiast

发布日期: 2023年10月19日

A decision tree is a Non-parametric (doesn't assume that your data follows a specific shape or pattern)? supervised machine learning algorithm used for classification and regression tasks. It features a tree-like structure comprising a root node, branches, internal nodes, and leaf nodes. The name 'decision tree' itself suggests its use of a flowchart-like structure to present predictions.

Think of a decision tree as a visual flowchart for decision-making. Similar to a real tree, it consists of branches and leaves. At the top, you have the 'root' node, which represents the starting point. As you move down the tree, you encounter 'internal' nodes, which serve as decision points, and finally, 'leaf' nodes, which provide the answers or predictions.

Let's say you want to decide whether to go for a picnic. Your decision tree might start with the question, "Is it sunny?" If it's sunny, you might go, but if it's not, you'd consider another factor like, "Is it windy?" If it's windy, you might change your mind, but if it's not, you decide to go on a picnic. Each question and answer guides you to the next step until you reach your final decision.

Decision trees in layman terms can be thought of as series of if-else statements. They work by checking a condition, and if that condition is true, it progresses to the next connected node to make further decisions.

This way, decision trees systematically break down complex decisions into a sequence of simpler choices, making them a powerful for problem-solving tasks.

Let’s first understand some basic Terminologies used in decision tree?

Root Node - Where It All Begins: Think of the root node as the starting point of your decision tree. It's where you begin considering your entire dataset.
Decision Nodes - Checkpoints in Your Journey: Decision nodes are like checkpoints in your decision-making journey. They help divide your data into different groups based on specific conditions.
Terminal / Leaf Nodes - Endpoints of Your Decision: The leaf nodes are where your decision tree ends. They represent the final outcome or classification, and this is where we stop making further decisions.
Branch / Sub-Tree - Focusing on the Details: A sub-tree is a smaller part of your decision tree that concentrates on specific conditions , each with its own set of decisions and results.
Pruning - Trimming the Excess : Pruning is like trimming the unnecessary branches / parts of a tree. It simplifies the decision tree and prevents it from OVERFITTING

Example of decision tree

In the given diagram, the decision tree begins by asking about the weather conditions: Is it sunny, cloudy, or rainy? If it's sunny, cloudy, or rainy, it goes on to consider factors like humidity and wind. Specifically, it checks if there's strong wind or weak wind, and if it's a situation of weak wind during rainy weather, it recommends going out to play.

Now, you might have noticed something interesting in this flowchart. When the weather is cloudy, the decision tree doesn't ask any further questions. You might wonder why it doesn't split more. The answer lies in more advanced concepts like entropy, information gain, and Gini index used in decision tree construction.

In simpler terms, the reason the decision tree stops at "cloudy" is that for the training dataset, the answer to whether you should play is always "yes" when it's cloudy, so there's no need to ask additional questions. The decision is straightforward, and that's why the tree stops at that point.

Now let’s understand what is Entropy..??

Entropy is a measure of the impurity or disorder in a dataset. It particularly determines how to split a dataset at each node of the tree.

In a decision tree, the goal is to create split/ branches? that result in subsets of data that are as homogenous(all elements belong to the same class) as possible.

S is the subset of the training example
C is number of categories/ classes

The entropy value ranges from 0 to 1.

If the dataset is perfectly homogeneous (all elements belong to the same class), the entropy is 0.
If the dataset is equally divided among different classes, the entropy is 1 (maximum disorder).

In decision tree algorithms, the goal is to minimize entropy at each split, resulting in subsets that are more pure, making the classification task easier and more accurate.

Now, Let’s understand entropy with help of example :

Imagine you're planning a picnic, and you want to check the weather forecast to decide whether to go or stay home. You look at the forecast, and it says one of three things: "Sunny," "Cloudy," or "Rainy."

Now, imagine you've been keeping track of how many times each of these forecasts turned out to be true. Here's what you find:

"Sunny" was true 20 times.
"Cloudy" was true 10 times.
"Rainy" was true 5 times.

Total number of values is 35?

Always remember that the higher the Entropy, the lower will be the purity and the higher will be the impurity.

As mentioned earlier the goal of machine learning is to decrease the uncertainty or impurity in the dataset, here by using the entropy we are getting the impurity of a particular node, But we don’t know if the parent entropy or the entropy of a particular node has decreased or not.

For this, we bring a new metric called “Information gain” which tells us how much the parent entropy has decreased after splitting it with particular feature.

Information Gain:

Information Gain is a metric used to quantifies how much information a feature provides about the class labels of the data, and it is used to select the best feature for making decisions.?

Let’s understand with example:

The higher the Information Gain for a feature, the more it reduces the entropy of the parent node, and therefore, the better it is for making a decision in a decision tree. Features with higher Information Gain are typically selected as the best choices for node splitting during the construction of the tree.???

?Understanding Gini Impurity:

Gini impurity, sometimes referred to as the Gini index, is a valuable metric used in decision tree algorithms to measure impurity or disorder within a dataset. It serves as an alternative to entropy for assessing the quality of splits in decision trees.? Gini impurity is like a measure of how mixed up or organized your data is. It quantifies the disorder or impurity in the data and ranges from 0 (perfectly pure, where all data points belong to a single class) to 0.5 (completely impure, where data points are evenly spread among all classes).

Gini Impurity Formula:

The Gini Formula:

Low Gini Impurity: When Gini Impurity is low, it implies that the data is "pure." In other words, the majority of data points belong to the same category or class, making classification straightforward.
High Gini Impurity: Conversely, when Gini Impurity is high, the data is "impure" or mixed. Data points are spread across different categories or classes, making accurate classification more challenging.

Difference between Entropy and Gini Impurity

Gini Impurity:

It is defined as the probability of misclassifying a randomly chosen element?
Gini impurity is calculated as follows for a node: Gini(p) = 1 - Σ (pi)^2 where pi is the proportion of instances in the node belonging to class i.
Gini impurity values range from 0 to 0.5, where 0 indicates perfect purity (all instances in the node belong to one class), and 0.5 indicates maximum impurity (an equal distribution of instances across all classes).
In decision tree algorithms, lower Gini impurity values are preferred for splitting nodes.
Gini impurity involves simpler arithmetic operations, such as squaring and summing the proportions of classes in a node, which are computationally less expensive.

Entropy:

Entropy measures the level of disorder or uncertainty in a dataset.
It quantifies the average amount of information needed to classify an element in the dataset.
The formula for entropy is given as: Entropy(p) = - Σ (pi * log2(pi)) where pi is the proportion of instances in the node belonging to class i.
Entropy values range from 0 to 1, where 0 indicates perfect purity (all instances in the node belong to one class), and 1 indicates maximum impurity (an equal distribution of instances across all classes).
In decision tree algorithms, lower entropy values are preferred for splitting nodes, as they represent less uncertainty.
Entropy involves more complex mathematical operations, such as logarithms and multiplication, making it computationally more expensive.?

Gini impurity is often preferred when efficiency is a concern, as it is quicker to compute in the context of decision tree algorithms.

In conclusion, decision trees are a powerful tool in the world of data science and machine learning. By understanding how they work and when to use them, you can make more informed decisions and create accurate predictive models. So, are you ready to start building your decision trees and unlocking the potential of your data?

Tagging friends and Mentors - Sahibpreet Singh Avinash Benki Sourav Mukhopadhyay Dr. Shafila Bansal Bramhendra K

Harmehak kour

Research analyst at Phronesis Partners

1 年

Keep it up

1 次回应

Jaskaran Singh

AI/ML (Founding) @ Dubverse.ai

1 年

Good job Jashneet Kaur

Jobanpreet Singh

Associate Data Scientist | M.Tech (AI) | Generative AI Enthusiast

1 年

Great explanation Jashneet Kaur

Sahibpreet Singh

Data Scientist and AI Enthusiast | Technical Writer, Udacity Bertelsmann Scholar

1 年

Awesome explanation thanks for putting up Jashneet Kaur

查看更多评论

要查看或添加评论，请登录

Jashneet Kaur的更多文章

Chunking in Retrieval-Augmented Generation (RAG) and it's Types:

2024年10月3日

Chunking in Retrieval-Augmented Generation (RAG) and it's Types:

Chunking refers to breaking down large documents or texts into smaller, manageable sections or "chunks". In RAG systems…

6 条评论
From Scratch BM25 and it's Variant's

2024年10月2日

From Scratch BM25 and it's Variant's

BM25 is a popular algorithm used to find the most relevant documents for a given search query. It does this by looking…

4 条评论
Retrieval Techniques

2024年9月30日

Retrieval Techniques

Information retrieval (IR) is the process of finding relevant information from large collections of unstructured data…

4 条评论
"Transcription Titans: AssemblyAI, Whisper, and Wav2Vec2 Engage in an Epic Battle for Excellence!" ????

2023年10月3日

"Transcription Titans: AssemblyAI, Whisper, and Wav2Vec2 Engage in an Epic Battle for Excellence!" ????

Welcome to the world of speech transcription! ??? It's a crucial tool across numerous domains, including customer…

5 条评论
Unlocking the Power of Text ????

2023年9月25日

Unlocking the Power of Text ????

In the realm of Natural Language Processing, text representation techniques are the magic behind converting words into…

See all articles

Now let’s understand what is Entropy..??

Information Gain:

?Understanding Gini Impurity:

Difference between Entropy and Gini Impurity

Jashneet Kaur的更多文章

Chunking in Retrieval-Augmented Generation (RAG) and it's Types:

From Scratch BM25 and it's Variant's

Retrieval Techniques

"Transcription Titans: AssemblyAI, Whisper, and Wav2Vec2 Engage in an Epic Battle for Excellence!" ????

Unlocking the Power of Text ????