登录查看更多内容

R Decision Trees

Malini Shukla

Senior Data Scientist || Hiring || 6M+ impressions || Trainer || Top Data Scientist || Speaker || Top content creator on LinkedIn || Tech Evangelist

发布日期: 2018年2月16日

What is R Decision Trees?

One of the most intuitive and popular methods of data mining that provides explicit rules for classification and copes well with heterogeneous data, missing data, and nonlinear effects is decision tree. It predicts the target value of an item by mapping observations about the item.

You can perform either classification or regression tasks here. For example, identifying fraudulent transactions using credit cards would be a classification task while forecasting prices of stock would be regression task.

Learn more in detail R Programming – introduction, features & applications

Applications of Decision Trees

Following are the common areas for applying decision trees:

Direct Marketing – While the marketing of products and services, business should track products and services offered by the competitors as it identifies the best combination of products and marketing channels that target specific sets of consumers.
Customer Retention – Decision trees helps organizations keep their valuable customers and get new ones by providing good quality products, discounts, and gift vouchers. These can also analyze buying behaviors of the customers and know their satisfaction levels.
Fraud Detection – Fraud is a major problem for many industries. Using classification tree, a business can detect frauds beforehand and can drop fraudulent customers.
Diagnosis of Medical Problems – Classification trees identifies patients who are at risk of suffering from serious diseases such as cancer and diabetes.

Let us see how to Install R

Principle of Decision Trees

You can use decision tree technique to detect the criteria for dividing individual items of a group into n predetermined classes (Often, n=2 represents a balanced tree, which means a largest of two child nodes for each parent node.)

Firstly, a variable is taken as the root node. This variable should best separate the classes. Then, each individual variable is divided into the given classes such that subpopulations called as nodes are generated. Same operation repeats on each new node obtained until no further separation of individuals are possible. Construction is such that each terminal node consists of individuals of a single class.

A tree with each node having no more than two child nodes is called binary tree. The first node is called root and the terminal nodes are known as leaves.

To create a decision tree, you need to follow certain steps:

i) Choosing a Variable – Choice of the variable that best separates the individuals of each class depends on the type of decision tree. The Precise criterion for choice of variable and separation condition on a variable depends on the type of tree.

A binary variable allows single separation condition while for a continuous variable having n separate values, there are n-1 possible separation conditions.

The condition for separation is:

X <= mean(xk, xk+1)

When best separation is found, it is applied and then the operation is repeated on each node to increase the discrimination.

The density of node is the ratio of its number of individuals to the total number of population.

ii) Once the best separation is found, classes are then split to create child nodes. A variable is taken out at this step also. To choose the best separation, used criteria are:

iii) The X2 Test – To test the independence of two variables X and Y, X2 test is used. If

Oij gives the term on the left-hand side of the equality symbol and Tij gives the term on the right, the test of independence of X and Y is X2, which is calculated by using the following formula:

The number of degrees of freedom can be calculated as follows:

p = (number of rows – 1) X (number of columns – 1)

i) The Gini Index – It is a measurement of the purity of nodes. It is used for all types of dependent variables and is calculated by using the following formula:

In the preceding formula: fi, i=1, . ., p, are the relative frequencies in the node of the p classes to be predicted.

More evenly distributed the classes are in the node, higher the Gini index will be. As the purity of node increases, Gini index decreases. Gini index measures the probability that 2 individuals picked at random with replacement from a node belong to two different classes.

ii) Assigning Data to Nodes – Once the tree is constructed and division criterion has been established, each individual can be assigned to exactly one leaf. This is determined by the values of the independent variables for the individual.

A leaf is assigned to a class if the cost of assigning that leaf to any other class is higher than assigning the leaf to the current class.

The cost is calculated as:

Starting with an error rate of each leaf, error rate of the tree also called as the total cost of tree or risk of the tree is calculated.

iii) Pruning the Tree: When you have deep decision trees, you may need to do pruning because they may contain some irrelevant nodes in the leaves. An algorithm is said to be good if it creates a largest-sized tree and automatically prunes it after detecting the optimal pruning threshold. It is necessary to use Cross-validation and combine error rates found for all possible subtrees to choose the best subtree.

It is important to shorten the branches of very deep trees to avoid creating very small nodes with no real statistical significance.

Get the best R books to become a master in R Programming.

Building Decision Trees

Decision trees belong to a class of recursive partitioning algorithms that are simple to describe and implement. For each decision tree algorithms described earlier, the algorithm steps are as follows:

You should assess the best way to split data into subgroups for each candidate input variable.
Then select the best split and divide data into the subgroups defined by the split.
Now you pick a subgroup and repeat Step 1 for every subgroup.
You should continue splitting until all the records after a split belongs to the same target variable value or until another stop condition occurs.

The stop condition may be as complicated as a statistical significance test or as simple as a least record count.

Decision trees are nonlinear predictors. This means that the decision boundary between target variable classes is nonlinear. The extent of nonlinearities depends on a number of splits in the tree.

As tree becomes more complex by increasing its depth, more piecewise constant separators are built into the decision boundary to provide nonlinear separation.

Decision trees are based on forwarding selection mechanism; so you cannot visit a split once it is created.

Learn more about Future of Data Science

Certain guidelines should be followed for creating decision trees:

Decision trees incorporate only one variable for each split. So if no variable splits the individuals on its own, decision trees may not start well. Trees need attributes that provide some lift right away. Try including multivariate features as candidates if modeler is aware of it.
These are considered weak learners or unstable models as small changes in data can produce significant changes in how the tree looks and behaves. Examining competitors to the winning split can be very helpful in understanding how valuable the winning splits are and if other variables may do as well.

Let us see Best Data Scientist Certifications.

Decision trees are biased toward selecting categorical variables with large numbers of levels. If one has a high number of levels in categorical variables, turn on cardinality penalties or try removing the variables to have fewer levels.
Trees can run out of data before it finds good models. Because each split reduces the number of remaining records, later splits are based on fewer and fewer records and hence have less statistical power.
Single trees are often not as accurate as other algorithms in predictive accuracy. This is because of forwarding variable selection and piecewise constant splits of nodes.

Read Complete Article>>

See Also-

Linear Regression in R
Nonlinear Regression in R

business Corporate

connecting people to technology worldwide

7 年

what are the top trending technologies??

查看更多评论

要查看或添加评论，请登录

Malini Shukla的更多文章

Top 9 Computer Vision Project Ideas for Beginners

2020年1月21日

Top 9 Computer Vision Project Ideas for Beginners

Understand the visual world around us Computer Vision Projects Computer vision is the most powerful and compelling type…
12 Cool Data Science project ideas with source code - "Strengthen your Resume"

2019年11月13日

12 Cool Data Science project ideas with source code - "Strengthen your Resume"

INTRODUCTION Data Science, a field that brings out wonders almost every second day and that’s why it is often regarded…

3 条评论
Python Coding Interview Questions for Experienced - Python FAQ's

2019年9月30日

Python Coding Interview Questions for Experienced - Python FAQ's

Firstly, If you are here, you probably already have a interview scheduled so my friend all the very best with that…
How Data Science is the Backbone of Retail?

2019年7月16日

How Data Science is the Backbone of Retail?

Data Science is having an increasing impact on business models in all industries. And in today’s digital world, data…
How to Get The Coolest & The Sexiest Job Of the Century- “Become a Data Scientist”

2019年7月9日

How to Get The Coolest & The Sexiest Job Of the Century- “Become a Data Scientist”

“The goal is to turn data into information, and information into insight” Data Scientist is an analytical data expert…
What’s the Best programming Language to Start a Career in Data Science?

2019年6月25日

What’s the Best programming Language to Start a Career in Data Science?

If you are thinking which programming languages should I learn to Master data Science in 2019? Then you are at the…

1 条评论
11 Reason Why TensorFlow is So Popular

2019年6月15日

11 Reason Why TensorFlow is So Popular

TensorFlow Features | Why TensorFlow Is So Popular TensorFlow gives us an interactive multiplatform programming…
20 Deep Learning Terminologies You Must Know

2019年6月14日

20 Deep Learning Terminologies You Must Know

Deep Learning Terminologies a. Recurrent Neuron It’s one of the best from the Deep Learning Terminologies.

2 条评论
TensorFlow Performance Optimization – Tips To Improve Performance

2019年6月12日

TensorFlow Performance Optimization – Tips To Improve Performance

Ways for TensorFlow Performance Optimization There a variety of ways through which you can optimize your hardware tools…
Top 9 Reasons Why QlikView is Best in BI

2019年6月11日

Top 9 Reasons Why QlikView is Best in BI

QlikView Features Below are the 9 Features of QlikView, which gives us the importance of QlikView, let’s discuss them:…

See all articles

R Decision Trees

Malini Shukla

Senior Data Scientist || Hiring || 6M+ impressions || Trainer || Top Data Scientist || Speaker || Top content creator on LinkedIn || Tech Evangelist

What is R Decision Trees?

Applications of Decision Trees

Principle of Decision Trees

Building Decision Trees

Malini Shukla的更多文章

社区洞察

其他会员也浏览了

"On-Time Delivery (OTD) Risk Predictions Using Bayesian Model"

Linear Regression (Less Linear Than You Might Think)

Information Security & Monty Hall Paradox

Logistic Regression

Linear Regression A-Z (Using Car Price Prediction dataset)

Fraud Detection : SMOTE |F1 Score (90%+)| 5 Models

Comprehensive Guide to Lasso Regression: Feature Selection, Regularization, and Use Cases

Data Scraping

Seven types of Regession Techniques

There is A.I and Agentic, and 10 more reasons why Treasure Data shines as a CDP

What is R Decision Trees?

Applications of Decision Trees

Principle of Decision Trees

Building Decision Trees

Malini Shukla的更多文章

Top 9 Computer Vision Project Ideas for Beginners

12 Cool Data Science project ideas with source code - "Strengthen your Resume"

Python Coding Interview Questions for Experienced - Python FAQ's

How Data Science is the Backbone of Retail?

How to Get The Coolest & The Sexiest Job Of the Century- “Become a Data Scientist”

What’s the Best programming Language to Start a Career in Data Science?

11 Reason Why TensorFlow is So Popular

20 Deep Learning Terminologies You Must Know

TensorFlow Performance Optimization – Tips To Improve Performance

Top 9 Reasons Why QlikView is Best in BI

社区洞察

其他会员也浏览了

"On-Time Delivery (OTD) Risk Predictions Using Bayesian Model"

Linear Regression (Less Linear Than You Might Think)

Information Security & Monty Hall Paradox

Logistic Regression

Linear Regression A-Z (Using Car Price Prediction dataset)

Fraud Detection : SMOTE |F1 Score (90%+)| 5 Models

Comprehensive Guide to Lasso Regression: Feature Selection, Regularization, and Use Cases

Data Scraping

Seven types of Regession Techniques

There is A.I and Agentic, and 10 more reasons why Treasure Data shines as a CDP