R Decision Trees

R Decision Trees

What is R Decision Trees?

One of the most intuitive and popular methods of data mining that provides explicit rules for classification and copes well with heterogeneous data, missing data, and nonlinear effects is decision tree. It predicts the target value of an item by mapping observations about the item.

You can perform either classification or regression tasks here. For example, identifying fraudulent transactions using credit cards would be a classification task while forecasting prices of stock would be regression task.

Learn more in detail R Programming – introduction, features & applications

Applications of Decision Trees

Following are the common areas for applying decision trees:

  • Direct Marketing – While the marketing of products and services, business should track products and services offered by the competitors as it identifies the best combination of products and marketing channels that target specific sets of consumers.
  • Customer Retention – Decision trees helps organizations keep their valuable customers and get new ones by providing good quality products, discounts, and gift vouchers. These can also analyze buying behaviors of the customers and know their satisfaction levels.
  • Fraud Detection – Fraud is a major problem for many industries. Using classification tree, a business can detect frauds beforehand and can drop fraudulent customers.
  • Diagnosis of Medical Problems – Classification trees identifies patients who are at risk of suffering from serious diseases such as cancer and diabetes.

Let us see how to Install R

Principle of Decision Trees

You can use decision tree technique to detect the criteria for dividing individual items of a group into n predetermined classes (Often, n=2 represents a balanced tree, which means a largest of two child nodes for each parent node.)

Firstly, a variable is taken as the root node. This variable should best separate the classes. Then, each individual variable is divided into the given classes such that subpopulations called as nodes are generated. Same operation repeats on each new node obtained until no further separation of individuals are possible. Construction is such that each terminal node consists of individuals of a single class.

A tree with each node having no more than two child nodes is called binary tree. The first node is called root and the terminal nodes are known as leaves.

To create a decision tree, you need to follow certain steps:

i) Choosing a Variable – Choice of the variable that best separates the individuals of each class depends on the type of decision tree. The Precise criterion for choice of variable and separation condition on a variable depends on the type of tree.

A binary variable allows single separation condition while for a continuous variable having n separate values, there are n-1 possible separation conditions.

The condition for separation is:

X <= mean(xk, xk+1)

When best separation is found, it is applied and then the operation is repeated on each node to increase the discrimination.

The density of node is the ratio of its number of individuals to the total number of population.

ii) Once the best separation is found, classes are then split to create child nodes. A variable is taken out at this step also. To choose the best separation, used criteria are:

iii) The X2 Test – To test the independence of two variables X and Y, X2 test is used. If

Oij gives the term on the left-hand side of the equality symbol and Tij gives the term on the right, the test of independence of X and Y is X2, which is calculated by using the following formula:

The number of degrees of freedom can be calculated as follows:

p = (number of rows – 1) X (number of columns – 1)

i) The Gini Index – It is a measurement of the purity of nodes. It is used for all types of dependent variables and is calculated by using the following formula:

In the preceding formula: fi, i=1, . ., p, are the relative frequencies in the node of the p classes to be predicted.

More evenly distributed the classes are in the node, higher the Gini index will be. As the purity of node increases, Gini index decreases. Gini index measures the probability that 2 individuals picked at random with replacement from a node belong to two different classes.

ii) Assigning Data to Nodes – Once the tree is constructed and division criterion has been established, each individual can be assigned to exactly one leaf. This is determined by the values of the independent variables for the individual.

A leaf is assigned to a class if the cost of assigning that leaf to any other class is higher than assigning the leaf to the current class.

The cost is calculated as:

Starting with an error rate of each leaf, error rate of the tree also called as the total cost of tree or risk of the tree is calculated.

iii) Pruning the Tree: When you have deep decision trees, you may need to do pruning because they may contain some irrelevant nodes in the leaves. An algorithm is said to be good if it creates a largest-sized tree and automatically prunes it after detecting the optimal pruning threshold. It is necessary to use Cross-validation and combine error rates found for all possible subtrees to choose the best subtree.

It is important to shorten the branches of very deep trees to avoid creating very small nodes with no real statistical significance.

Get the best R books to become a master in R Programming.

Building Decision Trees

Decision trees belong to a class of recursive partitioning algorithms that are simple to describe and implement. For each decision tree algorithms described earlier, the algorithm steps are as follows:

  • You should assess the best way to split data into subgroups for each candidate input variable.
  • Then select the best split and divide data into the subgroups defined by the split.
  • Now you pick a subgroup and repeat Step 1 for every subgroup.
  • You should continue splitting until all the records after a split belongs to the same target variable value or until another stop condition occurs.

The stop condition may be as complicated as a statistical significance test or as simple as a least record count.

Decision trees are nonlinear predictors. This means that the decision boundary between target variable classes is nonlinear. The extent of nonlinearities depends on a number of splits in the tree.

As tree becomes more complex by increasing its depth, more piecewise constant separators are built into the decision boundary to provide nonlinear separation.

Decision trees are based on forwarding selection mechanism; so you cannot visit a split once it is created.

Learn more about Future of Data Science

Certain guidelines should be followed for creating decision trees:

  • Decision trees incorporate only one variable for each split. So if no variable splits the individuals on its own, decision trees may not start well. Trees need attributes that provide some lift right away. Try including multivariate features as candidates if modeler is aware of it.
  • These are considered weak learners or unstable models as small changes in data can produce significant changes in how the tree looks and behaves. Examining competitors to the winning split can be very helpful in understanding how valuable the winning splits are and if other variables may do as well.

Let us see Best Data Scientist Certifications.

  • Decision trees are biased toward selecting categorical variables with large numbers of levels. If one has a high number of levels in categorical variables, turn on cardinality penalties or try removing the variables to have fewer levels.
  • Trees can run out of data before it finds good models. Because each split reduces the number of remaining records, later splits are based on fewer and fewer records and hence have less statistical power.
  • Single trees are often not as accurate as other algorithms in predictive accuracy. This is because of forwarding variable selection and piecewise constant splits of nodes.

Read Complete Article>>

See Also-



business Corporate

connecting people to technology worldwide

7 年

what are the top trending technologies??

回复

要查看或添加评论,请登录

Malini Shukla的更多文章

社区洞察

其他会员也浏览了