A decision tree or why are we talking about horticulture in data science?

A decision tree or why are we talking about horticulture in data science?

No alt text provided for this image

A tree is usually a large perennial, deciduous, or evergreen woody plant. Areas of trees are called forests. Trees are common on all continents except Antarctica. Although there is no set minimum height, trees are usually classified as plants higher than 6 m with secondary branches branched from the trunk. However, bonsai are also considered trees, although they are less than a meter high.

But today is not about that...

The decision tree learning method is the most commonly used in data mining. The purpose of the method is to create a model that predicts the value of a dependent variable based on several independent variables. Decision trees have the following properties:

  • Perfectly visualizes the result
  • An easily interpretable result can be used when the variables are strongly interdependent and duplicate the information result
  • Tends to learn, it is difficult to choose optimal pruning

Each leaf is the value of a dependent variable in relation to the independent variables that are represented in the path from the tree trunk to the leaf. Decision trees can be used to solve both the data classification problem and the regression problem. On the side is a small decision tree that depicts whether the buyer will buy a computer. As we can see, the data are first divided into three groups according to the person's age, and in one of the divided groups, a "pure" class is obtained. The other two groups are then further subdivided so that only one class of data remains on each sheet. Continuous variables are also well used in decision trees. To use continuous variables in decision trees, a split-point must be provided, according to which the values of the variable will be split (usually) into two parts.

Various methods can be used for this, from the simplest, where the median is used for the division (then the sample is divided into two equal parts), to the complex ones, where the various decomposition options are additionally tested and the optimal one is chosen according to some criteria.

Combining values. Some decision tree algorithms always perform a split sample into exactly two parts (binary trees). In this case, if the variable is discrete and takes on more than two values, its values will be grouped into two groups into which to divide.

Decision trees, like conventional trees, are pruned because decision trees tend to learn. too well adapted to the training sample data. To prevent this, it is necessary to prune the trees by getting rid of the branches close to the leaves of the tree. Pruning methods are divided into two types:

  • Pre pruning - in this case, the tree is not formed using the above-mentioned stopping condition, but some criterion is used, according to which the division into branches is stopped earlier;
  • Post pruning - the tree is fully pruned and then some algorithm (criterion) is used to prune the branches.

Several different algorithms have been developed for tree training: ID3, C4.5, CART, CHAID, and so on. However, about these algorithms in detail a little later, let’s follow the news ??

In practice, however, it is often not individual trees that are used, but a derivative of these trees that are random forests. These methods are classified as "ensemble methods".

Not one tree is formed for one training sample, but many decision trees (e.g. 100 or 1000). To form trees that are not identical, randomization is used during tree formation: a random part of the sample is used to form the tree, not the whole; the variable for splitting the sample into tree branches is selected not optimally from all variables but random part variables.

With a decision tree forest, each tree performs a point classification, and the final class is selected (“by voting”) according to the majority principle. At the same time, it makes it possible to determine the probability with which a point is assigned to a particular class.

Properly randomized, random forests are not prone to learning, even if they use learning trees that are prone to learning.

Yes, that is theory but how to implement it in practice?

There is not one different way to put this into practice because mostly all statistical analysis software has decision tree functionality. Here are some ways to implement a decision tree. All of these are just examples, as there are many different libraries/packages to solve this problem.

No alt text provided for this image


要查看或添加评论,请登录

Mantas Lukauskas, PhD的更多文章

社区洞察