Decision Tree in Machine Learning
Decision tree learning is a supervised learning approach used in statistics, data mining and machine learning. In this formalism, a classification or regression decision tree is used as a predictive model to draw conclusions about a set of observations.
Some advantages of decision trees are:
Simple to understand and to interpret. Trees can be visualized.
Requires little data preparation. Other techniques often require data normalization, dummy variables need to be created and blank values to be removed. Some tree and algorithm combinations support missing values.
The cost of using the tree (i.e., predicting data) is logarithmic in the number of data points used to train the tree.
Able to handle both numerical and categorical data. However, the scikit-learn implementation does not support categorical variables for now. Other techniques are usually specialized in analyzing datasets that have only one type of variable. See algorithms for more information.
Able to handle multi-output problems.
Uses a white box model. If a given situation is observable in a model, the explanation for the condition is easily explained by boolean logic. By contrast, in a black box model (e.g., in an artificial neural network), results may be more difficult to interpret.
Possible to validate a model using statistical tests. That makes it possible to account for the reliability of the model.
Performs well even if its assumptions are somewhat violated by the true model from which the data were generated.
Decision tree learning employs a divide and conquer strategy by conducting a greedy search to identify the optimal split points within a tree.
Decision Tree Terminologies
There are specialized terms associated with decision trees that denote various components and facets of the tree structure and decision-making procedure. :
Root Node: A decision tree’s root node, which represents the original choice or feature from which the tree branches, is the highest node.
Internal Nodes (Decision Nodes): Nodes in the tree whose choices are determined by the values of particular attributes. There are branches on these nodes that go to other nodes.
Leaf Nodes (Terminal Nodes): The branches’ termini, when choices or forecasts are decided upon. There are no more branches on leaf nodes.
Branches (Edges): Links between nodes that show how decisions are made in response to particular circumstances.
Splitting: The process of dividing a node into two or more sub-nodes based on a decision criterion. It involves selecting a feature and a threshold to create subsets of data.
Parent Node: A node that is split into child nodes. The original node from which a split originates.
Child Node: Nodes created as a result of a split from a parent node.
Decision Criterion: The rule or condition used to determine how the data should be split at a decision node. It involves comparing feature values against a threshold.
Pruning: The process of removing branches or nodes from a decision tree to improve its generalisation and prevent overfitting.