Tree Pruning and Avoiding Overfitting

Following are the major tasks in Data Preprocessing:

  1. Data cleaning, routines work to “clean” the data by filling in missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies
  2. Data Integration –When a data is included from multiple sources in our analysis. This would involve integrating multiple databases, data cubes, or files.
  3. Data reduction - obtains a reduced representation of the data set that is much smaller in volume yet produces the same (or almost the same) analytical results.

 

Decision Tree - As a part of Data cleaning, missing values are populated using the Decision Tree, which is used as an inference-based tool.

Tree Pruning - Why is tree pruning useful in decision tree induction? What is a drawback of using a separate set of tuples to evaluate pruning?

Answer: The decision tree built may overfit the training data. There could be too many branches, some of which may reflect anomalies in the training data due to noise or outliers. Tree pruning addresses this issue of overfitting the data by removing the least reliable branches (using statistical measures). This generally results in a more compact and reliable decision tree that is faster and more accurate in its classification of data.

The drawback of using a separate set of tuples to evaluate pruning is that it may not be representative of the training tuples used to create the original decision tree. If the separate set of tuples are skewed, then using them to evaluate the pruned tree would not be a good indicator of the pruned trees’ classification accuracy. Furthermore, using a separate set of tuples to evaluate pruning means there are fewer tuples to use for creation and testing of the tree. While this is considered a drawback in machine learning, it may not be so in data mining due to the availability of larger datasets.

 

Avoiding Overfitting

One problem that often arises when performing a UV-decomposition is that we arrive at one of the many local minima that conform well to the given data but picks up values in the data that don’t reflect well the underlying process that gives rise to the data. That is, although the RMSE may be small on the given data, it doesn’t do well predicting future data. There are several things that can be done to cope with this problem, which is called overfitting by statisticians.

  1. Avoid favoring the first components to be optimized by only moving the value of a component a fraction of the way, say halfway, from its current value toward its optimized value.
  2. Stop revisiting elements of U and V well before the process has converged.
  3. Take several different UV decompositions, and when predicting a new entry in the matrix M, take the average of the results of using each decomposition.

Glossary-

RMSE : Root-mean-square error.

UV- decomposition:Our basic single-matrix factorization model can be written X ≈ f(UV^T); choices include the prediction link f, the definition of ≈, and the constraints we place on the factors U and V. Different combinations of these choices also yield several new matrix factorization models.

Reference


References:

Data Mining Concepts and Techniques: Jiawei Han, Micheline Kamber, Jian Pei

Ramit Bajpai

VP Salesforce Technical Solutions Architect at First Citizens Bank

6 年

Great article!

要查看或添加评论,请登录

Saurabh Moondhra, MSc的更多文章

  • Software Engineering

    Software Engineering

    What is Software Engineering? Software engineering is the process of designing, developing, maintaining, and evaluating…

    2 条评论
  • E-commerce and Web Analytics

    E-commerce and Web Analytics

    Capturing customer interaction and leveraging the analytics data Google analytics on a website will suggest 74 out of…

  • Web Analytics - Why Where $ How?

    Web Analytics - Why Where $ How?

    What is Web Analytics? Web Analytics is the measurement, collection, analysis, and reporting of web data for purposes…

    2 条评论
  • Selection of Suitable Development Process

    Selection of Suitable Development Process

    o Waterfall process (Cobb, 2011) ü Waterfall model is like the assembly line process. It has unique goals for each…

社区洞察

其他会员也浏览了