Random Forest in Semi-Supervised Learning (Co-Forest)

Jo?o Pedro C.

MSc Student in Computer Science | Researcher | Software Developer | Data Scientist | Python | Machine Learning | NLP | Generative AI | Computer Vision | Statistical Analysis | React | MySQL | Docker | Java | C# | C | Lua

发布日期: 2024年1月15日

To put it bluntly, the Random Forest creates many decision trees, randomly, thus forming something that we can see as a forest, where each tree in this forest will be used to choose the final result.

To unravel the intricate mysteries of the Random Forest algorithm, it is imperative to delve into the deep waters of ensemble methods, of which it is a prominent component.

These methods are built following the same structure as more elementary algorithms, such as linear regression, decision tree or KNN. However, its striking distinction lies in its unique ability to combine diverse models to forge a single result.

This peculiar characteristic gives ensemble algorithms superior robustness and complexity, although sometimes at the cost of a more substantial computational demand, compensated by obtaining more refined results.

When building a conventional model, we typically opt for the algorithm that performs optimally for the data in question. Although it is possible to experiment with different configurations of this selected algorithm, thus generating a panoply of models, at the end of the machine learning process, a single choice is imposed.

Ensemble methods, on the other hand, transcend this limitation, generating multiple models from a single algorithm. Instead of choosing just one for the final implementation, they are all retained.

With this orchestrated approach, a multiplicity of results is engendered from the various models designed. If, for example, 100 models are created, 100 results are obtained, which are amalgamated into a single entity.

In regression tasks, the average of the values can be employed to derive the definitive result. In classification problems, the most frequent verdict emerges as the preponderant choice.

In some instances, the result of one model can be incorporated into the genesis of the next, creating an interdependence between the models and culminating in a single final outcome forged from several intermediate results.

Many ensemble methods are based on the decision tree concept, providing invaluable value to understanding this paradigm for effective immersion in these methods. In fact, those familiar with decision trees will explore ensemble methods with ease and speed.

Semi-Supervised Random Forest Methodology for Fault Diagnosis in Air-Handling Units - Scientific Figure on ResearchGate.

Decision Trees establish decision-making rules by creating a flowchart-like structure with “nodes” where conditions are checked. If a condition is met, the flow proceeds along one branch; otherwise, it follows another, continuing until the tree's completion.

In contrast to the creation of a single decision tree, the Random Forest algorithm starts by randomly selecting samples from the training data rather than using the entire dataset. The bootstrap resampling method is employed at this stage, allowing for the repetition of samples in the selection. The first decision tree is constructed based on these selected samples.

In the selection of variables for each node, the algorithm considers methods such as entropy or the Gini index, randomly choosing two or more variables to determine the best one for the root node. The subsequent selection of variables for the following nodes follows the same principle, excluding those already chosen.

When building the next tree, the aforementioned processes are repeated, resulting in a new tree that is likely different from the first due to the randomness in sample and variable selection. Multiple trees can be created, improving the model's results until a point where additional trees do not yield significant performance gains.

It is crucial to note that more trees imply a longer model creation time. To predict new values, the model utilizes the created trees, obtaining individual results. In regression problems, the average of predicted values is calculated as the final result; in classification problems, the most frequent result is chosen as the final prediction.

Jo?o Pedro C.的更多文章

The Magic of LoRA

2024年4月3日

The Magic of LoRA

Before we start talking about LoRA, we first need to talk about Fine-Tunning. Fine-tuning, in the context of machine…
A short introduction to Data Augmentation in Text

2023年12月27日

A short introduction to Data Augmentation in Text

In the real world, sometimes we have too much data, but often we have so little that analysis becomes complex. With…

Jo?o Pedro C.的更多文章

The Magic of LoRA

A short introduction to Data Augmentation in Text

社区洞察