Overview:
Random forest regression is an ensemble learning method that constructs multiple decision trees and combines their outputs to improve predictive performance and control overfitting. OtasML provides a range of configuration options to fine-tune random forest models for better performance. Below is a detailed guide to these parameters.
Configurations page:
The Configurations page allows users to adjust various parameters of the random forest regressor model. Here are the details:
N estimators
- Default Value: 100
- Description: It is a parameter specifies the number of individual decision trees (also known as estimators) that will be used to form the random forest ensemble. Each decision tree in the ensemble is trained on a different random subset of the training data.
Criterion
- Default Value: squared_error
- Description: It is a parameter specifies the criterion used to measure the quality of a split when constructing individual decision trees within the Random Forest ensemble for regression tasks.
- Options:
squared_error
(ormse
): This criterion uses the Mean Squared Error (MSE) as a measure of the quality of a split. The MSE quantifies the average squared difference between the predicted values and the actual target values for the samples in a node.absolute_error
(ormae
): This criterion uses the Mean Absolute Error (MAE) as a measure of the quality of a split. The MAE quantifies the average absolute difference between the predicted values and the actual target values for the samples in a node.friedman_mse
: This criterion is an improvement of MSE, specifically designed for the Friedman's gradient boosting algorithm. It is not used as frequently as the other two criteria.poisson
: This criterion uses the Poisson deviance as a measure of the quality of a split. It is primarily used for regression problems where the target variable follows a Poisson distribution, such as count data.
- Options:
Max depth
- Default Value: None
- Description: It is a parameter determines the maximum depth of each individual decision tree in the random forest ensemble.
Min samples split
- Default Value: 2
- Description: Its a parameter specifies the minimum number of samples required to split an internal node of an individual decision tree within the Random Forest ensemble.
Min samples leaf
- Default Value: 1
- Description: It is a parameter specifies the minimum number of samples required to be in a leaf node of an individual decision tree within the Random Forest ensemble.
Min weight fraction leaf
- Default Value: 0.0
- Description: It is a parameter specifies the minimum weighted fraction of the sum total of weights (of all the input samples) required to be present in a leaf node of an individual decision tree within the Random Forest ensemble.
Max features
- Default Value: 1
- Description: It is a parameter that controls the maximum number of features that are considered when looking for the best split at each node during the construction of individual decision trees within the Random Forest ensemble.
Max leaf nodes
- Default Value: None
- Description: It is a parameter that controls the maximum number of leaf nodes (terminal nodes) that each individual decision tree in the Random Forest ensemble is allowed to have.
Min impurity decrease
- Default Value: 0.0
- Description: It is a parameter that controls the minimum amount of impurity decrease that must be achieved to split an internal node during the construction of individual decision trees within the Random Forest ensemble.
Bootstrap
- Default Value: True
- Description: It is a parameter that controls whether or not bootstrap samples are used when building individual decision trees within the Random Forest ensemble.
OOB score
- Default Value: False
- Description:
OOB score
stands forOut-of-Bag score.
It is a measure of a Random Forest model's performance on the samples that were not included in the bootstrap samples used for training individual decision trees. OOB score provides a way to estimate the model's performance without the need for a separate validation set.
N jobs
- Default Value: None
- Description: It is a parameter determines the number of CPU cores or threads to use when fitting (training) the Random Forest model in parallel. It allows you to speed up the training process by utilizing multiple cores or threads on your computer's CPU.
- Warning:
-1
means using all processors.
Random state
- Default Value: None
- Description: It is a parameter used to control the randomization applied during the training and fitting of the Random Forest model. It allows you to specify a seed for the random number generator, ensuring that the results are reproducible.
Verbose
- Default Value: 0
- Description: It is a parameter that controls the level of verbosity or the amount of information that is displayed during the training and fitting of the Random Forest model. It allows you to specify whether you want to see progress updates and information about the training process.
Warm start
- Default Value: False
- Description: It is a parameter is a boolean flag that allows you to control whether to reuse the existing solution (trees) from a previous call when calling it again. It can be useful when you want to incrementally train a Random Forest ensemble or when you want to add more trees to an existing model.
CCP alpha
- Default Value: 0.0
- Description: Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than
ccp_alpha
will be chosen. By default, no pruning is performed.
Max samples
- Default Value: None
- Description: It is a parameter, and its purpose is to determine the number of samples drawn from the training data to train each base estimator (individual decision tree) when bootstrap sampling is enabled.
- Warning: If
bootstrap is True
, the number of samples to draw from input to train each base estimator.
Test size
- Default Value: 0.2
- Description: The
test size
parameter is used when splitting the dataset into these subsets, and it specifies the portion of the data that will be used for testing.
Train size
- Default Value: 0.8
- Description: The
train size
parameter is used when splitting the dataset into these subsets, and it specifies the portion of the data that will be used for model training.
Conclusion
OtasML’s configuration options for random forest regressor models allow users to fine-tune their models to specific datasets and prediction tasks. Adjusting parameters such as the number of estimators, criterion, and maximum depth can significantly impact the model’s performance and behavior. Use this guide to fine-tune your random forest models and achieve superior predictive accuracy with OtasML.