Overview
Two-Class Decision Forest module to create a machine learning model based on the random decision forests algorithm. Decision forests are fast, supervised ensemble models. This module can be used to predict a target that has two values.
If you are not sure of the best parameters, we recommend that you use the Tune Model Hyperparameters module to train and test multiple models and find the optimal parameters.
How to Configure a Two-Class Decision Forest
Step 1
Add the Two-Class Decision Forest module into your experiment
Step 2
For the Resampling method, choose the method used to create the individual trees. You can choose from Bagging or Replicate.
1. Bagging: Bagging is also called bootstrap aggregating. In this method, each tree is grown on a new sample, created by randomly sampling the original dataset with replacement until you have a dataset the size of the original.
The outputs of the models are combined by voting, which is a form of aggregation. Each tree is a classification decision forest outputs an un-normalized frequency histogram of labels. The aggregation is to sum these histograms and normalize to get the “probabilities” for each label. In this manner, the trees that have high prediction confidence will have a greater weight in the final decision of the ensemble.
2. Replicate.: In replication, each tree is trained on exactly the same input data. The determination of which split predicate is used for each tree node remains random and the trees will be diverse.
Step 3
Specify how you want the model to be trained, by setting the Create trainer mode option.
· Single Parameter. If you know how you want to configure the model, you can provide a specific set of values as arguments.
· Parameter Range. If you are not sure of the best parameters, you can find the optimal parameters by specifying multiple values and using the Tune Model Hyperparameters module to find the optimal configuration. The trainer will iterate over multiple combinations of the settings you provided and determine the combination of values that produces the best model.
Step 4
For the Number of decision trees, type the maximum number of decision trees that can be created in the ensemble. By creating more decision trees, you can potentially get better coverage, but training time will increase.
Step 5
For the Maximum depth of the decision trees, type a number to limit the maximum depth of any decision tree. Increasing the depth of the tree might increase precision, at the risk of some overfitting and increased training time.
Step 6
For the Number of random splits per node, type the number of splits to use when building each node of the tree. A split means that features in each level of the tree (node) are randomly divided.
Step 7
For a minimum number of samples per leaf node, indicate the minimum number of cases that are required to create any terminal node (leaf) in a tree.
By increasing this value, you increase the threshold for creating new rules. For example, with the default value of 1, even a single case can cause a new rule to be created. If you increase the value to 5, the training data would have to contain at least 5 cases that meet the same conditions.
Step 8
Select the Allow unknown values for categorical features option to create a group for unknown values in the training or validation sets.
If you deselect it, the model can accept only the values that are contained in the training data. In the former case, the model might be less precise for known values, but it can provide better predictions for new (unknown) values.
Step 9
Train the model.
If you set Create trainer mode to Single Parameter, connect a tagged dataset and the Train Model module.
If you set Create trainer mode to Parameter Range, connect a tagged dataset and train the model by using Tune Model Hyperparameters.
Step 10
When the model is trained, right-click the output of the Train Model module (or Tune Model Hyperparameters module) and select Visualize to see the tree that was created on each iteration.
Experiment with example
1. Dataset - Mercedes-train.csv
2. Model
3. Train Model
4. Evaluation Model
The Mercedes-train Dataset is an ideal dataset and the Two-Class Decision Forest algorithm classifies the classes with an accuracy of 100%.
Accuracy = (TP+TN)/(TP+TN+FP+FN)= 1.000. Type 1 Error = (FP)/(FP+TN) = 0
Type 2 Error = (FN)/(FN+TP) = 0