Tuning the Hyperparameters of the Random Forest Model

In this section, we will explore tuning the hyperparameters of the Random Forest model. Here are the parameters that we will tune in this section.

Note

Note that we are using a random forest model from the sklearn package.

  • criterion: Random Forest model from the sklearn package provide three different measurements for the quality of the tree: “Gini” and “entropy,”. It would be great to know which measurement is a better fit.

  • n_estimators: We usually don’t need to tune this variable since more estimators are generally better for a random forest. However, in the original project, we discovered a slight chance of over-fitting with the number of estimators, so I included it.

  • max_depth: We want to tune this variable since the depth of the tree can be closely related to over-fitting.

  • max_features: The rule of thumb of the max_features is the square of the number of features, but we must tune this variable since the number could be more optimal.

  • class_weight: This is very important because our dataset has highly imbalanced numbers of fraudulent and non-fraudulent postings. We need to tune the weight of each class so that we balance our final result.

Note

When we adjust the class_weight, we need to focus more on the recall rate, not the overall accuracy. Since we have much more non-fraudulent postings (95%) than fraudulent postings (5%), even a null classifier that always predicts the posing as non-fraudulent will get 95% accuracy. Since detecting a fraudulent posting is a primary focus of the project, we need to tune the class_weight with the AUC of the ROC curve.

Procedure:

  1. n_estimators, max_depth, max_features, and criterion will be tuned using the Hyperopt package, which uses a Bayesian optimization for tuning the hyperparameters.

  2. class_weight will be tuned using cross-validation with ROC_AUC as a score.

  3. Since the fraudulent should have more weight than the non-fraudulent, we will choose among {0: 1, 1:1.3}, {0: 1, 1:1.6}, {0: 1, 1:1.9}, {0:1, 1:2.2}, {0:1, 1:2.5} for the class_weight. In other words, we are doing a total of five cross-validations.

  4. This means we will tune n_estimators, max_depth, max_features, and criterion for each cross-validation using Hyperopt. We are also using Hyperopt five times.

import numpy as np 
import pandas as pd 
from sklearn.ensemble import RandomForestClassifier 
from sklearn import metrics
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler 
from hyperopt import tpe, hp, fmin, STATUS_OK,Trials
from hyperopt.pyll.base import scope
from hyperopt import Trials 
from sklearn.utils import shuffle
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 1
----> 1 import numpy as np 
      2 import pandas as pd 
      3 from sklearn.ensemble import RandomForestClassifier 

ModuleNotFoundError: No module named 'numpy'
text_features = pd.read_csv("./data/final_feature_matrix.csv", index_col = 0)
X = text_features.drop("fraudulent", axis = 1).values
y = text_features.fraudulent.values

Tuning The Random Forest Model with class_weight = {0: 1, 1: 1.3}

space = {
    "n_estimators": hp.choice("n_estimators", [100, 200, 300]),
    "max_depth": hp.uniform("max_depth", 30, 50),
    "criterion": hp.choice("criterion", ["gini", "entropy"]),
    "max_features": hp.choice("max_features", [10, 12, 14, 16, 18, 20, 22])
}
def hyperparameter_tuning_first(params):
    clf = RandomForestClassifier(**params, class_weight = {0:1, 1:1.3})
    acc = cross_val_score(clf, X, y, scoring="accuracy").mean()
    return {"loss": -acc, "status": STATUS_OK}
trials = Trials()

best = fmin(
    fn=hyperparameter_tuning_first,
    space = space, 
    algo=tpe.suggest, 
    max_evals=100, 
    trials=trials
)

print("Best: {}".format(best))
100%|███████████████████████████████████████████| 100/100 [2:19:23<00:00, 83.64s/trial, best loss: -0.9792366598797917]
Best: {'criterion': 0, 'max_depth': 34.6262742855114, 'max_features': 4, 'n_estimators': 1}

Cross Validation: ROC_AUC Score of The Random Forest Model With class_weight = {0:1, 1: 1.3} And Tuned Parameters

first_rf = RandomForestClassifier(criterion = "gini", 
                                  max_depth = 35, 
                                  max_features = 18,
                                  n_estimators = 200,
                                  class_weight = {0:1, 1:1.3})
X_s, y_s = shuffle(X, y)
first_cv_result = cross_val_score(first_rf, X_s, y_s, cv=10, scoring="roc_auc").mean()
first_cv_result
0.9639459787898208

Tuning The Random Forest Model with class_weight = {0: 1, 1: 1.6}

space_two = {
    "n_estimators": hp.choice("n_estimators", [200, 300, 400]),
    "max_depth": hp.uniform("max_depth", 30, 40),
    "criterion": hp.choice("criterion", ["gini", "entropy"]),
    "max_features": hp.choice("max_features", [20, 22, 24, 26])
}
def hyperparameter_tuning_second(params):
    clf = RandomForestClassifier(**params, class_weight = {0:1, 1:1.6})
    acc = cross_val_score(clf, X, y, scoring="accuracy").mean()
    return {"loss": -acc, "status": STATUS_OK}
trials_two = Trials()

best_two = fmin(
    fn=hyperparameter_tuning_second,
    space = space_two, 
    algo=tpe.suggest, 
    max_evals=50, 
    trials=trials_two
)

print("Best: {}".format(best_two))
100%|████████████████████████████████████████████| 50/50 [2:18:12<00:00, 165.84s/trial, best loss: -0.9794464012045279]
Best: {'criterion': 0, 'max_depth': 35.18631376404232, 'max_features': 2, 'n_estimators': 1}

Cross Validation: ROC_AUC Score of The Random Forest Model With class_weight = {0:1, 1: 1.6} And Tuned Parameters

second_rf = RandomForestClassifier(criterion = "gini", 
                                   max_depth = 35, 
                                   max_features = 24,
                                   n_estimators = 300,
                                   class_weight = {0:1, 1:1.6})
second_cv_result = cross_val_score(second_rf, X_s, y_s, cv=10, scoring="roc_auc").mean()
second_cv_result
0.964084562106477

Tuning The Random Forest Model with class_weight = {0: 1, 1: 1.9}

space_three = {
    "n_estimators": hp.choice("n_estimators", [200, 300, 400]),
    "max_depth": hp.uniform("max_depth", 30, 40),
    "criterion": hp.choice("criterion", ["gini", "entropy"]),
    "max_features": hp.choice("max_features", [24, 26, 28, 30])
}
def hyperparameter_tuning_thrid(params):
    clf = RandomForestClassifier(**params, class_weight = {0:1, 1:1.9})
    acc = cross_val_score(clf, X, y, scoring="accuracy").mean()
    return {"loss": -acc, "status": STATUS_OK}
trials_three = Trials()

best_three = fmin(
    fn=hyperparameter_tuning_thrid,
    space = space_three, 
    algo=tpe.suggest, 
    max_evals=50, 
    trials=trials_three
)

print("Best: {}".format(best_three))
100%|█████████████████████████████████████████████| 50/50 [1:09:40<00:00, 83.60s/trial, best loss: -0.9797260237141397]
Best: {'criterion': 0, 'max_depth': 32.19118649485558, 'max_features': 2, 'n_estimators': 0}

Cross Validation: ROC_AUC Score of The Random Forest Model With class_weight = {0:1, 1: 1.9} And Tuned Parameters

third_rf = RandomForestClassifier(criterion = "gini", 
                                   max_depth = 32, 
                                   max_features = 28,
                                   n_estimators = 200,
                                   class_weight = {0:1, 1:1.9})
third_cv_result = cross_val_score(third_rf, X_s, y_s, cv=10, scoring="roc_auc").mean()
third_cv_result
0.9608796667621868

Tuning The Random Forest Model with class_weight = {0: 1, 1: 2.2}

space_four = {
    "n_estimators": hp.choice("n_estimators", [200, 300, 400]),
    "max_depth": hp.uniform("max_depth", 30, 40),
    "criterion": hp.choice("criterion", ["gini", "entropy"]),
    "max_features": hp.choice("max_features", [24, 26, 28, 30, 32])
}
def hyperparameter_tuning_fourth(params):
    clf = RandomForestClassifier(**params, class_weight = {0:1, 1:2.2})
    acc = cross_val_score(clf, X, y, scoring="accuracy").mean()
    return {"loss": -acc, "status": STATUS_OK}
trials_four = Trials()

best_four = fmin(
    fn=hyperparameter_tuning_fourth,
    space = space_four, 
    algo=tpe.suggest, 
    max_evals=50, 
    trials=trials_four
)

print("Best: {}".format(best_four))
100%|████████████████████████████████████████████| 50/50 [2:40:56<00:00, 193.13s/trial, best loss: -0.9793764466920706]
Best: {'criterion': 0, 'max_depth': 38.834973080192476, 'max_features': 3, 'n_estimators': 2}

Cross Validation: ROC_AUC Score of The Random Forest Model With class_weight = {0:1, 1: 2.2} And Tuned Parameters

fourth_rf = RandomForestClassifier(criterion = "gini", 
                                   max_depth = 39, 
                                   max_features = 30,
                                   n_estimators = 400,
                                   class_weight = {0:1, 1:2.2})
fourth_cv_result = cross_val_score(fourth_rf, X_s, y_s, cv=10, scoring="roc_auc").mean()
fourth_cv_result
0.9607872249839016

Tuning The Random Forest Model with class_weight = {0: 1, 1: 2.5}

space_five = {
    "n_estimators": hp.choice("n_estimators", [200, 300, 400]),
    "max_depth": hp.uniform("max_depth", 30, 40),
    "criterion": hp.choice("criterion", ["gini", "entropy"]),
    "max_features": hp.choice("max_features", [18, 20, 22, 24, 26])
}
def hyperparameter_tuning_fifth(params):
    clf = RandomForestClassifier(**params, class_weight = {0:1, 1:2.5})
    acc = cross_val_score(clf, X, y, scoring="accuracy").mean()
    return {"loss": -acc, "status": STATUS_OK}
trials_five = Trials()

best_five = fmin(
    fn=hyperparameter_tuning_fifth,
    space = space_five, 
    algo=tpe.suggest, 
    max_evals=50, 
    trials=trials_five
)

print("Best: {}".format(best_five))
100%|████████████████████████████████████████████| 50/50 [1:28:47<00:00, 106.54s/trial, best loss: -0.9791667053673345]
Best: {'criterion': 0, 'max_depth': 38.22266362146147, 'max_features': 3, 'n_estimators': 0}

Cross Validation: ROC_AUC Score of The Random Forest Model With class_weight = {0:1, 1: 2.5} And Tuned Parameters

fifth_rf = RandomForestClassifier(criterion = "gini", 
                                   max_depth = 38, 
                                   max_features = 24,
                                   n_estimators = 200,
                                   class_weight = {0:1, 1:2.5})
fifth_cv_result = cross_val_score(fifth_rf, X_s, y_s, cv=10, scoring="roc_auc").mean()
fifth_cv_result
0.9633466214407141

Result Summary

data = {'class_weight': ['{0:1, 1:1.3}', '{0:1, 1:1.6}', '{0:1, 1:1.9}', '{0:1, 1:2.2}', '{0:1, 1:2.5}'],
        'criterion': ['gini', 'gini', 'gini', 'gini', 'gini'],
       'max_depth': [35, 35, 32,39, 38],
       'max_features' : [18, 24, 28, 30, 24],
       'n_estimators' : [200, 300, 200, 400, 200],
       'CV Accuracy (%)' : [97.9237, 97.9446, 97.9726, 97.9376, 97.9167],
       'CV ROC AUC' : [0.963946, 0.964085, 0.96088, 0.960787, 0.963347]}

df = pd.DataFrame(data)
df
class_weight criterion max_depth max_features n_estimators CV Accuracy (%) CV ROC AUC
0 {0:1, 1:1.3} gini 35 18 200 97.9237 0.963946
1 {0:1, 1:1.6} gini 35 24 300 97.9446 0.964085
2 {0:1, 1:1.9} gini 32 28 200 97.9726 0.960880
3 {0:1, 1:2.2} gini 39 30 400 97.9376 0.960787
4 {0:1, 1:2.5} gini 38 24 200 97.9167 0.963347

The result shows that the model with class_weight = {0:1, 1:1.9} has the highest accuracy and the model with class_weight = {0:1, 1:1.6} has the highest ROC AUC on the training set.

Given the result, we will use the model with class_weight = {0:1, 1:1.6} since the model has the most balanced CV accuracy and ROC AUC result.