Tuning the Hyperparameters of SVM
Contents
Tuning the Hyperparameters of SVM¶
We have two parameters to tune for the SVM model: cost and gamma.
Cost is a parameter that adjusts the level of soft margin.
Gamma is the one that adjusts the level of curvature of our decision boundaries.
We used the radial kernel function since it appeared to work the best. In order to tune the parameters, we used 5-fold cross-validation, but since we needed to find the best combination of these two parameters, we used a grid-search algorithm to find the best combination.
We will perform 5 fold Cross Validation twice - one for accuracy and another one for area under the ROC - so that we can find the combination that will give us the most balanced result.
5 Folds Cross Validation For Accuracy Using Grid Search Algorithm¶
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
import numpy as np
import pandas as pd
import pickle
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[1], line 1
----> 1 from sklearn.model_selection import GridSearchCV
2 from sklearn.svm import SVC
3 import numpy as np
ModuleNotFoundError: No module named 'sklearn'
feature_matrix_train = pd.read_csv("./data/final_feature_matrix.csv", index_col = 0)
X = feature_matrix_train.drop("fraudulent", axis = 1).values
y = feature_matrix_train.fraudulent.values
param_grid_acc = {'C': [0.01, 0.1, 1, 10, 100],
'gamma': [10, 1, 0.1, 0.01, 0.001],
'kernel': ['rbf','poly']}
grid_acc = GridSearchCV(SVC(), param_grid_acc)
grid_acc.fit(X, y)
GridSearchCV(estimator=SVC(),
param_grid={'C': [0.01, 0.1, 1, 10, 100],
'gamma': [10, 1, 0.1, 0.01, 0.001],
'kernel': ['rbf', 'poly']})
dictionary_acc = grid_acc.cv_results_
print(grid_acc.best_params_)
{'C': 10, 'gamma': 1, 'kernel': 'rbf'}
5 Folds Cross Validation For ROC AUC Using Grid Search Algorithm¶
param_grid_auc = {'C': [0.01, 0.1, 1, 10, 100],
'gamma': [10, 1, 0.1, 0.01, 0.001],
'kernel': ['rbf', 'poly']}
grid_auc = GridSearchCV(SVC(), param_grid_auc, scoring = "roc_auc")
grid_auc.fit(X,y)
GridSearchCV(estimator=SVC(),
param_grid={'C': [0.01, 0.1, 1, 10, 100],
'gamma': [10, 1, 0.1, 0.01, 0.001],
'kernel': ['rbf', 'poly']},
scoring='roc_auc')
dictionary_auc = grid_auc.cv_results_
print(grid_auc.best_params_)
{'C': 1, 'gamma': 1, 'kernel': 'rbf'}
Result Summary¶
result_df = pd.DataFrame({"C/gamma/kernel" : dictionary_acc['params'],
"Accuracy" : dictionary_acc['mean_test_score'],
"ROC_AUC": dictionary_auc['mean_test_score']})
result_df.sort_values("Accuracy", ascending = False).reset_index(drop = True)[0:10]
C/gamma/kernel | Accuracy | ROC_AUC | |
---|---|---|---|
0 | {'C': 10, 'gamma': 1, 'kernel': 'rbf'} | 0.973434 | 0.932281 |
1 | {'C': 100, 'gamma': 0.1, 'kernel': 'rbf'} | 0.973294 | 0.926156 |
2 | {'C': 100, 'gamma': 1, 'kernel': 'rbf'} | 0.971966 | 0.925867 |
3 | {'C': 1, 'gamma': 1, 'kernel': 'poly'} | 0.971756 | 0.918377 |
4 | {'C': 0.1, 'gamma': 1, 'kernel': 'poly'} | 0.971197 | 0.922597 |
5 | {'C': 100, 'gamma': 0.1, 'kernel': 'poly'} | 0.971197 | 0.922602 |
6 | {'C': 10, 'gamma': 0.1, 'kernel': 'rbf'} | 0.970638 | 0.926423 |
7 | {'C': 1, 'gamma': 1, 'kernel': 'rbf'} | 0.969869 | 0.933279 |
8 | {'C': 10, 'gamma': 10, 'kernel': 'rbf'} | 0.969729 | 0.924050 |
9 | {'C': 100, 'gamma': 0.01, 'kernel': 'rbf'} | 0.969379 | 0.913966 |
The combination of {C = 10, gamma = 1, kernel = ‘rbf’} gives the highest accuracy, while the combination of {C = 1, gamma =1, kernel = ‘rbf’} gives the highest AUC.
We will choose {C = 10, gamma = 1, kernel = ‘rbf’} as our final parameters for the SVM, since it has the most balanced score for both accuracy and ROC AUC.