Automated Feature Selection Using Boruta Algorithm

Automated Feature Selection Using Boruta Algorithm

In this section, we will perform automated feature selection using Boruta. Boruta algorithm uses randomization on top of results obtained from variable importance obtained from random forest to determine the truly important and statistically valid results.

Note

Boruta is produced as an improvement over random forest variable importance.

import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from boruta import BorutaPy
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score 
import joblib
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 1
----> 1 import pandas as pd
      2 import numpy as np
      3 from sklearn.ensemble import RandomForestClassifier

ModuleNotFoundError: No module named 'pandas'
text_features = pd.read_csv("./data/selected_text_features.csv", index_col = 0)
processed_train = joblib.load('./data/processed_train_jlib')
OHE_features = joblib.load('./data/OHE_features_train_jlib')
processed_train = processed_train.iloc[: , 2:8] #Removing two unnamed columns
all_features = pd.concat([text_features, OHE_features, processed_train], axis = 1)
all_features
administr_desc answer_desc asia_desc assist_desc bill_desc call_desc cash_desc desir_desc duti_desc earn_desc ... industry_Warehousing industry_Wholesale industry_Wireless industry_Writing and Editing company_profile telecommuting has_company_logo has_questions required_education fraudulent
0 0.000000 0.0 0.0 0.000000 0.0 0.092456 0.000000 0.0 0.000000 0.000000 ... 0.0 0.0 0.0 0.0 0 0 1 0 0 0
1 0.045662 0.0 0.0 0.034465 0.0 0.000000 0.000000 0.0 0.000000 0.000000 ... 0.0 0.0 0.0 0.0 0 0 1 0 0 0
2 0.000000 0.0 0.0 0.000000 0.0 0.000000 0.000000 0.0 0.000000 0.000000 ... 0.0 0.0 0.0 0.0 0 0 1 1 1 0
3 0.000000 0.0 0.0 0.047975 0.0 0.000000 0.085044 0.0 0.051481 0.000000 ... 0.0 0.0 0.0 0.0 0 0 1 0 1 0
4 0.000000 0.0 0.0 0.000000 0.0 0.000000 0.000000 0.0 0.000000 0.053905 ... 0.0 0.0 0.0 0.0 0 0 1 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
14299 0.000000 0.0 0.0 0.000000 0.0 0.000000 0.000000 0.0 0.000000 0.000000 ... 0.0 0.0 0.0 0.0 0 0 1 1 0 0
14300 0.000000 0.0 0.0 0.000000 0.0 0.000000 0.000000 0.0 0.000000 0.000000 ... 0.0 0.0 0.0 0.0 0 0 1 1 1 0
14301 0.000000 0.0 0.0 0.000000 0.0 0.000000 0.000000 0.0 0.000000 0.000000 ... 0.0 0.0 0.0 0.0 0 0 1 0 0 0
14302 0.000000 0.0 0.0 0.000000 0.0 0.000000 0.000000 0.0 0.120147 0.000000 ... 0.0 0.0 0.0 0.0 0 0 1 0 0 0
14303 0.000000 0.0 0.0 0.000000 0.0 0.000000 0.000000 0.0 0.000000 0.000000 ... 0.0 0.0 0.0 0.0 1 0 0 0 1 0

14304 rows × 288 columns

X = all_features.loc[:, all_features.columns != "fraudulent"] 
y = all_features.loc[:, all_features.columns == "fraudulent"] 
rfc = RandomForestClassifier(random_state=1, n_estimators=1000, max_depth=5)
boruta_selector = BorutaPy(rfc, n_estimators='auto', verbose=2, random_state=1)
boruta_selector.fit(np.array(X), y.values.ravel())
Iteration: 	1 / 100
Confirmed: 	0
Tentative: 	287
Rejected: 	0
Iteration: 	2 / 100
Confirmed: 	0
Tentative: 	287
Rejected: 	0
Iteration: 	3 / 100
Confirmed: 	0
Tentative: 	287
Rejected: 	0
Iteration: 	4 / 100
Confirmed: 	0
Tentative: 	287
Rejected: 	0
Iteration: 	5 / 100
Confirmed: 	0
Tentative: 	287
Rejected: 	0
Iteration: 	6 / 100
Confirmed: 	0
Tentative: 	287
Rejected: 	0
Iteration: 	7 / 100
Confirmed: 	0
Tentative: 	287
Rejected: 	0
Iteration: 	8 / 100
Confirmed: 	78
Tentative: 	18
Rejected: 	191
Iteration: 	9 / 100
Confirmed: 	78
Tentative: 	18
Rejected: 	191
Iteration: 	10 / 100
Confirmed: 	78
Tentative: 	18
Rejected: 	191
Iteration: 	11 / 100
Confirmed: 	78
Tentative: 	18
Rejected: 	191
Iteration: 	12 / 100
Confirmed: 	81
Tentative: 	15
Rejected: 	191
Iteration: 	13 / 100
Confirmed: 	81
Tentative: 	15
Rejected: 	191
Iteration: 	14 / 100
Confirmed: 	81
Tentative: 	13
Rejected: 	193
Iteration: 	15 / 100
Confirmed: 	81
Tentative: 	13
Rejected: 	193
Iteration: 	16 / 100
Confirmed: 	81
Tentative: 	13
Rejected: 	193
Iteration: 	17 / 100
Confirmed: 	81
Tentative: 	13
Rejected: 	193
Iteration: 	18 / 100
Confirmed: 	81
Tentative: 	13
Rejected: 	193
Iteration: 	19 / 100
Confirmed: 	82
Tentative: 	11
Rejected: 	194
Iteration: 	20 / 100
Confirmed: 	82
Tentative: 	11
Rejected: 	194
Iteration: 	21 / 100
Confirmed: 	82
Tentative: 	11
Rejected: 	194
Iteration: 	22 / 100
Confirmed: 	83
Tentative: 	10
Rejected: 	194
Iteration: 	23 / 100
Confirmed: 	83
Tentative: 	10
Rejected: 	194
Iteration: 	24 / 100
Confirmed: 	83
Tentative: 	8
Rejected: 	196
Iteration: 	25 / 100
Confirmed: 	83
Tentative: 	8
Rejected: 	196
Iteration: 	26 / 100
Confirmed: 	83
Tentative: 	8
Rejected: 	196
Iteration: 	27 / 100
Confirmed: 	83
Tentative: 	8
Rejected: 	196
Iteration: 	28 / 100
Confirmed: 	83
Tentative: 	8
Rejected: 	196
Iteration: 	29 / 100
Confirmed: 	83
Tentative: 	8
Rejected: 	196
Iteration: 	30 / 100
Confirmed: 	83
Tentative: 	8
Rejected: 	196
Iteration: 	31 / 100
Confirmed: 	83
Tentative: 	7
Rejected: 	197
Iteration: 	32 / 100
Confirmed: 	83
Tentative: 	7
Rejected: 	197
Iteration: 	33 / 100
Confirmed: 	83
Tentative: 	7
Rejected: 	197
Iteration: 	34 / 100
Confirmed: 	83
Tentative: 	6
Rejected: 	198
Iteration: 	35 / 100
Confirmed: 	83
Tentative: 	6
Rejected: 	198
Iteration: 	36 / 100
Confirmed: 	83
Tentative: 	6
Rejected: 	198
Iteration: 	37 / 100
Confirmed: 	83
Tentative: 	6
Rejected: 	198
Iteration: 	38 / 100
Confirmed: 	83
Tentative: 	6
Rejected: 	198
Iteration: 	39 / 100
Confirmed: 	83
Tentative: 	5
Rejected: 	199
Iteration: 	40 / 100
Confirmed: 	83
Tentative: 	5
Rejected: 	199
Iteration: 	41 / 100
Confirmed: 	83
Tentative: 	5
Rejected: 	199
Iteration: 	42 / 100
Confirmed: 	83
Tentative: 	4
Rejected: 	200
Iteration: 	43 / 100
Confirmed: 	84
Tentative: 	3
Rejected: 	200
Iteration: 	44 / 100
Confirmed: 	84
Tentative: 	3
Rejected: 	200
Iteration: 	45 / 100
Confirmed: 	84
Tentative: 	3
Rejected: 	200
Iteration: 	46 / 100
Confirmed: 	84
Tentative: 	3
Rejected: 	200
Iteration: 	47 / 100
Confirmed: 	84
Tentative: 	3
Rejected: 	200
Iteration: 	48 / 100
Confirmed: 	84
Tentative: 	3
Rejected: 	200
Iteration: 	49 / 100
Confirmed: 	84
Tentative: 	3
Rejected: 	200
Iteration: 	50 / 100
Confirmed: 	84
Tentative: 	3
Rejected: 	200
Iteration: 	51 / 100
Confirmed: 	84
Tentative: 	3
Rejected: 	200
Iteration: 	52 / 100
Confirmed: 	84
Tentative: 	3
Rejected: 	200
Iteration: 	53 / 100
Confirmed: 	84
Tentative: 	3
Rejected: 	200
Iteration: 	54 / 100
Confirmed: 	84
Tentative: 	3
Rejected: 	200
Iteration: 	55 / 100
Confirmed: 	84
Tentative: 	3
Rejected: 	200
Iteration: 	56 / 100
Confirmed: 	84
Tentative: 	3
Rejected: 	200
Iteration: 	57 / 100
Confirmed: 	84
Tentative: 	3
Rejected: 	200
Iteration: 	58 / 100
Confirmed: 	84
Tentative: 	3
Rejected: 	200
Iteration: 	59 / 100
Confirmed: 	84
Tentative: 	3
Rejected: 	200
Iteration: 	60 / 100
Confirmed: 	84
Tentative: 	3
Rejected: 	200
Iteration: 	61 / 100
Confirmed: 	84
Tentative: 	3
Rejected: 	200
Iteration: 	62 / 100
Confirmed: 	84
Tentative: 	3
Rejected: 	200
Iteration: 	63 / 100
Confirmed: 	84
Tentative: 	3
Rejected: 	200
Iteration: 	64 / 100
Confirmed: 	84
Tentative: 	3
Rejected: 	200
Iteration: 	65 / 100
Confirmed: 	85
Tentative: 	2
Rejected: 	200
Iteration: 	66 / 100
Confirmed: 	85
Tentative: 	2
Rejected: 	200
Iteration: 	67 / 100
Confirmed: 	85
Tentative: 	2
Rejected: 	200
Iteration: 	68 / 100
Confirmed: 	85
Tentative: 	2
Rejected: 	200
Iteration: 	69 / 100
Confirmed: 	85
Tentative: 	2
Rejected: 	200
Iteration: 	70 / 100
Confirmed: 	85
Tentative: 	2
Rejected: 	200
Iteration: 	71 / 100
Confirmed: 	85
Tentative: 	2
Rejected: 	200
Iteration: 	72 / 100
Confirmed: 	85
Tentative: 	2
Rejected: 	200
Iteration: 	73 / 100
Confirmed: 	85
Tentative: 	2
Rejected: 	200
Iteration: 	74 / 100
Confirmed: 	85
Tentative: 	2
Rejected: 	200
Iteration: 	75 / 100
Confirmed: 	85
Tentative: 	2
Rejected: 	200
Iteration: 	76 / 100
Confirmed: 	85
Tentative: 	2
Rejected: 	200
Iteration: 	77 / 100
Confirmed: 	85
Tentative: 	2
Rejected: 	200
Iteration: 	78 / 100
Confirmed: 	85
Tentative: 	2
Rejected: 	200
Iteration: 	79 / 100
Confirmed: 	85
Tentative: 	2
Rejected: 	200
Iteration: 	80 / 100
Confirmed: 	85
Tentative: 	2
Rejected: 	200
Iteration: 	81 / 100
Confirmed: 	85
Tentative: 	2
Rejected: 	200
Iteration: 	82 / 100
Confirmed: 	85
Tentative: 	2
Rejected: 	200
Iteration: 	83 / 100
Confirmed: 	85
Tentative: 	1
Rejected: 	201
Iteration: 	84 / 100
Confirmed: 	85
Tentative: 	1
Rejected: 	201
Iteration: 	85 / 100
Confirmed: 	85
Tentative: 	1
Rejected: 	201
Iteration: 	86 / 100
Confirmed: 	85
Tentative: 	1
Rejected: 	201
Iteration: 	87 / 100
Confirmed: 	85
Tentative: 	1
Rejected: 	201
Iteration: 	88 / 100
Confirmed: 	85
Tentative: 	1
Rejected: 	201
Iteration: 	89 / 100
Confirmed: 	85
Tentative: 	1
Rejected: 	201
Iteration: 	90 / 100
Confirmed: 	85
Tentative: 	1
Rejected: 	201
Iteration: 	91 / 100
Confirmed: 	85
Tentative: 	1
Rejected: 	201
Iteration: 	92 / 100
Confirmed: 	85
Tentative: 	1
Rejected: 	201
Iteration: 	93 / 100
Confirmed: 	85
Tentative: 	1
Rejected: 	201
Iteration: 	94 / 100
Confirmed: 	85
Tentative: 	1
Rejected: 	201
Iteration: 	95 / 100
Confirmed: 	85
Tentative: 	1
Rejected: 	201
Iteration: 	96 / 100
Confirmed: 	85
Tentative: 	1
Rejected: 	201
Iteration: 	97 / 100
Confirmed: 	85
Tentative: 	1
Rejected: 	201
Iteration: 	98 / 100
Confirmed: 	85
Tentative: 	1
Rejected: 	201
Iteration: 	99 / 100
Confirmed: 	85
Tentative: 	1
Rejected: 	201


BorutaPy finished running.

Iteration: 	100 / 100
Confirmed: 	85
Tentative: 	0
Rejected: 	201
BorutaPy(estimator=RandomForestClassifier(max_depth=5, n_estimators=262,
                                          random_state=RandomState(MT19937) at 0x21838CE8640),
         n_estimators='auto',
         random_state=RandomState(MT19937) at 0x21838CE8640, verbose=2)
print("Ranking: ",boruta_selector.ranking_)
print("No. of significant features: ", boruta_selector.n_features_) 
Ranking:  [  1   1  66   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1
   1   1  65  34   1   1   1   1   1   1   1   1   1   1   1  70  12   8
   6   1   1  84   1   1   1   1   1  83   1  11 135   1  91   1   1   1
   1  61  75  81   1  32  32  98   1   1  44   1  47  53   1   1   1  59
  14  71   1  19   1   6  45  64   8   1  74   1  12   1   1   1   1   1
   1   1   1   1  37  24   1   1   1   1  23   1  43 123 174  21 116  10
  62 113  63 108   1  45  51 135  95  51  25 174  51 107  35   1   6 135
 100  78 174 135 116 135  15 174 123 174 174 110  27   1   1  36   1  77
   4  58   1  72  57   1   1  16   1 135  40  98 174  89 135  68 100 135
 123 104 174 174 114 174 174  60 174 102  18  27  73 135  94  26 123  78
  92 116  47  98  40 105 123  38  55 174   2 174 135 123 174 174 174  82
 123 174  56 174  17  42  67 135 174 174 145  29  54 174  30 174 111 174
 116  80   1 174  86 174 174  87 174 106  21  95  68 174 135 109  39 174
 174 174 174   1 174  89   1 174 102 174 174 174  87 135 174 174 174 135
 174 174 174 119 174 174  31   3 135 174 174 143  75 127 174 174 174 174
  49  19 174 174  84 112 174 174  93 143 174 174   1   1   1   1   1]
No. of significant features:  85

Boruta Algorithm selected 85 features from 288 features. The features with rank 1 are selected by the algorithm

selected_features = X.iloc[:, boruta_selector.ranking_ == 1]
final_feature_matrix = pd.concat([selected_features, y], axis = 1)
final_feature_matrix
administr_desc answer_desc assist_desc bill_desc call_desc cash_desc desir_desc duti_desc earn_desc entri_desc ... industry_Accounting industry_Leisure, Travel & Tourism industry_NAN industry_Oil & Energy company_profile telecommuting has_company_logo has_questions required_education fraudulent
0 0.000000 0.0 0.000000 0.0 0.092456 0.000000 0.0 0.000000 0.000000 0.0 ... 0.0 0.0 0.0 0.0 0 0 1 0 0 0
1 0.045662 0.0 0.034465 0.0 0.000000 0.000000 0.0 0.000000 0.000000 0.0 ... 0.0 0.0 0.0 0.0 0 0 1 0 0 0
2 0.000000 0.0 0.000000 0.0 0.000000 0.000000 0.0 0.000000 0.000000 0.0 ... 0.0 0.0 0.0 0.0 0 0 1 1 1 0
3 0.000000 0.0 0.047975 0.0 0.000000 0.085044 0.0 0.051481 0.000000 0.0 ... 0.0 0.0 0.0 0.0 0 0 1 0 1 0
4 0.000000 0.0 0.000000 0.0 0.000000 0.000000 0.0 0.000000 0.053905 0.0 ... 0.0 0.0 0.0 0.0 0 0 1 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
14299 0.000000 0.0 0.000000 0.0 0.000000 0.000000 0.0 0.000000 0.000000 0.0 ... 0.0 0.0 0.0 0.0 0 0 1 1 0 0
14300 0.000000 0.0 0.000000 0.0 0.000000 0.000000 0.0 0.000000 0.000000 0.0 ... 0.0 0.0 1.0 0.0 0 0 1 1 1 0
14301 0.000000 0.0 0.000000 0.0 0.000000 0.000000 0.0 0.000000 0.000000 0.0 ... 0.0 0.0 0.0 0.0 0 0 1 0 0 0
14302 0.000000 0.0 0.000000 0.0 0.000000 0.000000 0.0 0.120147 0.000000 0.0 ... 0.0 0.0 0.0 0.0 0 0 1 0 0 0
14303 0.000000 0.0 0.000000 0.0 0.000000 0.000000 0.0 0.000000 0.000000 0.0 ... 0.0 0.0 0.0 0.0 1 0 0 0 1 0

14304 rows × 86 columns

We will save this final feature matrix as a csv file.

final_feature_matrix.to_csv("./data/final_feature_matrix.csv")