Automated Feature Selection Using Boruta Algorithm
Automated Feature Selection Using Boruta Algorithm¶
In this section, we will perform automated feature selection using Boruta. Boruta algorithm uses randomization on top of results obtained from variable importance obtained from random forest to determine the truly important and statistically valid results.
Note
Boruta is produced as an improvement over random forest variable importance.
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from boruta import BorutaPy
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import joblib
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[1], line 1
----> 1 import pandas as pd
2 import numpy as np
3 from sklearn.ensemble import RandomForestClassifier
ModuleNotFoundError: No module named 'pandas'
text_features = pd.read_csv("./data/selected_text_features.csv", index_col = 0)
processed_train = joblib.load('./data/processed_train_jlib')
OHE_features = joblib.load('./data/OHE_features_train_jlib')
processed_train = processed_train.iloc[: , 2:8] #Removing two unnamed columns
all_features = pd.concat([text_features, OHE_features, processed_train], axis = 1)
all_features
administr_desc | answer_desc | asia_desc | assist_desc | bill_desc | call_desc | cash_desc | desir_desc | duti_desc | earn_desc | ... | industry_Warehousing | industry_Wholesale | industry_Wireless | industry_Writing and Editing | company_profile | telecommuting | has_company_logo | has_questions | required_education | fraudulent | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.092456 | 0.000000 | 0.0 | 0.000000 | 0.000000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | 1 | 0 | 0 | 0 |
1 | 0.045662 | 0.0 | 0.0 | 0.034465 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | 1 | 0 | 0 | 0 |
2 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | 1 | 1 | 1 | 0 |
3 | 0.000000 | 0.0 | 0.0 | 0.047975 | 0.0 | 0.000000 | 0.085044 | 0.0 | 0.051481 | 0.000000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | 1 | 0 | 1 | 0 |
4 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.053905 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | 1 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
14299 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | 1 | 1 | 0 | 0 |
14300 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | 1 | 1 | 1 | 0 |
14301 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | 1 | 0 | 0 | 0 |
14302 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.120147 | 0.000000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | 1 | 0 | 0 | 0 |
14303 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1 | 0 | 0 | 0 | 1 | 0 |
14304 rows × 288 columns
X = all_features.loc[:, all_features.columns != "fraudulent"]
y = all_features.loc[:, all_features.columns == "fraudulent"]
rfc = RandomForestClassifier(random_state=1, n_estimators=1000, max_depth=5)
boruta_selector = BorutaPy(rfc, n_estimators='auto', verbose=2, random_state=1)
boruta_selector.fit(np.array(X), y.values.ravel())
Iteration: 1 / 100
Confirmed: 0
Tentative: 287
Rejected: 0
Iteration: 2 / 100
Confirmed: 0
Tentative: 287
Rejected: 0
Iteration: 3 / 100
Confirmed: 0
Tentative: 287
Rejected: 0
Iteration: 4 / 100
Confirmed: 0
Tentative: 287
Rejected: 0
Iteration: 5 / 100
Confirmed: 0
Tentative: 287
Rejected: 0
Iteration: 6 / 100
Confirmed: 0
Tentative: 287
Rejected: 0
Iteration: 7 / 100
Confirmed: 0
Tentative: 287
Rejected: 0
Iteration: 8 / 100
Confirmed: 78
Tentative: 18
Rejected: 191
Iteration: 9 / 100
Confirmed: 78
Tentative: 18
Rejected: 191
Iteration: 10 / 100
Confirmed: 78
Tentative: 18
Rejected: 191
Iteration: 11 / 100
Confirmed: 78
Tentative: 18
Rejected: 191
Iteration: 12 / 100
Confirmed: 81
Tentative: 15
Rejected: 191
Iteration: 13 / 100
Confirmed: 81
Tentative: 15
Rejected: 191
Iteration: 14 / 100
Confirmed: 81
Tentative: 13
Rejected: 193
Iteration: 15 / 100
Confirmed: 81
Tentative: 13
Rejected: 193
Iteration: 16 / 100
Confirmed: 81
Tentative: 13
Rejected: 193
Iteration: 17 / 100
Confirmed: 81
Tentative: 13
Rejected: 193
Iteration: 18 / 100
Confirmed: 81
Tentative: 13
Rejected: 193
Iteration: 19 / 100
Confirmed: 82
Tentative: 11
Rejected: 194
Iteration: 20 / 100
Confirmed: 82
Tentative: 11
Rejected: 194
Iteration: 21 / 100
Confirmed: 82
Tentative: 11
Rejected: 194
Iteration: 22 / 100
Confirmed: 83
Tentative: 10
Rejected: 194
Iteration: 23 / 100
Confirmed: 83
Tentative: 10
Rejected: 194
Iteration: 24 / 100
Confirmed: 83
Tentative: 8
Rejected: 196
Iteration: 25 / 100
Confirmed: 83
Tentative: 8
Rejected: 196
Iteration: 26 / 100
Confirmed: 83
Tentative: 8
Rejected: 196
Iteration: 27 / 100
Confirmed: 83
Tentative: 8
Rejected: 196
Iteration: 28 / 100
Confirmed: 83
Tentative: 8
Rejected: 196
Iteration: 29 / 100
Confirmed: 83
Tentative: 8
Rejected: 196
Iteration: 30 / 100
Confirmed: 83
Tentative: 8
Rejected: 196
Iteration: 31 / 100
Confirmed: 83
Tentative: 7
Rejected: 197
Iteration: 32 / 100
Confirmed: 83
Tentative: 7
Rejected: 197
Iteration: 33 / 100
Confirmed: 83
Tentative: 7
Rejected: 197
Iteration: 34 / 100
Confirmed: 83
Tentative: 6
Rejected: 198
Iteration: 35 / 100
Confirmed: 83
Tentative: 6
Rejected: 198
Iteration: 36 / 100
Confirmed: 83
Tentative: 6
Rejected: 198
Iteration: 37 / 100
Confirmed: 83
Tentative: 6
Rejected: 198
Iteration: 38 / 100
Confirmed: 83
Tentative: 6
Rejected: 198
Iteration: 39 / 100
Confirmed: 83
Tentative: 5
Rejected: 199
Iteration: 40 / 100
Confirmed: 83
Tentative: 5
Rejected: 199
Iteration: 41 / 100
Confirmed: 83
Tentative: 5
Rejected: 199
Iteration: 42 / 100
Confirmed: 83
Tentative: 4
Rejected: 200
Iteration: 43 / 100
Confirmed: 84
Tentative: 3
Rejected: 200
Iteration: 44 / 100
Confirmed: 84
Tentative: 3
Rejected: 200
Iteration: 45 / 100
Confirmed: 84
Tentative: 3
Rejected: 200
Iteration: 46 / 100
Confirmed: 84
Tentative: 3
Rejected: 200
Iteration: 47 / 100
Confirmed: 84
Tentative: 3
Rejected: 200
Iteration: 48 / 100
Confirmed: 84
Tentative: 3
Rejected: 200
Iteration: 49 / 100
Confirmed: 84
Tentative: 3
Rejected: 200
Iteration: 50 / 100
Confirmed: 84
Tentative: 3
Rejected: 200
Iteration: 51 / 100
Confirmed: 84
Tentative: 3
Rejected: 200
Iteration: 52 / 100
Confirmed: 84
Tentative: 3
Rejected: 200
Iteration: 53 / 100
Confirmed: 84
Tentative: 3
Rejected: 200
Iteration: 54 / 100
Confirmed: 84
Tentative: 3
Rejected: 200
Iteration: 55 / 100
Confirmed: 84
Tentative: 3
Rejected: 200
Iteration: 56 / 100
Confirmed: 84
Tentative: 3
Rejected: 200
Iteration: 57 / 100
Confirmed: 84
Tentative: 3
Rejected: 200
Iteration: 58 / 100
Confirmed: 84
Tentative: 3
Rejected: 200
Iteration: 59 / 100
Confirmed: 84
Tentative: 3
Rejected: 200
Iteration: 60 / 100
Confirmed: 84
Tentative: 3
Rejected: 200
Iteration: 61 / 100
Confirmed: 84
Tentative: 3
Rejected: 200
Iteration: 62 / 100
Confirmed: 84
Tentative: 3
Rejected: 200
Iteration: 63 / 100
Confirmed: 84
Tentative: 3
Rejected: 200
Iteration: 64 / 100
Confirmed: 84
Tentative: 3
Rejected: 200
Iteration: 65 / 100
Confirmed: 85
Tentative: 2
Rejected: 200
Iteration: 66 / 100
Confirmed: 85
Tentative: 2
Rejected: 200
Iteration: 67 / 100
Confirmed: 85
Tentative: 2
Rejected: 200
Iteration: 68 / 100
Confirmed: 85
Tentative: 2
Rejected: 200
Iteration: 69 / 100
Confirmed: 85
Tentative: 2
Rejected: 200
Iteration: 70 / 100
Confirmed: 85
Tentative: 2
Rejected: 200
Iteration: 71 / 100
Confirmed: 85
Tentative: 2
Rejected: 200
Iteration: 72 / 100
Confirmed: 85
Tentative: 2
Rejected: 200
Iteration: 73 / 100
Confirmed: 85
Tentative: 2
Rejected: 200
Iteration: 74 / 100
Confirmed: 85
Tentative: 2
Rejected: 200
Iteration: 75 / 100
Confirmed: 85
Tentative: 2
Rejected: 200
Iteration: 76 / 100
Confirmed: 85
Tentative: 2
Rejected: 200
Iteration: 77 / 100
Confirmed: 85
Tentative: 2
Rejected: 200
Iteration: 78 / 100
Confirmed: 85
Tentative: 2
Rejected: 200
Iteration: 79 / 100
Confirmed: 85
Tentative: 2
Rejected: 200
Iteration: 80 / 100
Confirmed: 85
Tentative: 2
Rejected: 200
Iteration: 81 / 100
Confirmed: 85
Tentative: 2
Rejected: 200
Iteration: 82 / 100
Confirmed: 85
Tentative: 2
Rejected: 200
Iteration: 83 / 100
Confirmed: 85
Tentative: 1
Rejected: 201
Iteration: 84 / 100
Confirmed: 85
Tentative: 1
Rejected: 201
Iteration: 85 / 100
Confirmed: 85
Tentative: 1
Rejected: 201
Iteration: 86 / 100
Confirmed: 85
Tentative: 1
Rejected: 201
Iteration: 87 / 100
Confirmed: 85
Tentative: 1
Rejected: 201
Iteration: 88 / 100
Confirmed: 85
Tentative: 1
Rejected: 201
Iteration: 89 / 100
Confirmed: 85
Tentative: 1
Rejected: 201
Iteration: 90 / 100
Confirmed: 85
Tentative: 1
Rejected: 201
Iteration: 91 / 100
Confirmed: 85
Tentative: 1
Rejected: 201
Iteration: 92 / 100
Confirmed: 85
Tentative: 1
Rejected: 201
Iteration: 93 / 100
Confirmed: 85
Tentative: 1
Rejected: 201
Iteration: 94 / 100
Confirmed: 85
Tentative: 1
Rejected: 201
Iteration: 95 / 100
Confirmed: 85
Tentative: 1
Rejected: 201
Iteration: 96 / 100
Confirmed: 85
Tentative: 1
Rejected: 201
Iteration: 97 / 100
Confirmed: 85
Tentative: 1
Rejected: 201
Iteration: 98 / 100
Confirmed: 85
Tentative: 1
Rejected: 201
Iteration: 99 / 100
Confirmed: 85
Tentative: 1
Rejected: 201
BorutaPy finished running.
Iteration: 100 / 100
Confirmed: 85
Tentative: 0
Rejected: 201
BorutaPy(estimator=RandomForestClassifier(max_depth=5, n_estimators=262,
random_state=RandomState(MT19937) at 0x21838CE8640),
n_estimators='auto',
random_state=RandomState(MT19937) at 0x21838CE8640, verbose=2)
print("Ranking: ",boruta_selector.ranking_)
print("No. of significant features: ", boruta_selector.n_features_)
Ranking: [ 1 1 66 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 65 34 1 1 1 1 1 1 1 1 1 1 1 70 12 8
6 1 1 84 1 1 1 1 1 83 1 11 135 1 91 1 1 1
1 61 75 81 1 32 32 98 1 1 44 1 47 53 1 1 1 59
14 71 1 19 1 6 45 64 8 1 74 1 12 1 1 1 1 1
1 1 1 1 37 24 1 1 1 1 23 1 43 123 174 21 116 10
62 113 63 108 1 45 51 135 95 51 25 174 51 107 35 1 6 135
100 78 174 135 116 135 15 174 123 174 174 110 27 1 1 36 1 77
4 58 1 72 57 1 1 16 1 135 40 98 174 89 135 68 100 135
123 104 174 174 114 174 174 60 174 102 18 27 73 135 94 26 123 78
92 116 47 98 40 105 123 38 55 174 2 174 135 123 174 174 174 82
123 174 56 174 17 42 67 135 174 174 145 29 54 174 30 174 111 174
116 80 1 174 86 174 174 87 174 106 21 95 68 174 135 109 39 174
174 174 174 1 174 89 1 174 102 174 174 174 87 135 174 174 174 135
174 174 174 119 174 174 31 3 135 174 174 143 75 127 174 174 174 174
49 19 174 174 84 112 174 174 93 143 174 174 1 1 1 1 1]
No. of significant features: 85
Boruta Algorithm selected 85 features from 288 features. The features with rank 1 are selected by the algorithm
selected_features = X.iloc[:, boruta_selector.ranking_ == 1]
final_feature_matrix = pd.concat([selected_features, y], axis = 1)
final_feature_matrix
administr_desc | answer_desc | assist_desc | bill_desc | call_desc | cash_desc | desir_desc | duti_desc | earn_desc | entri_desc | ... | industry_Accounting | industry_Leisure, Travel & Tourism | industry_NAN | industry_Oil & Energy | company_profile | telecommuting | has_company_logo | has_questions | required_education | fraudulent | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.092456 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | 1 | 0 | 0 | 0 |
1 | 0.045662 | 0.0 | 0.034465 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | 1 | 0 | 0 | 0 |
2 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | 1 | 1 | 1 | 0 |
3 | 0.000000 | 0.0 | 0.047975 | 0.0 | 0.000000 | 0.085044 | 0.0 | 0.051481 | 0.000000 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | 1 | 0 | 1 | 0 |
4 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.053905 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | 1 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
14299 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | 1 | 1 | 0 | 0 |
14300 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | ... | 0.0 | 0.0 | 1.0 | 0.0 | 0 | 0 | 1 | 1 | 1 | 0 |
14301 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | 1 | 0 | 0 | 0 |
14302 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.120147 | 0.000000 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | 1 | 0 | 0 | 0 |
14303 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1 | 0 | 0 | 0 | 1 | 0 |
14304 rows × 86 columns
We will save this final feature matrix as a csv file.
final_feature_matrix.to_csv("./data/final_feature_matrix.csv")