Automated Feature Selection Using Boruta Algorithm¶

In this section, we will perform automated feature selection using Boruta. Boruta algorithm uses randomization on top of results obtained from variable importance obtained from random forest to determine the truly important and statistically valid results.

Note

Boruta is produced as an improvement over random forest variable importance.

import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from boruta import BorutaPy
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score 
import joblib

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 1
----> 1 import pandas as pd
      2 import numpy as np
      3 from sklearn.ensemble import RandomForestClassifier

ModuleNotFoundError: No module named 'pandas'

text_features = pd.read_csv("./data/selected_text_features.csv", index_col = 0)
processed_train = joblib.load('./data/processed_train_jlib')
OHE_features = joblib.load('./data/OHE_features_train_jlib')
processed_train = processed_train.iloc[: , 2:8] #Removing two unnamed columns

all_features = pd.concat([text_features, OHE_features, processed_train], axis = 1)
all_features

	administr_desc	answer_desc	asia_desc	assist_desc	bill_desc	call_desc	cash_desc	desir_desc	duti_desc	earn_desc	...	industry_Warehousing	industry_Wholesale	industry_Wireless	industry_Writing and Editing	company_profile	telecommuting	has_company_logo	has_questions	required_education	fraudulent
0	0.000000	0.0	0.0	0.000000	0.0	0.092456	0.000000	0.0	0.000000	0.000000	...	0.0	0.0	0.0	0.0	0	0	1	0	0	0
1	0.045662	0.0	0.0	0.034465	0.0	0.000000	0.000000	0.0	0.000000	0.000000	...	0.0	0.0	0.0	0.0	0	0	1	0	0	0
2	0.000000	0.0	0.0	0.000000	0.0	0.000000	0.000000	0.0	0.000000	0.000000	...	0.0	0.0	0.0	0.0	0	0	1	1	1	0
3	0.000000	0.0	0.0	0.047975	0.0	0.000000	0.085044	0.0	0.051481	0.000000	...	0.0	0.0	0.0	0.0	0	0	1	0	1	0
4	0.000000	0.0	0.0	0.000000	0.0	0.000000	0.000000	0.0	0.000000	0.053905	...	0.0	0.0	0.0	0.0	0	0	1	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
14299	0.000000	0.0	0.0	0.000000	0.0	0.000000	0.000000	0.0	0.000000	0.000000	...	0.0	0.0	0.0	0.0	0	0	1	1	0	0
14300	0.000000	0.0	0.0	0.000000	0.0	0.000000	0.000000	0.0	0.000000	0.000000	...	0.0	0.0	0.0	0.0	0	0	1	1	1	0
14301	0.000000	0.0	0.0	0.000000	0.0	0.000000	0.000000	0.0	0.000000	0.000000	...	0.0	0.0	0.0	0.0	0	0	1	0	0	0
14302	0.000000	0.0	0.0	0.000000	0.0	0.000000	0.000000	0.0	0.120147	0.000000	...	0.0	0.0	0.0	0.0	0	0	1	0	0	0
14303	0.000000	0.0	0.0	0.000000	0.0	0.000000	0.000000	0.0	0.000000	0.000000	...	0.0	0.0	0.0	0.0	1	0	0	0	1	0

14304 rows × 288 columns

X = all_features.loc[:, all_features.columns != "fraudulent"] 
y = all_features.loc[:, all_features.columns == "fraudulent"] 

rfc = RandomForestClassifier(random_state=1, n_estimators=1000, max_depth=5)

boruta_selector = BorutaPy(rfc, n_estimators='auto', verbose=2, random_state=1)

boruta_selector.fit(np.array(X), y.values.ravel())

Iteration: 	1 / 100
Confirmed: 	0
Tentative: 	287
Rejected: 	0
Iteration: 	2 / 100
Confirmed: 	0
Tentative: 	287
Rejected: 	0
Iteration: 	3 / 100
Confirmed: 	0
Tentative: 	287
Rejected: 	0
Iteration: 	4 / 100
Confirmed: 	0
Tentative: 	287
Rejected: 	0
Iteration: 	5 / 100
Confirmed: 	0
Tentative: 	287
Rejected: 	0
Iteration: 	6 / 100
Confirmed: 	0
Tentative: 	287
Rejected: 	0
Iteration: 	7 / 100
Confirmed: 	0
Tentative: 	287
Rejected: 	0
Iteration: 	8 / 100
Confirmed: 	78
Tentative: 	18
Rejected: 	191
Iteration: 	9 / 100
Confirmed: 	78
Tentative: 	18
Rejected: 	191
Iteration: 	10 / 100
Confirmed: 	78
Tentative: 	18
Rejected: 	191
Iteration: 	11 / 100
Confirmed: 	78
Tentative: 	18
Rejected: 	191
Iteration: 	12 / 100
Confirmed: 	81
Tentative: 	15
Rejected: 	191
Iteration: 	13 / 100
Confirmed: 	81
Tentative: 	15
Rejected: 	191
Iteration: 	14 / 100
Confirmed: 	81
Tentative: 	13
Rejected: 	193
Iteration: 	15 / 100
Confirmed: 	81
Tentative: 	13
Rejected: 	193
Iteration: 	16 / 100
Confirmed: 	81
Tentative: 	13
Rejected: 	193
Iteration: 	17 / 100
Confirmed: 	81
Tentative: 	13
Rejected: 	193
Iteration: 	18 / 100
Confirmed: 	81
Tentative: 	13
Rejected: 	193
Iteration: 	19 / 100
Confirmed: 	82
Tentative: 	11
Rejected: 	194
Iteration: 	20 / 100
Confirmed: 	82
Tentative: 	11
Rejected: 	194
Iteration: 	21 / 100
Confirmed: 	82
Tentative: 	11
Rejected: 	194
Iteration: 	22 / 100
Confirmed: 	83
Tentative: 	10
Rejected: 	194
Iteration: 	23 / 100
Confirmed: 	83
Tentative: 	10
Rejected: 	194
Iteration: 	24 / 100
Confirmed: 	83
Tentative: 	8
Rejected: 	196
Iteration: 	25 / 100
Confirmed: 	83
Tentative: 	8
Rejected: 	196
Iteration: 	26 / 100
Confirmed: 	83
Tentative: 	8
Rejected: 	196
Iteration: 	27 / 100
Confirmed: 	83
Tentative: 	8
Rejected: 	196
Iteration: 	28 / 100
Confirmed: 	83
Tentative: 	8
Rejected: 	196
Iteration: 	29 / 100
Confirmed: 	83
Tentative: 	8
Rejected: 	196
Iteration: 	30 / 100
Confirmed: 	83
Tentative: 	8
Rejected: 	196
Iteration: 	31 / 100
Confirmed: 	83
Tentative: 	7
Rejected: 	197
Iteration: 	32 / 100
Confirmed: 	83
Tentative: 	7
Rejected: 	197
Iteration: 	33 / 100
Confirmed: 	83
Tentative: 	7
Rejected: 	197
Iteration: 	34 / 100
Confirmed: 	83
Tentative: 	6
Rejected: 	198
Iteration: 	35 / 100
Confirmed: 	83
Tentative: 	6
Rejected: 	198
Iteration: 	36 / 100
Confirmed: 	83
Tentative: 	6
Rejected: 	198
Iteration: 	37 / 100
Confirmed: 	83
Tentative: 	6
Rejected: 	198
Iteration: 	38 / 100
Confirmed: 	83
Tentative: 	6
Rejected: 	198
Iteration: 	39 / 100
Confirmed: 	83
Tentative: 	5
Rejected: 	199
Iteration: 	40 / 100
Confirmed: 	83
Tentative: 	5
Rejected: 	199
Iteration: 	41 / 100
Confirmed: 	83
Tentative: 	5
Rejected: 	199
Iteration: 	42 / 100
Confirmed: 	83
Tentative: 	4
Rejected: 	200
Iteration: 	43 / 100
Confirmed: 	84
Tentative: 	3
Rejected: 	200
Iteration: 	44 / 100
Confirmed: 	84
Tentative: 	3
Rejected: 	200
Iteration: 	45 / 100
Confirmed: 	84
Tentative: 	3
Rejected: 	200
Iteration: 	46 / 100
Confirmed: 	84
Tentative: 	3
Rejected: 	200
Iteration: 	47 / 100
Confirmed: 	84
Tentative: 	3
Rejected: 	200
Iteration: 	48 / 100
Confirmed: 	84
Tentative: 	3
Rejected: 	200
Iteration: 	49 / 100
Confirmed: 	84
Tentative: 	3
Rejected: 	200
Iteration: 	50 / 100
Confirmed: 	84
Tentative: 	3
Rejected: 	200
Iteration: 	51 / 100
Confirmed: 	84
Tentative: 	3
Rejected: 	200
Iteration: 	52 / 100
Confirmed: 	84
Tentative: 	3
Rejected: 	200
Iteration: 	53 / 100
Confirmed: 	84
Tentative: 	3
Rejected: 	200
Iteration: 	54 / 100
Confirmed: 	84
Tentative: 	3
Rejected: 	200
Iteration: 	55 / 100
Confirmed: 	84
Tentative: 	3
Rejected: 	200
Iteration: 	56 / 100
Confirmed: 	84
Tentative: 	3
Rejected: 	200
Iteration: 	57 / 100
Confirmed: 	84
Tentative: 	3
Rejected: 	200
Iteration: 	58 / 100
Confirmed: 	84
Tentative: 	3
Rejected: 	200
Iteration: 	59 / 100
Confirmed: 	84
Tentative: 	3
Rejected: 	200
Iteration: 	60 / 100
Confirmed: 	84
Tentative: 	3
Rejected: 	200
Iteration: 	61 / 100
Confirmed: 	84
Tentative: 	3
Rejected: 	200
Iteration: 	62 / 100
Confirmed: 	84
Tentative: 	3
Rejected: 	200
Iteration: 	63 / 100
Confirmed: 	84
Tentative: 	3
Rejected: 	200
Iteration: 	64 / 100
Confirmed: 	84
Tentative: 	3
Rejected: 	200
Iteration: 	65 / 100
Confirmed: 	85
Tentative: 	2
Rejected: 	200
Iteration: 	66 / 100
Confirmed: 	85
Tentative: 	2
Rejected: 	200
Iteration: 	67 / 100
Confirmed: 	85
Tentative: 	2
Rejected: 	200
Iteration: 	68 / 100
Confirmed: 	85
Tentative: 	2
Rejected: 	200
Iteration: 	69 / 100
Confirmed: 	85
Tentative: 	2
Rejected: 	200
Iteration: 	70 / 100
Confirmed: 	85
Tentative: 	2
Rejected: 	200
Iteration: 	71 / 100
Confirmed: 	85
Tentative: 	2
Rejected: 	200
Iteration: 	72 / 100
Confirmed: 	85
Tentative: 	2
Rejected: 	200
Iteration: 	73 / 100
Confirmed: 	85
Tentative: 	2
Rejected: 	200
Iteration: 	74 / 100
Confirmed: 	85
Tentative: 	2
Rejected: 	200
Iteration: 	75 / 100
Confirmed: 	85
Tentative: 	2
Rejected: 	200
Iteration: 	76 / 100
Confirmed: 	85
Tentative: 	2
Rejected: 	200
Iteration: 	77 / 100
Confirmed: 	85
Tentative: 	2
Rejected: 	200
Iteration: 	78 / 100
Confirmed: 	85
Tentative: 	2
Rejected: 	200
Iteration: 	79 / 100
Confirmed: 	85
Tentative: 	2
Rejected: 	200
Iteration: 	80 / 100
Confirmed: 	85
Tentative: 	2
Rejected: 	200
Iteration: 	81 / 100
Confirmed: 	85
Tentative: 	2
Rejected: 	200
Iteration: 	82 / 100
Confirmed: 	85
Tentative: 	2
Rejected: 	200
Iteration: 	83 / 100
Confirmed: 	85
Tentative: 	1
Rejected: 	201
Iteration: 	84 / 100
Confirmed: 	85
Tentative: 	1
Rejected: 	201
Iteration: 	85 / 100
Confirmed: 	85
Tentative: 	1
Rejected: 	201
Iteration: 	86 / 100
Confirmed: 	85
Tentative: 	1
Rejected: 	201
Iteration: 	87 / 100
Confirmed: 	85
Tentative: 	1
Rejected: 	201
Iteration: 	88 / 100
Confirmed: 	85
Tentative: 	1
Rejected: 	201
Iteration: 	89 / 100
Confirmed: 	85
Tentative: 	1
Rejected: 	201
Iteration: 	90 / 100
Confirmed: 	85
Tentative: 	1
Rejected: 	201
Iteration: 	91 / 100
Confirmed: 	85
Tentative: 	1
Rejected: 	201
Iteration: 	92 / 100
Confirmed: 	85
Tentative: 	1
Rejected: 	201
Iteration: 	93 / 100
Confirmed: 	85
Tentative: 	1
Rejected: 	201
Iteration: 	94 / 100
Confirmed: 	85
Tentative: 	1
Rejected: 	201
Iteration: 	95 / 100
Confirmed: 	85
Tentative: 	1
Rejected: 	201
Iteration: 	96 / 100
Confirmed: 	85
Tentative: 	1
Rejected: 	201
Iteration: 	97 / 100
Confirmed: 	85
Tentative: 	1
Rejected: 	201
Iteration: 	98 / 100
Confirmed: 	85
Tentative: 	1
Rejected: 	201
Iteration: 	99 / 100
Confirmed: 	85
Tentative: 	1
Rejected: 	201


BorutaPy finished running.

Iteration: 	100 / 100
Confirmed: 	85
Tentative: 	0
Rejected: 	201

BorutaPy(estimator=RandomForestClassifier(max_depth=5, n_estimators=262,
                                          random_state=RandomState(MT19937) at 0x21838CE8640),
         n_estimators='auto',
         random_state=RandomState(MT19937) at 0x21838CE8640, verbose=2)

print("Ranking: ",boruta_selector.ranking_)
print("No. of significant features: ", boruta_selector.n_features_) 

Ranking:  [  1   1  66   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1
 1  65  34   1   1   1   1   1   1   1   1   1   1   1  70  12   8
 1   1  84   1   1   1   1   1  83   1  11 135   1  91   1   1   1
61  75  81   1  32  32  98   1   1  44   1  47  53   1   1   1  59
71   1  19   1   6  45  64   8   1  74   1  12   1   1   1   1   1
 1   1   1  37  24   1   1   1   1  23   1  43 123 174  21 116  10
113  63 108   1  45  51 135  95  51  25 174  51 107  35   1   6 135
78 174 135 116 135  15 174 123 174 174 110  27   1   1  36   1  77
58   1  72  57   1   1  16   1 135  40  98 174  89 135  68 100 135
104 174 174 114 174 174  60 174 102  18  27  73 135  94  26 123  78
116  47  98  40 105 123  38  55 174   2 174 135 123 174 174 174  82
174  56 174  17  42  67 135 174 174 145  29  54 174  30 174 111 174
80   1 174  86 174 174  87 174 106  21  95  68 174 135 109  39 174
174 174   1 174  89   1 174 102 174 174 174  87 135 174 174 174 135
174 174 119 174 174  31   3 135 174 174 143  75 127 174 174 174 174
19 174 174  84 112 174 174  93 143 174 174   1   1   1   1   1]
No. of significant features:  85

Boruta Algorithm selected 85 features from 288 features. The features with rank 1 are selected by the algorithm

selected_features = X.iloc[:, boruta_selector.ranking_ == 1]
final_feature_matrix = pd.concat([selected_features, y], axis = 1)
final_feature_matrix

	administr_desc	answer_desc	assist_desc	bill_desc	call_desc	cash_desc	desir_desc	duti_desc	earn_desc	entri_desc	...	industry_Accounting	industry_Leisure, Travel & Tourism	industry_NAN	industry_Oil & Energy	company_profile	telecommuting	has_company_logo	has_questions	required_education	fraudulent
0	0.000000	0.0	0.000000	0.0	0.092456	0.000000	0.0	0.000000	0.000000	0.0	...	0.0	0.0	0.0	0.0	0	0	1	0	0	0
1	0.045662	0.0	0.034465	0.0	0.000000	0.000000	0.0	0.000000	0.000000	0.0	...	0.0	0.0	0.0	0.0	0	0	1	0	0	0
2	0.000000	0.0	0.000000	0.0	0.000000	0.000000	0.0	0.000000	0.000000	0.0	...	0.0	0.0	0.0	0.0	0	0	1	1	1	0
3	0.000000	0.0	0.047975	0.0	0.000000	0.085044	0.0	0.051481	0.000000	0.0	...	0.0	0.0	0.0	0.0	0	0	1	0	1	0
4	0.000000	0.0	0.000000	0.0	0.000000	0.000000	0.0	0.000000	0.053905	0.0	...	0.0	0.0	0.0	0.0	0	0	1	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
14299	0.000000	0.0	0.000000	0.0	0.000000	0.000000	0.0	0.000000	0.000000	0.0	...	0.0	0.0	0.0	0.0	0	0	1	1	0	0
14300	0.000000	0.0	0.000000	0.0	0.000000	0.000000	0.0	0.000000	0.000000	0.0	...	0.0	0.0	1.0	0.0	0	0	1	1	1	0
14301	0.000000	0.0	0.000000	0.0	0.000000	0.000000	0.0	0.000000	0.000000	0.0	...	0.0	0.0	0.0	0.0	0	0	1	0	0	0
14302	0.000000	0.0	0.000000	0.0	0.000000	0.000000	0.0	0.120147	0.000000	0.0	...	0.0	0.0	0.0	0.0	0	0	1	0	0	0
14303	0.000000	0.0	0.000000	0.0	0.000000	0.000000	0.0	0.000000	0.000000	0.0	...	0.0	0.0	0.0	0.0	1	0	0	0	1	0

14304 rows × 86 columns

We will save this final feature matrix as a csv file.

final_feature_matrix.to_csv("./data/final_feature_matrix.csv")

Classifying Fake Job Posting Using Machine Learning Algorithm

Automated Feature Selection Using Boruta Algorithm

Automated Feature Selection Using Boruta Algorithm¶