# Automated Feature Selection Using Boruta Algorithm

In this section, we will perform automated feature selection using Boruta. Boruta algorithm uses randomization on top of results obtained from variable importance obtained from random forest to determine the truly important and statistically valid results. 

```{note}
Boruta is produced as an improvement over random forest variable importance.
```

In [3]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from boruta import BorutaPy
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score 
import joblib

In [16]:
text_features = pd.read_csv("./data/selected_text_features.csv", index_col = 0)
processed_train = joblib.load('./data/processed_train_jlib')
OHE_features = joblib.load('./data/OHE_features_train_jlib')
processed_train = processed_train.iloc[: , 2:8] #Removing two unnamed columns

In [20]:
all_features = pd.concat([text_features, OHE_features, processed_train], axis = 1)
all_features

Unnamed: 0,administr_desc,answer_desc,asia_desc,assist_desc,bill_desc,call_desc,cash_desc,desir_desc,duti_desc,earn_desc,...,industry_Warehousing,industry_Wholesale,industry_Wireless,industry_Writing and Editing,company_profile,telecommuting,has_company_logo,has_questions,required_education,fraudulent
0,0.000000,0.0,0.0,0.000000,0.0,0.092456,0.000000,0.0,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0,0,1,0,0,0
1,0.045662,0.0,0.0,0.034465,0.0,0.000000,0.000000,0.0,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0,0,1,0,0,0
2,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0,0,1,1,1,0
3,0.000000,0.0,0.0,0.047975,0.0,0.000000,0.085044,0.0,0.051481,0.000000,...,0.0,0.0,0.0,0.0,0,0,1,0,1,0
4,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.053905,...,0.0,0.0,0.0,0.0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14299,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0,0,1,1,0,0
14300,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0,0,1,1,1,0
14301,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0,0,1,0,0,0
14302,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.120147,0.000000,...,0.0,0.0,0.0,0.0,0,0,1,0,0,0


In [23]:
X = all_features.loc[:, all_features.columns != "fraudulent"] 
y = all_features.loc[:, all_features.columns == "fraudulent"] 

In [27]:
rfc = RandomForestClassifier(random_state=1, n_estimators=1000, max_depth=5)

In [28]:
boruta_selector = BorutaPy(rfc, n_estimators='auto', verbose=2, random_state=1)

In [35]:
boruta_selector.fit(np.array(X), y.values.ravel())

Iteration: 	1 / 100
Confirmed: 	0
Tentative: 	287
Rejected: 	0
Iteration: 	2 / 100
Confirmed: 	0
Tentative: 	287
Rejected: 	0
Iteration: 	3 / 100
Confirmed: 	0
Tentative: 	287
Rejected: 	0
Iteration: 	4 / 100
Confirmed: 	0
Tentative: 	287
Rejected: 	0
Iteration: 	5 / 100
Confirmed: 	0
Tentative: 	287
Rejected: 	0
Iteration: 	6 / 100
Confirmed: 	0
Tentative: 	287
Rejected: 	0
Iteration: 	7 / 100
Confirmed: 	0
Tentative: 	287
Rejected: 	0
Iteration: 	8 / 100
Confirmed: 	78
Tentative: 	18
Rejected: 	191
Iteration: 	9 / 100
Confirmed: 	78
Tentative: 	18
Rejected: 	191
Iteration: 	10 / 100
Confirmed: 	78
Tentative: 	18
Rejected: 	191
Iteration: 	11 / 100
Confirmed: 	78
Tentative: 	18
Rejected: 	191
Iteration: 	12 / 100
Confirmed: 	81
Tentative: 	15
Rejected: 	191
Iteration: 	13 / 100
Confirmed: 	81
Tentative: 	15
Rejected: 	191
Iteration: 	14 / 100
Confirmed: 	81
Tentative: 	13
Rejected: 	193
Iteration: 	15 / 100
Confirmed: 	81
Tentative: 	13
Rejected: 	193
Iteration: 	16 / 100
Confirmed: 	

BorutaPy(estimator=RandomForestClassifier(max_depth=5, n_estimators=262,
                                          random_state=RandomState(MT19937) at 0x21838CE8640),
         n_estimators='auto',
         random_state=RandomState(MT19937) at 0x21838CE8640, verbose=2)

In [43]:
print("Ranking: ",boruta_selector.ranking_)
print("No. of significant features: ", boruta_selector.n_features_) 

Ranking:  [  1   1  66   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1
   1   1  65  34   1   1   1   1   1   1   1   1   1   1   1  70  12   8
   6   1   1  84   1   1   1   1   1  83   1  11 135   1  91   1   1   1
   1  61  75  81   1  32  32  98   1   1  44   1  47  53   1   1   1  59
  14  71   1  19   1   6  45  64   8   1  74   1  12   1   1   1   1   1
   1   1   1   1  37  24   1   1   1   1  23   1  43 123 174  21 116  10
  62 113  63 108   1  45  51 135  95  51  25 174  51 107  35   1   6 135
 100  78 174 135 116 135  15 174 123 174 174 110  27   1   1  36   1  77
   4  58   1  72  57   1   1  16   1 135  40  98 174  89 135  68 100 135
 123 104 174 174 114 174 174  60 174 102  18  27  73 135  94  26 123  78
  92 116  47  98  40 105 123  38  55 174   2 174 135 123 174 174 174  82
 123 174  56 174  17  42  67 135 174 174 145  29  54 174  30 174 111 174
 116  80   1 174  86 174 174  87 174 106  21  95  68 174 135 109  39 174
 174 174 174   1 174  89   1 174 102 174 

Boruta Algorithm selected 85 features from 288 features. The features with rank 1 are selected by the algorithm

In [45]:
selected_features = X.iloc[:, boruta_selector.ranking_ == 1]
final_feature_matrix = pd.concat([selected_features, y], axis = 1)
final_feature_matrix

Unnamed: 0,administr_desc,answer_desc,assist_desc,bill_desc,call_desc,cash_desc,desir_desc,duti_desc,earn_desc,entri_desc,...,industry_Accounting,"industry_Leisure, Travel & Tourism",industry_NAN,industry_Oil & Energy,company_profile,telecommuting,has_company_logo,has_questions,required_education,fraudulent
0,0.000000,0.0,0.000000,0.0,0.092456,0.000000,0.0,0.000000,0.000000,0.0,...,0.0,0.0,0.0,0.0,0,0,1,0,0,0
1,0.045662,0.0,0.034465,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.0,...,0.0,0.0,0.0,0.0,0,0,1,0,0,0
2,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.0,...,0.0,0.0,0.0,0.0,0,0,1,1,1,0
3,0.000000,0.0,0.047975,0.0,0.000000,0.085044,0.0,0.051481,0.000000,0.0,...,0.0,0.0,0.0,0.0,0,0,1,0,1,0
4,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.053905,0.0,...,0.0,0.0,0.0,0.0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14299,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.0,...,0.0,0.0,0.0,0.0,0,0,1,1,0,0
14300,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.0,...,0.0,0.0,1.0,0.0,0,0,1,1,1,0
14301,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.0,...,0.0,0.0,0.0,0.0,0,0,1,0,0,0
14302,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.120147,0.000000,0.0,...,0.0,0.0,0.0,0.0,0,0,1,0,0,0


We will save this final feature matrix as a csv file.

In [46]:
final_feature_matrix.to_csv("./data/final_feature_matrix.csv")