Text Feature Selection

As we discussed in here, we must perform the feature selection on text features first because it is causing MemeoryError due to its massive file size (10GB). Since we cannot use the fraudulent column, we will use column means to select the features based on several assumptions. After we get a smaller version of text_features_train, we will combine it with fraudulent column and perform supervised feature selection using Chi-Squre Statistics.

Feature Selection Using Column Mean

import pandas as pd 
import joblib
text_features_train = joblib.load('./data/text_features_train_jlib')
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 1
----> 1 import pandas as pd 
      2 import joblib
      3 text_features_train = joblib.load('./data/text_features_train_jlib')

ModuleNotFoundError: No module named 'pandas'
text_features_train.head(5)
aa_desc aaa_desc aaab_desc aab_desc aabc_desc aabd_desc aabf_desc aac_desc aaccd_desc aachen_desc ... zodat_benefits zollman_benefits zombi_benefits zone_benefits zoo_benefits zowel_benefits zu_benefits zult_benefits zutrifft_benefits zweig_benefits
0 0.165596 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 89527 columns

Let’s carefully observe the dataframe above. It is straightforward to realize that some of these features have unusual names, such as zutrifft and aabd, and it is hard to believe that they are stemmed from a standard English word. There are two main reasons why those unusual names appear as a feature.

  1. Although we removed the URL and HTML format in the pre-data processing step, it is still possible that some formats are not perfectly removed. Also, the data can include other non-English words, such as email or file names.

  2. If we look over the original dataset, we can observe that text data was saved with no space between the lines. For example, “I love dog” (Line 1), “Cat ate fish” (Line 2) to “I love dogCat ate fish”. Then it creates the abnormal word “dogCat”.

The one simple way to remove these unusual words with the lowest computational cost is to use column mean and filter out the features with exceptionally low means. This is based on the assumption that unusual words will appear less frequently than normal words. For instance, a word like “havecommunication” will not frequently appear across the dataset. If the words appear infrequently, they will have a low mean.

Warning

This method is also based on the dangerous assumption that low-frequency words are less important than more frequent words. Feature selection using column means can eliminate some important words for machine learning. However, since we have more than 10,000 unusual words as features, I am trading off the accuracy for more efficiency. I am aware that feature selection has to be done very carefully, and this is not the best way.

However, we must not perform this feature selection by column means on the entire dataset since our text_features_train dataset is a combination of four different columns: description, title, requirements and benefits. Since the tf-idf value can vary depending on different characteristics of each dataset, we must get a column mean, select the feature separately by each dataset, and combine it later to get the best result.

sum(text_features_train.columns.str.contains('_desc'))
text_features_desc = text_features_train.iloc[:, 0:40607]
sum(text_features_train.columns.str.contains('_req'))
text_features_req = text_features_train.iloc[:, 40607:74863]
sum(text_features_train.columns.str.contains('_title'))
text_features_title = text_features_train.iloc[:, 74863:78280]
sum(text_features_train.columns.str.contains('_benefits'))
text_features_benefits = text_features_train.iloc[:, 78280:89527]

Note

We are seperating the dataframe like this to avoid the MemoryError.

mean_desc = text_features_desc.sum() / 14304
mean_desc.sort_values(ascending = False)
work_desc       3.387162e-02
develop_desc    3.251693e-02
team_desc       3.203152e-02
manag_desc      3.167333e-02
custom_desc     3.149140e-02
                    ...     
peugeot_desc    9.292221e-07
bencki_desc     9.292221e-07
sanofi_desc     9.292221e-07
qmetric_desc    5.049230e-07
gra_desc        5.049230e-07
Length: 40607, dtype: float64

The column means for features from description dataset shows that the assumptions we made previously are somewhat reasonable. As we see here, more the average, the words look more normal, such as “work” and “develope”.

Since the highest mean is 0.03387162, let’s choose all features with average mean higher than 0.002.

Note

I tried many different thresholds and figured out 0.002 is the best number.

select_desc = mean_desc > 0.002
selected_features_desc = text_features_desc.loc[:, select_desc]
selected_features_desc.head()
abil_desc abl_desc accept_desc access_desc accord_desc account_desc accur_desc achiev_desc acquisit_desc across_desc ... without_desc word_desc work_desc world_desc would_desc write_desc written_desc year_desc york_desc young_desc
0 0.043316 0.04441 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.000000 ... 0.060186 0.0 0.021943 0.000000 0.0 0.0 0.0 0.038188 0.0 0.0
1 0.000000 0.00000 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.000000 ... 0.000000 0.0 0.000000 0.000000 0.0 0.0 0.0 0.000000 0.0 0.0
2 0.050161 0.00000 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.000000 ... 0.000000 0.0 0.025411 0.000000 0.0 0.0 0.0 0.044223 0.0 0.0
3 0.000000 0.00000 0.0 0.0 0.0 0.052053 0.0 0.0 0.0 0.000000 ... 0.000000 0.0 0.000000 0.000000 0.0 0.0 0.0 0.000000 0.0 0.0
4 0.000000 0.00000 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.036083 ... 0.000000 0.0 0.000000 0.069462 0.0 0.0 0.0 0.092432 0.0 0.0

5 rows × 812 columns

This looks much better. We will repeat the process for the other dataframes as well.

mean_req = text_features_req.sum() / 14304
mean_req.sort_values(ascending = False)
experi_req       0.052954
work_req         0.034828
skill_req        0.033378
requir_req       0.031967
year_req         0.027195
                   ...   
cano_req         0.000002
mcnz_req         0.000002
orthopaed_req    0.000002
inhabit_req      0.000002
zeta_req         0.000001
Length: 34256, dtype: float64

Since TF-IDF is bit higher for text_feature_req, we will adjust the threshold a bit to adjust for the difference.

select_req = mean_req > 0.003
selected_features_req = text_features_req.loc[:, select_req]
selected_features_req.head()
abil_req abl_req account_req across_req activ_req adapt_req addit_req administr_req adob_req advanc_req ... willing_req window_req within_req without_req word_req work_req would_req write_req written_req year_req
0 0.081650 0.105596 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.000000 0.000000 0.0 0.064795 0.0 0.0 0.000000 0.068928
1 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.061549 0.151344 0.0 0.063024 0.0 0.0 0.000000 0.033522
2 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.000000 0.000000 0.0 0.000000 0.0 0.0 0.000000 0.000000
3 0.047567 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.087269 0.000000 0.000000 0.0 0.075495 0.0 0.0 0.054646 0.040155
4 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.000000 0.000000 0.000000 0.0 0.072623 0.0 0.0 0.000000 0.038628

5 rows × 350 columns

mean_title = text_features_title.sum() / 14304
mean_title.sort_values(ascending = False)
manag_title           0.048922
develop_title         0.046175
engin_title           0.038562
sale_title            0.029780
servic_title          0.024249
                        ...   
maharashtra_title     0.000024
barri_title           0.000024
peterborough_title    0.000024
haliburton_title      0.000024
elgin_title           0.000021
Length: 3417, dtype: float64
select_title = mean_title > 0.004
selected_features_title = text_features_title.loc[:, select_title]
selected_features_title.head()
abroad_title account_title admin_title administr_title agent_title analyst_title android_title applic_title apprenticeship_title architect_title ... system_title teacher_title team_title technic_title technician_title time_title ui_title ux_title web_title year_title
0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.440388 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 86 columns

mean_benefits = text_features_benefits.sum() / 14304
mean_benefits.sort_values(ascending = False)
job_benefits         0.026567
descript_benefits    0.026070
see_benefits         0.025900
benefit_benefits     0.025386
work_benefits        0.024180
                       ...   
ebe_benefits         0.000002
efd_benefits         0.000002
efff_benefits        0.000002
cdc_benefits         0.000002
charit_benefits      0.000001
Length: 11247, dtype: float64
select_benefits = mean_benefits > 0.003
selected_features_benefits = text_features_benefits.loc[:, select_benefits]
selected_features_benefits.head()
advanc_benefits also_benefits appli_benefits applic_benefits avail_benefits base_benefits benefit_benefits best_benefits bonu_benefits bonus_benefits ... us_benefits vacat_benefits vision_benefits want_benefits week_benefits well_benefits within_benefits work_benefits world_benefits year_benefits
0 0.0 0.0 0.0 0.0 0.0 0.0 0.074620 0.000000 0.0 0.142779 ... 0.0 0.204903 0.000000 0.0 0.000000 0.348124 0.0 0.0 0.0 0.112979
1 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0 0.000000 ... 0.0 0.000000 0.000000 0.0 0.000000 0.000000 0.0 0.0 0.0 0.000000
2 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0 0.000000 ... 0.0 0.000000 0.000000 0.0 0.000000 0.000000 0.0 0.0 0.0 0.000000
3 0.0 0.0 0.0 0.0 0.0 0.0 0.109283 0.000000 0.0 0.000000 ... 0.0 0.150045 0.150367 0.0 0.183764 0.000000 0.0 0.0 0.0 0.165463
4 0.0 0.0 0.0 0.0 0.0 0.0 0.042046 0.131633 0.0 0.000000 ... 0.0 0.057729 0.000000 0.0 0.000000 0.000000 0.0 0.0 0.0 0.063661

5 rows × 143 columns

We will combine all dataframe to get a smaller version of text_feature_train.

text_feature = pd.concat([selected_features_desc, selected_features_req, selected_features_title, selected_features_benefits], axis=1)
text_feature
abil_desc abl_desc accept_desc access_desc accord_desc account_desc accur_desc achiev_desc acquisit_desc across_desc ... us_benefits vacat_benefits vision_benefits want_benefits week_benefits well_benefits within_benefits work_benefits world_benefits year_benefits
0 0.043316 0.04441 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.000000 ... 0.00000 0.204903 0.000000 0.0 0.000000 0.348124 0.0 0.000000 0.0 0.112979
1 0.000000 0.00000 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.000000 ... 0.00000 0.000000 0.000000 0.0 0.000000 0.000000 0.0 0.000000 0.0 0.000000
2 0.050161 0.00000 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.000000 ... 0.00000 0.000000 0.000000 0.0 0.000000 0.000000 0.0 0.000000 0.0 0.000000
3 0.000000 0.00000 0.0 0.0 0.0 0.052053 0.0 0.0 0.0 0.000000 ... 0.00000 0.150045 0.150367 0.0 0.183764 0.000000 0.0 0.000000 0.0 0.165463
4 0.000000 0.00000 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.036083 ... 0.00000 0.057729 0.000000 0.0 0.000000 0.000000 0.0 0.000000 0.0 0.063661
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
14299 0.000000 0.00000 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.000000 ... 0.00000 0.000000 0.000000 0.0 0.000000 0.000000 0.0 0.230658 0.0 0.000000
14300 0.000000 0.00000 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.000000 ... 0.09475 0.000000 0.000000 0.0 0.000000 0.000000 0.0 0.000000 0.0 0.000000
14301 0.000000 0.00000 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.000000 ... 0.00000 0.000000 0.000000 0.0 0.000000 0.000000 0.0 0.000000 0.0 0.000000
14302 0.000000 0.00000 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.000000 ... 0.00000 0.000000 0.000000 0.0 0.000000 0.000000 0.0 0.000000 0.0 0.000000
14303 0.000000 0.00000 0.0 0.0 0.0 0.077250 0.0 0.0 0.0 0.000000 ... 0.00000 0.000000 0.000000 0.0 0.000000 0.000000 0.0 0.000000 0.0 0.000000

14304 rows × 1391 columns

Supervised Feature Selection Using Chi-Square Statistics

In this part, we will perform a supervised feature selection using Chi-Square Statistics so that we can eliminate the features that are the most likely to be independent of fraudulent column and therefore irrelevant for classification.

# Importing processed train data from previous step to get a fraudulent column.
processed_train = joblib.load('./data/processed_train_jlib')
target = processed_train["fraudulent"]
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

supervised_text_features = SelectKBest(chi2, k = 100).fit(text_feature, target)
df_text_features = text_feature.iloc[: , supervised_text_features.get_support()]
df_text_features
administr_desc answer_desc asia_desc assist_desc bill_desc call_desc cash_desc desir_desc duti_desc earn_desc ... life_benefits need_benefits per_benefits posit_benefits prospect_benefits see_benefits share_benefits skill_benefits start_benefits train_benefits
0 0.000000 0.0 0.0 0.000000 0.0 0.092456 0.000000 0.0 0.000000 0.000000 ... 0.103912 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.045662 0.0 0.0 0.034465 0.0 0.000000 0.000000 0.0 0.000000 0.000000 ... 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.000000 0.0 0.0 0.000000 0.0 0.000000 0.000000 0.0 0.000000 0.000000 ... 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.000000 0.0 0.0 0.047975 0.0 0.000000 0.085044 0.0 0.051481 0.000000 ... 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.000000 0.0 0.0 0.000000 0.0 0.000000 0.000000 0.0 0.000000 0.053905 ... 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
14299 0.000000 0.0 0.0 0.000000 0.0 0.000000 0.000000 0.0 0.000000 0.000000 ... 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
14300 0.000000 0.0 0.0 0.000000 0.0 0.000000 0.000000 0.0 0.000000 0.000000 ... 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
14301 0.000000 0.0 0.0 0.000000 0.0 0.000000 0.000000 0.0 0.000000 0.000000 ... 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
14302 0.000000 0.0 0.0 0.000000 0.0 0.000000 0.000000 0.0 0.120147 0.000000 ... 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
14303 0.000000 0.0 0.0 0.000000 0.0 0.000000 0.000000 0.0 0.000000 0.000000 ... 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

14304 rows × 100 columns

We will save this selected text features to csv file.

df_text_features.to_csv("./data/selected_text_features.csv")