Text Feature Selection
Contents
Text Feature Selection¶
As we discussed in here, we must perform the feature selection on text features first because it is causing MemeoryError due to its massive file size (10GB). Since we cannot use the fraudulent
column, we will use column means to select the features based on several assumptions. After we get a smaller version of text_features_train
, we will combine it with fraudulent
column and perform supervised feature selection using Chi-Squre Statistics.
Feature Selection Using Column Mean¶
import pandas as pd
import joblib
text_features_train = joblib.load('./data/text_features_train_jlib')
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[1], line 1
----> 1 import pandas as pd
2 import joblib
3 text_features_train = joblib.load('./data/text_features_train_jlib')
ModuleNotFoundError: No module named 'pandas'
text_features_train.head(5)
aa_desc | aaa_desc | aaab_desc | aab_desc | aabc_desc | aabd_desc | aabf_desc | aac_desc | aaccd_desc | aachen_desc | ... | zodat_benefits | zollman_benefits | zombi_benefits | zone_benefits | zoo_benefits | zowel_benefits | zu_benefits | zult_benefits | zutrifft_benefits | zweig_benefits | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.165596 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 89527 columns
Let’s carefully observe the dataframe above. It is straightforward to realize that some of these features have unusual names, such as zutrifft
and aabd
, and it is hard to believe that they are stemmed from a standard English word. There are two main reasons why those unusual names appear as a feature.
Although we removed the URL and HTML format in the pre-data processing step, it is still possible that some formats are not perfectly removed. Also, the data can include other non-English words, such as email or file names.
If we look over the original dataset, we can observe that text data was saved with no space between the lines. For example, “I love dog” (Line 1), “Cat ate fish” (Line 2) to “I love dogCat ate fish”. Then it creates the abnormal word “dogCat”.
The one simple way to remove these unusual words with the lowest computational cost is to use column mean and filter out the features with exceptionally low means. This is based on the assumption that unusual words will appear less frequently than normal words. For instance, a word like “havecommunication” will not frequently appear across the dataset. If the words appear infrequently, they will have a low mean.
Warning
This method is also based on the dangerous assumption that low-frequency words are less important than more frequent words. Feature selection using column means can eliminate some important words for machine learning. However, since we have more than 10,000 unusual words as features, I am trading off the accuracy for more efficiency. I am aware that feature selection has to be done very carefully, and this is not the best way.
However, we must not perform this feature selection by column means on the entire dataset since our text_features_train
dataset is a combination of four different columns: description
, title
, requirements
and benefits
. Since the tf-idf value can vary depending on different characteristics of each dataset, we must get a column mean, select the feature separately by each dataset, and combine it later to get the best result.
sum(text_features_train.columns.str.contains('_desc'))
text_features_desc = text_features_train.iloc[:, 0:40607]
sum(text_features_train.columns.str.contains('_req'))
text_features_req = text_features_train.iloc[:, 40607:74863]
sum(text_features_train.columns.str.contains('_title'))
text_features_title = text_features_train.iloc[:, 74863:78280]
sum(text_features_train.columns.str.contains('_benefits'))
text_features_benefits = text_features_train.iloc[:, 78280:89527]
Note
We are seperating the dataframe like this to avoid the MemoryError.
mean_desc = text_features_desc.sum() / 14304
mean_desc.sort_values(ascending = False)
work_desc 3.387162e-02
develop_desc 3.251693e-02
team_desc 3.203152e-02
manag_desc 3.167333e-02
custom_desc 3.149140e-02
...
peugeot_desc 9.292221e-07
bencki_desc 9.292221e-07
sanofi_desc 9.292221e-07
qmetric_desc 5.049230e-07
gra_desc 5.049230e-07
Length: 40607, dtype: float64
The column means for features from description
dataset shows that the assumptions we made previously are somewhat reasonable. As we see here, more the average, the words look more normal, such as “work” and “develope”.
Since the highest mean is 0.03387162, let’s choose all features with average mean higher than 0.002.
Note
I tried many different thresholds and figured out 0.002 is the best number.
select_desc = mean_desc > 0.002
selected_features_desc = text_features_desc.loc[:, select_desc]
selected_features_desc.head()
abil_desc | abl_desc | accept_desc | access_desc | accord_desc | account_desc | accur_desc | achiev_desc | acquisit_desc | across_desc | ... | without_desc | word_desc | work_desc | world_desc | would_desc | write_desc | written_desc | year_desc | york_desc | young_desc | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.043316 | 0.04441 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | ... | 0.060186 | 0.0 | 0.021943 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.038188 | 0.0 | 0.0 |
1 | 0.000000 | 0.00000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | ... | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 |
2 | 0.050161 | 0.00000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | ... | 0.000000 | 0.0 | 0.025411 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.044223 | 0.0 | 0.0 |
3 | 0.000000 | 0.00000 | 0.0 | 0.0 | 0.0 | 0.052053 | 0.0 | 0.0 | 0.0 | 0.000000 | ... | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 |
4 | 0.000000 | 0.00000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.036083 | ... | 0.000000 | 0.0 | 0.000000 | 0.069462 | 0.0 | 0.0 | 0.0 | 0.092432 | 0.0 | 0.0 |
5 rows × 812 columns
This looks much better. We will repeat the process for the other dataframes as well.
mean_req = text_features_req.sum() / 14304
mean_req.sort_values(ascending = False)
experi_req 0.052954
work_req 0.034828
skill_req 0.033378
requir_req 0.031967
year_req 0.027195
...
cano_req 0.000002
mcnz_req 0.000002
orthopaed_req 0.000002
inhabit_req 0.000002
zeta_req 0.000001
Length: 34256, dtype: float64
Since TF-IDF is bit higher for text_feature_req
, we will adjust the threshold a bit to adjust for the difference.
select_req = mean_req > 0.003
selected_features_req = text_features_req.loc[:, select_req]
selected_features_req.head()
abil_req | abl_req | account_req | across_req | activ_req | adapt_req | addit_req | administr_req | adob_req | advanc_req | ... | willing_req | window_req | within_req | without_req | word_req | work_req | would_req | write_req | written_req | year_req | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.081650 | 0.105596 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.064795 | 0.0 | 0.0 | 0.000000 | 0.068928 |
1 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.000000 | 0.061549 | 0.151344 | 0.0 | 0.063024 | 0.0 | 0.0 | 0.000000 | 0.033522 |
2 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 |
3 | 0.047567 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.087269 | 0.000000 | 0.000000 | 0.0 | 0.075495 | 0.0 | 0.0 | 0.054646 | 0.040155 |
4 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.072623 | 0.0 | 0.0 | 0.000000 | 0.038628 |
5 rows × 350 columns
mean_title = text_features_title.sum() / 14304
mean_title.sort_values(ascending = False)
manag_title 0.048922
develop_title 0.046175
engin_title 0.038562
sale_title 0.029780
servic_title 0.024249
...
maharashtra_title 0.000024
barri_title 0.000024
peterborough_title 0.000024
haliburton_title 0.000024
elgin_title 0.000021
Length: 3417, dtype: float64
select_title = mean_title > 0.004
selected_features_title = text_features_title.loc[:, select_title]
selected_features_title.head()
abroad_title | account_title | admin_title | administr_title | agent_title | analyst_title | android_title | applic_title | apprenticeship_title | architect_title | ... | system_title | teacher_title | team_title | technic_title | technician_title | time_title | ui_title | ux_title | web_title | year_title | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.440388 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 86 columns
mean_benefits = text_features_benefits.sum() / 14304
mean_benefits.sort_values(ascending = False)
job_benefits 0.026567
descript_benefits 0.026070
see_benefits 0.025900
benefit_benefits 0.025386
work_benefits 0.024180
...
ebe_benefits 0.000002
efd_benefits 0.000002
efff_benefits 0.000002
cdc_benefits 0.000002
charit_benefits 0.000001
Length: 11247, dtype: float64
select_benefits = mean_benefits > 0.003
selected_features_benefits = text_features_benefits.loc[:, select_benefits]
selected_features_benefits.head()
advanc_benefits | also_benefits | appli_benefits | applic_benefits | avail_benefits | base_benefits | benefit_benefits | best_benefits | bonu_benefits | bonus_benefits | ... | us_benefits | vacat_benefits | vision_benefits | want_benefits | week_benefits | well_benefits | within_benefits | work_benefits | world_benefits | year_benefits | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.074620 | 0.000000 | 0.0 | 0.142779 | ... | 0.0 | 0.204903 | 0.000000 | 0.0 | 0.000000 | 0.348124 | 0.0 | 0.0 | 0.0 | 0.112979 |
1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 | ... | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 |
2 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 | ... | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 |
3 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.109283 | 0.000000 | 0.0 | 0.000000 | ... | 0.0 | 0.150045 | 0.150367 | 0.0 | 0.183764 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.165463 |
4 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.042046 | 0.131633 | 0.0 | 0.000000 | ... | 0.0 | 0.057729 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.063661 |
5 rows × 143 columns
We will combine all dataframe to get a smaller version of text_feature_train
.
text_feature = pd.concat([selected_features_desc, selected_features_req, selected_features_title, selected_features_benefits], axis=1)
text_feature
abil_desc | abl_desc | accept_desc | access_desc | accord_desc | account_desc | accur_desc | achiev_desc | acquisit_desc | across_desc | ... | us_benefits | vacat_benefits | vision_benefits | want_benefits | week_benefits | well_benefits | within_benefits | work_benefits | world_benefits | year_benefits | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.043316 | 0.04441 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | ... | 0.00000 | 0.204903 | 0.000000 | 0.0 | 0.000000 | 0.348124 | 0.0 | 0.000000 | 0.0 | 0.112979 |
1 | 0.000000 | 0.00000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | ... | 0.00000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.000000 |
2 | 0.050161 | 0.00000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | ... | 0.00000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.000000 |
3 | 0.000000 | 0.00000 | 0.0 | 0.0 | 0.0 | 0.052053 | 0.0 | 0.0 | 0.0 | 0.000000 | ... | 0.00000 | 0.150045 | 0.150367 | 0.0 | 0.183764 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.165463 |
4 | 0.000000 | 0.00000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.036083 | ... | 0.00000 | 0.057729 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.063661 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
14299 | 0.000000 | 0.00000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | ... | 0.00000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.230658 | 0.0 | 0.000000 |
14300 | 0.000000 | 0.00000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | ... | 0.09475 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.000000 |
14301 | 0.000000 | 0.00000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | ... | 0.00000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.000000 |
14302 | 0.000000 | 0.00000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | ... | 0.00000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.000000 |
14303 | 0.000000 | 0.00000 | 0.0 | 0.0 | 0.0 | 0.077250 | 0.0 | 0.0 | 0.0 | 0.000000 | ... | 0.00000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.000000 |
14304 rows × 1391 columns
Supervised Feature Selection Using Chi-Square Statistics¶
In this part, we will perform a supervised feature selection using Chi-Square Statistics so that we can eliminate the features that are the most likely to be independent of fraudulent
column and therefore irrelevant for classification.
# Importing processed train data from previous step to get a fraudulent column.
processed_train = joblib.load('./data/processed_train_jlib')
target = processed_train["fraudulent"]
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
supervised_text_features = SelectKBest(chi2, k = 100).fit(text_feature, target)
df_text_features = text_feature.iloc[: , supervised_text_features.get_support()]
df_text_features
administr_desc | answer_desc | asia_desc | assist_desc | bill_desc | call_desc | cash_desc | desir_desc | duti_desc | earn_desc | ... | life_benefits | need_benefits | per_benefits | posit_benefits | prospect_benefits | see_benefits | share_benefits | skill_benefits | start_benefits | train_benefits | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.092456 | 0.000000 | 0.0 | 0.000000 | 0.000000 | ... | 0.103912 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 0.045662 | 0.0 | 0.0 | 0.034465 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | ... | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | ... | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 0.000000 | 0.0 | 0.0 | 0.047975 | 0.0 | 0.000000 | 0.085044 | 0.0 | 0.051481 | 0.000000 | ... | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.053905 | ... | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
14299 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | ... | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
14300 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | ... | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
14301 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | ... | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
14302 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.120147 | 0.000000 | ... | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
14303 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | ... | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
14304 rows × 100 columns
We will save this selected text features to csv file.
df_text_features.to_csv("./data/selected_text_features.csv")