Text Feature Selection¶

As we discussed in here, we must perform the feature selection on text features first because it is causing MemeoryError due to its massive file size (10GB). Since we cannot use the fraudulent column, we will use column means to select the features based on several assumptions. After we get a smaller version of text_features_train, we will combine it with fraudulent column and perform supervised feature selection using Chi-Squre Statistics.

Feature Selection Using Column Mean¶

import pandas as pd 
import joblib
text_features_train = joblib.load('./data/text_features_train_jlib')

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 1
----> 1 import pandas as pd 
      2 import joblib
      3 text_features_train = joblib.load('./data/text_features_train_jlib')

ModuleNotFoundError: No module named 'pandas'

text_features_train.head(5)

	aa_desc	...
0	0.165596	...
1	0.000000	...
2	0.000000	...
3	0.000000	...
4	0.000000	...

5 rows × 89527 columns

Let’s carefully observe the dataframe above. It is straightforward to realize that some of these features have unusual names, such as zutrifft and aabd, and it is hard to believe that they are stemmed from a standard English word. There are two main reasons why those unusual names appear as a feature.

Although we removed the URL and HTML format in the pre-data processing step, it is still possible that some formats are not perfectly removed. Also, the data can include other non-English words, such as email or file names.
If we look over the original dataset, we can observe that text data was saved with no space between the lines. For example, “I love dog” (Line 1), “Cat ate fish” (Line 2) to “I love dogCat ate fish”. Then it creates the abnormal word “dogCat”.

The one simple way to remove these unusual words with the lowest computational cost is to use column mean and filter out the features with exceptionally low means. This is based on the assumption that unusual words will appear less frequently than normal words. For instance, a word like “havecommunication” will not frequently appear across the dataset. If the words appear infrequently, they will have a low mean.

Warning

This method is also based on the dangerous assumption that low-frequency words are less important than more frequent words. Feature selection using column means can eliminate some important words for machine learning. However, since we have more than 10,000 unusual words as features, I am trading off the accuracy for more efficiency. I am aware that feature selection has to be done very carefully, and this is not the best way.

However, we must not perform this feature selection by column means on the entire dataset since our text_features_train dataset is a combination of four different columns: description, title, requirements and benefits. Since the tf-idf value can vary depending on different characteristics of each dataset, we must get a column mean, select the feature separately by each dataset, and combine it later to get the best result.

sum(text_features_train.columns.str.contains('_desc'))
text_features_desc = text_features_train.iloc[:, 0:40607]

sum(text_features_train.columns.str.contains('_req'))
text_features_req = text_features_train.iloc[:, 40607:74863]

sum(text_features_train.columns.str.contains('_title'))
text_features_title = text_features_train.iloc[:, 74863:78280]

sum(text_features_train.columns.str.contains('_benefits'))
text_features_benefits = text_features_train.iloc[:, 78280:89527]

Note

We are seperating the dataframe like this to avoid the MemoryError.

mean_desc = text_features_desc.sum() / 14304
mean_desc.sort_values(ascending = False)

work_desc       3.387162e-02
develop_desc    3.251693e-02
team_desc       3.203152e-02
manag_desc      3.167333e-02
custom_desc     3.149140e-02
                    ...     
peugeot_desc    9.292221e-07
bencki_desc     9.292221e-07
sanofi_desc     9.292221e-07
qmetric_desc    5.049230e-07
gra_desc        5.049230e-07
Length: 40607, dtype: float64

The column means for features from description dataset shows that the assumptions we made previously are somewhat reasonable. As we see here, more the average, the words look more normal, such as “work” and “develope”.

Since the highest mean is 0.03387162, let’s choose all features with average mean higher than 0.002.

Note

I tried many different thresholds and figured out 0.002 is the best number.

select_desc = mean_desc > 0.002
selected_features_desc = text_features_desc.loc[:, select_desc]
selected_features_desc.head()

	abil_desc	abl_desc	account_desc	across_desc	...	without_desc	work_desc	world_desc	year_desc
0	0.043316	0.04441	0.000000	0.000000	...	0.060186	0.021943	0.000000	0.038188
1	0.000000	0.00000	0.000000	0.000000	...	0.000000	0.000000	0.000000	0.000000
2	0.050161	0.00000	0.000000	0.000000	...	0.000000	0.025411	0.000000	0.044223
3	0.000000	0.00000	0.052053	0.000000	...	0.000000	0.000000	0.000000	0.000000
4	0.000000	0.00000	0.000000	0.036083	...	0.000000	0.000000	0.069462	0.092432

5 rows × 812 columns

This looks much better. We will repeat the process for the other dataframes as well.

mean_req = text_features_req.sum() / 14304
mean_req.sort_values(ascending = False)

experi_req       0.052954
work_req         0.034828
skill_req        0.033378
requir_req       0.031967
year_req         0.027195
                   ...   
cano_req         0.000002
mcnz_req         0.000002
orthopaed_req    0.000002
inhabit_req      0.000002
zeta_req         0.000001
Length: 34256, dtype: float64

Since TF-IDF is bit higher for text_feature_req, we will adjust the threshold a bit to adjust for the difference.

select_req = mean_req > 0.003
selected_features_req = text_features_req.loc[:, select_req]
selected_features_req.head()

	abil_req	abl_req	...	window_req	within_req	without_req	work_req	written_req	year_req
0	0.081650	0.105596	...	0.000000	0.000000	0.000000	0.064795	0.000000	0.068928
1	0.000000	0.000000	...	0.000000	0.061549	0.151344	0.063024	0.000000	0.033522
2	0.000000	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
3	0.047567	0.000000	...	0.087269	0.000000	0.000000	0.075495	0.054646	0.040155
4	0.000000	0.000000	...	0.000000	0.000000	0.000000	0.072623	0.000000	0.038628

5 rows × 350 columns

mean_title = text_features_title.sum() / 14304
mean_title.sort_values(ascending = False)

manag_title           0.048922
develop_title         0.046175
engin_title           0.038562
sale_title            0.029780
servic_title          0.024249
                        ...   
maharashtra_title     0.000024
barri_title           0.000024
peterborough_title    0.000024
haliburton_title      0.000024
elgin_title           0.000021
Length: 3417, dtype: float64

select_title = mean_title > 0.004
selected_features_title = text_features_title.loc[:, select_title]
selected_features_title.head()

	analyst_title	...
0	0.000000	...
1	0.000000	...
2	0.440388	...
3	0.000000	...
4	0.000000	...

5 rows × 86 columns

mean_benefits = text_features_benefits.sum() / 14304
mean_benefits.sort_values(ascending = False)

job_benefits         0.026567
descript_benefits    0.026070
see_benefits         0.025900
benefit_benefits     0.025386
work_benefits        0.024180
                       ...   
ebe_benefits         0.000002
efd_benefits         0.000002
efff_benefits        0.000002
cdc_benefits         0.000002
charit_benefits      0.000001
Length: 11247, dtype: float64

select_benefits = mean_benefits > 0.003
selected_features_benefits = text_features_benefits.loc[:, select_benefits]
selected_features_benefits.head()

	benefit_benefits	best_benefits	bonus_benefits	...	vacat_benefits	vision_benefits	week_benefits	well_benefits	year_benefits
0	0.074620	0.000000	0.142779	...	0.204903	0.000000	0.000000	0.348124	0.112979
1	0.000000	0.000000	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000
2	0.000000	0.000000	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000
3	0.109283	0.000000	0.000000	...	0.150045	0.150367	0.183764	0.000000	0.165463
4	0.042046	0.131633	0.000000	...	0.057729	0.000000	0.000000	0.000000	0.063661

5 rows × 143 columns

We will combine all dataframe to get a smaller version of text_feature_train.

text_feature = pd.concat([selected_features_desc, selected_features_req, selected_features_title, selected_features_benefits], axis=1)
text_feature

	abil_desc	abl_desc	accept_desc	access_desc	accord_desc	account_desc	accur_desc	achiev_desc	acquisit_desc	across_desc	...	us_benefits	vacat_benefits	vision_benefits	want_benefits	week_benefits	well_benefits	within_benefits	work_benefits	world_benefits	year_benefits
0	0.043316	0.04441	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.000000	...	0.00000	0.204903	0.000000	0.0	0.000000	0.348124	0.0	0.000000	0.0	0.112979
1	0.000000	0.00000	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.000000	...	0.00000	0.000000	0.000000	0.0	0.000000	0.000000	0.0	0.000000	0.0	0.000000
2	0.050161	0.00000	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.000000	...	0.00000	0.000000	0.000000	0.0	0.000000	0.000000	0.0	0.000000	0.0	0.000000
3	0.000000	0.00000	0.0	0.0	0.0	0.052053	0.0	0.0	0.0	0.000000	...	0.00000	0.150045	0.150367	0.0	0.183764	0.000000	0.0	0.000000	0.0	0.165463
4	0.000000	0.00000	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.036083	...	0.00000	0.057729	0.000000	0.0	0.000000	0.000000	0.0	0.000000	0.0	0.063661
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
14299	0.000000	0.00000	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.000000	...	0.00000	0.000000	0.000000	0.0	0.000000	0.000000	0.0	0.230658	0.0	0.000000
14300	0.000000	0.00000	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.000000	...	0.09475	0.000000	0.000000	0.0	0.000000	0.000000	0.0	0.000000	0.0	0.000000
14301	0.000000	0.00000	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.000000	...	0.00000	0.000000	0.000000	0.0	0.000000	0.000000	0.0	0.000000	0.0	0.000000
14302	0.000000	0.00000	0.0	0.0	0.0	0.000000	0.0	0.0	0.0	0.000000	...	0.00000	0.000000	0.000000	0.0	0.000000	0.000000	0.0	0.000000	0.0	0.000000
14303	0.000000	0.00000	0.0	0.0	0.0	0.077250	0.0	0.0	0.0	0.000000	...	0.00000	0.000000	0.000000	0.0	0.000000	0.000000	0.0	0.000000	0.0	0.000000

14304 rows × 1391 columns

Supervised Feature Selection Using Chi-Square Statistics¶

In this part, we will perform a supervised feature selection using Chi-Square Statistics so that we can eliminate the features that are the most likely to be independent of fraudulent column and therefore irrelevant for classification.

# Importing processed train data from previous step to get a fraudulent column.
processed_train = joblib.load('./data/processed_train_jlib')
target = processed_train["fraudulent"]

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

supervised_text_features = SelectKBest(chi2, k = 100).fit(text_feature, target)
df_text_features = text_feature.iloc[: , supervised_text_features.get_support()]
df_text_features

	administr_desc	answer_desc	asia_desc	assist_desc	bill_desc	call_desc	cash_desc	desir_desc	duti_desc	earn_desc	...	life_benefits	need_benefits	per_benefits	posit_benefits	prospect_benefits	see_benefits	share_benefits	skill_benefits	start_benefits	train_benefits
0	0.000000	0.0	0.0	0.000000	0.0	0.092456	0.000000	0.0	0.000000	0.000000	...	0.103912	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1	0.045662	0.0	0.0	0.034465	0.0	0.000000	0.000000	0.0	0.000000	0.000000	...	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2	0.000000	0.0	0.0	0.000000	0.0	0.000000	0.000000	0.0	0.000000	0.000000	...	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3	0.000000	0.0	0.0	0.047975	0.0	0.000000	0.085044	0.0	0.051481	0.000000	...	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
4	0.000000	0.0	0.0	0.000000	0.0	0.000000	0.000000	0.0	0.000000	0.053905	...	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
14299	0.000000	0.0	0.0	0.000000	0.0	0.000000	0.000000	0.0	0.000000	0.000000	...	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
14300	0.000000	0.0	0.0	0.000000	0.0	0.000000	0.000000	0.0	0.000000	0.000000	...	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
14301	0.000000	0.0	0.0	0.000000	0.0	0.000000	0.000000	0.0	0.000000	0.000000	...	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
14302	0.000000	0.0	0.0	0.000000	0.0	0.000000	0.000000	0.0	0.120147	0.000000	...	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
14303	0.000000	0.0	0.0	0.000000	0.0	0.000000	0.000000	0.0	0.000000	0.000000	...	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

14304 rows × 100 columns

We will save this selected text features to csv file.

df_text_features.to_csv("./data/selected_text_features.csv")

Classifying Fake Job Posting Using Machine Learning Algorithm

Text Feature Selection

Contents

Text Feature Selection¶

Feature Selection Using Column Mean¶

Supervised Feature Selection Using Chi-Square Statistics¶