Feature Selection¶

In this chapter, we will carefully examine our pre-processed training dataset and select the best features for machine learning algorithms. I already processed the training dataset and saved it as joblib file. See here if you want to know the whole process. Since the dataframe was massive, I had to break it down to three different dataframes to save it.

import pandas as pd 
import pickle
import joblib

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 1
----> 1 import pandas as pd 
      2 import pickle
      3 import joblib

ModuleNotFoundError: No module named 'pandas'

text_features_train = joblib.load('./data/text_features_train_jlib')
OHE_features_train = joblib.load('./data/OHE_features_train_jlib')
processed_train = joblib.load('./data/processed_train_jlib')

However, here we face a big problem. If we try to combine these three dataframes into single dataframe by doing

train_features = pd.concat([text_features_train, OHE_features_train, processed_train], axis = 1)

Then we get the MemeoryError because text_features_train is massive dataframe with 89527 columns (10 GB).

text_features_train

	aa_desc	aaa_desc	aaab_desc	aab_desc	aabc_desc	aabd_desc	aabf_desc	aac_desc	aaccd_desc	aachen_desc	...	zodat_benefits	zollman_benefits	zombi_benefits	zone_benefits	zoo_benefits	zowel_benefits	zu_benefits	zult_benefits	zutrifft_benefits	zweig_benefits
0	0.165596	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
4	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
14299	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
14300	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
14301	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
14302	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
14303	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

14304 rows × 89527 columns

This means that we are not able to perform any supervised feature selection until we select features from text_features_train. We must reduce its dimension significantly.

In this section, we will discuss how we can reduce text_features_train’s dimension significantly, and what features should we select to make the most efficient and precise machine learning outcomes.

Classifying Fake Job Posting Using Machine Learning Algorithm

Feature Selection

Feature Selection¶