Feature Selection

Feature Selection

In this chapter, we will carefully examine our pre-processed training dataset and select the best features for machine learning algorithms. I already processed the training dataset and saved it as joblib file. See here if you want to know the whole process. Since the dataframe was massive, I had to break it down to three different dataframes to save it.

import pandas as pd 
import pickle
import joblib
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 1
----> 1 import pandas as pd 
      2 import pickle
      3 import joblib

ModuleNotFoundError: No module named 'pandas'
text_features_train = joblib.load('./data/text_features_train_jlib')
OHE_features_train = joblib.load('./data/OHE_features_train_jlib')
processed_train = joblib.load('./data/processed_train_jlib')

However, here we face a big problem. If we try to combine these three dataframes into single dataframe by doing

train_features = pd.concat([text_features_train, OHE_features_train, processed_train], axis = 1)

Then we get the MemeoryError because text_features_train is massive dataframe with 89527 columns (10 GB).

text_features_train
aa_desc aaa_desc aaab_desc aab_desc aabc_desc aabd_desc aabf_desc aac_desc aaccd_desc aachen_desc ... zodat_benefits zollman_benefits zombi_benefits zone_benefits zoo_benefits zowel_benefits zu_benefits zult_benefits zutrifft_benefits zweig_benefits
0 0.165596 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
14299 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
14300 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
14301 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
14302 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
14303 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

14304 rows × 89527 columns

This means that we are not able to perform any supervised feature selection until we select features from text_features_train. We must reduce its dimension significantly.

In this section, we will discuss how we can reduce text_features_train’s dimension significantly, and what features should we select to make the most efficient and precise machine learning outcomes.