Feature Creation

Feature CreationΒΆ

This chapter will focus on creating features from cleaned-text data by counting word frequency and some limitations regarding using such a method. Also, we will talk about how we should handle non-text columns in the data, especially the columns with categorical variables. Additionally, we will take closer look into our training dataset to see if we can create more helpful features other than given columns.

First, we will load the data we cleaned on the previous page, and we will split this into train and test datasets. We split the dataset to ensure the test dataset does not influence when creating a pipeline. Technically, we are not supposed to know our test set before the completion of our pipeline. After we create the pipeline, we will apply the same pipeline we used for the train set to the test set to ensure both datasets have the same dimension. Please refer to here if you want to know how I splited the dataset into train and test sets.

import pandas as pd 
from sklearn.model_selection import train_test_split
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 1
----> 1 import pandas as pd 
      2 from sklearn.model_selection import train_test_split

ModuleNotFoundError: No module named 'pandas'
data = pd.read_csv("./data/cleaned_fake_job_postings.csv")
y = data.iloc[:, -1]
X = data.iloc[:, :-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42, stratify = y)

Before we move on to the next step, we will create new dataframes for the train and test set and save these as csv to data folder. We will use the following code to do the work.

# Resetting Index before we create new csv file for train and test
X_train.reset_index(inplace = True, drop = True)
y_train.reset_index(inplace = True, drop = True)
X_test.reset_index(inplace = True, drop = True)
y_test.reset_index(inplace = True, drop = True)

# Combining X and y (which is just fraudulent column) 
X_train["fraudulent"] = y_train
X_test["fraudulent"] = y_test

# Save the combined dataframe to csv
X_train.to_csv("./data/train_set.csv")
X_test.to_csv("./data/test_set.csv")