Train and Test Split

Train and Test SplitΒΆ

Since we have a relatively small number of fraudulent postings compared to non-fraudulent postings, we want to ensure that both the train and test set have the same percentage of fraudulent as the original dataset. In the original dataset, we have about 5% of fraudulent postings, so we will ensure that the train and test set have the same percentage to keep the representativeness of the train and test sets. We will use an 80%/20% split to ensure that we use the data with the balanced output class to avoid problems like overfitting or underfitting.

To do this split correctly, we will do a stratified train-test split using the train_test_split() function in the scikit-learn Python machine-learning library.

Here is the code I used to create the train and test set.

import pandas as pd 
from sklearn.model_selection import train_test_split
data = pd.read_csv("./data/fake_job_postings.csv")
y = data.iloc[:, -1]
X = data.iloc[:, :-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42, stratify = y)
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 1
----> 1 import pandas as pd 
      2 from sklearn.model_selection import train_test_split
      3 data = pd.read_csv("./data/fake_job_postings.csv")

ModuleNotFoundError: No module named 'pandas'

We will use the random_state = 42 to ensure that we get same train and test set throughout the book.

Note

In the original project, the raw training set had 5362 rows with 259 fradulent and the test set had 5000 rows with 50 fradulent. We had 5% of fradulent on the training set while we only have 1% of fradulent on the test set, indicating the train and test set for the original project were not created through a stratified train-test split.