Data Description

Data Description

import pandas as pd 
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 1
----> 1 import pandas as pd 

ModuleNotFoundError: No module named 'pandas'

In this project, we aim to classify job postings as fraudulent and non-fraudulent using several features in the dataset. Let’s import the dataset and observe the first five rows.

Note

This is a data description for the entire dataset before we split the dataset into the train and the test set. In the original project, the data description was written based on only the training dataset.

data = pd.read_csv("./data/fake_job_postings.csv")
data.head()
job_id title location department salary_range company_profile description requirements benefits telecommuting has_company_logo has_questions employment_type required_experience required_education industry function fraudulent
0 1 Marketing Intern US, NY, New York Marketing NaN We're Food52, and we've created a groundbreaki... Food52, a fast-growing, James Beard Award-winn... Experience with content management systems a m... NaN 0 1 0 Other Internship NaN NaN Marketing 0
1 2 Customer Service - Cloud Video Production NZ, , Auckland Success NaN 90 Seconds, the worlds Cloud Video Production ... Organised - Focused - Vibrant - Awesome!Do you... What we expect from you:Your key responsibilit... What you will get from usThrough being part of... 0 1 0 Full-time Not Applicable NaN Marketing and Advertising Customer Service 0
2 3 Commissioning Machinery Assistant (CMA) US, IA, Wever NaN NaN Valor Services provides Workforce Solutions th... Our client, located in Houston, is actively se... Implement pre-commissioning and commissioning ... NaN 0 1 0 NaN NaN NaN NaN NaN 0
3 4 Account Executive - Washington DC US, DC, Washington Sales NaN Our passion for improving quality of life thro... THE COMPANY: ESRI – Environmental Systems Rese... EDUCATION: Bachelor’s or Master’s in GIS, busi... Our culture is anything but corporate—we have ... 0 1 0 Full-time Mid-Senior level Bachelor's Degree Computer Software Sales 0
4 5 Bill Review Manager US, FL, Fort Worth NaN NaN SpotSource Solutions LLC is a Global Human Cap... JOB TITLE: Itemization Review ManagerLOCATION:... QUALIFICATIONS:RN license in the State of Texa... Full Benefits Offered 0 1 1 Full-time Mid-Senior level Bachelor's Degree Hospital & Health Care Health Care Provider 0

From the dataset above, we can notice several important dataset characteristics we need to keep in mind while performing the analysis.

  1. The dataset consists of job posting information and labels (0 for non-fraudulent, 1 for fraudulent). Therefore, The last column should not be used as a feature since it is the variable we want to predict. It will be helpful to find out how many fraudulent/non-fraudulent rows we have for each.

data[data["fraudulent"] == 1].shape[0]

866

data[data["fraudulent"] == 0].shape[0]

17014

We have 866 fraudulent and 17014 non-fraudulent job postings in the dataset. This information can be helpful when we split the data into the test-set and the train-set.

  1. We can observe several null values across the dataset. We must remember that we must do the imputation before we start the analysis.

  2. Some columns contain text data, so we must go through some natural language processing to clean and sort the features. To avoid overfitting, we might also have to sort out important words to use as “super features.”

The dataframe below shows more details about dataset including name of all columns and its data type.

data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17880 entries, 0 to 17879
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   job_id               17880 non-null  int64 
 1   title                17880 non-null  object
 2   location             17534 non-null  object
 3   department           6333 non-null   object
 4   salary_range         2868 non-null   object
 5   company_profile      14572 non-null  object
 6   description          17879 non-null  object
 7   requirements         15185 non-null  object
 8   benefits             10670 non-null  object
 9   telecommuting        17880 non-null  int64 
 10  has_company_logo     17880 non-null  int64 
 11  has_questions        17880 non-null  int64 
 12  employment_type      14409 non-null  object
 13  required_experience  10830 non-null  object
 14  required_education   9775 non-null   object
 15  industry             12977 non-null  object
 16  function             11425 non-null  object
 17  fraudulent           17880 non-null  int64 
dtypes: int64(5), object(13)
memory usage: 2.5+ MB

From the dataframe above, we realize that some columns have more NA values than others. Knowing the percentage of NA for each column can be a helpful tool for feature selection.

percent_missing = data.isnull().sum() * 100 / len(data)
missing_value_data = pd.DataFrame({'percent_missing': percent_missing})
missing_value_data.sort_values('percent_missing', inplace=True, ascending = False)
missing_value_data
percent_missing
salary_range 83.959732
department 64.580537
required_education 45.329978
benefits 40.324385
required_experience 39.429530
function 36.101790
industry 27.421700
employment_type 19.412752
company_profile 18.501119
requirements 15.072707
location 1.935123
description 0.005593
job_id 0.000000
telecommuting 0.000000
has_questions 0.000000
has_company_logo 0.000000
title 0.000000
fraudulent 0.000000

It looks like salary_range and department have a high percentage of NAs. The existence of NA in the data can be helpful in prediction, but unlike the variables required_education, benefits, and required_experience, the presence of NA in salary_range and department looks not significant. To avoid overfitting, excluding those columns for our analysis will be safer.