Data Description¶

import pandas as pd

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 1
----> 1 import pandas as pd 

ModuleNotFoundError: No module named 'pandas'

In this project, we aim to classify job postings as fraudulent and non-fraudulent using several features in the dataset. Let’s import the dataset and observe the first five rows.

Note

This is a data description for the entire dataset before we split the dataset into the train and the test set. In the original project, the data description was written based on only the training dataset.

data = pd.read_csv("./data/fake_job_postings.csv")
data.head()

	job_id	title	location	department	salary_range	company_profile	description	requirements	benefits	has_company_logo	has_questions	employment_type	required_experience	required_education	industry	function
0	1	Marketing Intern	US, NY, New York	Marketing	NaN	We're Food52, and we've created a groundbreaki...	Food52, a fast-growing, James Beard Award-winn...	Experience with content management systems a m...	NaN	1	0	Other	Internship	NaN	NaN	Marketing
1	2	Customer Service - Cloud Video Production	NZ, , Auckland	Success	NaN	90 Seconds, the worlds Cloud Video Production ...	Organised - Focused - Vibrant - Awesome!Do you...	What we expect from you:Your key responsibilit...	What you will get from usThrough being part of...	1	0	Full-time	Not Applicable	NaN	Marketing and Advertising	Customer Service
2	3	Commissioning Machinery Assistant (CMA)	US, IA, Wever	NaN	NaN	Valor Services provides Workforce Solutions th...	Our client, located in Houston, is actively se...	Implement pre-commissioning and commissioning ...	NaN	1	0	NaN	NaN	NaN	NaN	NaN
3	4	Account Executive - Washington DC	US, DC, Washington	Sales	NaN	Our passion for improving quality of life thro...	THE COMPANY: ESRI – Environmental Systems Rese...	EDUCATION: Bachelor’s or Master’s in GIS, busi...	Our culture is anything but corporate—we have ...	1	0	Full-time	Mid-Senior level	Bachelor's Degree	Computer Software	Sales
4	5	Bill Review Manager	US, FL, Fort Worth	NaN	NaN	SpotSource Solutions LLC is a Global Human Cap...	JOB TITLE: Itemization Review ManagerLOCATION:...	QUALIFICATIONS:RN license in the State of Texa...	Full Benefits Offered	1	1	Full-time	Mid-Senior level	Bachelor's Degree	Hospital & Health Care	Health Care Provider

From the dataset above, we can notice several important dataset characteristics we need to keep in mind while performing the analysis.

The dataset consists of job posting information and labels (0 for non-fraudulent, 1 for fraudulent). Therefore, The last column should not be used as a feature since it is the variable we want to predict. It will be helpful to find out how many fraudulent/non-fraudulent rows we have for each.

data[data["fraudulent"] == 1].shape[0]

866

data[data["fraudulent"] == 0].shape[0]

17014

We have 866 fraudulent and 17014 non-fraudulent job postings in the dataset. This information can be helpful when we split the data into the test-set and the train-set.

We can observe several null values across the dataset. We must remember that we must do the imputation before we start the analysis.
Some columns contain text data, so we must go through some natural language processing to clean and sort the features. To avoid overfitting, we might also have to sort out important words to use as “super features.”

The dataframe below shows more details about dataset including name of all columns and its data type.

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17880 entries, 0 to 17879
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 job_id               17880 non-null  int64 
 title                17880 non-null  object
 location             17534 non-null  object
 department           6333 non-null   object
 salary_range         2868 non-null   object
 company_profile      14572 non-null  object
 description          17879 non-null  object
 requirements         15185 non-null  object
 benefits             10670 non-null  object
 telecommuting        17880 non-null  int64 
has_company_logo     17880 non-null  int64 
has_questions        17880 non-null  int64 
employment_type      14409 non-null  object
required_experience  10830 non-null  object
required_education   9775 non-null   object
industry             12977 non-null  object
function             11425 non-null  object
fraudulent           17880 non-null  int64 
dtypes: int64(5), object(13)
memory usage: 2.5+ MB

From the dataframe above, we realize that some columns have more NA values than others. Knowing the percentage of NA for each column can be a helpful tool for feature selection.

percent_missing = data.isnull().sum() * 100 / len(data)
missing_value_data = pd.DataFrame({'percent_missing': percent_missing})
missing_value_data.sort_values('percent_missing', inplace=True, ascending = False)
missing_value_data

	percent_missing
salary_range	83.959732
department	64.580537
required_education	45.329978
benefits	40.324385
required_experience	39.429530
function	36.101790
industry	27.421700
employment_type	19.412752
company_profile	18.501119
requirements	15.072707
location	1.935123
description	0.005593
job_id	0.000000
telecommuting	0.000000
has_questions	0.000000
has_company_logo	0.000000
title	0.000000
fraudulent	0.000000

It looks like salary_range and department have a high percentage of NAs. The existence of NA in the data can be helpful in prediction, but unlike the variables required_education, benefits, and required_experience, the presence of NA in salary_range and department looks not significant. To avoid overfitting, excluding those columns for our analysis will be safer.

Classifying Fake Job Posting Using Machine Learning Algorithm

Data Description

Data Description¶