Data Description
Data Description¶
import pandas as pd
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[1], line 1
----> 1 import pandas as pd
ModuleNotFoundError: No module named 'pandas'
In this project, we aim to classify job postings as fraudulent and non-fraudulent using several features in the dataset. Let’s import the dataset and observe the first five rows.
Note
This is a data description for the entire dataset before we split the dataset into the train and the test set. In the original project, the data description was written based on only the training dataset.
data = pd.read_csv("./data/fake_job_postings.csv")
data.head()
job_id | title | location | department | salary_range | company_profile | description | requirements | benefits | telecommuting | has_company_logo | has_questions | employment_type | required_experience | required_education | industry | function | fraudulent | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Marketing Intern | US, NY, New York | Marketing | NaN | We're Food52, and we've created a groundbreaki... | Food52, a fast-growing, James Beard Award-winn... | Experience with content management systems a m... | NaN | 0 | 1 | 0 | Other | Internship | NaN | NaN | Marketing | 0 |
1 | 2 | Customer Service - Cloud Video Production | NZ, , Auckland | Success | NaN | 90 Seconds, the worlds Cloud Video Production ... | Organised - Focused - Vibrant - Awesome!Do you... | What we expect from you:Your key responsibilit... | What you will get from usThrough being part of... | 0 | 1 | 0 | Full-time | Not Applicable | NaN | Marketing and Advertising | Customer Service | 0 |
2 | 3 | Commissioning Machinery Assistant (CMA) | US, IA, Wever | NaN | NaN | Valor Services provides Workforce Solutions th... | Our client, located in Houston, is actively se... | Implement pre-commissioning and commissioning ... | NaN | 0 | 1 | 0 | NaN | NaN | NaN | NaN | NaN | 0 |
3 | 4 | Account Executive - Washington DC | US, DC, Washington | Sales | NaN | Our passion for improving quality of life thro... | THE COMPANY: ESRI – Environmental Systems Rese... | EDUCATION: Bachelor’s or Master’s in GIS, busi... | Our culture is anything but corporate—we have ... | 0 | 1 | 0 | Full-time | Mid-Senior level | Bachelor's Degree | Computer Software | Sales | 0 |
4 | 5 | Bill Review Manager | US, FL, Fort Worth | NaN | NaN | SpotSource Solutions LLC is a Global Human Cap... | JOB TITLE: Itemization Review ManagerLOCATION:... | QUALIFICATIONS:RN license in the State of Texa... | Full Benefits Offered | 0 | 1 | 1 | Full-time | Mid-Senior level | Bachelor's Degree | Hospital & Health Care | Health Care Provider | 0 |
From the dataset above, we can notice several important dataset characteristics we need to keep in mind while performing the analysis.
The dataset consists of job posting information and labels (0 for non-fraudulent, 1 for fraudulent). Therefore, The last column should not be used as a feature since it is the variable we want to predict. It will be helpful to find out how many fraudulent/non-fraudulent rows we have for each.
data[data["fraudulent"] == 1].shape[0]
866
data[data["fraudulent"] == 0].shape[0]
17014
We have 866 fraudulent and 17014 non-fraudulent job postings in the dataset. This information can be helpful when we split the data into the test-set and the train-set.
We can observe several null values across the dataset. We must remember that we must do the imputation before we start the analysis.
Some columns contain text data, so we must go through some natural language processing to clean and sort the features. To avoid overfitting, we might also have to sort out important words to use as “super features.”
The dataframe below shows more details about dataset including name of all columns and its data type.
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17880 entries, 0 to 17879
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 job_id 17880 non-null int64
1 title 17880 non-null object
2 location 17534 non-null object
3 department 6333 non-null object
4 salary_range 2868 non-null object
5 company_profile 14572 non-null object
6 description 17879 non-null object
7 requirements 15185 non-null object
8 benefits 10670 non-null object
9 telecommuting 17880 non-null int64
10 has_company_logo 17880 non-null int64
11 has_questions 17880 non-null int64
12 employment_type 14409 non-null object
13 required_experience 10830 non-null object
14 required_education 9775 non-null object
15 industry 12977 non-null object
16 function 11425 non-null object
17 fraudulent 17880 non-null int64
dtypes: int64(5), object(13)
memory usage: 2.5+ MB
From the dataframe above, we realize that some columns have more NA values than others. Knowing the percentage of NA for each column can be a helpful tool for feature selection.
percent_missing = data.isnull().sum() * 100 / len(data)
missing_value_data = pd.DataFrame({'percent_missing': percent_missing})
missing_value_data.sort_values('percent_missing', inplace=True, ascending = False)
missing_value_data
percent_missing | |
---|---|
salary_range | 83.959732 |
department | 64.580537 |
required_education | 45.329978 |
benefits | 40.324385 |
required_experience | 39.429530 |
function | 36.101790 |
industry | 27.421700 |
employment_type | 19.412752 |
company_profile | 18.501119 |
requirements | 15.072707 |
location | 1.935123 |
description | 0.005593 |
job_id | 0.000000 |
telecommuting | 0.000000 |
has_questions | 0.000000 |
has_company_logo | 0.000000 |
title | 0.000000 |
fraudulent | 0.000000 |
It looks like salary_range
and department
have a high percentage of NAs. The existence of NA in the data can be helpful in prediction, but unlike the variables required_education
, benefits
, and required_experience
, the presence of NA in salary_range
and department
looks not significant. To avoid overfitting, excluding those columns for our analysis will be safer.