{ "cells": [ { "cell_type": "markdown", "id": "e7377006-3392-4138-8f7d-696a67350e5b", "metadata": {}, "source": [ "# Data Description" ] }, { "cell_type": "code", "execution_count": 1, "id": "fe8d8bd2-c4b8-4481-8b6e-a110958f41ea", "metadata": { "tags": [ "hide-cell" ] }, "outputs": [], "source": [ "import pandas as pd " ] }, { "cell_type": "markdown", "id": "cabba208-7f1e-4e7c-b053-d977ed248a4b", "metadata": {}, "source": [ "In this project, **we aim to classify job postings as fraudulent and non-fraudulent using several features in the dataset.** Let's import the dataset and observe the first five rows. \n", "\n", "```{note}\n", "This is a data description for the entire dataset before we split the dataset into the train and the test set. In the original project, the data description was written based on only the training dataset.\n", "```" ] }, { "cell_type": "code", "execution_count": 2, "id": "b2552b9a-53ae-4f2e-8384-fa9ca53afbf2", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
job_idtitlelocationdepartmentsalary_rangecompany_profiledescriptionrequirementsbenefitstelecommutinghas_company_logohas_questionsemployment_typerequired_experiencerequired_educationindustryfunctionfraudulent
01Marketing InternUS, NY, New YorkMarketingNaNWe're Food52, and we've created a groundbreaki...Food52, a fast-growing, James Beard Award-winn...Experience with content management systems a m...NaN010OtherInternshipNaNNaNMarketing0
12Customer Service - Cloud Video ProductionNZ, , AucklandSuccessNaN90 Seconds, the worlds Cloud Video Production ...Organised - Focused - Vibrant - Awesome!Do you...What we expect from you:Your key responsibilit...What you will get from usThrough being part of...010Full-timeNot ApplicableNaNMarketing and AdvertisingCustomer Service0
23Commissioning Machinery Assistant (CMA)US, IA, WeverNaNNaNValor Services provides Workforce Solutions th...Our client, located in Houston, is actively se...Implement pre-commissioning and commissioning ...NaN010NaNNaNNaNNaNNaN0
34Account Executive - Washington DCUS, DC, WashingtonSalesNaNOur passion for improving quality of life thro...THE COMPANY: ESRI – Environmental Systems Rese...EDUCATION: Bachelor’s or Master’s in GIS, busi...Our culture is anything but corporate—we have ...010Full-timeMid-Senior levelBachelor's DegreeComputer SoftwareSales0
45Bill Review ManagerUS, FL, Fort WorthNaNNaNSpotSource Solutions LLC is a Global Human Cap...JOB TITLE: Itemization Review ManagerLOCATION:...QUALIFICATIONS:RN license in the State of Texa...Full Benefits Offered011Full-timeMid-Senior levelBachelor's DegreeHospital & Health CareHealth Care Provider0
\n", "
" ], "text/plain": [ " job_id title location \\\n", "0 1 Marketing Intern US, NY, New York \n", "1 2 Customer Service - Cloud Video Production NZ, , Auckland \n", "2 3 Commissioning Machinery Assistant (CMA) US, IA, Wever \n", "3 4 Account Executive - Washington DC US, DC, Washington \n", "4 5 Bill Review Manager US, FL, Fort Worth \n", "\n", " department salary_range company_profile \\\n", "0 Marketing NaN We're Food52, and we've created a groundbreaki... \n", "1 Success NaN 90 Seconds, the worlds Cloud Video Production ... \n", "2 NaN NaN Valor Services provides Workforce Solutions th... \n", "3 Sales NaN Our passion for improving quality of life thro... \n", "4 NaN NaN SpotSource Solutions LLC is a Global Human Cap... \n", "\n", " description \\\n", "0 Food52, a fast-growing, James Beard Award-winn... \n", "1 Organised - Focused - Vibrant - Awesome!Do you... \n", "2 Our client, located in Houston, is actively se... \n", "3 THE COMPANY: ESRI – Environmental Systems Rese... \n", "4 JOB TITLE: Itemization Review ManagerLOCATION:... \n", "\n", " requirements \\\n", "0 Experience with content management systems a m... \n", "1 What we expect from you:Your key responsibilit... \n", "2 Implement pre-commissioning and commissioning ... \n", "3 EDUCATION: Bachelor’s or Master’s in GIS, busi... \n", "4 QUALIFICATIONS:RN license in the State of Texa... \n", "\n", " benefits telecommuting \\\n", "0 NaN 0 \n", "1 What you will get from usThrough being part of... 0 \n", "2 NaN 0 \n", "3 Our culture is anything but corporate—we have ... 0 \n", "4 Full Benefits Offered 0 \n", "\n", " has_company_logo has_questions employment_type required_experience \\\n", "0 1 0 Other Internship \n", "1 1 0 Full-time Not Applicable \n", "2 1 0 NaN NaN \n", "3 1 0 Full-time Mid-Senior level \n", "4 1 1 Full-time Mid-Senior level \n", "\n", " required_education industry function \\\n", "0 NaN NaN Marketing \n", "1 NaN Marketing and Advertising Customer Service \n", "2 NaN NaN NaN \n", "3 Bachelor's Degree Computer Software Sales \n", "4 Bachelor's Degree Hospital & Health Care Health Care Provider \n", "\n", " fraudulent \n", "0 0 \n", "1 0 \n", "2 0 \n", "3 0 \n", "4 0 " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = pd.read_csv(\"./data/fake_job_postings.csv\")\n", "data.head()" ] }, { "cell_type": "markdown", "id": "9c0662a5-dff1-4e9d-a472-305fd1260da8", "metadata": {}, "source": [ "From the dataset above, we can notice several important dataset characteristics we need to keep in mind while performing the analysis.\n", "\n", "1. The dataset consists of job posting information and labels (0 for non-fraudulent, 1 for fraudulent). Therefore, The last column should not be used as a feature since it is the variable we want to predict. It will be helpful to find out how many fraudulent/non-fraudulent rows we have for each. \n", "\n", "```python \n", "data[data[\"fraudulent\"] == 1].shape[0]\n", "\n", "866\n", "\n", "data[data[\"fraudulent\"] == 0].shape[0]\n", "\n", "17014\n", "```\n", "\n", "> We have **866 fraudulent and 17014 non-fraudulent** job postings in the dataset. This information can be helpful when we split the data into the test-set and the train-set.\n", "\n", "2. We can observe several null values across the dataset. We must remember that we must do the imputation before we start the analysis. \n", "\n", "3. Some columns contain text data, so we must go through some natural language processing to clean and sort the features. To avoid overfitting, we might also have to sort out important words to use as \"super features.\" " ] }, { "cell_type": "markdown", "id": "ed69e50e-ef16-4669-8ecf-b64b40c14968", "metadata": {}, "source": [ "The dataframe below shows more details about dataset including name of all columns and its data type. " ] }, { "cell_type": "code", "execution_count": 3, "id": "667dc7e1-926a-4e7b-9454-b7b5dde55a5c", "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 17880 entries, 0 to 17879\n", "Data columns (total 18 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 job_id 17880 non-null int64 \n", " 1 title 17880 non-null object\n", " 2 location 17534 non-null object\n", " 3 department 6333 non-null object\n", " 4 salary_range 2868 non-null object\n", " 5 company_profile 14572 non-null object\n", " 6 description 17879 non-null object\n", " 7 requirements 15185 non-null object\n", " 8 benefits 10670 non-null object\n", " 9 telecommuting 17880 non-null int64 \n", " 10 has_company_logo 17880 non-null int64 \n", " 11 has_questions 17880 non-null int64 \n", " 12 employment_type 14409 non-null object\n", " 13 required_experience 10830 non-null object\n", " 14 required_education 9775 non-null object\n", " 15 industry 12977 non-null object\n", " 16 function 11425 non-null object\n", " 17 fraudulent 17880 non-null int64 \n", "dtypes: int64(5), object(13)\n", "memory usage: 2.5+ MB\n" ] } ], "source": [ "data.info()" ] }, { "cell_type": "markdown", "id": "928d3c07-2141-407a-aabf-8d7abb748aad", "metadata": {}, "source": [ "From the dataframe above, we realize that some columns have more NA values than others. Knowing the percentage of NA for each column can be a helpful tool for feature selection. " ] }, { "cell_type": "code", "execution_count": 4, "id": "d98c05d8-6459-44fc-a231-7712a68b5c87", "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
percent_missing
salary_range83.959732
department64.580537
required_education45.329978
benefits40.324385
required_experience39.429530
function36.101790
industry27.421700
employment_type19.412752
company_profile18.501119
requirements15.072707
location1.935123
description0.005593
job_id0.000000
telecommuting0.000000
has_questions0.000000
has_company_logo0.000000
title0.000000
fraudulent0.000000
\n", "
" ], "text/plain": [ " percent_missing\n", "salary_range 83.959732\n", "department 64.580537\n", "required_education 45.329978\n", "benefits 40.324385\n", "required_experience 39.429530\n", "function 36.101790\n", "industry 27.421700\n", "employment_type 19.412752\n", "company_profile 18.501119\n", "requirements 15.072707\n", "location 1.935123\n", "description 0.005593\n", "job_id 0.000000\n", "telecommuting 0.000000\n", "has_questions 0.000000\n", "has_company_logo 0.000000\n", "title 0.000000\n", "fraudulent 0.000000" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "percent_missing = data.isnull().sum() * 100 / len(data)\n", "missing_value_data = pd.DataFrame({'percent_missing': percent_missing})\n", "missing_value_data.sort_values('percent_missing', inplace=True, ascending = False)\n", "missing_value_data" ] }, { "cell_type": "markdown", "id": "12e6f100-369d-426c-9402-c251d63bd85f", "metadata": {}, "source": [ "It looks like `salary_range` and `department` have a high percentage of NAs. The existence of NA in the data can be helpful in prediction, but unlike the variables `required_education`, `benefits`, and `required_experience`, the presence of NA in `salary_range` and `department` looks not significant. To avoid overfitting, excluding those columns for our analysis will be safer." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" } }, "nbformat": 4, "nbformat_minor": 5 }