{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "e7377006-3392-4138-8f7d-696a67350e5b",
   "metadata": {},
   "source": [
    "# Data Description"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "fe8d8bd2-c4b8-4481-8b6e-a110958f41ea",
   "metadata": {
    "tags": [
     "hide-cell"
    ]
   },
   "outputs": [],
   "source": [
    "import pandas as pd "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cabba208-7f1e-4e7c-b053-d977ed248a4b",
   "metadata": {},
   "source": [
    "In this project, **we aim to classify job postings as fraudulent and non-fraudulent using several features in the dataset.** Let's import the dataset and observe the first five rows. \n",
    "\n",
    "```{note}\n",
    "This is a data description for the entire dataset before we split the dataset into the train and the test set. In the original project, the data description was written based on only the training dataset.\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "b2552b9a-53ae-4f2e-8384-fa9ca53afbf2",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>job_id</th>\n",
       "      <th>title</th>\n",
       "      <th>location</th>\n",
       "      <th>department</th>\n",
       "      <th>salary_range</th>\n",
       "      <th>company_profile</th>\n",
       "      <th>description</th>\n",
       "      <th>requirements</th>\n",
       "      <th>benefits</th>\n",
       "      <th>telecommuting</th>\n",
       "      <th>has_company_logo</th>\n",
       "      <th>has_questions</th>\n",
       "      <th>employment_type</th>\n",
       "      <th>required_experience</th>\n",
       "      <th>required_education</th>\n",
       "      <th>industry</th>\n",
       "      <th>function</th>\n",
       "      <th>fraudulent</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>Marketing Intern</td>\n",
       "      <td>US, NY, New York</td>\n",
       "      <td>Marketing</td>\n",
       "      <td>NaN</td>\n",
       "      <td>We're Food52, and we've created a groundbreaki...</td>\n",
       "      <td>Food52, a fast-growing, James Beard Award-winn...</td>\n",
       "      <td>Experience with content management systems a m...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>Other</td>\n",
       "      <td>Internship</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Marketing</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>Customer Service - Cloud Video Production</td>\n",
       "      <td>NZ, , Auckland</td>\n",
       "      <td>Success</td>\n",
       "      <td>NaN</td>\n",
       "      <td>90 Seconds, the worlds Cloud Video Production ...</td>\n",
       "      <td>Organised - Focused - Vibrant - Awesome!Do you...</td>\n",
       "      <td>What we expect from you:Your key responsibilit...</td>\n",
       "      <td>What you will get from usThrough being part of...</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>Full-time</td>\n",
       "      <td>Not Applicable</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Marketing and Advertising</td>\n",
       "      <td>Customer Service</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>Commissioning Machinery Assistant (CMA)</td>\n",
       "      <td>US, IA, Wever</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Valor Services provides Workforce Solutions th...</td>\n",
       "      <td>Our client, located in Houston, is actively se...</td>\n",
       "      <td>Implement pre-commissioning and commissioning ...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>Account Executive - Washington DC</td>\n",
       "      <td>US, DC, Washington</td>\n",
       "      <td>Sales</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Our passion for improving quality of life thro...</td>\n",
       "      <td>THE COMPANY: ESRI – Environmental Systems Rese...</td>\n",
       "      <td>EDUCATION: Bachelor’s or Master’s in GIS, busi...</td>\n",
       "      <td>Our culture is anything but corporate—we have ...</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>Full-time</td>\n",
       "      <td>Mid-Senior level</td>\n",
       "      <td>Bachelor's Degree</td>\n",
       "      <td>Computer Software</td>\n",
       "      <td>Sales</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>Bill Review Manager</td>\n",
       "      <td>US, FL, Fort Worth</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>SpotSource Solutions LLC is a Global Human Cap...</td>\n",
       "      <td>JOB TITLE: Itemization Review ManagerLOCATION:...</td>\n",
       "      <td>QUALIFICATIONS:RN license in the State of Texa...</td>\n",
       "      <td>Full Benefits Offered</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Full-time</td>\n",
       "      <td>Mid-Senior level</td>\n",
       "      <td>Bachelor's Degree</td>\n",
       "      <td>Hospital &amp; Health Care</td>\n",
       "      <td>Health Care Provider</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   job_id                                      title            location  \\\n",
       "0       1                           Marketing Intern    US, NY, New York   \n",
       "1       2  Customer Service - Cloud Video Production      NZ, , Auckland   \n",
       "2       3    Commissioning Machinery Assistant (CMA)       US, IA, Wever   \n",
       "3       4          Account Executive - Washington DC  US, DC, Washington   \n",
       "4       5                        Bill Review Manager  US, FL, Fort Worth   \n",
       "\n",
       "  department salary_range                                    company_profile  \\\n",
       "0  Marketing          NaN  We're Food52, and we've created a groundbreaki...   \n",
       "1    Success          NaN  90 Seconds, the worlds Cloud Video Production ...   \n",
       "2        NaN          NaN  Valor Services provides Workforce Solutions th...   \n",
       "3      Sales          NaN  Our passion for improving quality of life thro...   \n",
       "4        NaN          NaN  SpotSource Solutions LLC is a Global Human Cap...   \n",
       "\n",
       "                                         description  \\\n",
       "0  Food52, a fast-growing, James Beard Award-winn...   \n",
       "1  Organised - Focused - Vibrant - Awesome!Do you...   \n",
       "2  Our client, located in Houston, is actively se...   \n",
       "3  THE COMPANY: ESRI – Environmental Systems Rese...   \n",
       "4  JOB TITLE: Itemization Review ManagerLOCATION:...   \n",
       "\n",
       "                                        requirements  \\\n",
       "0  Experience with content management systems a m...   \n",
       "1  What we expect from you:Your key responsibilit...   \n",
       "2  Implement pre-commissioning and commissioning ...   \n",
       "3  EDUCATION: Bachelor’s or Master’s in GIS, busi...   \n",
       "4  QUALIFICATIONS:RN license in the State of Texa...   \n",
       "\n",
       "                                            benefits  telecommuting  \\\n",
       "0                                                NaN              0   \n",
       "1  What you will get from usThrough being part of...              0   \n",
       "2                                                NaN              0   \n",
       "3  Our culture is anything but corporate—we have ...              0   \n",
       "4                              Full Benefits Offered              0   \n",
       "\n",
       "   has_company_logo  has_questions employment_type required_experience  \\\n",
       "0                 1              0           Other          Internship   \n",
       "1                 1              0       Full-time      Not Applicable   \n",
       "2                 1              0             NaN                 NaN   \n",
       "3                 1              0       Full-time    Mid-Senior level   \n",
       "4                 1              1       Full-time    Mid-Senior level   \n",
       "\n",
       "  required_education                   industry              function  \\\n",
       "0                NaN                        NaN             Marketing   \n",
       "1                NaN  Marketing and Advertising      Customer Service   \n",
       "2                NaN                        NaN                   NaN   \n",
       "3  Bachelor's Degree          Computer Software                 Sales   \n",
       "4  Bachelor's Degree     Hospital & Health Care  Health Care Provider   \n",
       "\n",
       "   fraudulent  \n",
       "0           0  \n",
       "1           0  \n",
       "2           0  \n",
       "3           0  \n",
       "4           0  "
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data = pd.read_csv(\"./data/fake_job_postings.csv\")\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9c0662a5-dff1-4e9d-a472-305fd1260da8",
   "metadata": {},
   "source": [
    "From the dataset above, we can notice several important dataset characteristics we need to keep in mind while performing the analysis.\n",
    "\n",
    "1. The dataset consists of job posting information and labels (0 for non-fraudulent, 1 for fraudulent). Therefore,  The last column should not be used as a feature since it is the variable we want to predict. It will be helpful to find out how many fraudulent/non-fraudulent rows we have for each. \n",
    "\n",
    "```python \n",
    "data[data[\"fraudulent\"] == 1].shape[0]\n",
    "\n",
    "866\n",
    "\n",
    "data[data[\"fraudulent\"] == 0].shape[0]\n",
    "\n",
    "17014\n",
    "```\n",
    "\n",
    "> We have **866 fraudulent and 17014 non-fraudulent** job postings in the dataset. This information can be helpful when we split the data into the test-set and the train-set.\n",
    "\n",
    "2. We can observe several null values across the dataset. We must remember that we must do the imputation before we start the analysis. \n",
    "\n",
    "3. Some columns contain text data, so we must go through some natural language processing to clean and sort the features. To avoid overfitting, we might also have to sort out important words to use as \"super features.\" "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ed69e50e-ef16-4669-8ecf-b64b40c14968",
   "metadata": {},
   "source": [
    "The dataframe below shows more details about dataset including name of all columns and its data type. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "667dc7e1-926a-4e7b-9454-b7b5dde55a5c",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 17880 entries, 0 to 17879\n",
      "Data columns (total 18 columns):\n",
      " #   Column               Non-Null Count  Dtype \n",
      "---  ------               --------------  ----- \n",
      " 0   job_id               17880 non-null  int64 \n",
      " 1   title                17880 non-null  object\n",
      " 2   location             17534 non-null  object\n",
      " 3   department           6333 non-null   object\n",
      " 4   salary_range         2868 non-null   object\n",
      " 5   company_profile      14572 non-null  object\n",
      " 6   description          17879 non-null  object\n",
      " 7   requirements         15185 non-null  object\n",
      " 8   benefits             10670 non-null  object\n",
      " 9   telecommuting        17880 non-null  int64 \n",
      " 10  has_company_logo     17880 non-null  int64 \n",
      " 11  has_questions        17880 non-null  int64 \n",
      " 12  employment_type      14409 non-null  object\n",
      " 13  required_experience  10830 non-null  object\n",
      " 14  required_education   9775 non-null   object\n",
      " 15  industry             12977 non-null  object\n",
      " 16  function             11425 non-null  object\n",
      " 17  fraudulent           17880 non-null  int64 \n",
      "dtypes: int64(5), object(13)\n",
      "memory usage: 2.5+ MB\n"
     ]
    }
   ],
   "source": [
    "data.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "928d3c07-2141-407a-aabf-8d7abb748aad",
   "metadata": {},
   "source": [
    "From the dataframe above, we realize that some columns have more NA values than others. Knowing the percentage of NA for each column can be a helpful tool for feature selection. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "d98c05d8-6459-44fc-a231-7712a68b5c87",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>percent_missing</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>salary_range</th>\n",
       "      <td>83.959732</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>department</th>\n",
       "      <td>64.580537</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>required_education</th>\n",
       "      <td>45.329978</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>benefits</th>\n",
       "      <td>40.324385</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>required_experience</th>\n",
       "      <td>39.429530</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>function</th>\n",
       "      <td>36.101790</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>industry</th>\n",
       "      <td>27.421700</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>employment_type</th>\n",
       "      <td>19.412752</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>company_profile</th>\n",
       "      <td>18.501119</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>requirements</th>\n",
       "      <td>15.072707</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>location</th>\n",
       "      <td>1.935123</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>description</th>\n",
       "      <td>0.005593</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>job_id</th>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>telecommuting</th>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>has_questions</th>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>has_company_logo</th>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>title</th>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>fraudulent</th>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                     percent_missing\n",
       "salary_range               83.959732\n",
       "department                 64.580537\n",
       "required_education         45.329978\n",
       "benefits                   40.324385\n",
       "required_experience        39.429530\n",
       "function                   36.101790\n",
       "industry                   27.421700\n",
       "employment_type            19.412752\n",
       "company_profile            18.501119\n",
       "requirements               15.072707\n",
       "location                    1.935123\n",
       "description                 0.005593\n",
       "job_id                      0.000000\n",
       "telecommuting               0.000000\n",
       "has_questions               0.000000\n",
       "has_company_logo            0.000000\n",
       "title                       0.000000\n",
       "fraudulent                  0.000000"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "percent_missing = data.isnull().sum() * 100 / len(data)\n",
    "missing_value_data = pd.DataFrame({'percent_missing': percent_missing})\n",
    "missing_value_data.sort_values('percent_missing', inplace=True, ascending = False)\n",
    "missing_value_data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "12e6f100-369d-426c-9402-c251d63bd85f",
   "metadata": {},
   "source": [
    "It looks like `salary_range` and `department` have a high percentage of NAs. The existence of NA in the data can be helpful in prediction, but unlike the variables `required_education`, `benefits`, and `required_experience`, the presence of NA in `salary_range` and `department` looks not significant. To avoid overfitting, excluding those columns for our analysis will be safer."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}