Creating Features Through Investigation¶

On this page, we will focus on creating new features other than the given columns in the dataset. Let’s start with some simple hypothesis and investigate whether we can use our hypothesis to create a new feature.

Note

This page corresponds to the “Power Feature” part of the original project. I will revise and fix some errors in the original project and add new investigations for more features.

First Investigation: Relationship Between Location and Fraudulent Posting¶

The first hypothesis we will investigate is that location, especially state, is related to whether the posting is fraudulent. For example, many fake job postings might have come from California or New York since those states have more job availability than others due to their high population. If the location is significantly related to the fraudulent variable, we can extract the state from the location and use it as a feature. Let’s investigate the relationship and see if the hypothesis is correct.

Note

This part is not in the original project.

import pandas as pd 
import numpy as np
import re
from matplotlib import pyplot as plt

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 1
----> 1 import pandas as pd 
      2 import numpy as np
      3 import re

ModuleNotFoundError: No module named 'pandas'

train_data = pd.read_csv("./data/train_set.csv")

To extract the state from the location, let’s define the function to make the process easier.

def extract_state(s):
    """ Extract state from the location"""
    """ The function can be used only when the state is formmated with two capital letter"""
    """ Input: Series, iterable object"""
    """ Output: List of States"""
    
    s.fillna("No Location", inplace = True)
    result = []    
    for i in np.arange(len(s)):
        if (s[i].__contains__("US")):
            extracted = re.findall(r'[A-Z]{2}', re.sub(r'[US]','',s[i]))
            # Edge Case 1: Posting is from US but State is not posted
            if extracted == []:
                extracted = ["Domestic"]
            # Edge Case 2: Regex detect a city name as a State name
            if len(extracted) != 1:
                while len(extracted) > 1:
                    extracted.pop()
            result += extracted
        else:
            # Edge Case 3: Location is not given 
            if s[i] == ["No Location"]:
                result += s[i]
            # Edge Case 4: Location is given but not in US
            elif re.findall(r'[A-Z]{2}', s[i]) != []:
                result += ["Foreign"]
            # Edge Case 5: Location cannot be identified from the given information
            else:
                result += ["No Location"]
    return result

def extract_state(s):
    """ Extract state from the location"""
    """ The function can be used only when the state is formmated with two capital letter"""
    """ Input: Series, iterable object"""
    """ Output: List of States"""
    
    s.fillna("No Location", inplace = True)
    result = []    
    for i in np.arange(len(s)):
        if (s[i].__contains__("US")):
            extracted = re.findall(r'[A-Z]{2}', re.sub(r'[US]','',s[i]))
            # Edge Case 1: Posting is from US but State is not posted
            if extracted == []:
                extracted = ["Domestic"]
            # Edge Case 2: Regex detect a city name as a State name
            if len(extracted) != 1:
                while len(extracted) > 1:
                    extracted.pop()
            result += extracted
        else:
            # Edge Case 3: Location is not given 
            if s[i] == ["No Location"]:
                result += s[i]
            # Edge Case 4: Location is given but not in US
            elif re.findall(r'[A-Z]{2}', s[i]) != []:
                result += ["Foreign"]
            # Edge Case 5: Location cannot be identified from the given information
            else:
                result += ["No Location"]
    return result

result = extract_state(train_data["location"])
train_data["state"] = result
train_data[["location", "state", "fraudulent"]]

	location	state	fraudulent
0	US, VA, Virginia Beach	VA	0
1	US, TX, Dallas	TX	0
2	NZ, , Auckland	Foreign	0
3	US, NE, Omaha	NE	0
4	US, CA, Los Angeles	CA	0
...	...	...	...
14299	US, CA, Irvine	CA	0
14300	GR, I, Athens	Foreign	0
14301	US, NJ, Elmwood Park	NJ	0
14302	US, AZ, Phoenix	AZ	0
14303	CA, ON, Toronto	Foreign	0

14304 rows × 3 columns

Now, let’s use the extracted state to validate our hypothesis. First of all, let’s observe the distribution of state in the entire dataset.

Distribution of Location: Entire Dataset¶

combined_data = train_data["state"].value_counts().to_frame()
combined_data = combined_data.reset_index().rename(columns = {'index':'state', 'state':'count'})
combined_data["percentage"] = (combined_data["count"] / combined_data["count"].sum()) * 100
combined_data.head(10)

	state	count	percentage
0	Foreign	5504	38.478747
1	CA	1617	11.304530
2	NY	1003	7.012025
3	TX	806	5.634787
4	Domestic	584	4.082774
5	IL	326	2.279083
6	FL	322	2.251119
7	OH	304	2.125280
8	No Location	280	1.957494
9	MA	262	1.831655

sorted_df = combined_data.sort_values('percentage')
sorted_df = sorted_df[sorted_df['percentage'] > 1]
x = sorted_df['state']
y = sorted_df['percentage']

plt.figure(figsize=(5,5))
plt.barh(x,y)
plt.xlabel("Percentage", size=12)
plt.ylabel("State", size=12)
plt.title("Percent Distribution of Location: Entire Dataset", size=13)

Text(0.5, 1.0, 'Percent Distribution of Location: Entire Dataset')

Note

Foreign: the job posting has a location outside of the US

Domestic: the job posting has a US location, but a specific state name is not given

No Location: the job posting did not include a location

The distribution of the state of the entire dataset shows that about 40% of job posting is from companies outside the US, and about 20% of job posting is from either California or New York. The result is quite surprising because I was not expecting to see such a high number of ‘foreign.’

Let’s see if we see a notable difference in distribution when we split the dataset into fraudulent and non-fraudulent.

Distribution of Location: Fraudulent and Non-Fraudulent Dataset¶

# Convinent function for data frame display 

from IPython.core.display import display, HTML

def display_side_by_side(dfs:list, captions:list, tablespacing=5):
    """Display tables side by side to save vertical space
    Input:
        dfs: list of pandas.DataFrame
        captions: list of table captions
    """
    output = ""
    for (caption, df) in zip(captions, dfs):
        output += df.style.set_table_attributes("style='display:inline'").set_caption(caption)._repr_html_()
        output += tablespacing * "\xa0"
    display(HTML(output))

fraud_data = train_data[train_data['fraudulent'] == 1]
non_fraud_data = train_data[train_data['fraudulent'] == 0]

f_combined_data = fraud_data["state"].value_counts().to_frame()
f_combined_data = f_combined_data.reset_index().rename(columns = {'index':'state', 'state':'count'})
f_combined_data["percentage"] = (f_combined_data["count"] / f_combined_data["count"].sum()) * 100

nf_combined_data = non_fraud_data["state"].value_counts().to_frame()
nf_combined_data = nf_combined_data.reset_index().rename(columns = {'index':'state', 'state':'count'})
nf_combined_data["percentage"] = (nf_combined_data["count"] / nf_combined_data["count"].sum()) * 100

display_side_by_side([nf_combined_data[0:10], f_combined_data[0:10]], ['Non-Fraudulent', 'Fraudulent'], tablespacing =20)

Non-Fraudulent
	state	count	percentage
0	Foreign	5407	39.725222
1	CA	1509	11.086621
2	NY	947	6.957608
3	TX	681	5.003306
4	Domestic	536	3.937991
5	IL	310	2.277570
6	FL	298	2.189406
7	OH	290	2.130630
8	No Location	264	1.939608
9	MA	253	1.858791

Fraudulent
	state	count	percentage
0	TX	125	18.037518
1	CA	108	15.584416
2	Foreign	97	13.997114
3	NY	56	8.080808
4	Domestic	48	6.926407
5	MD	27	3.896104
6	FL	24	3.463203
7	GA	17	2.453102
8	No Location	16	2.308802
9	IL	16	2.308802

f_sorted_df = f_combined_data.sort_values('percentage')
f_sorted_df = f_sorted_df[f_sorted_df['percentage'] > 1]
x1 = f_sorted_df['state']
y1 = f_sorted_df['percentage']

nf_sorted_df = nf_combined_data.sort_values('percentage')
nf_sorted_df = nf_sorted_df[nf_sorted_df['percentage'] > 1]
x2 = nf_sorted_df['state']
y2 = nf_sorted_df['percentage']

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 5))

axes[0].barh(x2,y2)
axes[0].set_xlabel("Percentage", size=12)
axes[0].set_ylabel("State", size=12)
axes[0].set_title("Percent Distribution of Location: Non-Fraudulent", size=13)

axes[1].barh(x1,y1)
axes[1].set_xlabel("Percentage", size=12)
axes[1].set_ylabel("State", size=12)
axes[1].set_title("Percent Distribution of Location: Fraudulent", size=13)

fig.tight_layout()

The result is fascinating in many ways!

The distributions of fraudulent and non-fraudulent are significantly different. For example, 40% of non-fraudulent posting comes from foreign countries, while only 14% of fraudulent posting is from the US. Also, Texas shows a notable difference in percentage: 5% in non-fraudulent but 18% in fraudulent.
We also need to pay close attention to Maryland. Only 0.4% of non-fraudulent postings are from Maryland, but the percentage jumps to 3.9% in the fraudulent dataset. This is such a huge jump.
Interestingly, the distribution of location in the non-fraudulent dataset closely resembles the distribution of the entire dataset, while the fraudulent dataset does not. This resemblance may seem evident since the fraudulent dataset is only 5% of the whole dataset. However, the difference in location distribution between these two datasets is still significant. The difference implies that the location (state) might be able to serve as a great feature in our machine-learning analysis.

The observation shows that our hypothesis on location and outcomes is sound, and we should include the state column as our feature.

Second Investigation: Relationship between NAs and Fraudulent Posting¶

Before we proceed with the categorical variables, we must impute all NA values. However, some of these categorical variables may require a unique imputation to become a better feature.

The hypothesis we will explore in this section is that the presence of NAs has more predictive power for some columns than other categorical values. For example, many fraudulent job postings have a high chance of not including their company information to increase their chance of tricking people. If this hypothesis is confirmed, we can binarize those columns to NA and NOT_NA instead of one-hot-encoding all values. This way will reduce our computation costs and increase its predictive power.

Let’s make a table so that we can validate our hypothesis.

train_fraudulent = train_data[train_data['fraudulent'] == 1]
train_non_fraudulent = train_data[train_data['fraudulent'] == 0]

percent_missing = train_data.isnull().sum() * 100 / len(train_data)
percent_missing_f = train_fraudulent.isnull().sum() * 100 / len(train_fraudulent)
percent_missing_nf = train_non_fraudulent.isnull().sum() * 100 / len(train_non_fraudulent)
NA_diff = abs(percent_missing_f - percent_missing_nf)

missing_value_data = pd.DataFrame({'Total NA %': percent_missing , 
                                   'Fraudulent NA %': percent_missing_f , 
                                   'Non Fraudulent NA %': percent_missing_nf,
                                  'Difference': NA_diff })
missing_value_data.sort_values('Total NA %', inplace=True, ascending = False)
missing_value_data[missing_value_data['Total NA %'] > 0]

	Total NA %	Fraudulent NA %	Non Fraudulent NA %	Difference
salary_range	84.039430	74.458874	84.527221	10.068346
department	64.842002	62.626263	64.954816	2.328553
required_education	45.434843	53.535354	45.022408	8.512945
benefits	40.555089	43.001443	40.430534	2.570909
required_experience	39.674217	50.937951	39.100727	11.837224
function	36.136745	39.393939	35.970906	3.423034
industry	27.446868	32.611833	27.183895	5.427937
employment_type	19.274329	28.427128	18.808317	9.618812
company_profile	18.680089	69.841270	16.075233	53.766037
requirements	14.904922	17.460317	14.774814	2.685503
location	1.957494	2.308802	1.939608	0.369195
description	0.006991	0.144300	0.000000	0.144300

The table above clearly gives us several important insights:

The column company_profile’s colossal difference (54%) implies that most of the NAs in company profile are from fraudulent postings; therefore, the presence of NA in company_profile can serve as a great feature. The compant_profile will be binarized to be served as a feature (1 = NA, 0 = NON-NA).
Also, some columns demonstrated meaningful differences: salary_range, required_education, required_experience, and employment_type. Since those differences are not dramatic as company profile, we need to consider a trade-off between computational cost and accuracy when we decide whether or not we need binarization. Let’s take a look at these columns.

required_experience and employment_type contains few factors, so the cost of using OHE is not expensive. We don’t need binarization.

train_data['required_experience'].value_counts()

Mid-Senior level    3011
Entry level         2177
Associate           1832
Not Applicable       884
Director             314
Internship           301
Executive            110
Name: required_experience, dtype: int64

train_data['employment_type'].value_counts()

Full-time    9306
Contract     1224
Part-time     626
Temporary     200
Other         191
Name: employment_type, dtype: int64

However, required_education contains 13 factors, and the cost of using OHE is more pricy. We will binarize this column to reduce our computational cost.

train_data['required_education'].value_counts()

Bachelors Degree                     4116
High School or equivalent            1650
Unspecified                          1095
Masters Degree                        339
Associate Degree                      227
Certification                         148
Some College Coursework Completed      78
Professional                           57
Vocational                             38
Some High School Coursework            24
Doctorate                              20
Vocational - HS Diploma                 8
Vocational - Degree                     5
Name: required_education, dtype: int64

Some columns in the dataset contain too many NAs, which makes the column useless. For example, salary_range and department have more than 50% of NA, making the column unusable. We will eliminate those columns in the feature selection section.

In conclusion, the table supports our hypothesis, and we could improve company_profile and required_education into a much more powerful feature.

Classifying Fake Job Posting Using Machine Learning Algorithm

Creating Features Through Investigation

Contents

Creating Features Through Investigation¶

First Investigation: Relationship Between Location and Fraudulent Posting¶

Distribution of Location: Entire Dataset¶

Distribution of Location: Fraudulent and Non-Fraudulent Dataset¶

Second Investigation: Relationship between NAs and Fraudulent Posting¶