Creating Features from the Description

Creating Features from the Description

As we mentioned earlier, we want each word to be a feature of the dataset. To do this, we will use the Bag of Words method that will ensure each description in the dataset is represented only with the word count (e.g., if the description contains the word, the column will mark it as 1, otherwise 0). However, only using the Bag of Words is somewhat problematic. Since frequency is dependent on the length of the text, longer texts can calculate a higher frequency for some unnecessary words that rarely appear across the data as a whole. This clearly can cause big trouble when we come to feature selection.

To remedy this problem, we will standardize the word frequency using TF-IDF (Term Frequency-Inverse Document Frequency) so that all frequencies can be weighted. It is a numerical statistic intended to reflect how important a word is to a document in a collection, and here is how we calculate it.

\[ \text{TF-IDF} = \text{TF} \times \text{IDF} \]

where

\[TF = \frac{\text{number of times the word appears in the description}}{\text{total number of the word in the entire dataset}}\]
\[IDF = \log ( \frac{\text{number of description in the dataset}}{\text{number of description that contained the word}} )\]

Term Frequency (TF) measures how frequently a term occurs in a document. Inverse Document Frequency (IDF) is a factor that diminishes the weight of terms that occur very frequently in the document and increases the weight of words that occur rarely. As you observe here, as the word appears less frequently throughout the dataset, the IDF increases which decreases TF-IDF as a result. We gives more weight on the words that appear frequently across the entire dataset. This way, we can avoid possible outlier/confounding features in our dataset.

Note

In the original project, we had to extracting the features and converting those using TF-IDF were a seperate step. In Python, sklearn has very useful feature extraction encoder, “TfidfVectorizer”, and this combines the extraction and conversion into a single step.

from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd 
import pickle
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 1
----> 1 from sklearn.feature_extraction.text import TfidfVectorizer
      2 import pandas as pd 
      3 import pickle

ModuleNotFoundError: No module named 'sklearn'
train_data = pd.read_csv("./data/train_set.csv")
train_data["description"].fillna('no_description', inplace = True)
vec = TfidfVectorizer(smooth_idf=True)
tfidf_description = vec.fit_transform(train_data["description"])

Warning

We have some NAs in the description column, which will cause the error with TfidfVectorizer if we don’t impute it. I decided to replace all NAs with “no_description” because NA may indicate a strong signal for fraudulent. For instance, a fraudulent posting might have no job description.

features_name = vec.get_feature_names_out()
df_tfidfvect = pd.DataFrame(data = tfidf_description.toarray(), columns = features_name)
df_tfidfvect.head()
aa aaa aaab aab aabc aabd aabf aac aaccd aachen ... zumero zur zurb zurich zusammenarbeitest zusammenbringt zweig zyfax zyka zynga
0 0.164839 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 40989 columns

We will save our TfidfVectorizer as a pickle so that we can use it later to the test dataset.

with open('./pickle/tfidfvec.pkl', 'wb') as f:
    pickle.dump(vec, f)

Warning

This step is essential because if we don’t use the same encoder to train and test the set, it might raise the dimension error in any machine learning algorithm. For example, if we use a different encoder, the feature “planner” in the training set will not appear in the test set so that we can get the numbers of a feature. To avoid this error, we should apply the encoder that was trained by the training set to the test set. Sometimes, people combine train and test sets and run TfidfVectorizer to avoid this error, but we must always assume that we don’t have any test set yet.

We will apply these same step to the columns requirement, title and benefit. If you are curious why we should exclude company_profile even though it also contains text data, please refer to here.