Creating Features from the Description
Creating Features from the Description¶
As we mentioned earlier, we want each word to be a feature of the dataset. To do this, we will use the Bag of Words method that will ensure each description in the dataset is represented only with the word count (e.g., if the description contains the word, the column will mark it as 1, otherwise 0). However, only using the Bag of Words is somewhat problematic. Since frequency is dependent on the length of the text, longer texts can calculate a higher frequency for some unnecessary words that rarely appear across the data as a whole. This clearly can cause big trouble when we come to feature selection.
To remedy this problem, we will standardize the word frequency using TF-IDF (Term Frequency-Inverse Document Frequency) so that all frequencies can be weighted. It is a numerical statistic intended to reflect how important a word is to a document in a collection, and here is how we calculate it.
where
Term Frequency (TF) measures how frequently a term occurs in a document. Inverse Document Frequency (IDF) is a factor that diminishes the weight of terms that occur very frequently in the document and increases the weight of words that occur rarely. As you observe here, as the word appears less frequently throughout the dataset, the IDF increases which decreases TF-IDF as a result. We gives more weight on the words that appear frequently across the entire dataset. This way, we can avoid possible outlier/confounding features in our dataset.
Note
In the original project, we had to extracting the features and converting those using TF-IDF were a seperate step. In Python, sklearn has very useful feature extraction encoder, “TfidfVectorizer”, and this combines the extraction and conversion into a single step.
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import pickle
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[1], line 1
----> 1 from sklearn.feature_extraction.text import TfidfVectorizer
2 import pandas as pd
3 import pickle
ModuleNotFoundError: No module named 'sklearn'
train_data = pd.read_csv("./data/train_set.csv")
train_data["description"].fillna('no_description', inplace = True)
vec = TfidfVectorizer(smooth_idf=True)
tfidf_description = vec.fit_transform(train_data["description"])
Warning
We have some NAs in the description column, which will cause the error with TfidfVectorizer if we don’t impute it. I decided to replace all NAs with “no_description” because NA may indicate a strong signal for fraudulent. For instance, a fraudulent posting might have no job description.
features_name = vec.get_feature_names_out()
df_tfidfvect = pd.DataFrame(data = tfidf_description.toarray(), columns = features_name)
df_tfidfvect.head()
aa | aaa | aaab | aab | aabc | aabd | aabf | aac | aaccd | aachen | ... | zumero | zur | zurb | zurich | zusammenarbeitest | zusammenbringt | zweig | zyfax | zyka | zynga | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.164839 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 40989 columns
We will save our TfidfVectorizer as a pickle so that we can use it later to the test dataset.
with open('./pickle/tfidfvec.pkl', 'wb') as f:
pickle.dump(vec, f)
Warning
This step is essential because if we don’t use the same encoder to train and test the set, it might raise the dimension error in any machine learning algorithm. For example, if we use a different encoder, the feature “planner” in the training set will not appear in the test set so that we can get the numbers of a feature. To avoid this error, we should apply the encoder that was trained by the training set to the test set. Sometimes, people combine train and test sets and run TfidfVectorizer to avoid this error, but we must always assume that we don’t have any test set yet.
We will apply these same step to the columns requirement
, title
and benefit
. If you are curious why we should exclude company_profile
even though it also contains text data, please refer to here.