Data Cleaning
Contents
Data Cleaning¶
Feature creation is one of the most critical parts of this project since the data largely depends on text data which requires delicate text mining and data cleaning. The most common way of creating features using text data is to use a word frequency, and accurate recording frequency necessitates careful data cleaning. On this page, we will only perform data cleaning on the “description” columns to create features based on this column. Note that we only want to use the text data in the description
column, not the company profile
, requirement
and benefit
columns, since nearly every job posting has description
while company profile
, requirement
and benefit
columns are often missing. Please refer to here if you want to see how many variables are missing for each column.
Warning
The other project combined ‘description’ with ‘company profile,’ ‘requirement’, and ‘benefit’ into a single column and performed the analysis, but I believe this can cause to confound our result. Even though all these columns consist of text data, each column must be treated independently and should not be combined without any special reasons.
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import word_tokenize,sent_tokenize
import re,string,unicodedata
from nltk.tokenize.toktok import ToktokTokenizer
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt')
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[1], line 1
----> 1 import pandas as pd
2 import numpy as np
3 import nltk
ModuleNotFoundError: No module named 'pandas'
data = pd.read_csv("./data/fake_job_postings.csv")
1. Decapitalization¶
This steps must be done before removing stopwords. Otherwise, Python will treat “And” and “and” differently. Decapitalization of a text data can make our frequency more accurate.
data["description"] = data["description"].str.lower()
2. Remove Insignificant Words and Punctuation¶
The stopwords and punctuation must be removed since their frequency would not significantly improve prediction. We also pulled any strings that related to HTML/URL and emojis. Note that we substitute those removed URLs and HTML with “url” and “html” since the presence of URL/HTML can be significant in predicting, so we still want to count those. (For example, many spam emails include more URL/HTML than ham emails.) After we remove these words, we can observe that the words such as “\xa” and “amp” still exist in a sentence, so we will manually remove those words as much as we can. We will also remove all number from the text.
# This step loads stopwords from nltk library
stop = set(stopwords.words('english'))
stop = list(stop)
def remove_URL(text):
url = re.compile(r'#url_\w*#')
return url.sub(r'url ',str(text))
def remove_emoji(text):
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
"]+", flags=re.UNICODE)
return emoji_pattern.sub(r' ', str(text))
def remove_html(text):
html=re.compile(r'<.*?>')
return html.sub(r'html ',str(text))
def remove_stopwords(words):
return ' '.join(word for word in str(words).split() if word not in stop)
def remove_punctuation(words):
return ' '.join(word.strip(string.punctuation) for word in str(words).split())
def remove_dirty_words(words):
dirty_words=re.compile(r'[^\x00-\x7F]+|(&)|\d|[^\w\s]')
return dirty_words.sub(r' ',str(words))
data["description"] = data.description.apply(remove_URL)
data["description"] = data.description.apply(remove_html)
data["description"] = data.description.apply(remove_emoji)
data["description"] = data.description.apply(remove_dirty_words)
data["description"] = data.description.apply(remove_punctuation)
data["description"] = data.description.apply(remove_stopwords)
3. Lemmatization vs. Stemming¶
The primary goal of data cleaning in this project is to get the correct frequency of the words. Therefore we need lemmatization or stemming so that the words “love” and “loving” count as the same. In the original project, we used lemmatization instead of stemming, but this time, we will focus on using stemming for these reasons:
The stemming algorithm is more efficient and faster than lemmatization. Both methods return the root word, but in lemmatization, we use WordNet corpus and a corpus for stop words to produce a lemma which makes it slower than stemming.
Stemming is enough for this project. The main difference between these two methods is that lemmatization returns the actual word, while stemming doesn’t have to. Since we don’t need the actual word, stemming provides a more straightforward solution.
Note
The original project used lemmatization for no reasons.
porter=PorterStemmer()
def stemSentence(sentence):
token_words=word_tokenize(sentence)
token_words
stem_sentence=[]
for word in token_words:
stem_sentence.append(porter.stem(word))
stem_sentence.append(" ")
return "".join(stem_sentence)
data["description"] = data.description.apply(stemSentence)
We will store this dataset as CSV so that we can use it in the next step!
data.to_csv("./data/cleaned_fake_job_postings.csv")