Label Encoding Vs. One Hot Encoding

Label encoding and one-hot-encoding are two popular ways to process categorical variables in the dataset. The step is required for machine learning analysis because many algorithms are not compatible with categorical data. These two methods have their pro and con, so we need to choose the most suitable one for our dataset.

Label encoding

  • Label encoding assigns an integer to each categorical value based on alphabetical order.

  • Pro: Label encoding has a significantly lower computational cost. We don’t need to modify the dimension of a matrix.

  • Con: Label encoding create unnecessary ranking between categorical values. For example, if the number 1 is assigned to “doctor” and the number 2 is assigned to “intern”, it means that the intern is twice more important as the doctor. The problem gets severe as we have more categorical values.

  • Label encoding can be useful if the categorical values are ordinal. For example, if we have ordinal categorical values such as, Not Satisfied, Satisfied, and Highly Satisfied, assigning numbers to those levels will not be a big problem. However, this still makes our analysis and interpretation harder: for example, if the number 1 is assigned to “Not Satisfied” and the number 3 is assigned to “Highly Satisfy”, how do we know that “Highly Satisfy” is three times more important than “Not Satisfied”?

One Hot Encoding

  • One Hot Encoding creates dummy variables using 0 and 1 to represent the categorical values.

  • Pro: It is a more straightforward way of representing categorical values. Also, it does not make any ranking between variables, so it works with any categorical variable. For a random forest model, using OHE also can make the model simpler (more generalized) and better. For example, let’s say I label encode the sex feature (male:1, female:2, other:3). Then when the tree needs to choose female, the tree needs two branches: a branch that selects bigger than 0 and a branch that selects less than 3. Using OHE, the tree only needs one branch that selects 1 in the column female.

  • Con: Using One Hot Encoding can be computationally costly since we are creating several dummies. Using OHE can result in creating a larger and more sparse feature matrix.

For our dataset, I will use One-Hot-Encoding instead of Lable Encoding because of two reasons:

  1. All of the categorical columns we have in the dataset are not ordinal. Although we have lots of categorical values in the columns like function, I think it would make a better result if we use OHE and then we select the features to reduce the dimension.

  2. For this project, I want to enhance the performance of the random forest model. Using OHE instead of label encoding will maximize the regularization effect by generalizing the random forest model.

Example: One Hot Encode of the column function

from sklearn.preprocessing import OneHotEncoder
import pandas as pd 
import pickle
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 1
----> 1 from sklearn.preprocessing import OneHotEncoder
      2 import pandas as pd 
      3 import pickle

ModuleNotFoundError: No module named 'sklearn'
train_data = pd.read_csv("./data/train_set.csv")
train_data["function"].fillna("NAN", inplace = True)
encoder = OneHotEncoder()
encode_function = encoder.fit_transform(train_data[['function']])
feature_name = encoder.get_feature_names_out()
encoder_df = pd.DataFrame(encode_function.toarray(), columns = feature_name)
encoder_df.head(10)
function_Accounting/Auditing function_Administrative function_Advertising function_Art/Creative function_Business Analyst function_Business Development function_Consulting function_Customer Service function_Data Analyst function_Design ... function_Public Relations function_Purchasing function_Quality Assurance function_Research function_Sales function_Science function_Strategy/Planning function_Supply Chain function_Training function_Writing/Editing
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

10 rows × 38 columns

We will save the encoder as a pickle so that we can use the same encoder to test set.

with open('./pickle/function_encode.pkl', 'wb') as f:
    pickle.dump(encoder, f)