{ "cells": [ { "cell_type": "markdown", "id": "af3b7839-37f7-43ce-a12e-ccb723c1774e", "metadata": {}, "source": [ "# Feature Selection" ] }, { "cell_type": "markdown", "id": "8d477afd-c2ab-46e4-9472-9f2000595d20", "metadata": {}, "source": [ "In this chapter, we will carefully examine our pre-processed training dataset and select the best features for machine learning algorithms. I already processed the training dataset and saved it as joblib file. See [here](Pipeline.ipynb) if you want to know the whole process. Since the dataframe was massive, I had to break it down to three different dataframes to save it." ] }, { "cell_type": "code", "execution_count": 2, "id": "c1dff93c-c6c2-4ad3-bfdc-13a96a43d5a2", "metadata": { "tags": [ "hide-cell" ] }, "outputs": [], "source": [ "import pandas as pd \n", "import pickle\n", "import joblib" ] }, { "cell_type": "code", "execution_count": 3, "id": "9dcad22e-cbb9-48d8-bdae-6675cbb2935e", "metadata": {}, "outputs": [], "source": [ "text_features_train = joblib.load('./data/text_features_train_jlib')\n", "OHE_features_train = joblib.load('./data/OHE_features_train_jlib')\n", "processed_train = joblib.load('./data/processed_train_jlib')" ] }, { "cell_type": "markdown", "id": "e641e902-6fa2-4b2b-8333-22dfea106c0a", "metadata": {}, "source": [ "However, here we face a big problem. If we try to combine these three dataframes into single dataframe by doing \n", "\n", "```python\n", "train_features = pd.concat([text_features_train, OHE_features_train, processed_train], axis = 1)\n", "```\n", "\n", "Then we get the **MemeoryError** because **text_features_train is massive dataframe with 89527 columns (10 GB).**" ] }, { "cell_type": "code", "execution_count": 4, "id": "76f5de60-ba66-4b3c-9fbf-615f05fedf6c", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | aa_desc | \n", "aaa_desc | \n", "aaab_desc | \n", "aab_desc | \n", "aabc_desc | \n", "aabd_desc | \n", "aabf_desc | \n", "aac_desc | \n", "aaccd_desc | \n", "aachen_desc | \n", "... | \n", "zodat_benefits | \n", "zollman_benefits | \n", "zombi_benefits | \n", "zone_benefits | \n", "zoo_benefits | \n", "zowel_benefits | \n", "zu_benefits | \n", "zult_benefits | \n", "zutrifft_benefits | \n", "zweig_benefits | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "0.165596 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "... | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
1 | \n", "0.000000 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "... | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
2 | \n", "0.000000 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "... | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
3 | \n", "0.000000 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "... | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
4 | \n", "0.000000 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "... | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
14299 | \n", "0.000000 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "... | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
14300 | \n", "0.000000 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "... | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
14301 | \n", "0.000000 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "... | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
14302 | \n", "0.000000 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "... | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
14303 | \n", "0.000000 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "... | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
14304 rows × 89527 columns
\n", "