{ "cells": [ { "cell_type": "markdown", "id": "af3b7839-37f7-43ce-a12e-ccb723c1774e", "metadata": {}, "source": [ "# Feature Selection" ] }, { "cell_type": "markdown", "id": "8d477afd-c2ab-46e4-9472-9f2000595d20", "metadata": {}, "source": [ "In this chapter, we will carefully examine our pre-processed training dataset and select the best features for machine learning algorithms. I already processed the training dataset and saved it as joblib file. See [here](Pipeline.ipynb) if you want to know the whole process. Since the dataframe was massive, I had to break it down to three different dataframes to save it." ] }, { "cell_type": "code", "execution_count": 2, "id": "c1dff93c-c6c2-4ad3-bfdc-13a96a43d5a2", "metadata": { "tags": [ "hide-cell" ] }, "outputs": [], "source": [ "import pandas as pd \n", "import pickle\n", "import joblib" ] }, { "cell_type": "code", "execution_count": 3, "id": "9dcad22e-cbb9-48d8-bdae-6675cbb2935e", "metadata": {}, "outputs": [], "source": [ "text_features_train = joblib.load('./data/text_features_train_jlib')\n", "OHE_features_train = joblib.load('./data/OHE_features_train_jlib')\n", "processed_train = joblib.load('./data/processed_train_jlib')" ] }, { "cell_type": "markdown", "id": "e641e902-6fa2-4b2b-8333-22dfea106c0a", "metadata": {}, "source": [ "However, here we face a big problem. If we try to combine these three dataframes into single dataframe by doing \n", "\n", "```python\n", "train_features = pd.concat([text_features_train, OHE_features_train, processed_train], axis = 1)\n", "```\n", "\n", "Then we get the **MemeoryError** because **text_features_train is massive dataframe with 89527 columns (10 GB).**" ] }, { "cell_type": "code", "execution_count": 4, "id": "76f5de60-ba66-4b3c-9fbf-615f05fedf6c", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
aa_descaaa_descaaab_descaab_descaabc_descaabd_descaabf_descaac_descaaccd_descaachen_desc...zodat_benefitszollman_benefitszombi_benefitszone_benefitszoo_benefitszowel_benefitszu_benefitszult_benefitszutrifft_benefitszweig_benefits
00.1655960.00.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
10.0000000.00.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
20.0000000.00.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
30.0000000.00.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
40.0000000.00.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
..................................................................
142990.0000000.00.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
143000.0000000.00.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
143010.0000000.00.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
143020.0000000.00.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
143030.0000000.00.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
\n", "

14304 rows × 89527 columns

\n", "
" ], "text/plain": [ " aa_desc aaa_desc aaab_desc aab_desc aabc_desc aabd_desc \\\n", "0 0.165596 0.0 0.0 0.0 0.0 0.0 \n", "1 0.000000 0.0 0.0 0.0 0.0 0.0 \n", "2 0.000000 0.0 0.0 0.0 0.0 0.0 \n", "3 0.000000 0.0 0.0 0.0 0.0 0.0 \n", "4 0.000000 0.0 0.0 0.0 0.0 0.0 \n", "... ... ... ... ... ... ... \n", "14299 0.000000 0.0 0.0 0.0 0.0 0.0 \n", "14300 0.000000 0.0 0.0 0.0 0.0 0.0 \n", "14301 0.000000 0.0 0.0 0.0 0.0 0.0 \n", "14302 0.000000 0.0 0.0 0.0 0.0 0.0 \n", "14303 0.000000 0.0 0.0 0.0 0.0 0.0 \n", "\n", " aabf_desc aac_desc aaccd_desc aachen_desc ... zodat_benefits \\\n", "0 0.0 0.0 0.0 0.0 ... 0.0 \n", "1 0.0 0.0 0.0 0.0 ... 0.0 \n", "2 0.0 0.0 0.0 0.0 ... 0.0 \n", "3 0.0 0.0 0.0 0.0 ... 0.0 \n", "4 0.0 0.0 0.0 0.0 ... 0.0 \n", "... ... ... ... ... ... ... \n", "14299 0.0 0.0 0.0 0.0 ... 0.0 \n", "14300 0.0 0.0 0.0 0.0 ... 0.0 \n", "14301 0.0 0.0 0.0 0.0 ... 0.0 \n", "14302 0.0 0.0 0.0 0.0 ... 0.0 \n", "14303 0.0 0.0 0.0 0.0 ... 0.0 \n", "\n", " zollman_benefits zombi_benefits zone_benefits zoo_benefits \\\n", "0 0.0 0.0 0.0 0.0 \n", "1 0.0 0.0 0.0 0.0 \n", "2 0.0 0.0 0.0 0.0 \n", "3 0.0 0.0 0.0 0.0 \n", "4 0.0 0.0 0.0 0.0 \n", "... ... ... ... ... \n", "14299 0.0 0.0 0.0 0.0 \n", "14300 0.0 0.0 0.0 0.0 \n", "14301 0.0 0.0 0.0 0.0 \n", "14302 0.0 0.0 0.0 0.0 \n", "14303 0.0 0.0 0.0 0.0 \n", "\n", " zowel_benefits zu_benefits zult_benefits zutrifft_benefits \\\n", "0 0.0 0.0 0.0 0.0 \n", "1 0.0 0.0 0.0 0.0 \n", "2 0.0 0.0 0.0 0.0 \n", "3 0.0 0.0 0.0 0.0 \n", "4 0.0 0.0 0.0 0.0 \n", "... ... ... ... ... \n", "14299 0.0 0.0 0.0 0.0 \n", "14300 0.0 0.0 0.0 0.0 \n", "14301 0.0 0.0 0.0 0.0 \n", "14302 0.0 0.0 0.0 0.0 \n", "14303 0.0 0.0 0.0 0.0 \n", "\n", " zweig_benefits \n", "0 0.0 \n", "1 0.0 \n", "2 0.0 \n", "3 0.0 \n", "4 0.0 \n", "... ... \n", "14299 0.0 \n", "14300 0.0 \n", "14301 0.0 \n", "14302 0.0 \n", "14303 0.0 \n", "\n", "[14304 rows x 89527 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text_features_train" ] }, { "cell_type": "markdown", "id": "761884f7-fcea-4c36-abe6-8771929d1b0e", "metadata": {}, "source": [ "This means that we are not able to perform any supervised feature selection until we select features from `text_features_train`. We must reduce its dimension significantly. \n", "\n", "**In this section, we will discuss how we can reduce `text_features_train`'s dimension significantly, and what features should we select to make the most efficient and precise machine learning outcomes.**" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" } }, "nbformat": 4, "nbformat_minor": 5 }