{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "b33a2141-9cee-4595-811a-cb25e36c56c8",
   "metadata": {},
   "source": [
    "# Text Feature Selection"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1f871ae0-1a1b-4927-93a4-16556e69dcc6",
   "metadata": {},
   "source": [
    "As we discussed in [here](featureselection.ipynb), we must perform the feature selection on text features first because it is causing MemeoryError due to its massive file size (10GB). Since we cannot use the `fraudulent` column, we will use column means to select the features based on several assumptions. After we get a smaller version of `text_features_train`, we will combine it with `fraudulent` column and perform supervised feature selection using Chi-Squre Statistics."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4f9f3725-c151-43ea-af1b-e4069d7d393a",
   "metadata": {},
   "source": [
    "### Feature Selection Using Column Mean"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "d5643cb1-502b-484d-a009-1ef6acdb048b",
   "metadata": {
    "tags": [
     "hide-output"
    ]
   },
   "outputs": [],
   "source": [
    "import pandas as pd \n",
    "import joblib\n",
    "text_features_train = joblib.load('./data/text_features_train_jlib')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "ef0d544b-1234-454d-940a-5fd177ea517a",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>aa_desc</th>\n",
       "      <th>aaa_desc</th>\n",
       "      <th>aaab_desc</th>\n",
       "      <th>aab_desc</th>\n",
       "      <th>aabc_desc</th>\n",
       "      <th>aabd_desc</th>\n",
       "      <th>aabf_desc</th>\n",
       "      <th>aac_desc</th>\n",
       "      <th>aaccd_desc</th>\n",
       "      <th>aachen_desc</th>\n",
       "      <th>...</th>\n",
       "      <th>zodat_benefits</th>\n",
       "      <th>zollman_benefits</th>\n",
       "      <th>zombi_benefits</th>\n",
       "      <th>zone_benefits</th>\n",
       "      <th>zoo_benefits</th>\n",
       "      <th>zowel_benefits</th>\n",
       "      <th>zu_benefits</th>\n",
       "      <th>zult_benefits</th>\n",
       "      <th>zutrifft_benefits</th>\n",
       "      <th>zweig_benefits</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.165596</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 89527 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "    aa_desc  aaa_desc  aaab_desc  aab_desc  aabc_desc  aabd_desc  aabf_desc  \\\n",
       "0  0.165596       0.0        0.0       0.0        0.0        0.0        0.0   \n",
       "1  0.000000       0.0        0.0       0.0        0.0        0.0        0.0   \n",
       "2  0.000000       0.0        0.0       0.0        0.0        0.0        0.0   \n",
       "3  0.000000       0.0        0.0       0.0        0.0        0.0        0.0   \n",
       "4  0.000000       0.0        0.0       0.0        0.0        0.0        0.0   \n",
       "\n",
       "   aac_desc  aaccd_desc  aachen_desc  ...  zodat_benefits  zollman_benefits  \\\n",
       "0       0.0         0.0          0.0  ...             0.0               0.0   \n",
       "1       0.0         0.0          0.0  ...             0.0               0.0   \n",
       "2       0.0         0.0          0.0  ...             0.0               0.0   \n",
       "3       0.0         0.0          0.0  ...             0.0               0.0   \n",
       "4       0.0         0.0          0.0  ...             0.0               0.0   \n",
       "\n",
       "   zombi_benefits  zone_benefits  zoo_benefits  zowel_benefits  zu_benefits  \\\n",
       "0             0.0            0.0           0.0             0.0          0.0   \n",
       "1             0.0            0.0           0.0             0.0          0.0   \n",
       "2             0.0            0.0           0.0             0.0          0.0   \n",
       "3             0.0            0.0           0.0             0.0          0.0   \n",
       "4             0.0            0.0           0.0             0.0          0.0   \n",
       "\n",
       "   zult_benefits  zutrifft_benefits  zweig_benefits  \n",
       "0            0.0                0.0             0.0  \n",
       "1            0.0                0.0             0.0  \n",
       "2            0.0                0.0             0.0  \n",
       "3            0.0                0.0             0.0  \n",
       "4            0.0                0.0             0.0  \n",
       "\n",
       "[5 rows x 89527 columns]"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text_features_train.head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b1099e1c-2093-4479-8f66-bb6538ee830b",
   "metadata": {},
   "source": [
    "Let's carefully observe the dataframe above. It is straightforward to realize that some of these features have unusual names, such as `zutrifft` and `aabd`, and it is hard to believe that they are stemmed from a standard English word. There are two main reasons why those unusual names appear as a feature. \n",
    "\n",
    "1. Although we removed the URL and HTML format in the pre-data processing step, it is still possible that some formats are not perfectly removed. Also, the data can include other non-English words, such as email or file names.  \n",
    "2. If we look over the original dataset, we can observe that text data was saved with no space between the lines. For example, \"I love dog\" (Line 1), \"Cat ate fish\" (Line 2) to \"I love dogCat ate fish\". Then it creates the abnormal word \"dogCat\".  \n",
    "\n",
    "**The one simple way to remove these unusual words with the lowest computational cost is to use column mean and filter out the features with exceptionally low means.** This is based on the assumption that unusual words will appear less frequently than normal words. For instance, a word like \"havecommunication\" will not frequently appear across the dataset. If the words appear infrequently, they will have a low mean.\n",
    "\n",
    "```{warning}\n",
    "This method is also based on the dangerous assumption that low-frequency words are less important than more frequent words. Feature selection using column means can eliminate some important words for machine learning. However, since we have more than 10,000 unusual words as features, I am trading off the accuracy for more efficiency. **I am aware that feature selection has to be done very carefully, and this is not the best way.**\n",
    "```\n",
    "\n",
    "However, we must not perform this feature selection by column means on the entire dataset since our `text_features_train` dataset is a combination of four different columns: `description`, `title`, `requirements` and `benefits`. Since the tf-idf value can vary depending on different characteristics of each dataset, we must get a column mean, select the feature separately by each dataset, and combine it later to get the best result. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "76591b4e-127f-4d56-ba9c-5d5ac32234a2",
   "metadata": {},
   "outputs": [],
   "source": [
    "sum(text_features_train.columns.str.contains('_desc'))\n",
    "text_features_desc = text_features_train.iloc[:, 0:40607]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "e122c9ec-ec33-44d5-84d6-aa3d637c1408",
   "metadata": {},
   "outputs": [],
   "source": [
    "sum(text_features_train.columns.str.contains('_req'))\n",
    "text_features_req = text_features_train.iloc[:, 40607:74863]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "93fa71de-6a8d-4f67-a0fa-ed8208af4875",
   "metadata": {},
   "outputs": [],
   "source": [
    "sum(text_features_train.columns.str.contains('_title'))\n",
    "text_features_title = text_features_train.iloc[:, 74863:78280]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "bac9cd13-0590-40ea-b9f2-0e4fafd79392",
   "metadata": {},
   "outputs": [],
   "source": [
    "sum(text_features_train.columns.str.contains('_benefits'))\n",
    "text_features_benefits = text_features_train.iloc[:, 78280:89527]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "371a6487-037a-48b7-8b5a-677e33b35847",
   "metadata": {},
   "source": [
    "```{note}\n",
    "We are seperating the dataframe like this to avoid the MemoryError. \n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "262a772c-5d0d-471f-98cb-581ba3da6c11",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "work_desc       3.387162e-02\n",
       "develop_desc    3.251693e-02\n",
       "team_desc       3.203152e-02\n",
       "manag_desc      3.167333e-02\n",
       "custom_desc     3.149140e-02\n",
       "                    ...     \n",
       "peugeot_desc    9.292221e-07\n",
       "bencki_desc     9.292221e-07\n",
       "sanofi_desc     9.292221e-07\n",
       "qmetric_desc    5.049230e-07\n",
       "gra_desc        5.049230e-07\n",
       "Length: 40607, dtype: float64"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mean_desc = text_features_desc.sum() / 14304\n",
    "mean_desc.sort_values(ascending = False)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "200144b0-6d17-4d45-9532-55f11f39047d",
   "metadata": {},
   "source": [
    "The column means for features from `description` dataset shows that the assumptions we made previously are somewhat reasonable. As we see here, more the average, the words look more normal, such as \"work\" and \"develope\".\n",
    "\n",
    "Since the highest mean is 0.03387162, let's choose all features with average mean higher than 0.002. \n",
    "\n",
    "```{note}\n",
    "I tried many different thresholds and figured out 0.002 is the best number.\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "457311ff-0e89-4549-84fc-2578e6eacea8",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>abil_desc</th>\n",
       "      <th>abl_desc</th>\n",
       "      <th>accept_desc</th>\n",
       "      <th>access_desc</th>\n",
       "      <th>accord_desc</th>\n",
       "      <th>account_desc</th>\n",
       "      <th>accur_desc</th>\n",
       "      <th>achiev_desc</th>\n",
       "      <th>acquisit_desc</th>\n",
       "      <th>across_desc</th>\n",
       "      <th>...</th>\n",
       "      <th>without_desc</th>\n",
       "      <th>word_desc</th>\n",
       "      <th>work_desc</th>\n",
       "      <th>world_desc</th>\n",
       "      <th>would_desc</th>\n",
       "      <th>write_desc</th>\n",
       "      <th>written_desc</th>\n",
       "      <th>year_desc</th>\n",
       "      <th>york_desc</th>\n",
       "      <th>young_desc</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.043316</td>\n",
       "      <td>0.04441</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>0.060186</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.021943</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.038188</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.050161</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.025411</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.044223</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.052053</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.036083</td>\n",
       "      <td>...</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.069462</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.092432</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 812 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   abil_desc  abl_desc  accept_desc  access_desc  accord_desc  account_desc  \\\n",
       "0   0.043316   0.04441          0.0          0.0          0.0      0.000000   \n",
       "1   0.000000   0.00000          0.0          0.0          0.0      0.000000   \n",
       "2   0.050161   0.00000          0.0          0.0          0.0      0.000000   \n",
       "3   0.000000   0.00000          0.0          0.0          0.0      0.052053   \n",
       "4   0.000000   0.00000          0.0          0.0          0.0      0.000000   \n",
       "\n",
       "   accur_desc  achiev_desc  acquisit_desc  across_desc  ...  without_desc  \\\n",
       "0         0.0          0.0            0.0     0.000000  ...      0.060186   \n",
       "1         0.0          0.0            0.0     0.000000  ...      0.000000   \n",
       "2         0.0          0.0            0.0     0.000000  ...      0.000000   \n",
       "3         0.0          0.0            0.0     0.000000  ...      0.000000   \n",
       "4         0.0          0.0            0.0     0.036083  ...      0.000000   \n",
       "\n",
       "   word_desc  work_desc  world_desc  would_desc  write_desc  written_desc  \\\n",
       "0        0.0   0.021943    0.000000         0.0         0.0           0.0   \n",
       "1        0.0   0.000000    0.000000         0.0         0.0           0.0   \n",
       "2        0.0   0.025411    0.000000         0.0         0.0           0.0   \n",
       "3        0.0   0.000000    0.000000         0.0         0.0           0.0   \n",
       "4        0.0   0.000000    0.069462         0.0         0.0           0.0   \n",
       "\n",
       "   year_desc  york_desc  young_desc  \n",
       "0   0.038188        0.0         0.0  \n",
       "1   0.000000        0.0         0.0  \n",
       "2   0.044223        0.0         0.0  \n",
       "3   0.000000        0.0         0.0  \n",
       "4   0.092432        0.0         0.0  \n",
       "\n",
       "[5 rows x 812 columns]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "select_desc = mean_desc > 0.002\n",
    "selected_features_desc = text_features_desc.loc[:, select_desc]\n",
    "selected_features_desc.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "433133cb-4584-4dfe-b59a-d997e1435186",
   "metadata": {},
   "source": [
    "This looks much better. We will repeat the process for the other dataframes as well. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "21b24224-4225-4bff-a90f-294d33f74500",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "experi_req       0.052954\n",
       "work_req         0.034828\n",
       "skill_req        0.033378\n",
       "requir_req       0.031967\n",
       "year_req         0.027195\n",
       "                   ...   \n",
       "cano_req         0.000002\n",
       "mcnz_req         0.000002\n",
       "orthopaed_req    0.000002\n",
       "inhabit_req      0.000002\n",
       "zeta_req         0.000001\n",
       "Length: 34256, dtype: float64"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mean_req = text_features_req.sum() / 14304\n",
    "mean_req.sort_values(ascending = False)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e0c9f23c-1dd1-4761-b121-5e777bb16dcc",
   "metadata": {},
   "source": [
    "Since TF-IDF is bit higher for `text_feature_req`, we will adjust the threshold a bit to adjust for the difference."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "e1a1a2b5-f1c9-4736-a56e-c8c276276770",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>abil_req</th>\n",
       "      <th>abl_req</th>\n",
       "      <th>account_req</th>\n",
       "      <th>across_req</th>\n",
       "      <th>activ_req</th>\n",
       "      <th>adapt_req</th>\n",
       "      <th>addit_req</th>\n",
       "      <th>administr_req</th>\n",
       "      <th>adob_req</th>\n",
       "      <th>advanc_req</th>\n",
       "      <th>...</th>\n",
       "      <th>willing_req</th>\n",
       "      <th>window_req</th>\n",
       "      <th>within_req</th>\n",
       "      <th>without_req</th>\n",
       "      <th>word_req</th>\n",
       "      <th>work_req</th>\n",
       "      <th>would_req</th>\n",
       "      <th>write_req</th>\n",
       "      <th>written_req</th>\n",
       "      <th>year_req</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.081650</td>\n",
       "      <td>0.105596</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.064795</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.068928</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.061549</td>\n",
       "      <td>0.151344</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.063024</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.033522</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.047567</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.087269</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.075495</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.054646</td>\n",
       "      <td>0.040155</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.072623</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.038628</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 350 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   abil_req   abl_req  account_req  across_req  activ_req  adapt_req  \\\n",
       "0  0.081650  0.105596          0.0         0.0        0.0        0.0   \n",
       "1  0.000000  0.000000          0.0         0.0        0.0        0.0   \n",
       "2  0.000000  0.000000          0.0         0.0        0.0        0.0   \n",
       "3  0.047567  0.000000          0.0         0.0        0.0        0.0   \n",
       "4  0.000000  0.000000          0.0         0.0        0.0        0.0   \n",
       "\n",
       "   addit_req  administr_req  adob_req  advanc_req  ...  willing_req  \\\n",
       "0        0.0            0.0       0.0         0.0  ...          0.0   \n",
       "1        0.0            0.0       0.0         0.0  ...          0.0   \n",
       "2        0.0            0.0       0.0         0.0  ...          0.0   \n",
       "3        0.0            0.0       0.0         0.0  ...          0.0   \n",
       "4        0.0            0.0       0.0         0.0  ...          0.0   \n",
       "\n",
       "   window_req  within_req  without_req  word_req  work_req  would_req  \\\n",
       "0    0.000000    0.000000     0.000000       0.0  0.064795        0.0   \n",
       "1    0.000000    0.061549     0.151344       0.0  0.063024        0.0   \n",
       "2    0.000000    0.000000     0.000000       0.0  0.000000        0.0   \n",
       "3    0.087269    0.000000     0.000000       0.0  0.075495        0.0   \n",
       "4    0.000000    0.000000     0.000000       0.0  0.072623        0.0   \n",
       "\n",
       "   write_req  written_req  year_req  \n",
       "0        0.0     0.000000  0.068928  \n",
       "1        0.0     0.000000  0.033522  \n",
       "2        0.0     0.000000  0.000000  \n",
       "3        0.0     0.054646  0.040155  \n",
       "4        0.0     0.000000  0.038628  \n",
       "\n",
       "[5 rows x 350 columns]"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "select_req = mean_req > 0.003\n",
    "selected_features_req = text_features_req.loc[:, select_req]\n",
    "selected_features_req.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "b11df688-7adf-4a49-985c-6124a0679532",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "manag_title           0.048922\n",
       "develop_title         0.046175\n",
       "engin_title           0.038562\n",
       "sale_title            0.029780\n",
       "servic_title          0.024249\n",
       "                        ...   \n",
       "maharashtra_title     0.000024\n",
       "barri_title           0.000024\n",
       "peterborough_title    0.000024\n",
       "haliburton_title      0.000024\n",
       "elgin_title           0.000021\n",
       "Length: 3417, dtype: float64"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mean_title = text_features_title.sum() / 14304\n",
    "mean_title.sort_values(ascending = False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "7dcd6a44-320c-48d0-ab80-edf973195be4",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>abroad_title</th>\n",
       "      <th>account_title</th>\n",
       "      <th>admin_title</th>\n",
       "      <th>administr_title</th>\n",
       "      <th>agent_title</th>\n",
       "      <th>analyst_title</th>\n",
       "      <th>android_title</th>\n",
       "      <th>applic_title</th>\n",
       "      <th>apprenticeship_title</th>\n",
       "      <th>architect_title</th>\n",
       "      <th>...</th>\n",
       "      <th>system_title</th>\n",
       "      <th>teacher_title</th>\n",
       "      <th>team_title</th>\n",
       "      <th>technic_title</th>\n",
       "      <th>technician_title</th>\n",
       "      <th>time_title</th>\n",
       "      <th>ui_title</th>\n",
       "      <th>ux_title</th>\n",
       "      <th>web_title</th>\n",
       "      <th>year_title</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.440388</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 86 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   abroad_title  account_title  admin_title  administr_title  agent_title  \\\n",
       "0           0.0            0.0          0.0              0.0          0.0   \n",
       "1           0.0            0.0          0.0              0.0          0.0   \n",
       "2           0.0            0.0          0.0              0.0          0.0   \n",
       "3           0.0            0.0          0.0              0.0          0.0   \n",
       "4           0.0            0.0          0.0              0.0          0.0   \n",
       "\n",
       "   analyst_title  android_title  applic_title  apprenticeship_title  \\\n",
       "0       0.000000            0.0           0.0                   0.0   \n",
       "1       0.000000            0.0           0.0                   0.0   \n",
       "2       0.440388            0.0           0.0                   0.0   \n",
       "3       0.000000            0.0           0.0                   0.0   \n",
       "4       0.000000            0.0           0.0                   0.0   \n",
       "\n",
       "   architect_title  ...  system_title  teacher_title  team_title  \\\n",
       "0              0.0  ...           0.0            0.0         0.0   \n",
       "1              0.0  ...           0.0            0.0         0.0   \n",
       "2              0.0  ...           0.0            0.0         0.0   \n",
       "3              0.0  ...           0.0            0.0         0.0   \n",
       "4              0.0  ...           0.0            0.0         0.0   \n",
       "\n",
       "   technic_title  technician_title  time_title  ui_title  ux_title  web_title  \\\n",
       "0            0.0               0.0         0.0       0.0       0.0        0.0   \n",
       "1            0.0               0.0         0.0       0.0       0.0        0.0   \n",
       "2            0.0               0.0         0.0       0.0       0.0        0.0   \n",
       "3            0.0               0.0         0.0       0.0       0.0        0.0   \n",
       "4            0.0               0.0         0.0       0.0       0.0        0.0   \n",
       "\n",
       "   year_title  \n",
       "0         0.0  \n",
       "1         0.0  \n",
       "2         0.0  \n",
       "3         0.0  \n",
       "4         0.0  \n",
       "\n",
       "[5 rows x 86 columns]"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "select_title = mean_title > 0.004\n",
    "selected_features_title = text_features_title.loc[:, select_title]\n",
    "selected_features_title.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "86f295ad-e056-4b4e-a98c-15cd83182162",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "job_benefits         0.026567\n",
       "descript_benefits    0.026070\n",
       "see_benefits         0.025900\n",
       "benefit_benefits     0.025386\n",
       "work_benefits        0.024180\n",
       "                       ...   \n",
       "ebe_benefits         0.000002\n",
       "efd_benefits         0.000002\n",
       "efff_benefits        0.000002\n",
       "cdc_benefits         0.000002\n",
       "charit_benefits      0.000001\n",
       "Length: 11247, dtype: float64"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mean_benefits = text_features_benefits.sum() / 14304\n",
    "mean_benefits.sort_values(ascending = False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "827767db-1b8e-4012-ad22-5eebe00592c0",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>advanc_benefits</th>\n",
       "      <th>also_benefits</th>\n",
       "      <th>appli_benefits</th>\n",
       "      <th>applic_benefits</th>\n",
       "      <th>avail_benefits</th>\n",
       "      <th>base_benefits</th>\n",
       "      <th>benefit_benefits</th>\n",
       "      <th>best_benefits</th>\n",
       "      <th>bonu_benefits</th>\n",
       "      <th>bonus_benefits</th>\n",
       "      <th>...</th>\n",
       "      <th>us_benefits</th>\n",
       "      <th>vacat_benefits</th>\n",
       "      <th>vision_benefits</th>\n",
       "      <th>want_benefits</th>\n",
       "      <th>week_benefits</th>\n",
       "      <th>well_benefits</th>\n",
       "      <th>within_benefits</th>\n",
       "      <th>work_benefits</th>\n",
       "      <th>world_benefits</th>\n",
       "      <th>year_benefits</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.074620</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.142779</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.204903</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.348124</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.112979</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.109283</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.150045</td>\n",
       "      <td>0.150367</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.183764</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.165463</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.042046</td>\n",
       "      <td>0.131633</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.057729</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.063661</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 143 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   advanc_benefits  also_benefits  appli_benefits  applic_benefits  \\\n",
       "0              0.0            0.0             0.0              0.0   \n",
       "1              0.0            0.0             0.0              0.0   \n",
       "2              0.0            0.0             0.0              0.0   \n",
       "3              0.0            0.0             0.0              0.0   \n",
       "4              0.0            0.0             0.0              0.0   \n",
       "\n",
       "   avail_benefits  base_benefits  benefit_benefits  best_benefits  \\\n",
       "0             0.0            0.0          0.074620       0.000000   \n",
       "1             0.0            0.0          0.000000       0.000000   \n",
       "2             0.0            0.0          0.000000       0.000000   \n",
       "3             0.0            0.0          0.109283       0.000000   \n",
       "4             0.0            0.0          0.042046       0.131633   \n",
       "\n",
       "   bonu_benefits  bonus_benefits  ...  us_benefits  vacat_benefits  \\\n",
       "0            0.0        0.142779  ...          0.0        0.204903   \n",
       "1            0.0        0.000000  ...          0.0        0.000000   \n",
       "2            0.0        0.000000  ...          0.0        0.000000   \n",
       "3            0.0        0.000000  ...          0.0        0.150045   \n",
       "4            0.0        0.000000  ...          0.0        0.057729   \n",
       "\n",
       "   vision_benefits  want_benefits  week_benefits  well_benefits  \\\n",
       "0         0.000000            0.0       0.000000       0.348124   \n",
       "1         0.000000            0.0       0.000000       0.000000   \n",
       "2         0.000000            0.0       0.000000       0.000000   \n",
       "3         0.150367            0.0       0.183764       0.000000   \n",
       "4         0.000000            0.0       0.000000       0.000000   \n",
       "\n",
       "   within_benefits  work_benefits  world_benefits  year_benefits  \n",
       "0              0.0            0.0             0.0       0.112979  \n",
       "1              0.0            0.0             0.0       0.000000  \n",
       "2              0.0            0.0             0.0       0.000000  \n",
       "3              0.0            0.0             0.0       0.165463  \n",
       "4              0.0            0.0             0.0       0.063661  \n",
       "\n",
       "[5 rows x 143 columns]"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "select_benefits = mean_benefits > 0.003\n",
    "selected_features_benefits = text_features_benefits.loc[:, select_benefits]\n",
    "selected_features_benefits.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0c823651-8c76-4093-8099-c4e1e7f2903e",
   "metadata": {},
   "source": [
    "We will combine all dataframe to get a smaller version of `text_feature_train`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "c9cfe7fc-2c37-4154-bd45-f77fd07b3b92",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>abil_desc</th>\n",
       "      <th>abl_desc</th>\n",
       "      <th>accept_desc</th>\n",
       "      <th>access_desc</th>\n",
       "      <th>accord_desc</th>\n",
       "      <th>account_desc</th>\n",
       "      <th>accur_desc</th>\n",
       "      <th>achiev_desc</th>\n",
       "      <th>acquisit_desc</th>\n",
       "      <th>across_desc</th>\n",
       "      <th>...</th>\n",
       "      <th>us_benefits</th>\n",
       "      <th>vacat_benefits</th>\n",
       "      <th>vision_benefits</th>\n",
       "      <th>want_benefits</th>\n",
       "      <th>week_benefits</th>\n",
       "      <th>well_benefits</th>\n",
       "      <th>within_benefits</th>\n",
       "      <th>work_benefits</th>\n",
       "      <th>world_benefits</th>\n",
       "      <th>year_benefits</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.043316</td>\n",
       "      <td>0.04441</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.204903</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.348124</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.112979</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.050161</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.052053</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.150045</td>\n",
       "      <td>0.150367</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.183764</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.165463</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.036083</td>\n",
       "      <td>...</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.057729</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.063661</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14299</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.230658</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14300</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>0.09475</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14301</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14302</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14303</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.077250</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>14304 rows × 1391 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       abil_desc  abl_desc  accept_desc  access_desc  accord_desc  \\\n",
       "0       0.043316   0.04441          0.0          0.0          0.0   \n",
       "1       0.000000   0.00000          0.0          0.0          0.0   \n",
       "2       0.050161   0.00000          0.0          0.0          0.0   \n",
       "3       0.000000   0.00000          0.0          0.0          0.0   \n",
       "4       0.000000   0.00000          0.0          0.0          0.0   \n",
       "...          ...       ...          ...          ...          ...   \n",
       "14299   0.000000   0.00000          0.0          0.0          0.0   \n",
       "14300   0.000000   0.00000          0.0          0.0          0.0   \n",
       "14301   0.000000   0.00000          0.0          0.0          0.0   \n",
       "14302   0.000000   0.00000          0.0          0.0          0.0   \n",
       "14303   0.000000   0.00000          0.0          0.0          0.0   \n",
       "\n",
       "       account_desc  accur_desc  achiev_desc  acquisit_desc  across_desc  ...  \\\n",
       "0          0.000000         0.0          0.0            0.0     0.000000  ...   \n",
       "1          0.000000         0.0          0.0            0.0     0.000000  ...   \n",
       "2          0.000000         0.0          0.0            0.0     0.000000  ...   \n",
       "3          0.052053         0.0          0.0            0.0     0.000000  ...   \n",
       "4          0.000000         0.0          0.0            0.0     0.036083  ...   \n",
       "...             ...         ...          ...            ...          ...  ...   \n",
       "14299      0.000000         0.0          0.0            0.0     0.000000  ...   \n",
       "14300      0.000000         0.0          0.0            0.0     0.000000  ...   \n",
       "14301      0.000000         0.0          0.0            0.0     0.000000  ...   \n",
       "14302      0.000000         0.0          0.0            0.0     0.000000  ...   \n",
       "14303      0.077250         0.0          0.0            0.0     0.000000  ...   \n",
       "\n",
       "       us_benefits  vacat_benefits  vision_benefits  want_benefits  \\\n",
       "0          0.00000        0.204903         0.000000            0.0   \n",
       "1          0.00000        0.000000         0.000000            0.0   \n",
       "2          0.00000        0.000000         0.000000            0.0   \n",
       "3          0.00000        0.150045         0.150367            0.0   \n",
       "4          0.00000        0.057729         0.000000            0.0   \n",
       "...            ...             ...              ...            ...   \n",
       "14299      0.00000        0.000000         0.000000            0.0   \n",
       "14300      0.09475        0.000000         0.000000            0.0   \n",
       "14301      0.00000        0.000000         0.000000            0.0   \n",
       "14302      0.00000        0.000000         0.000000            0.0   \n",
       "14303      0.00000        0.000000         0.000000            0.0   \n",
       "\n",
       "       week_benefits  well_benefits  within_benefits  work_benefits  \\\n",
       "0           0.000000       0.348124              0.0       0.000000   \n",
       "1           0.000000       0.000000              0.0       0.000000   \n",
       "2           0.000000       0.000000              0.0       0.000000   \n",
       "3           0.183764       0.000000              0.0       0.000000   \n",
       "4           0.000000       0.000000              0.0       0.000000   \n",
       "...              ...            ...              ...            ...   \n",
       "14299       0.000000       0.000000              0.0       0.230658   \n",
       "14300       0.000000       0.000000              0.0       0.000000   \n",
       "14301       0.000000       0.000000              0.0       0.000000   \n",
       "14302       0.000000       0.000000              0.0       0.000000   \n",
       "14303       0.000000       0.000000              0.0       0.000000   \n",
       "\n",
       "       world_benefits  year_benefits  \n",
       "0                 0.0       0.112979  \n",
       "1                 0.0       0.000000  \n",
       "2                 0.0       0.000000  \n",
       "3                 0.0       0.165463  \n",
       "4                 0.0       0.063661  \n",
       "...               ...            ...  \n",
       "14299             0.0       0.000000  \n",
       "14300             0.0       0.000000  \n",
       "14301             0.0       0.000000  \n",
       "14302             0.0       0.000000  \n",
       "14303             0.0       0.000000  \n",
       "\n",
       "[14304 rows x 1391 columns]"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text_feature = pd.concat([selected_features_desc, selected_features_req, selected_features_title, selected_features_benefits], axis=1)\n",
    "text_feature"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cf8cb8eb-3fdf-4835-a144-21b1cfbf48a5",
   "metadata": {},
   "source": [
    "## Supervised Feature Selection Using Chi-Square Statistics"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "83cf9894-e89c-45c3-9375-4cc3ba4f3e98",
   "metadata": {},
   "source": [
    "In this part, we will perform a supervised feature selection using Chi-Square Statistics so that we can eliminate the features that are the most likely to be independent of `fraudulent` column and therefore irrelevant for classification."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "d5f5a051-4953-4e7a-bd35-3246b3538261",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Importing processed train data from previous step to get a fraudulent column.\n",
    "processed_train = joblib.load('./data/processed_train_jlib')\n",
    "target = processed_train[\"fraudulent\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "b920f20b-2720-4e77-9bed-c5161808f08f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>administr_desc</th>\n",
       "      <th>answer_desc</th>\n",
       "      <th>asia_desc</th>\n",
       "      <th>assist_desc</th>\n",
       "      <th>bill_desc</th>\n",
       "      <th>call_desc</th>\n",
       "      <th>cash_desc</th>\n",
       "      <th>desir_desc</th>\n",
       "      <th>duti_desc</th>\n",
       "      <th>earn_desc</th>\n",
       "      <th>...</th>\n",
       "      <th>life_benefits</th>\n",
       "      <th>need_benefits</th>\n",
       "      <th>per_benefits</th>\n",
       "      <th>posit_benefits</th>\n",
       "      <th>prospect_benefits</th>\n",
       "      <th>see_benefits</th>\n",
       "      <th>share_benefits</th>\n",
       "      <th>skill_benefits</th>\n",
       "      <th>start_benefits</th>\n",
       "      <th>train_benefits</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.092456</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>0.103912</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.045662</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.034465</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.047975</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.085044</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.051481</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.053905</td>\n",
       "      <td>...</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14299</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14300</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14301</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14302</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.120147</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14303</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>14304 rows × 100 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       administr_desc  answer_desc  asia_desc  assist_desc  bill_desc  \\\n",
       "0            0.000000          0.0        0.0     0.000000        0.0   \n",
       "1            0.045662          0.0        0.0     0.034465        0.0   \n",
       "2            0.000000          0.0        0.0     0.000000        0.0   \n",
       "3            0.000000          0.0        0.0     0.047975        0.0   \n",
       "4            0.000000          0.0        0.0     0.000000        0.0   \n",
       "...               ...          ...        ...          ...        ...   \n",
       "14299        0.000000          0.0        0.0     0.000000        0.0   \n",
       "14300        0.000000          0.0        0.0     0.000000        0.0   \n",
       "14301        0.000000          0.0        0.0     0.000000        0.0   \n",
       "14302        0.000000          0.0        0.0     0.000000        0.0   \n",
       "14303        0.000000          0.0        0.0     0.000000        0.0   \n",
       "\n",
       "       call_desc  cash_desc  desir_desc  duti_desc  earn_desc  ...  \\\n",
       "0       0.092456   0.000000         0.0   0.000000   0.000000  ...   \n",
       "1       0.000000   0.000000         0.0   0.000000   0.000000  ...   \n",
       "2       0.000000   0.000000         0.0   0.000000   0.000000  ...   \n",
       "3       0.000000   0.085044         0.0   0.051481   0.000000  ...   \n",
       "4       0.000000   0.000000         0.0   0.000000   0.053905  ...   \n",
       "...          ...        ...         ...        ...        ...  ...   \n",
       "14299   0.000000   0.000000         0.0   0.000000   0.000000  ...   \n",
       "14300   0.000000   0.000000         0.0   0.000000   0.000000  ...   \n",
       "14301   0.000000   0.000000         0.0   0.000000   0.000000  ...   \n",
       "14302   0.000000   0.000000         0.0   0.120147   0.000000  ...   \n",
       "14303   0.000000   0.000000         0.0   0.000000   0.000000  ...   \n",
       "\n",
       "       life_benefits  need_benefits  per_benefits  posit_benefits  \\\n",
       "0           0.103912            0.0           0.0             0.0   \n",
       "1           0.000000            0.0           0.0             0.0   \n",
       "2           0.000000            0.0           0.0             0.0   \n",
       "3           0.000000            0.0           0.0             0.0   \n",
       "4           0.000000            0.0           0.0             0.0   \n",
       "...              ...            ...           ...             ...   \n",
       "14299       0.000000            0.0           0.0             0.0   \n",
       "14300       0.000000            0.0           0.0             0.0   \n",
       "14301       0.000000            0.0           0.0             0.0   \n",
       "14302       0.000000            0.0           0.0             0.0   \n",
       "14303       0.000000            0.0           0.0             0.0   \n",
       "\n",
       "       prospect_benefits  see_benefits  share_benefits  skill_benefits  \\\n",
       "0                    0.0           0.0             0.0             0.0   \n",
       "1                    0.0           0.0             0.0             0.0   \n",
       "2                    0.0           0.0             0.0             0.0   \n",
       "3                    0.0           0.0             0.0             0.0   \n",
       "4                    0.0           0.0             0.0             0.0   \n",
       "...                  ...           ...             ...             ...   \n",
       "14299                0.0           0.0             0.0             0.0   \n",
       "14300                0.0           0.0             0.0             0.0   \n",
       "14301                0.0           0.0             0.0             0.0   \n",
       "14302                0.0           0.0             0.0             0.0   \n",
       "14303                0.0           0.0             0.0             0.0   \n",
       "\n",
       "       start_benefits  train_benefits  \n",
       "0                 0.0             0.0  \n",
       "1                 0.0             0.0  \n",
       "2                 0.0             0.0  \n",
       "3                 0.0             0.0  \n",
       "4                 0.0             0.0  \n",
       "...               ...             ...  \n",
       "14299             0.0             0.0  \n",
       "14300             0.0             0.0  \n",
       "14301             0.0             0.0  \n",
       "14302             0.0             0.0  \n",
       "14303             0.0             0.0  \n",
       "\n",
       "[14304 rows x 100 columns]"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.feature_selection import SelectKBest\n",
    "from sklearn.feature_selection import chi2\n",
    "\n",
    "supervised_text_features = SelectKBest(chi2, k = 100).fit(text_feature, target)\n",
    "df_text_features = text_feature.iloc[: , supervised_text_features.get_support()]\n",
    "df_text_features"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8bc05fa8-0832-4254-85c3-b58542ab494c",
   "metadata": {},
   "source": [
    "We will save this selected text features to csv file. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "415dd380-525c-4dc9-93a1-5b7400f6461e",
   "metadata": {},
   "outputs": [],
   "source": [
    "df_text_features.to_csv(\"./data/selected_text_features.csv\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}