{ "cells": [ { "cell_type": "markdown", "id": "b33a2141-9cee-4595-811a-cb25e36c56c8", "metadata": {}, "source": [ "# Text Feature Selection" ] }, { "cell_type": "markdown", "id": "1f871ae0-1a1b-4927-93a4-16556e69dcc6", "metadata": {}, "source": [ "As we discussed in [here](featureselection.ipynb), we must perform the feature selection on text features first because it is causing MemeoryError due to its massive file size (10GB). Since we cannot use the `fraudulent` column, we will use column means to select the features based on several assumptions. After we get a smaller version of `text_features_train`, we will combine it with `fraudulent` column and perform supervised feature selection using Chi-Squre Statistics." ] }, { "cell_type": "markdown", "id": "4f9f3725-c151-43ea-af1b-e4069d7d393a", "metadata": {}, "source": [ "### Feature Selection Using Column Mean" ] }, { "cell_type": "code", "execution_count": 1, "id": "d5643cb1-502b-484d-a009-1ef6acdb048b", "metadata": { "tags": [ "hide-output" ] }, "outputs": [], "source": [ "import pandas as pd \n", "import joblib\n", "text_features_train = joblib.load('./data/text_features_train_jlib')" ] }, { "cell_type": "code", "execution_count": 2, "id": "ef0d544b-1234-454d-940a-5fd177ea517a", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
aa_descaaa_descaaab_descaab_descaabc_descaabd_descaabf_descaac_descaaccd_descaachen_desc...zodat_benefitszollman_benefitszombi_benefitszone_benefitszoo_benefitszowel_benefitszu_benefitszult_benefitszutrifft_benefitszweig_benefits
00.1655960.00.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
10.0000000.00.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
20.0000000.00.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
30.0000000.00.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
40.0000000.00.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
\n", "

5 rows × 89527 columns

\n", "
" ], "text/plain": [ " aa_desc aaa_desc aaab_desc aab_desc aabc_desc aabd_desc aabf_desc \\\n", "0 0.165596 0.0 0.0 0.0 0.0 0.0 0.0 \n", "1 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 \n", "2 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 \n", "3 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 \n", "4 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 \n", "\n", " aac_desc aaccd_desc aachen_desc ... zodat_benefits zollman_benefits \\\n", "0 0.0 0.0 0.0 ... 0.0 0.0 \n", "1 0.0 0.0 0.0 ... 0.0 0.0 \n", "2 0.0 0.0 0.0 ... 0.0 0.0 \n", "3 0.0 0.0 0.0 ... 0.0 0.0 \n", "4 0.0 0.0 0.0 ... 0.0 0.0 \n", "\n", " zombi_benefits zone_benefits zoo_benefits zowel_benefits zu_benefits \\\n", "0 0.0 0.0 0.0 0.0 0.0 \n", "1 0.0 0.0 0.0 0.0 0.0 \n", "2 0.0 0.0 0.0 0.0 0.0 \n", "3 0.0 0.0 0.0 0.0 0.0 \n", "4 0.0 0.0 0.0 0.0 0.0 \n", "\n", " zult_benefits zutrifft_benefits zweig_benefits \n", "0 0.0 0.0 0.0 \n", "1 0.0 0.0 0.0 \n", "2 0.0 0.0 0.0 \n", "3 0.0 0.0 0.0 \n", "4 0.0 0.0 0.0 \n", "\n", "[5 rows x 89527 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text_features_train.head(5)" ] }, { "cell_type": "markdown", "id": "b1099e1c-2093-4479-8f66-bb6538ee830b", "metadata": {}, "source": [ "Let's carefully observe the dataframe above. It is straightforward to realize that some of these features have unusual names, such as `zutrifft` and `aabd`, and it is hard to believe that they are stemmed from a standard English word. There are two main reasons why those unusual names appear as a feature. \n", "\n", "1. Although we removed the URL and HTML format in the pre-data processing step, it is still possible that some formats are not perfectly removed. Also, the data can include other non-English words, such as email or file names. \n", "2. If we look over the original dataset, we can observe that text data was saved with no space between the lines. For example, \"I love dog\" (Line 1), \"Cat ate fish\" (Line 2) to \"I love dogCat ate fish\". Then it creates the abnormal word \"dogCat\". \n", "\n", "**The one simple way to remove these unusual words with the lowest computational cost is to use column mean and filter out the features with exceptionally low means.** This is based on the assumption that unusual words will appear less frequently than normal words. For instance, a word like \"havecommunication\" will not frequently appear across the dataset. If the words appear infrequently, they will have a low mean.\n", "\n", "```{warning}\n", "This method is also based on the dangerous assumption that low-frequency words are less important than more frequent words. Feature selection using column means can eliminate some important words for machine learning. However, since we have more than 10,000 unusual words as features, I am trading off the accuracy for more efficiency. **I am aware that feature selection has to be done very carefully, and this is not the best way.**\n", "```\n", "\n", "However, we must not perform this feature selection by column means on the entire dataset since our `text_features_train` dataset is a combination of four different columns: `description`, `title`, `requirements` and `benefits`. Since the tf-idf value can vary depending on different characteristics of each dataset, we must get a column mean, select the feature separately by each dataset, and combine it later to get the best result. " ] }, { "cell_type": "code", "execution_count": 2, "id": "76591b4e-127f-4d56-ba9c-5d5ac32234a2", "metadata": {}, "outputs": [], "source": [ "sum(text_features_train.columns.str.contains('_desc'))\n", "text_features_desc = text_features_train.iloc[:, 0:40607]" ] }, { "cell_type": "code", "execution_count": 3, "id": "e122c9ec-ec33-44d5-84d6-aa3d637c1408", "metadata": {}, "outputs": [], "source": [ "sum(text_features_train.columns.str.contains('_req'))\n", "text_features_req = text_features_train.iloc[:, 40607:74863]" ] }, { "cell_type": "code", "execution_count": 4, "id": "93fa71de-6a8d-4f67-a0fa-ed8208af4875", "metadata": {}, "outputs": [], "source": [ "sum(text_features_train.columns.str.contains('_title'))\n", "text_features_title = text_features_train.iloc[:, 74863:78280]" ] }, { "cell_type": "code", "execution_count": 5, "id": "bac9cd13-0590-40ea-b9f2-0e4fafd79392", "metadata": {}, "outputs": [], "source": [ "sum(text_features_train.columns.str.contains('_benefits'))\n", "text_features_benefits = text_features_train.iloc[:, 78280:89527]" ] }, { "cell_type": "markdown", "id": "371a6487-037a-48b7-8b5a-677e33b35847", "metadata": {}, "source": [ "```{note}\n", "We are seperating the dataframe like this to avoid the MemoryError. \n", "```" ] }, { "cell_type": "code", "execution_count": 6, "id": "262a772c-5d0d-471f-98cb-581ba3da6c11", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "work_desc 3.387162e-02\n", "develop_desc 3.251693e-02\n", "team_desc 3.203152e-02\n", "manag_desc 3.167333e-02\n", "custom_desc 3.149140e-02\n", " ... \n", "peugeot_desc 9.292221e-07\n", "bencki_desc 9.292221e-07\n", "sanofi_desc 9.292221e-07\n", "qmetric_desc 5.049230e-07\n", "gra_desc 5.049230e-07\n", "Length: 40607, dtype: float64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mean_desc = text_features_desc.sum() / 14304\n", "mean_desc.sort_values(ascending = False)" ] }, { "cell_type": "markdown", "id": "200144b0-6d17-4d45-9532-55f11f39047d", "metadata": {}, "source": [ "The column means for features from `description` dataset shows that the assumptions we made previously are somewhat reasonable. As we see here, more the average, the words look more normal, such as \"work\" and \"develope\".\n", "\n", "Since the highest mean is 0.03387162, let's choose all features with average mean higher than 0.002. \n", "\n", "```{note}\n", "I tried many different thresholds and figured out 0.002 is the best number.\n", "```" ] }, { "cell_type": "code", "execution_count": 7, "id": "457311ff-0e89-4549-84fc-2578e6eacea8", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abil_descabl_descaccept_descaccess_descaccord_descaccount_descaccur_descachiev_descacquisit_descacross_desc...without_descword_descwork_descworld_descwould_descwrite_descwritten_descyear_descyork_descyoung_desc
00.0433160.044410.00.00.00.0000000.00.00.00.000000...0.0601860.00.0219430.0000000.00.00.00.0381880.00.0
10.0000000.000000.00.00.00.0000000.00.00.00.000000...0.0000000.00.0000000.0000000.00.00.00.0000000.00.0
20.0501610.000000.00.00.00.0000000.00.00.00.000000...0.0000000.00.0254110.0000000.00.00.00.0442230.00.0
30.0000000.000000.00.00.00.0520530.00.00.00.000000...0.0000000.00.0000000.0000000.00.00.00.0000000.00.0
40.0000000.000000.00.00.00.0000000.00.00.00.036083...0.0000000.00.0000000.0694620.00.00.00.0924320.00.0
\n", "

5 rows × 812 columns

\n", "
" ], "text/plain": [ " abil_desc abl_desc accept_desc access_desc accord_desc account_desc \\\n", "0 0.043316 0.04441 0.0 0.0 0.0 0.000000 \n", "1 0.000000 0.00000 0.0 0.0 0.0 0.000000 \n", "2 0.050161 0.00000 0.0 0.0 0.0 0.000000 \n", "3 0.000000 0.00000 0.0 0.0 0.0 0.052053 \n", "4 0.000000 0.00000 0.0 0.0 0.0 0.000000 \n", "\n", " accur_desc achiev_desc acquisit_desc across_desc ... without_desc \\\n", "0 0.0 0.0 0.0 0.000000 ... 0.060186 \n", "1 0.0 0.0 0.0 0.000000 ... 0.000000 \n", "2 0.0 0.0 0.0 0.000000 ... 0.000000 \n", "3 0.0 0.0 0.0 0.000000 ... 0.000000 \n", "4 0.0 0.0 0.0 0.036083 ... 0.000000 \n", "\n", " word_desc work_desc world_desc would_desc write_desc written_desc \\\n", "0 0.0 0.021943 0.000000 0.0 0.0 0.0 \n", "1 0.0 0.000000 0.000000 0.0 0.0 0.0 \n", "2 0.0 0.025411 0.000000 0.0 0.0 0.0 \n", "3 0.0 0.000000 0.000000 0.0 0.0 0.0 \n", "4 0.0 0.000000 0.069462 0.0 0.0 0.0 \n", "\n", " year_desc york_desc young_desc \n", "0 0.038188 0.0 0.0 \n", "1 0.000000 0.0 0.0 \n", "2 0.044223 0.0 0.0 \n", "3 0.000000 0.0 0.0 \n", "4 0.092432 0.0 0.0 \n", "\n", "[5 rows x 812 columns]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "select_desc = mean_desc > 0.002\n", "selected_features_desc = text_features_desc.loc[:, select_desc]\n", "selected_features_desc.head()" ] }, { "cell_type": "markdown", "id": "433133cb-4584-4dfe-b59a-d997e1435186", "metadata": {}, "source": [ "This looks much better. We will repeat the process for the other dataframes as well. " ] }, { "cell_type": "code", "execution_count": 8, "id": "21b24224-4225-4bff-a90f-294d33f74500", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "experi_req 0.052954\n", "work_req 0.034828\n", "skill_req 0.033378\n", "requir_req 0.031967\n", "year_req 0.027195\n", " ... \n", "cano_req 0.000002\n", "mcnz_req 0.000002\n", "orthopaed_req 0.000002\n", "inhabit_req 0.000002\n", "zeta_req 0.000001\n", "Length: 34256, dtype: float64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mean_req = text_features_req.sum() / 14304\n", "mean_req.sort_values(ascending = False)" ] }, { "cell_type": "markdown", "id": "e0c9f23c-1dd1-4761-b121-5e777bb16dcc", "metadata": {}, "source": [ "Since TF-IDF is bit higher for `text_feature_req`, we will adjust the threshold a bit to adjust for the difference." ] }, { "cell_type": "code", "execution_count": 9, "id": "e1a1a2b5-f1c9-4736-a56e-c8c276276770", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abil_reqabl_reqaccount_reqacross_reqactiv_reqadapt_reqaddit_reqadministr_reqadob_reqadvanc_req...willing_reqwindow_reqwithin_reqwithout_reqword_reqwork_reqwould_reqwrite_reqwritten_reqyear_req
00.0816500.1055960.00.00.00.00.00.00.00.0...0.00.0000000.0000000.0000000.00.0647950.00.00.0000000.068928
10.0000000.0000000.00.00.00.00.00.00.00.0...0.00.0000000.0615490.1513440.00.0630240.00.00.0000000.033522
20.0000000.0000000.00.00.00.00.00.00.00.0...0.00.0000000.0000000.0000000.00.0000000.00.00.0000000.000000
30.0475670.0000000.00.00.00.00.00.00.00.0...0.00.0872690.0000000.0000000.00.0754950.00.00.0546460.040155
40.0000000.0000000.00.00.00.00.00.00.00.0...0.00.0000000.0000000.0000000.00.0726230.00.00.0000000.038628
\n", "

5 rows × 350 columns

\n", "
" ], "text/plain": [ " abil_req abl_req account_req across_req activ_req adapt_req \\\n", "0 0.081650 0.105596 0.0 0.0 0.0 0.0 \n", "1 0.000000 0.000000 0.0 0.0 0.0 0.0 \n", "2 0.000000 0.000000 0.0 0.0 0.0 0.0 \n", "3 0.047567 0.000000 0.0 0.0 0.0 0.0 \n", "4 0.000000 0.000000 0.0 0.0 0.0 0.0 \n", "\n", " addit_req administr_req adob_req advanc_req ... willing_req \\\n", "0 0.0 0.0 0.0 0.0 ... 0.0 \n", "1 0.0 0.0 0.0 0.0 ... 0.0 \n", "2 0.0 0.0 0.0 0.0 ... 0.0 \n", "3 0.0 0.0 0.0 0.0 ... 0.0 \n", "4 0.0 0.0 0.0 0.0 ... 0.0 \n", "\n", " window_req within_req without_req word_req work_req would_req \\\n", "0 0.000000 0.000000 0.000000 0.0 0.064795 0.0 \n", "1 0.000000 0.061549 0.151344 0.0 0.063024 0.0 \n", "2 0.000000 0.000000 0.000000 0.0 0.000000 0.0 \n", "3 0.087269 0.000000 0.000000 0.0 0.075495 0.0 \n", "4 0.000000 0.000000 0.000000 0.0 0.072623 0.0 \n", "\n", " write_req written_req year_req \n", "0 0.0 0.000000 0.068928 \n", "1 0.0 0.000000 0.033522 \n", "2 0.0 0.000000 0.000000 \n", "3 0.0 0.054646 0.040155 \n", "4 0.0 0.000000 0.038628 \n", "\n", "[5 rows x 350 columns]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "select_req = mean_req > 0.003\n", "selected_features_req = text_features_req.loc[:, select_req]\n", "selected_features_req.head()" ] }, { "cell_type": "code", "execution_count": 10, "id": "b11df688-7adf-4a49-985c-6124a0679532", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "manag_title 0.048922\n", "develop_title 0.046175\n", "engin_title 0.038562\n", "sale_title 0.029780\n", "servic_title 0.024249\n", " ... \n", "maharashtra_title 0.000024\n", "barri_title 0.000024\n", "peterborough_title 0.000024\n", "haliburton_title 0.000024\n", "elgin_title 0.000021\n", "Length: 3417, dtype: float64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mean_title = text_features_title.sum() / 14304\n", "mean_title.sort_values(ascending = False)" ] }, { "cell_type": "code", "execution_count": 11, "id": "7dcd6a44-320c-48d0-ab80-edf973195be4", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abroad_titleaccount_titleadmin_titleadministr_titleagent_titleanalyst_titleandroid_titleapplic_titleapprenticeship_titlearchitect_title...system_titleteacher_titleteam_titletechnic_titletechnician_titletime_titleui_titleux_titleweb_titleyear_title
00.00.00.00.00.00.0000000.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
10.00.00.00.00.00.0000000.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
20.00.00.00.00.00.4403880.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
30.00.00.00.00.00.0000000.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
40.00.00.00.00.00.0000000.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
\n", "

5 rows × 86 columns

\n", "
" ], "text/plain": [ " abroad_title account_title admin_title administr_title agent_title \\\n", "0 0.0 0.0 0.0 0.0 0.0 \n", "1 0.0 0.0 0.0 0.0 0.0 \n", "2 0.0 0.0 0.0 0.0 0.0 \n", "3 0.0 0.0 0.0 0.0 0.0 \n", "4 0.0 0.0 0.0 0.0 0.0 \n", "\n", " analyst_title android_title applic_title apprenticeship_title \\\n", "0 0.000000 0.0 0.0 0.0 \n", "1 0.000000 0.0 0.0 0.0 \n", "2 0.440388 0.0 0.0 0.0 \n", "3 0.000000 0.0 0.0 0.0 \n", "4 0.000000 0.0 0.0 0.0 \n", "\n", " architect_title ... system_title teacher_title team_title \\\n", "0 0.0 ... 0.0 0.0 0.0 \n", "1 0.0 ... 0.0 0.0 0.0 \n", "2 0.0 ... 0.0 0.0 0.0 \n", "3 0.0 ... 0.0 0.0 0.0 \n", "4 0.0 ... 0.0 0.0 0.0 \n", "\n", " technic_title technician_title time_title ui_title ux_title web_title \\\n", "0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "1 0.0 0.0 0.0 0.0 0.0 0.0 \n", "2 0.0 0.0 0.0 0.0 0.0 0.0 \n", "3 0.0 0.0 0.0 0.0 0.0 0.0 \n", "4 0.0 0.0 0.0 0.0 0.0 0.0 \n", "\n", " year_title \n", "0 0.0 \n", "1 0.0 \n", "2 0.0 \n", "3 0.0 \n", "4 0.0 \n", "\n", "[5 rows x 86 columns]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "select_title = mean_title > 0.004\n", "selected_features_title = text_features_title.loc[:, select_title]\n", "selected_features_title.head()" ] }, { "cell_type": "code", "execution_count": 12, "id": "86f295ad-e056-4b4e-a98c-15cd83182162", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "job_benefits 0.026567\n", "descript_benefits 0.026070\n", "see_benefits 0.025900\n", "benefit_benefits 0.025386\n", "work_benefits 0.024180\n", " ... \n", "ebe_benefits 0.000002\n", "efd_benefits 0.000002\n", "efff_benefits 0.000002\n", "cdc_benefits 0.000002\n", "charit_benefits 0.000001\n", "Length: 11247, dtype: float64" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mean_benefits = text_features_benefits.sum() / 14304\n", "mean_benefits.sort_values(ascending = False)" ] }, { "cell_type": "code", "execution_count": 13, "id": "827767db-1b8e-4012-ad22-5eebe00592c0", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
advanc_benefitsalso_benefitsappli_benefitsapplic_benefitsavail_benefitsbase_benefitsbenefit_benefitsbest_benefitsbonu_benefitsbonus_benefits...us_benefitsvacat_benefitsvision_benefitswant_benefitsweek_benefitswell_benefitswithin_benefitswork_benefitsworld_benefitsyear_benefits
00.00.00.00.00.00.00.0746200.0000000.00.142779...0.00.2049030.0000000.00.0000000.3481240.00.00.00.112979
10.00.00.00.00.00.00.0000000.0000000.00.000000...0.00.0000000.0000000.00.0000000.0000000.00.00.00.000000
20.00.00.00.00.00.00.0000000.0000000.00.000000...0.00.0000000.0000000.00.0000000.0000000.00.00.00.000000
30.00.00.00.00.00.00.1092830.0000000.00.000000...0.00.1500450.1503670.00.1837640.0000000.00.00.00.165463
40.00.00.00.00.00.00.0420460.1316330.00.000000...0.00.0577290.0000000.00.0000000.0000000.00.00.00.063661
\n", "

5 rows × 143 columns

\n", "
" ], "text/plain": [ " advanc_benefits also_benefits appli_benefits applic_benefits \\\n", "0 0.0 0.0 0.0 0.0 \n", "1 0.0 0.0 0.0 0.0 \n", "2 0.0 0.0 0.0 0.0 \n", "3 0.0 0.0 0.0 0.0 \n", "4 0.0 0.0 0.0 0.0 \n", "\n", " avail_benefits base_benefits benefit_benefits best_benefits \\\n", "0 0.0 0.0 0.074620 0.000000 \n", "1 0.0 0.0 0.000000 0.000000 \n", "2 0.0 0.0 0.000000 0.000000 \n", "3 0.0 0.0 0.109283 0.000000 \n", "4 0.0 0.0 0.042046 0.131633 \n", "\n", " bonu_benefits bonus_benefits ... us_benefits vacat_benefits \\\n", "0 0.0 0.142779 ... 0.0 0.204903 \n", "1 0.0 0.000000 ... 0.0 0.000000 \n", "2 0.0 0.000000 ... 0.0 0.000000 \n", "3 0.0 0.000000 ... 0.0 0.150045 \n", "4 0.0 0.000000 ... 0.0 0.057729 \n", "\n", " vision_benefits want_benefits week_benefits well_benefits \\\n", "0 0.000000 0.0 0.000000 0.348124 \n", "1 0.000000 0.0 0.000000 0.000000 \n", "2 0.000000 0.0 0.000000 0.000000 \n", "3 0.150367 0.0 0.183764 0.000000 \n", "4 0.000000 0.0 0.000000 0.000000 \n", "\n", " within_benefits work_benefits world_benefits year_benefits \n", "0 0.0 0.0 0.0 0.112979 \n", "1 0.0 0.0 0.0 0.000000 \n", "2 0.0 0.0 0.0 0.000000 \n", "3 0.0 0.0 0.0 0.165463 \n", "4 0.0 0.0 0.0 0.063661 \n", "\n", "[5 rows x 143 columns]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "select_benefits = mean_benefits > 0.003\n", "selected_features_benefits = text_features_benefits.loc[:, select_benefits]\n", "selected_features_benefits.head()" ] }, { "cell_type": "markdown", "id": "0c823651-8c76-4093-8099-c4e1e7f2903e", "metadata": {}, "source": [ "We will combine all dataframe to get a smaller version of `text_feature_train`. " ] }, { "cell_type": "code", "execution_count": 14, "id": "c9cfe7fc-2c37-4154-bd45-f77fd07b3b92", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abil_descabl_descaccept_descaccess_descaccord_descaccount_descaccur_descachiev_descacquisit_descacross_desc...us_benefitsvacat_benefitsvision_benefitswant_benefitsweek_benefitswell_benefitswithin_benefitswork_benefitsworld_benefitsyear_benefits
00.0433160.044410.00.00.00.0000000.00.00.00.000000...0.000000.2049030.0000000.00.0000000.3481240.00.0000000.00.112979
10.0000000.000000.00.00.00.0000000.00.00.00.000000...0.000000.0000000.0000000.00.0000000.0000000.00.0000000.00.000000
20.0501610.000000.00.00.00.0000000.00.00.00.000000...0.000000.0000000.0000000.00.0000000.0000000.00.0000000.00.000000
30.0000000.000000.00.00.00.0520530.00.00.00.000000...0.000000.1500450.1503670.00.1837640.0000000.00.0000000.00.165463
40.0000000.000000.00.00.00.0000000.00.00.00.036083...0.000000.0577290.0000000.00.0000000.0000000.00.0000000.00.063661
..................................................................
142990.0000000.000000.00.00.00.0000000.00.00.00.000000...0.000000.0000000.0000000.00.0000000.0000000.00.2306580.00.000000
143000.0000000.000000.00.00.00.0000000.00.00.00.000000...0.094750.0000000.0000000.00.0000000.0000000.00.0000000.00.000000
143010.0000000.000000.00.00.00.0000000.00.00.00.000000...0.000000.0000000.0000000.00.0000000.0000000.00.0000000.00.000000
143020.0000000.000000.00.00.00.0000000.00.00.00.000000...0.000000.0000000.0000000.00.0000000.0000000.00.0000000.00.000000
143030.0000000.000000.00.00.00.0772500.00.00.00.000000...0.000000.0000000.0000000.00.0000000.0000000.00.0000000.00.000000
\n", "

14304 rows × 1391 columns

\n", "
" ], "text/plain": [ " abil_desc abl_desc accept_desc access_desc accord_desc \\\n", "0 0.043316 0.04441 0.0 0.0 0.0 \n", "1 0.000000 0.00000 0.0 0.0 0.0 \n", "2 0.050161 0.00000 0.0 0.0 0.0 \n", "3 0.000000 0.00000 0.0 0.0 0.0 \n", "4 0.000000 0.00000 0.0 0.0 0.0 \n", "... ... ... ... ... ... \n", "14299 0.000000 0.00000 0.0 0.0 0.0 \n", "14300 0.000000 0.00000 0.0 0.0 0.0 \n", "14301 0.000000 0.00000 0.0 0.0 0.0 \n", "14302 0.000000 0.00000 0.0 0.0 0.0 \n", "14303 0.000000 0.00000 0.0 0.0 0.0 \n", "\n", " account_desc accur_desc achiev_desc acquisit_desc across_desc ... \\\n", "0 0.000000 0.0 0.0 0.0 0.000000 ... \n", "1 0.000000 0.0 0.0 0.0 0.000000 ... \n", "2 0.000000 0.0 0.0 0.0 0.000000 ... \n", "3 0.052053 0.0 0.0 0.0 0.000000 ... \n", "4 0.000000 0.0 0.0 0.0 0.036083 ... \n", "... ... ... ... ... ... ... \n", "14299 0.000000 0.0 0.0 0.0 0.000000 ... \n", "14300 0.000000 0.0 0.0 0.0 0.000000 ... \n", "14301 0.000000 0.0 0.0 0.0 0.000000 ... \n", "14302 0.000000 0.0 0.0 0.0 0.000000 ... \n", "14303 0.077250 0.0 0.0 0.0 0.000000 ... \n", "\n", " us_benefits vacat_benefits vision_benefits want_benefits \\\n", "0 0.00000 0.204903 0.000000 0.0 \n", "1 0.00000 0.000000 0.000000 0.0 \n", "2 0.00000 0.000000 0.000000 0.0 \n", "3 0.00000 0.150045 0.150367 0.0 \n", "4 0.00000 0.057729 0.000000 0.0 \n", "... ... ... ... ... \n", "14299 0.00000 0.000000 0.000000 0.0 \n", "14300 0.09475 0.000000 0.000000 0.0 \n", "14301 0.00000 0.000000 0.000000 0.0 \n", "14302 0.00000 0.000000 0.000000 0.0 \n", "14303 0.00000 0.000000 0.000000 0.0 \n", "\n", " week_benefits well_benefits within_benefits work_benefits \\\n", "0 0.000000 0.348124 0.0 0.000000 \n", "1 0.000000 0.000000 0.0 0.000000 \n", "2 0.000000 0.000000 0.0 0.000000 \n", "3 0.183764 0.000000 0.0 0.000000 \n", "4 0.000000 0.000000 0.0 0.000000 \n", "... ... ... ... ... \n", "14299 0.000000 0.000000 0.0 0.230658 \n", "14300 0.000000 0.000000 0.0 0.000000 \n", "14301 0.000000 0.000000 0.0 0.000000 \n", "14302 0.000000 0.000000 0.0 0.000000 \n", "14303 0.000000 0.000000 0.0 0.000000 \n", "\n", " world_benefits year_benefits \n", "0 0.0 0.112979 \n", "1 0.0 0.000000 \n", "2 0.0 0.000000 \n", "3 0.0 0.165463 \n", "4 0.0 0.063661 \n", "... ... ... \n", "14299 0.0 0.000000 \n", "14300 0.0 0.000000 \n", "14301 0.0 0.000000 \n", "14302 0.0 0.000000 \n", "14303 0.0 0.000000 \n", "\n", "[14304 rows x 1391 columns]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text_feature = pd.concat([selected_features_desc, selected_features_req, selected_features_title, selected_features_benefits], axis=1)\n", "text_feature" ] }, { "cell_type": "markdown", "id": "cf8cb8eb-3fdf-4835-a144-21b1cfbf48a5", "metadata": {}, "source": [ "## Supervised Feature Selection Using Chi-Square Statistics" ] }, { "cell_type": "markdown", "id": "83cf9894-e89c-45c3-9375-4cc3ba4f3e98", "metadata": {}, "source": [ "In this part, we will perform a supervised feature selection using Chi-Square Statistics so that we can eliminate the features that are the most likely to be independent of `fraudulent` column and therefore irrelevant for classification." ] }, { "cell_type": "code", "execution_count": 15, "id": "d5f5a051-4953-4e7a-bd35-3246b3538261", "metadata": {}, "outputs": [], "source": [ "# Importing processed train data from previous step to get a fraudulent column.\n", "processed_train = joblib.load('./data/processed_train_jlib')\n", "target = processed_train[\"fraudulent\"]" ] }, { "cell_type": "code", "execution_count": 16, "id": "b920f20b-2720-4e77-9bed-c5161808f08f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
administr_descanswer_descasia_descassist_descbill_desccall_desccash_descdesir_descduti_descearn_desc...life_benefitsneed_benefitsper_benefitsposit_benefitsprospect_benefitssee_benefitsshare_benefitsskill_benefitsstart_benefitstrain_benefits
00.0000000.00.00.0000000.00.0924560.0000000.00.0000000.000000...0.1039120.00.00.00.00.00.00.00.00.0
10.0456620.00.00.0344650.00.0000000.0000000.00.0000000.000000...0.0000000.00.00.00.00.00.00.00.00.0
20.0000000.00.00.0000000.00.0000000.0000000.00.0000000.000000...0.0000000.00.00.00.00.00.00.00.00.0
30.0000000.00.00.0479750.00.0000000.0850440.00.0514810.000000...0.0000000.00.00.00.00.00.00.00.00.0
40.0000000.00.00.0000000.00.0000000.0000000.00.0000000.053905...0.0000000.00.00.00.00.00.00.00.00.0
..................................................................
142990.0000000.00.00.0000000.00.0000000.0000000.00.0000000.000000...0.0000000.00.00.00.00.00.00.00.00.0
143000.0000000.00.00.0000000.00.0000000.0000000.00.0000000.000000...0.0000000.00.00.00.00.00.00.00.00.0
143010.0000000.00.00.0000000.00.0000000.0000000.00.0000000.000000...0.0000000.00.00.00.00.00.00.00.00.0
143020.0000000.00.00.0000000.00.0000000.0000000.00.1201470.000000...0.0000000.00.00.00.00.00.00.00.00.0
143030.0000000.00.00.0000000.00.0000000.0000000.00.0000000.000000...0.0000000.00.00.00.00.00.00.00.00.0
\n", "

14304 rows × 100 columns

\n", "
" ], "text/plain": [ " administr_desc answer_desc asia_desc assist_desc bill_desc \\\n", "0 0.000000 0.0 0.0 0.000000 0.0 \n", "1 0.045662 0.0 0.0 0.034465 0.0 \n", "2 0.000000 0.0 0.0 0.000000 0.0 \n", "3 0.000000 0.0 0.0 0.047975 0.0 \n", "4 0.000000 0.0 0.0 0.000000 0.0 \n", "... ... ... ... ... ... \n", "14299 0.000000 0.0 0.0 0.000000 0.0 \n", "14300 0.000000 0.0 0.0 0.000000 0.0 \n", "14301 0.000000 0.0 0.0 0.000000 0.0 \n", "14302 0.000000 0.0 0.0 0.000000 0.0 \n", "14303 0.000000 0.0 0.0 0.000000 0.0 \n", "\n", " call_desc cash_desc desir_desc duti_desc earn_desc ... \\\n", "0 0.092456 0.000000 0.0 0.000000 0.000000 ... \n", "1 0.000000 0.000000 0.0 0.000000 0.000000 ... \n", "2 0.000000 0.000000 0.0 0.000000 0.000000 ... \n", "3 0.000000 0.085044 0.0 0.051481 0.000000 ... \n", "4 0.000000 0.000000 0.0 0.000000 0.053905 ... \n", "... ... ... ... ... ... ... \n", "14299 0.000000 0.000000 0.0 0.000000 0.000000 ... \n", "14300 0.000000 0.000000 0.0 0.000000 0.000000 ... \n", "14301 0.000000 0.000000 0.0 0.000000 0.000000 ... \n", "14302 0.000000 0.000000 0.0 0.120147 0.000000 ... \n", "14303 0.000000 0.000000 0.0 0.000000 0.000000 ... \n", "\n", " life_benefits need_benefits per_benefits posit_benefits \\\n", "0 0.103912 0.0 0.0 0.0 \n", "1 0.000000 0.0 0.0 0.0 \n", "2 0.000000 0.0 0.0 0.0 \n", "3 0.000000 0.0 0.0 0.0 \n", "4 0.000000 0.0 0.0 0.0 \n", "... ... ... ... ... \n", "14299 0.000000 0.0 0.0 0.0 \n", "14300 0.000000 0.0 0.0 0.0 \n", "14301 0.000000 0.0 0.0 0.0 \n", "14302 0.000000 0.0 0.0 0.0 \n", "14303 0.000000 0.0 0.0 0.0 \n", "\n", " prospect_benefits see_benefits share_benefits skill_benefits \\\n", "0 0.0 0.0 0.0 0.0 \n", "1 0.0 0.0 0.0 0.0 \n", "2 0.0 0.0 0.0 0.0 \n", "3 0.0 0.0 0.0 0.0 \n", "4 0.0 0.0 0.0 0.0 \n", "... ... ... ... ... \n", "14299 0.0 0.0 0.0 0.0 \n", "14300 0.0 0.0 0.0 0.0 \n", "14301 0.0 0.0 0.0 0.0 \n", "14302 0.0 0.0 0.0 0.0 \n", "14303 0.0 0.0 0.0 0.0 \n", "\n", " start_benefits train_benefits \n", "0 0.0 0.0 \n", "1 0.0 0.0 \n", "2 0.0 0.0 \n", "3 0.0 0.0 \n", "4 0.0 0.0 \n", "... ... ... \n", "14299 0.0 0.0 \n", "14300 0.0 0.0 \n", "14301 0.0 0.0 \n", "14302 0.0 0.0 \n", "14303 0.0 0.0 \n", "\n", "[14304 rows x 100 columns]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.feature_selection import SelectKBest\n", "from sklearn.feature_selection import chi2\n", "\n", "supervised_text_features = SelectKBest(chi2, k = 100).fit(text_feature, target)\n", "df_text_features = text_feature.iloc[: , supervised_text_features.get_support()]\n", "df_text_features" ] }, { "cell_type": "markdown", "id": "8bc05fa8-0832-4254-85c3-b58542ab494c", "metadata": {}, "source": [ "We will save this selected text features to csv file. " ] }, { "cell_type": "code", "execution_count": 17, "id": "415dd380-525c-4dc9-93a1-5b7400f6461e", "metadata": {}, "outputs": [], "source": [ "df_text_features.to_csv(\"./data/selected_text_features.csv\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" } }, "nbformat": 4, "nbformat_minor": 5 }