Conclusion

Improvement

  1. Feature Engineering: There was a huge improvement in feature engineering we produced the same or better result than the previous model using fewer features. The previous work used 500 features, while we only used 85 features in this project. This is mainly because the previous feature engineering used many unsupported assumptions and did not utilize a more efficient and sophisticated feature selection algorithm. In this project, I tried to avoid assumptions and borrowed the power of popular feature selection algorithms such as Boruta to overcome the previous weakness.

  2. Tuning Hyperparameters: It also improved the tuning process notably. The previous work suffered from using the ML package in R because it lacked tuning tools (for example, the random forest package in R has a function that tunes the number of features considered in each split, but there is no tuning function for the number of trees), so I had to define my tuning function, which took more than three to four hours to tune a single hyperparameter. To deal with this problem, this project used Hyperopt in python to efficiently tune several hyperparameters at the same time. Using the grid search algorithm to tune hyperparameters of SVM also helped me to enhance the efficiency and accuracy of the tuning process (For example, the previous work used cost = 500 for the SVM but this wrongly tuned hyperparameter allowed the SVM to overfit magnificently).

  3. Use of Encoder: The use of several encoders in the python scikit-learn package helped efficiently process the train and test dataset. The previous work suffered from different levels of a factor of categorical variables between train and test datasets, but the use of encoder and pickle could easily handle the problem. Encoders also helped to make a cleaner pipeline.

  4. Narrative: This project undeniably has a more detailed narrative and explanation for each step.

Areas That Need More Improvement

  1. Pre-Data Processing: In the pre-data processing step, I used the column means to eliminate all unusual words created by missing space between each sentence in the raw dataset. However, this method is quite dangerous because it can also eliminate important word features. The result could have been better if I had known another way of handling such unusual words.

  2. Feature Selection: I applied two different feature selection techniques separately: using chi-square statistics to select 100-word features from more than 80000-word features and using the Boruta algorithm to make a final feature matrix. I am still figuring out if this is the correct way of selecting a feature. It would be preferred to use a more efficient algorithm that can take more than 80000 features simultaneously and make the feature selection altogether.

  3. Recall Rate: Although the model in the project performed better than the model in the previous work, I needed a better recall rate. The performance could be better if I knew a more advanced machine learning algorithm that is more suitable for text feature datasets. It is interesting to observe that the model consistently misclassifies those 40%~30% of fraudulent postings. If I can find any unique characteristics of those misclassified postings by applying clustering or graphical analysis, I may find a way to improve the algorithm further.

Concluding Remark

Though the best model, the final random forest model, is far from perfect, we have set foot on a long but incredible journey of applying machine learning to solve problems. On my quest to conquer the infinite sets of problems in this vast world, I am confident that with larger samples and more time to delve into the data and tools, I can develop even better models than I have here.