Welcome

BerkeleyLogo

Welcome to Classifying Fake Job Posting Using Machine Learning Algorithm Book

This document is a renovated version of my STAT 154 (Machine Learning) Final project at UC Berkeley. The paper will carefully examine job posting data, detailed pre-processing steps, and machine learning analysis to create a model that classifies fake job posting efficiently and accurately. I decided to recreate my previous project because…

  1. The original project had too many critical flaws and errors. Even though the paper scored the highest in the entire class, I could still spot many errors in computation and explanation. For example, word stemming is better suited for the project’s goal, but we used lemmatization instead. Also, feature selection process in the original work was based on many unsupported assumptions that can eliminate many important features. Through renovating my old project, I hope I can improve the performance of the model by fixing the flaws and error.

  2. I wanted to recreate the original project using Python instead of R. The initial project was done using R, but we faced lots of difficulty in pre-data processing, which could be easily handled with Python. Also, after taking a course about reproducibility, I wanted to recreate the project using Jupyter Notebook and publish it through Jupyter Book. Through this renovation project, I hope to have a deeper understanding of machine learning using Python and review what I have been learning through my undergraduate study.

  3. Currently, it is impossible to reproduce or replicate the result of the original project. First, the original project report was written in Google Docs, so it showed a clear limitation on including computation and the complete demonstration of workflow. This hinders effective communication with the readers, making it impossible for them to reproduce or replicate our work. Also, the current GitHub repository for the original project makes replicating or reproducing my original works challenging. Recreating the project using Jupyter Notebook and Jupyter Book will ensure that my report consists of narrative and computation in a more readable format. Also, reconstructing the GitHub repository will enhance further reproducibility/replicability of the project.


Original Project

Use this link to see my original work!

Original Work


Thanks To…

This project would not have been possible without instruction/teaching of many professors at UC Berkeley.

  • Professor Nusrat Rabbee (STAT 154: Machine Learning)

  • Professor Josh Hug (DATA 100: Principles and Techniques of Data Science)

  • Professor Fernando Pérez (STAT 159: Reproducible and Collaborative Data Science)

And special thanks to my four beloved friends who were the group members of the original project. It was the best group I ever had in my life. Check the original project paper for their names.