Understanding Naive Bayes Classifier and its Application in Detecting AI-Generated Text
Introduction
Naive Bayes Classifier (NBC) is a probabilistic algorithm widely used in machine learning for classification tasks. It is based on Bayes' theorem and makes a naive assumption that the features used to describe an observation are conditionally independent given the class label. Despite its simplicity and the naive assumption, NBC often performs surprisingly well, especially in text classification tasks.
The classifier is based on Bayes' theorem, which relates the conditional and marginal probabilities of random events.
Conditional Probability
P(A|B) = P(A and B) / P(B)
P(B|A) = P(A and B) / P(A)
Bayes' Theorem
P(A|B) = {P(B|A) ⋅ P(A)} / P(B)
In the context of NBC:
P(Class | Observation) = {P(Observation | Class) ⋅ P(Class)} / P(Observation)
The naive assumption is that the features used to describe an observation are conditionally independent given the class label:
P(Observation | Class) = P(Feature1 | Class)⋅P(Feature2 | Class)⋅…...
Step 1: Setup Kaggle Environment and Import libraries
In this initial step, we set up the Kaggle environment by exploring the available datasets and files. And also import the necessary libraries for building a Naive Bayes classifier.
Step 2: Load the kaggle Data
Next, we use pandas to read the data files, which are in CSV format.
Step 3: Text Preprocessing
we define a preprocess function that takes a list of sentences and applies various cleaning steps, including removing non-alphabetic characters, digits, extra spaces, and stopwords. We then apply this preprocessing function to the 'text' column in both the training and test datasets.[1]
Step 4: Building the Vocabulary
Here, we flatten the list of preprocessed sentences and count the occurrences of each word. The vocabulary is then constructed by selecting words that occur at least min_occurrence times. [1]
Step 5: Probability Calculation with Laplace Smoothing
To account for words that may not occur in certain classes, Laplace smoothing is applied. This ensures that even if a word is not present in a particular class during training, it still has a non-zero probability. [1]
Step 6: Conditional Probability based on the class (human or LLM)
Contribution:
Now that we have preprocessed our text data and built a vocabulary, the next crucial step in our Naive Bayes Classifier implementation is calculating conditional probabilities based on the class labels—whether an essay is written by a human (class 0) or generated by a language model (LLM) (class 1). These probabilities play a key role in our classification decisions.
These conditional probabilities will serve as the basis for our Naive Bayes Classifier in the subsequent steps. Stay tuned as we move forward with the classification process!
Step 7: Classification and Validation Accuracy
Finally, the NBC classifies an essay into either human-written or generated by LLM based on the calculated conditional probabilities on the validation dataset.
I have achieved 89.13% accuracy on validation dataset.
Contribution:
To find the optimal hyper parameters, like minimum occurrence, I have experimented with a bunch of different values, below is the code and graph to show the same.
From the above graph you can see that as I kept on increasing the minimum occurrence value the validation accuracy has also increased but still I did not chose the minimum occurrence value which gave the maximum validation accuracy to prevent overfitting. To be realistic I have chosen minimum occurrence value to be '40' which gave me the validation accuracy of 89.13%.
Derive Top 10 words that predicts each class. Which word predicts the human essays most likely?
Contribution:
Below are the Top 10 words that predict each class.
The word most likely predicting the human essays was, "people".
Step 8: Kaggle submission and competition results
Below is the submission script for the kaggle competition. Finally, we have successfully implemented Naive Bayes Classifier to detect essays written by human or generated by LLM.
The classification accuracy achieved on the Kaggle competition test dataset is 49.99%, which can be seen on the below kaggle submission report.
Links:
Link to my kaggle notebooks code in GitHub: Initial Notebook: https://www.kaggle.com/code/vinodpgowda/initial Updated Notebook: https://www.kaggle.com/code/vinodpgowda/updated
Link to my GitHub repository: https://github.com/vinodpgowda/Naive-Bayes-Classifier/tree/main
Link to my GitHub Profile: https://github.com/vinodpgowda
References
harshitha-ravi/datamining Naive Bayes Classifier
Text classification using Naive Bayes Classifier
This code gives the detailed implementation of the Naive bayes classifier - from the scratch - without using any inbuilt libraries. https://github.com/harshitha-ravi/datamining
Comments