Understanding Naive Bayes Classifier and its Application in Detecting AI-Generated Text

Vinod Puttamadegowda
Dec 4, 2023
3 min read

Introduction

Naive Bayes Classifier (NBC) is a probabilistic algorithm widely used in machine learning for classification tasks. It is based on Bayes' theorem and makes a naive assumption that the features used to describe an observation are conditionally independent given the class label. Despite its simplicity and the naive assumption, NBC often performs surprisingly well, especially in text classification tasks.

The classifier is based on Bayes' theorem, which relates the conditional and marginal probabilities of random events.

Conditional Probability

P(A|B) = P(A and B) / P(B)

P(B|A) = P(A and B) / P(A)

Bayes' Theorem

P(A|B) = {P(B|A) ⋅ P(A)} / P(B) 

In the context of NBC:

P(Class | Observation) = {P(Observation | Class) ⋅ P(Class)} / P(Observation)

The naive assumption is that the features used to describe an observation are conditionally independent given the class label:

P(Observation | Class) = P(Feature1 | Class)⋅P(Feature2 | Class)⋅…...

Step 1: Setup Kaggle Environment and Import libraries

In this initial step, we set up the Kaggle environment by exploring the available datasets and files. And also import the necessary libraries for building a Naive Bayes classifier.

Step 2: Load the kaggle Data

Next, we use pandas to read the data files, which are in CSV format.

Step 3: Text Preprocessing

we define a preprocess function that takes a list of sentences and applies various cleaning steps, including removing non-alphabetic characters, digits, extra spaces, and stopwords. We then apply this preprocessing function to the 'text' column in both the training and test datasets.[1]

Step 4: Building the Vocabulary

Here, we flatten the list of preprocessed sentences and count the occurrences of each word. The vocabulary is then constructed by selecting words that occur at least min_occurrence times. [1]

Step 5: Probability Calculation with Laplace Smoothing

To account for words that may not occur in certain classes, Laplace smoothing is applied. This ensures that even if a word is not present in a particular class during training, it still has a non-zero probability. [1]

Step 6: Conditional Probability based on the class (human or LLM)

Contribution:

Now that we have preprocessed our text data and built a vocabulary, the next crucial step in our Naive Bayes Classifier implementation is calculating conditional probabilities based on the class labels—whether an essay is written by a human (class 0) or generated by a language model (LLM) (class 1). These probabilities play a key role in our classification decisions.

These conditional probabilities will serve as the basis for our Naive Bayes Classifier in the subsequent steps. Stay tuned as we move forward with the classification process!

Step 7: Classification and Validation Accuracy

Finally, the NBC classifies an essay into either human-written or generated by LLM based on the calculated conditional probabilities on the validation dataset.

I have achieved 89.13% accuracy on validation dataset.

Contribution:

To find the optimal hyper parameters, like minimum occurrence, I have experimented with a bunch of different values, below is the code and graph to show the same.

From the above graph you can see that as I kept on increasing the minimum occurrence value the validation accuracy has also increased but still I did not chose the minimum occurrence value which gave the maximum validation accuracy to prevent overfitting. To be realistic I have chosen minimum occurrence value to be '40' which gave me the validation accuracy of 89.13%.

Derive Top 10 words that predicts each class. Which word predicts the human essays most likely?

Contribution:

Below are the Top 10 words that predict each class.

The word most likely predicting the human essays was, "people".

Step 8: Kaggle submission and competition results

Below is the submission script for the kaggle competition. Finally, we have successfully implemented Naive Bayes Classifier to detect essays written by human or generated by LLM.

The classification accuracy achieved on the Kaggle competition test dataset is 49.99%, which can be seen on the below kaggle submission report.

References

harshitha-ravi/datamining Naive Bayes Classifier

Text classification using Naive Bayes Classifier

This code gives the detailed implementation of the Naive bayes classifier - from the scratch - without using any inbuilt libraries. https://github.com/harshitha-ravi/datamining

Understanding Naive Bayes Classifier and its Application in Detecting AI-Generated Text

Introduction

Conditional Probability

Bayes' Theorem

Step 1: Setup Kaggle Environment and Import libraries

Step 2: Load the kaggle Data

Step 3: Text Preprocessing

Step 4: Building the Vocabulary

Step 5: Probability Calculation with Laplace Smoothing

Step 6: Conditional Probability based on the class (human or LLM)

Contribution:

Step 7: Classification and Validation Accuracy

Contribution:

Derive Top 10 words that predicts each class. Which word predicts the human essays most likely?

Contribution:

Step 8: Kaggle submission and competition results

Links:

References

Recent Posts

Comentários