This article details a basic concept about Naïve Bayes algorithm in Machine Learning and will cover below points:
- What is Naive Bayes Classifier?
- Types of Naive Bayes Classifiers
- Common Use Cases of Naive Bayes Classifier
- Advantages of Naive Bayes
- Limitations/Challenges of Naive Bayes
- High-Level Implementation Steps for Naive Bayes Classifier in Python
- When to use and When Not to Use Naive Bayes Classifier
- Naive Bayes Classifiers vs Logistic Regression
1. What is Naive Bayes Classifier?
The Naive Bayes Classifier is a probabilistic supervised machine learning algorithm.
Naive Bayes classifiers are effective in various real-world applications, particularly in text classification and spam filtering. To comprehend the nomenclature, let's deconstruct it into two terms: "Naive" and "Bayes."
It is “naive” in the sense that an attribute value on a given class is assumed to be independent of the values of the other attributes.
For instance, in identifying a fruit based on attributes such as shape, color, and taste, the algorithm independently considers each feature. In this way, the classification of a sweet fruit as an apple is made without depending on the interdependence of these features.
Now, why "Bayes"? The algorithm employs Bayes' theorem to calculate probabilities, thus earning the name Naive Bayes.
The algorithm is based on Bayes' theorem, which is a probability theory that relates the probability of an event based on prior knowledge of conditions that might be related to the event.
The formula for Bayes' theorem is:
In this equation, ‘A’ stands for class, and ‘B’ stands for attributes. P(A/B) stands for the posterior probability of class according to the predictor. P(B) is the prior probability of the predictor, and P(A) is the prior probability of the class. P(B/A) shows the probability of the predictor according to the class.
In the context of classification, this can be written as:
2. Types of Naive Bayes Classifiers
- Multinomial Naive Bayes: Used for discrete data, often in text classification where the features represent word counts or term frequencies.
- Gaussian Naive Bayes: Assumes that the features follow a normal distribution. It is suitable for continuous data.
- Bernoulli Naive Bayes: Designed for binary or Boolean features, commonly used in document classification tasks.
3. Common Use Cases of Naive Bayes Classifier
Here are some common use cases for Naive Bayes classifiers, but not limited to this:
- Text Classification and Spam Filtering
- Sentiment analysis
- Medical Diagnosis
- Customer Segmentation
- Weather prediction
- Face recognition
- Recommendation Systems
4. Advantages of Naive Bayes
- Effective with Small Training Data: Naive Bayes works well with limited training data, making it useful when it's challenging to gather a large, labeled dataset.
- Quick Training and Prediction: Naive Bayes models train rapidly and make predictions efficiently, saving computational time.
- Applicable to Binary and Multi-class Classifications: It can be used for both binary (two-class) and multi-class classifications, providing versatility across different types of problems.
- Interpretable Probability Results: Naive Bayes produces easily interpretable probabilities, giving a clear indication of the likelihood that a specific instance belongs to a particular class. This is valuable for decision-making.
- Resistant to Overfitting: Naive Bayes is less likely to overfit, especially when the independence assumption holds approximately true. This makes it robust, especially in the presence of noisy data.
- High-Dimensional Data Performance: Performs well in datasets with many features, such as text classification with a large number of words. It handles a large feature set without a significant increase in computational complexity.
5. Limitation/Challenges of Naive Bayes
Naive Bayes classifiers have some drawbacks but not limited to below:
- Zero Probability Issue: If a particular feature does not appear in the training data for a specific class, the probability calculation can result in zero probability. This can cause problems, especially in cases where new data introduces previously unseen features. This phenomenon is called ‘Zero Frequency,’ and you’ll have to use a smoothing technique to solve this problem.
- Assumption of Feature Independence: Naive Bayes assumes features are unrelated, which might not be true in real situations, affecting prediction accuracy.
- Challenges with Continuous Data: Naive Bayes assumes that features are discrete and independent. It may not perform well with continuous or numerical data without discretization, as it does not consider the distribution of values within each feature.
- Sensitive to Irrelevant Features: It treats all features equally, making it sensitive to irrelevant features that could impact its performance.
- May Not Handle Imbalanced Data Well: Naive Bayes may face challenges with imbalanced datasets, affecting its ability to predict minority classes accurately.
- Highly Non-Normal Distributions: If the data distribution is highly non-normal, especially in the presence of outliers, Naive Bayes may not provide accurate predictions.
6. High Level Implementation Steps for Naive Bayes Classifier
We will build a classification model that uses Sklearn to see how the Naive Bayes Classifier works. For instance, we will be trying to build a spam classifier that will be able to classify a given SMS as spam or not spam.
Step 1. Importing and Understanding Data
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
# sklearn
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import BernoulliNB
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score, roc_curve
# reading the training data
docs = pd.read_csv('/content/smsspamcollection.csv',names=['label','sms_message'])
docs.head()
Step 2. Data preprocessing
# mapping labels to 0 and 1
docs['label'] = docs.label.map({'ham':0, 'spam':1})
Step 3. Split data into separate training and test set
# splitting into test and train
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
Step 4. Transforming the train and test datasets
# vectorizing the sentences; removing stop words
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(stop_words='english')
vect.fit(X_train)
# transforming the train and test datasets
X_train_transformed = vect.transform(X_train)
# Transform testing data and return the matrix. Note we are not fitting the testing data into the CountVectorizer()
X_test_transformed = vect.transform(X_test)
Step 5. Model Building & Predict the results
# training the NB model and making predictions
from sklearn.naive_bayes import MultinomialNB
mnb_model = MultinomialNB()
# fit
mnb_model.fit(X_train_transformed,y_train)
# predict class
y_pred = mnb_model.predict(X_test_transformed)
Step 6. Evaluating the Model
# confusion matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm,cmap='BuPu',annot=True,fmt='d')
Here,
- 1201 Non-Spam SMSs have been correctly classified
- 175 Spam SMSs have been correctly classified
- 7 Non-Spam SMSs have been classified as Spam SMSs (False Positives or Type I Error)
- 10 Spam SMSs have been classified as Non-Spam (False Negatives or Type II Error)
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print('Accuracy score: {0:0.2f}'.format(accuracy_score(y_test, y_pred)))
print('Precision score: {0:0.2f}'.format(precision_score(y_test, y_pred)))
print('Recall score: {0:0.2f}'.format(recall_score(y_test, y_pred)))
print('F1 score: {0:0.2f}'.format(f1_score(y_test, y_pred)))
print('The area under the curve is: {0:0.2f}'.format(roc_auc_score(y_test, y_pred)))
For detailed steps, please refer this Python Notebook.
7. When to use and When Not to Use Naive Bayes Classifier
When to use Naive Bayes Classifier:
- Suitable for problems with categorical or binary features.
- Effective in text classification tasks like spam detection, sentiment analysis, and topic categorization, especially with a large number of features (words).
- A good choice when simplicity and speed are crucial.
- Performs well with small datasets, making it useful when obtaining a large labeled dataset is challenging.
- Commonly used for spam filtering due to its efficiency in classifying emails based on word occurrences.
- Effective for predictive modeling in situations where feature independence is a reasonable assumption.
When not to use Naive Bayes Classifier:
- Avoid Naive Bayes for complex relationships in the data.
- It may not perform well with continuous or numerical data without proper preprocessing.
- Naive Bayes may not be the best choice when features are highly correlated.
- Not ideal for highly imbalanced datasets where one class significantly outnumbers the other.
- Naive Bayes treats all features equally, leading to sensitivity to irrelevant features.
8. Naive Bayes Classifiers vs Logistic Regression
|
Naive Bayes |
Logistic Regression |
Probability Model |
Based on the Bayes' theorem, assumes independence between features. |
Utilizes the logistic function to model the probability of a binary outcome. |
Feature Independence |
Assumes that the features are conditionally independent, which may not hold in real-world scenarios. That’s why Naive Bayes has a higher bias but lower variance compared to logistic regression |
Does not assume feature independence, allowing for more flexibility in capturing relationships. |
Data Requirements |
Can perform well with small datasets and is less prone to overfitting |
Generally, requires more data to avoid overfitting. |
Handling Features |
Well-suited for categorical or binary features. |
Can handle both categorical and continuous features. |
Robustness to Irrelevant Features |
Sensitive to irrelevant features, as it treats all features equally. |
Can handle irrelevant features better by adjusting their coefficients. |
Model Type |
This is a generative model where feature A is targeted to target B so that the probability between both can be calculated using the theorem P(B|A). This explains whether A has happened whenever B has occurred so that the classification can be done easily. |
This is a discriminative model where probability is calculated directly by mapping A to B or B to A so that we can know whether B has occurred at a certain interval of time owing to A. |
Happy Learning!