Machine Learning: Naive Bayes

Introduction

 
In the previous chapter, we studied Decision Tree.
 
In this chapter, we will study naive bayes.
 
Note: if you can correlate anything with yourself or your life, there are greater chances of understanding the concept. So try to understand everything by relating it to humans.
 

Key Terms

 
1. Coin
A coin has two sides, Head and Tail. If an event consists of more than one coin, then coins are considered as distinct, if not otherwise stated.
 
2. Die
The die has six faces marked 11, 2, 3, 4, 5, and 6. If we have more than one dice, then all dice are considered as distinct, if not otherwise stated.
 
3. Playing Cards
A pack of playing cards has 52 cards. There are 4 suits (spade, heart, diamond, and club) each having 13 cards. There are two colors, red (heart and diamond) and black (spade and club) each having 26 cards. In 13 cards of each suit, there are 3 face cards namely king, queen and jack so there are in all ’12 face cards. Also, there are 16 honor cards, 4 of each suit namely ace, king, queen, and jack.
 

Types of Experiments

 
1. Deterministic Experiment Those experiments, which when repeated under identical conditions produce the same result or outcome are known as a deterministic experiment,
 
2. Probabilistic/Random Experiment Those experiments, which when repeated under identical conditions, do not produce the same outcome every time but the outcome in a trial is one of the several possible outcomes called a random experiment.
 

Important Definitions

 
(i)     Trial
Let a random experiment, be repeated under identical conditions, then the experiment is called a Trial.
(ii)    Sample Space
The set of all possible outcomes of an experiment is called the sample space of the experiment and it is denoted by S.
(iii)   Event
A subset of the sample space associated with a random experiment is called an event or case.
(iv)   Sample Points
The outcomes of an experiment are called the sample point.
(v)    Certain Event
An event that must occur, whatever be the outcomes, is called a certain or sure event.
(vi)   Impossible Event
An event that cannot occur in a particular random experiment is called an impossible event.
(vii)  Elementary Event
An event certainly only one sample point is called elementary event or indecomposable events.
(viii) Favorable Event
Let S be the sample space associated with a random experiment and let E ⊂ S. Then, the elementary events belonging to E are known as the favorable event to E.
(ix)  Compound Events
An event certainly more than one sample point is called compound events or decomposable events.
 

Probability

 
If there are n elementary events associated with a random experiment and m of them are favorable to an event A, then the probability of happening or occurrence of A, denoted by P(A), is given by P(A) = m / n = Number of favourable cases / Total number of possible cases
 

Types of Events

 
(i) Equally Likely Events
The given events are said to be equally likely if none of them is expected to occur in preference to the other.
 
(ii) Mutually Exclusive Events
A set of events is said to be mutually exclusive if the happening of one excludes the happening of the other. If A and B are mutually exclusive, then P(A ∩ B) = 0
 
(iii) Exhaustive Events
A set of events is said to be exhaustive if the performance of the experiment always results in the occurrence of at least one of them. If E1, E2, … , En are exhaustive events, then El ∪ E2 ∪ … ∪ En = S i.e., P(E1 ∪ E2 ∪ E3 ∪ … ∪ En) = 1
 
(iv) Independent Events
Two events A and B associated with a random experiment are independent if the probability of occurrence or non-occurrence of A is not affected by the occurrence or non-occurrence of B. i.e., P(A ∩ B) = P(A) P(B)
 

The complement of an Event

 
Let A be an event in a sample space S~the complement of A is the set of all sample points of the space other than the sample point in A and it is denoted by, A’ or A = {n : n ∈ S, n ∉ A}
(i) P(A ∪ A’) = S
(ii) P(A ∩ A’) = φ
(iii) P(A’)’ = A
 

Partition of a Sample Space

 
The events A1, A2,…., An represent a partition of the sample space S, if they are pairwise disjoint, exhaustive and have non-zero probabilities. i.e.,
(i) Ai ∩ Aj = φ; i ≠ j; i,j= 1,2, …. ,n
(ii) A1 ∪ A2 ∪ … ∪ An = S
(iii) P(Ai) > 0, ∀ i = 1,2, …. ,n
 

Important Results on Probability

 
(i) If a set of events A1, A2,…., An are mutually exclusive, then
A1 ∩ A2 ∩ A3 ∩ …∩ An = φ
P(A1 ∪ A2 ∪ A3 ∪… ∪ An) = P(A1) + (A2) + … + P(An) and A1 ∩ A2 ∩ A3 ∩ …∩ An = 0
 
(ii) If a set of events A1, A2,…., An are exhaustive, then P(A1 ∪ A2 ∪ … ∪ An) = 1
 
(iii) The probability of an impossible event is O. i.e., P(A) = 0 if A is an impossible event. ,
 
(iv) Probability of any event in a sample space is 1. i.e., P(A) = 1
 
(v) Odds in favour of A = P(A) / P(A)
 
(vi) Odds in Against of A = P(A) / P(A)
 
(vii) Addition Theorem of Probability
(a) For two events A and B
P(A ∪ B) = P(A) + P(B) – P(A ∩ B)
(b) For three events A, B and C
P(A ∪ B ∪ C) = P(A) + P(B) + P(C) -P(A ∩ B) – P(B ∩ C) – P(A ∩ C) + P(A ∩ B ∩ C)
(c) For n events A1, A2,…., An
 
(viii) If A and B are two events, then P(A ∩ B) ≤ P(A) ≤ P(A ∪ B) ≤ P(A) + P(B)
 
(ix) If A and B are two events associated with a random experiment, then
(a) P(A ∩ B) = P(B) – P(A ∩ B)
(b) P(A ∩ B) = P(A) – P(A ∩ B)
(c)P [(A ∩ B) ∪ (A ∩ B)] = P(A) + P(B) – 2P(A ∩ B)
(d) P(A ∩ B) = 1- P(A ∪ B)
(e) P(A ∪ B) = 1- P(A ∩ B)
(f) P(A) = P(A ∩ B) + P(A ∩ B).
(g) P(B) = P(A ∩ B) + P(B ∩ A)
 
(x)
(a) P (exactly one of A, B occurs) = P(A) + P(B) – 2P(A ∩ B) = P(A ∪ B) – P(A ∩ B)
(b) P(neither A nor B) = P(A’ ∩ B’) = 1 – P(A ∪ B)
 
(xi) If A, B and C are three events, then
(a) P(exactly one of A, B, C occurs) = P(A) + P(B) + P(C) – 2P(A ∩ B) – 2P(B ∩ C) – 2P(A ∩ C) + 3P(A ∩ B ∩ C)
(b) P (at least two of A, B, C occurs) = P(A ∩ B) + P(B ∩ C) + P(C ∩ A) – 2P(A ∩ B ∩ C)
(c) P (exactly two of A, B, C occurs) . = P(A ∩ B) + P(B ∩ C) + P(A ∩ C) – 3P(A ∩ B ∩ C)
 
(xii)
(a) P(A ∪ B) = P(A) + P(B), if A and B are mutually exclusive events.
(b) P(A ∪ B ∪ C) = P(A) + P(B) + P(C), if A, Band C are mutually exclusive events.
 
(xiii) P(A) = 1- P(A)
 
(xiv) P(A ∪ B) = P(S) = 1, P(φ) = 0
 
(xv) P(A ∩ B) = P(A) x P(B), if A and B are independent events.
 
(xvi) If A1, A2,…., An are independent events associated with a random experiment, the probability of occurrence of at least one
= P(A1 ∪ A2 ∪…. ∪ An)
= 1 – P(A1 ∪ A2 ∪…. ∪ An)
= 1 – P(A1)P(A2)…P(An)
 
(xvii) If B ⊆ A, then P(A ∩ B) = P(A) – P(B)
 

Conditional Probability

 
Let A and B be two events associated with a random experiment, then, the probability of occurrence of event A under the condition that B has already occurred and P(B) ≠ 0, is called the conditional probability. 
i.e., P(A/B) = P(A ∩ B) / P(B)
 
If A has already occurred and P (A) ≠ 0, then
P(B/A) = P(A ∩ B) / P(A)
 
Also, P(A / B) + P (A / B) = 1
 

Multiplication Theorem on Probability

 
(i) If A and B are two events associated with a random experiment, then
P(A ∩ B) = P(A)P(B /A), IF P(A) ≠ 0
OR
P(A ∩ B) = P(B)P(A /B), IF P(B) ≠ 0
 
(ii) If A1, A2,…., An are n events associated with a random experiment, then 
P(A1 ∩ A2 ∩…. ∩ An) = P(A1) P(A2 / A1) P(A3 / (A1 ∩ A2)) …P(An / (A1 ∩ A2 ∩ A3 ∩…∩A n – 1))
 

Total Probability

 
Let S be the sample space and let E1, E2,…., En be n mutually exclusive and exhaustive events associated with a random experiment. If A is any event which occurs with E1 or E2 or … or En then
 
P(A) = P(E1)P(A / E1) + P(E2)P(A / E2) + … + P(En) P(A / En)
 

Baye’s Theorem

 
Let S be the sample space and let E1, E2,…, En, be n mutually exclusive and exhaustive events associated With a random experiment. If A is any event which occurs with E1 or E2 or … or En then the probability of occurrence of Ei, when A occurred,
 
where,
  1. P (Ei), i = 1,2, n are known as the prior probabilities
  2. P (A / Ei), i = 1,2, , n are called the likelihood probabilities 
  3. P (Ei / A), i = 1, 2, … ,n is called the posterior probabilities
\displaystyle Prosterior={\frac {Likelihood \mbox{*} Prior}{Evidence}}
 
OR
 
{\displaystyle P(A\mid B)={\frac {P(B\mid A)P(A)}{P(B)}}\cdot }
 
where A and B are events and P ( B ) ≠ 0.
  1. P ( A ∣ B ) is a conditional probability: the likelihood of event A occurring given that B is true.
  2. P ( B ∣ A ) is also a conditional probability: the likelihood of event B occurring given that A is true.
  3. P ( A ) and P ( B ) are the probabilities of observing A and Bindependently of each other; this is known as the marginal probability.

Random Variable

 
Let U or S be a sample space associated with a given random experiment. A real-valued function X defined on U or S, i:e.,
 
X: U → R is called a random variable.
 
There are two types of random variables.
 
(i) Discrete Random Variable
If the range of the real function X: U → R is a finite set or an infinite set of real numbers, it is called a discrete random variable.
 
(ii) Continuous Random Variable
If the range of X is an interval (a, b) of R, then X is called a continuous random variable. e.g., In tossing of two coins S = {HH, HT, TH, TT}, let X denotes the number of heads in the tossing of two coins, then X(HH) = 2, X(TH) = 1, X(TT) = 0
 

Probability Distribution

 
If a random variable X takes values X1, X2,…., Xn with respective probabilities P1, P2,…., Pn then is known as the probability distribution of X, or Probability distribution gives the values of the random variable along with the corresponding probabilities.
 

Mathematical Expectation/Mean

 
If X is a discrete random variable which assume values X1, X2,…., Xn with respective probabilities P1, P2,…., Pn then the mean x of X is defined as
E(X) = X = P1X1 + P2X2 + … + PnXn = Σni = 1 PiXi
 
Important Results
 
(i) Variance V(X) = σ2x = E(X2) – (E(X))2
where, E(X2) = Σni = 1 x2iP(xi)
 
(ii) Standard Deviation √V(X) = σx = √E(X2) – (E(X))2
 
(iii) If Y = a X + b, then
(a) E(Y) = E(aX + b) = aE(X) + b
(b) σ2y = a2V(Y) = a2σ2x
(c) σy = √V(Y) = |a|σx
 
(iv) If Z = aX2 + bX + c, then
E(Z) = E(aX2 + bX + c) = aE(X2) + bE(X) + c 
 

Baye's Theorem Explanation using an example

 
Let us try to understand the above formula through an example:
 
Question
 
Talking about C-SharpCorner, if a person visits C-SharpCorner, the chances of he/she revisiting are 60%, the chances of a person liking a particular article are 75%. Chances of a person liking the article and coming back are 75%. So we need to find the probability of a person re-visiting the website given that he/she doesn't like the article.
 
Solution
 
A: A person re-visits the website
B:  A person likes an article
 
So, P(A) = 0.6, P(A') = 0.4
P(B) = 0.75 , P(B') =0.25
P(A|B) = 0.75, P(A'|B) = 0.25
 
P(B|A') = P((A'|B)*P(B))/ P(A')
            = (0.25*0.75)/0.4
            = 0.46875
 
So, from the above calculations, it is clear that a person will revisit ~47% time if he/she doesn't like a particular article
 

When is Naive Bayes Classifier Used?

 
1. Real-time prediction
Naive Bayes Algorithm is fast and always ready to learn hence best suited for real-time predictions.
 
2. Multi-class prediction
The probability of multi-classes of any target variable can be predicted using a Naive Bayes algorithm.
 
3. Recommendation system
Naive Bayes classifier with the help of Collaborative Filtering builds a Recommendation System. This system uses data mining and machine learning techniques to filter the information which is not seen before and then predict whether a user would appreciate a given resource or not.
 
4. Text classification/ Sentiment Analysis/ Spam Filtering
Due to its better performance with multi-class problems and its independence rule, Naive Bayes algorithm performs better or have a higher success rate in text classification, Therefore, it is used in Sentiment Analysis and Spam filtering.
 

Difference between Bayes and Naive Bayes Algorithm

 
The naive Bayes classifier is an approximation to the Bayes classifier, in which we assume that the features are conditionally independent given the class instead of modeling their full conditional distribution given the class. A Bayes classifier is best interpreted as a decision rule. Suppose we seek to estimate the class of ("classify") an observation is given a vector of features. Denote the class C and the vector of features (F1, F2,…, Fk). Given a probability model underlying the data (that is, given the joint distribution of (C, F1, F2,…, Fk), the Bayes classification function chooses a class by maximizing the probability of the class given the observed features: argmaxc P(C=c∣F1=f1,…,Fk=fk)  
 

Assumptions of Naive Bayes Algorithm

  1. All the features are independent, that is there are no dependencies between any of the features.
  2. Each of the features is given equal or the same importance or equal weight

Things to keep in mind to get the best out of Naive Bayes Algorithm

  1. Categorical Inputs
    Naive Bayes assumes label attributes such as binary, categorical, or nominal.
  2. Gaussian Inputs
    If the input variables are real-valued, a Gaussian distribution is assumed. In which case the algorithm will perform better if the univariate distributions of your data are Gaussian or near-Gaussian. This may require removing outliers (e.g. values that are more than 3 or 4 standard deviations from the mean).
  3. Classification Problems
    Naive Bayes works best with binary and multiclass classification.
  4. Log Probabilities
    The calculation of the likelihood of different class values involves multiplying a lot of small numbers together. We should use a log transform of the probabilities to avoid an underflow of numerical precision.
  5. Kernel Functions
    Rather than assuming a Gaussian distribution for numerical input values, more complex distributions can be used such as a variety of kernel density functions.
  6. Update Probabilities
    When new data becomes available, you can simply update the probabilities of your model. This can be helpful if the data changes frequently.

Types of Naive Bayes Algorithm

 

1. Multinomial Naive Bayes

 
This is mostly used for the document classification problems, i.e whether a the document belongs to the category of sports, politics, technology, etc. The features/predictors used by the classifier are the frequency of the words present in the document.
 

2. Bernoulli Naive Bayes

 
This is similar to the multinomial naive Bayes but the predictors are boolean variables. The parameters that we use to predict the class the variable takes up only values yes or no, for example, if a word occurs in the text or not.
 

3. Gaussian Naive Bayes

 
When the predictors take up a continuous value and are not discrete, we assume that these values are sampled from a gaussian distribution.
 

4. Semi-supervised parameter estimation

 
Given away to train a naive Bayes classifier from labeled data, it's possible to construct a semi-supervised training algorithm that can learn from a combination of labeled and unlabeled data by running the supervised learning algorithm in a loop
 

What is Naive Bayes?

 
Type Long Not Long Sweet Not Sweet Yellow Not Yellow Total
Banana 400 100 350 150 450 50 500
Orange 0 300 150 150 300 0 300
Other 100 100 150 50 50 150 200
Total 500 500 650 350 800 200 1000
 
So the objective of the classifier is to predict if a given fruit is a ‘Banana’ or ‘Orange’ or ‘Other’ when only the 3 features (long, sweet, and yellow) are known.
 
So to predict this we need to find 3 probabilities. Let's start
 
1. We first calculate the "Prior" probabilities for each of the class of fruits
P[Y=Banana] = 500/1000 = 0.5
P[Y=Orange] = 300/1000 = 0.3
P[Y=Other] = 200/1000 = 0.2
 
2. We then compute the probability of evidence that goes in the denominator
P[x1=Long] =  500/100 = 0.5
P[x2=Sweet] = 650/100 = 0.65
P[x3=Yellow] = 800/100 = 0.8
 
3. Now we calculate the probability of likelihood of evidences that goes in the numerator
P[x1=Long | Y=Banana] = 400/500 = 0.8
P[x2=Sweet | Y=Banana] = 350/500 = 0.7
P[x3=Yellow | Y=Banana] = 450/500 = 0.9
 
4. At the end we substitute all the values in the Naive Bayes Formula,
 
a. P(Banana | Long, Sweet and Yellow)
= ((P(Long | Banana) * P(Sweet | Banana) * P(Yellow | Banana))*P(Banana))/ (P(Long) * P(Sweet) * P(Yellow))
= ((0.8 * 0.7 * 0.9) * 0.5)/(0.5 * 0.65 * 0.8)
=  0.97
 
b. P(Orange | Long, Sweet and Yellow)
=  ((P(Long | Orange) * P(Sweet | Orange) * P(Yellow | Orange))*P(Orange))/ (P(Long) * P(Sweet) * P(Yellow))
= 0
 
c. P(Others | Long, Sweet and Yellow)
= ((P(Long | Others) * P(Sweet | Others) * P(Yellow | Others))*P(Others))/ (P(Long) * P(Sweet) * P(Yellow))
= 0.07
 
So, from the Naive Bayes Classifier, we predict the fruit is a Banana.
 

Python Implementation of Decision Tree

 
Let's take the example of the IRIS dataset, you can directly import it from the sklearn dataset repository. Feel free to use any dataset, there some very good datasets available on kaggle and with Google Colab.
 

1. Using functions

  1. from csv import reader  
  2. from math import sqrt  
  3. from math import exp  
  4. from math import pi  
  5.    
  6. # Load a CSV file  
  7. def load_csv(filename):  
  8.     dataset = list()  
  9.     with open(filename, 'r') as file:  
  10.         csv_reader = reader(file)  
  11.         for row in csv_reader:  
  12.             if not row:  
  13.                 continue  
  14.             dataset.append(row)  
  15.     return dataset  
  16.    
  17. # Convert string column to float  
  18. def str_column_to_float(dataset, column):  
  19.     for row in dataset:  
  20.         row[column] = float(row[column].strip())  
  21.    
  22. # Convert string column to integer  
  23. def str_column_to_int(dataset, column):  
  24.     class_values = [row[column] for row in dataset]  
  25.     unique = set(class_values)  
  26.     lookup = dict()  
  27.     for i, value in enumerate(unique):  
  28.         lookup[value] = i  
  29.         print('[%s] => %d' % (value, i))  
  30.     for row in dataset:  
  31.         row[column] = lookup[row[column]]  
  32.     return lookup  
  33.    
  34. # Split the dataset by class values, returns a dictionary  
  35. def separate_by_class(dataset):  
  36.     separated = dict()  
  37.     for i in range(len(dataset)):  
  38.         vector = dataset[i] 
  39.         class_value = vector[-1]  
  40.         if (class_value not in separated):  
  41.             separated[class_value] = list()  
  42.         separated[class_value].append(vector)  
  43.     return separated  
  44.    
  45. # Calculate the mean of a list of numbers  
  46. def mean(numbers):  
  47.     return sum(numbers)/float(len(numbers))  
  48.    
  49. # Calculate the standard deviation of a list of numbers  
  50. def stdev(numbers):  
  51.     avg = mean(numbers)  
  52.     variance = sum([(x-avg)**2 for x in numbers]) / float(len(numbers)-1)  
  53.     return sqrt(variance)  
  54.    
  55. # Calculate the mean, stdev and count for each column in a dataset  
  56. def summarize_dataset(dataset):  
  57.     summaries = [(mean(column), stdev(column), len(column)) for column in zip(*dataset)]  
  58.     del(summaries[-1])  
  59.     return summaries  
  60.    
  61. # Split dataset by class then calculate statistics for each row  
  62. def summarize_by_class(dataset):  
  63.     separated = separate_by_class(dataset)  
  64.     summaries = dict()  
  65.     for class_value, rows in separated.items():  
  66.         summaries[class_value] = summarize_dataset(rows)  
  67.     return summaries  
  68.    
  69. # Calculate the Gaussian probability distribution function for x  
  70. def calculate_probability(x, mean, stdev):  
  71.     exponent = exp(-((x-mean)**2 / (2 * stdev**2 )))  
  72.     return (1 / (sqrt(2 * pi) * stdev)) * exponent  
  73.    
  74. # Calculate the probabilities of predicting each class for a given row  
  75. def calculate_class_probabilities(summaries, row):  
  76.     total_rows = sum([summaries[label][0][2for label in summaries])  
  77.     probabilities = dict()  
  78.     for class_value, class_summaries in summaries.items():  
  79.         probabilities[class_value] = summaries[class_value][0][2]/float(total_rows)  
  80.         for i in range(len(class_summaries)):  
  81.             mean, stdev, _ = class_summaries[i]  
  82.             probabilities[class_value] *= calculate_probability(row[i], mean, stdev)  
  83.     return probabilities  
  84.    
  85. # Predict the class for a given row  
  86. def predict(summaries, row):  
  87.     probabilities = calculate_class_probabilities(summaries, row)  
  88.     best_label, best_prob = None, -1  
  89.     for class_value, probability in probabilities.items():  
  90.         if best_label is None or probability > best_prob:  
  91.             best_prob = probability  
  92.             best_label = class_value  
  93.     return best_label  
  94.    
  95. # Make a prediction with Naive Bayes on Iris Dataset  
  96. filename = 'iris.csv'  
  97. dataset = load_csv(filename)  
  98. for i in range(len(dataset[0])-1):  
  99.     str_column_to_float(dataset, i)  
  100. # convert class column to integers  
  101. str_column_to_int(dataset, len(dataset[0])-1)  
  102. # fit model  
  103. model = summarize_by_class(dataset)  
  104. # define a new record  
  105. row = [5.7,2.9,4.2,1.3]  
  106. # predict the label  
  107. label = predict(model, row)  
  108. print('Data=%s, Predicted: %s' % (row, label)) 
The output that I am getting is
 
[Iris-versicolor] => 0
[Iris-setosa] => 1
[Iris-virginica] => 2
 
Data=[5.7, 2.9, 4.2, 1.3], Predicted: 0
 

2. Using Sklearn

  1. from sklearn import datasets  
  2.   
  3. #Load dataset  
  4. wine = datasets.load_wine()  
  5. # print the names of the 13 features  
  6. print ("Features: ", wine.feature_names)  
  7.  
  8. # print the label type of wine(class_0, class_1, class_2)  
  9. print ("Labels: ", wine.target_names)  
  10.   
  11. # Import train_test_split function  
  12. from sklearn.model_selection import train_test_split  
  13.   
  14. # Split the dataset into the training set and test set  
  15. X_train, X_test, y_train, y_test = train_test_split(wine.data, wine.target, test_size=0.3,random_state=109)  
  16.   
  17. #Import Gaussian Naive Bayes model  
  18. from sklearn.naive_bayes import GaussianNB  
  19.   
  20. #Create a Gaussian Classifier  
  21. gnb = GaussianNB()  
  22.  
  23. #Train the model using the training sets  
  24. gnb.fit(X_train, y_train)  
  25.   
  26. #Predict the response for test dataset  
  27. y_pred = gnb.predict(X_test)  
  28.   
  29. print("y_pred: ",y_pred)  
  30.   
  31. #Import scikit-learn metrics module for accuracy calculation  
  32. from sklearn import metrics  
  33.   
  34. # Model Accuracy, how often is the classifier correct?  
  35. print("Accuracy:",metrics.accuracy_score(y_test, y_pred)) 
The output that I got is
 
Features: ['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']
 
Labels: ['class_0' 'class_1' 'class_2']
 
y_pred: [0 0 1 2 0 1 0 0 1 0 2 2 2 2 0 1 1 0 0 1 2 1 0 2 0 0 1 2 0 1 2 1 1 0 1 1 0 2 2 0 2 1 0 0 0 2 2 0 1 1 2 0 0 2]
 
Accuracy: 0.9074074074074074
 

3. Using TensorFlow

  1. from IPython import embed  
  2. from matplotlib import colors  
  3. from matplotlib import pyplot as plt  
  4. from sklearn import datasets  
  5. import numpy as np  
  6. import tensorflow as tf  
  7. from sklearn.utils.fixes import logsumexp  
  8. import numpy as np  
  9.   
  10.  
  11. class TFNaiveBayesClassifier:  
  12.     dist = None  
  13.   
  14.     def fit(self, X, y):  
  15.         # Separate training points by class (nb_classes * nb_samples * nb_features)  
  16.         unique_y = np.unique(y)  
  17.         points_by_class = np.array([  
  18.             [x for x, t in zip(X, y) if t == c]  
  19.             for c in unique_y])  
  20.   
  21.         # Estimate mean and variance for each class / feature  
  22.         # shape: nb_classes * nb_features  
  23.         mean, var = tf.nn.moments(tf.constant(points_by_class), axes=[1])  
  24.   
  25.         # Create a 3x2 univariate normal distribution with the   
  26.         # known mean and variance  
  27.         self.dist = tf.distributions.Normal(loc=mean, scale=tf.sqrt(var))  
  28.   
  29.     def predict(self, X):  
  30.         assert self.dist is not None  
  31.         nb_classes, nb_features = map(int, self.dist.scale.shape)  
  32.   
  33.         # Conditional probabilities log P(x|c) with shape  
  34.         # (nb_samples, nb_classes)  
  35.         cond_probs = tf.reduce_sum(  
  36.             self.dist.log_prob(  
  37.                 tf.reshape(  
  38.                     tf.tile(X, [1, nb_classes]), [-1, nb_classes, nb_features])),  
  39.             axis=2)  
  40.   
  41.         # uniform priors  
  42.         priors = np.log(np.array([1. / nb_classes] * nb_classes))  
  43.   
  44.         # posterior log probability, log P(c) + log P(x|c)  
  45.         joint_likelihood = tf.add(priors, cond_probs)  
  46.   
  47.         # normalize to get (log)-probabilities  
  48.         norm_factor = tf.reduce_logsumexp(  
  49.             joint_likelihood, axis=1, keep_dims=True)  
  50.         log_prob = joint_likelihood - norm_factor  
  51.         # exp to get the actual probabilities  
  52.         return tf.exp(log_prob)  
  53.   
  54.   
  55. if __name__ == '__main__':  
  56.     iris = datasets.load_iris()  
  57.     # Only take the first two features  
  58.     X = iris.data[:, :2]  
  59.     y = iris.target  
  60.   
  61.     tf_nb = TFNaiveBayesClassifier()  
  62.     tf_nb.fit(X, y)  
  63.       
  64.     # Create a regular grid and classify each point  
  65.     x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5  
  66.     y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5  
  67.     xx, yy = np.meshgrid(np.linspace(x_min, x_max, 30),  
  68.                          np.linspace(y_min, y_max, 30))  
  69.     s = tf.Session()  
  70.     Z = s.run(tf_nb.predict(np.c_[xx.ravel(), yy.ravel()]))  
  71.     # Extract probabilities of class 2 and 3  
  72.     Z1 = Z[:, 1].reshape(xx.shape)  
  73.     Z2 = Z[:, 2].reshape(xx.shape)  
  74.   
  75.     # Plot  
  76.     fig = plt.figure(figsize=(53.75))  
  77.     ax = fig.add_subplot(111)  
  78.   
  79.     ax.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Set1,  
  80.                 edgecolor='k')  
  81.     # Swap signs to make the contour dashed (MPL default)  
  82.     ax.contour(xx, yy, -Z1, [-0.5], colors='k')  
  83.     ax.contour(xx, yy, -Z2, [-0.5], colors='k')  
  84.   
  85.     ax.set_xlabel('Sepal length')  
  86.     ax.set_ylabel('Sepal width')  
  87.     ax.set_title('TensorFlow decision boundary')  
  88.     ax.set_xlim(x_min, x_max)  
  89.     ax.set_ylim(y_min, y_max)  
  90.     ax.set_xticks(())  
  91.     ax.set_yticks(())  
  92.   
  93.     plt.tight_layout()  
  94.     fig.savefig('tf_iris.png', bbox_inches='tight'
The output that I got is:
tensorflow
 

Conclusion

 
In this article, we studied naive bayes.
 
In the next chapter, we will study k-means clustering.
Author
Rohit Gupta
73 26.2k 2.9m
Next » Machine Learning: K-Means Clustering