Naive Bayes implementation in Python from scratch

Image result for naive bayes classifier
Naive Bayes (NB) is considered as one of the basic algorithm in the class of classification algorithms in machine learning. It is famous because it is not only straight forward but also produce effective results sometimes in hard problems. In this blog, I am trying to explain NB algorithm from the scratch and make it very simple even for those who have very little background in machine learning. Later, I’ll compare the results we obtained using our custom algorithm with the ones obtained from scikit-learn.

There are two main classes of classification algorithms i.e. discriminative and generative. NB class of algorithms lie in the category of generative learning algorithms. The good thing about generative algorithms is that they are really fast and efficient with small number of training examples and therefore, can be implemented very easily at scale.


Introduction

The basis of NB algorithm comes from Bayes rule which is \[Pr(h|D) = \frac{Pr(D|h)Pr(h)}{Pr(D))}\] Where h is the hypothesis and D is the data with one or more features impacting hypothesis. In simple words, probability (Pr ) of a hypothesis h given data D is determined by the above formula.is the likelihood andis the posterior probability.
In NB algorithm we have a very strong assumption that all the prior events are independent of each other. Based on the bayes rule we can compute maximum a posteriori (MAP) hypothesis for any data set \[h_{MAP} = argmax_{h\in H}Pr(h|D)\] \[h_{MAP} = argmax_{h\in H}\frac{Pr(D|h)Pr(h)}{Pr(D)} --- (1)\] \[h_{MAP} = argmax_{h\in H}Pr(D|h)Pr(h) --- (2)\]
We can eliminate Pr(D) due to independence assumption. Further if we have class probabilities P(h) uniformly distributed then the formula can be further simplified as \[h_{MAP} = argmax_{h\in H}Pr(D|h) --- (3)\]


Understanding Naive Bayes using examples

The above formula can be understood better if we solve a couple of simple examples with the data.

Example 1
Days
Weather
Play ?
1
SUNNY
YES
2
RAINY
YES
3
CLOUDY
NO
4
SUNNY
YES
5
RAINY
NO
6
SUNNY
YES
7
SUNNY
YES
8
CLOUDY
NO
9
SUNNY
YES
10
RAINY
YES


Find → P(RAINY|Play=YES)= P(Play=YES|RAINY).P(YES) / P(RAINY)
→ P(RAINY|Play=YES) = (2/7*7/10)/3/10 = 0.667
→ P(RAINY|Play=NO) = (3/10*1/3)/3/10 = 0.333

The numbers in red came from RAINY days when Play was done (YES). The numbers in blue simply came from total probability of a play (YES) which 7 days out of total 10 days. Similarly, red numbers came from total probability of being RAINY which was 3 days out of total 10 days. The resulting probability is 0.667 which shows that there is 66.7% chances that there will be play if there is a rainy weather.
Now from (2) argmax(0.667, 0.333) = 0.667 which corresponds to class (Play=YES)


Example 2

Lets add some more features in the data that might be responsible for making decision whether to play or not. We have added 2 more features in the same data i.e. temperature and windy

Days
Outlook
Temperature
Windy
Play ?
1
SUNNY
HOT
TRUE
YES
2
RAINY
MILD
TRUE
YES
3
CLOUDY
HOT
FALSE
NO
4
SUNNY
HOT
TRUE
YES
5
RAINY
MILD
FALSE
NO
6
SUNNY
HOT
TRUE
YES
7
SUNNY
COOL
FALSE
YES
8
CLOUDY
HOT
TRUE
NO
9
SUNNY
COOL
TRUE
YES
10
RAINY
HOT
TRUE
YES

Now we have on day 11 the following features observed outlook=RAINY, temperature=MILD and windy=TRUE. We have to determine the posterior probability that there will be play or not.
P(Play=YES| outlook=RAINY, temperature=MILD and windy=TRUE) = P(Play=YES)*P(RAINY|YES)*P(MILD|YES)*P(TRUE|YES)

→ 7/10*2/7*1/7*6/7 = 0.0244
→ 3/10*1/3*1/3*2/3 = 0.0222

We have calculated the above probabilities using the independence assumptions from (3). Using (2) again argmax(0.0244, 0.0222) = 0.0244 which corresponds to class (Play=YES).

A relatively detailed example

Lets try to work with a relatively large, practical and detailed example so we can understand how to implement Naive Bayes in an effective way. In our example of Bayes algorithm implementation, we’ll use Pima Indians Diabetes problem data set. From this file you can download the whole data to your local drive. Metadata can be found in this file.
A small description about the data set is that it contains 768 observations of Pima Indian patients. Each row of the data set contains the following measurements from the patients:

1. Number of times pregnant
2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test
3. Diastolic blood pressure (mm Hg)
4. Triceps skin fold thickness (mm)
5. 2-Hour serum insulin (mu U/ml)
6. Body mass index (weight in kg/(height in m)^2)
7. Diabetes pedigree function
8. Age (years)
9. Class variable (0 or 1)

There are various types of variables that can exist with combination in a data set but for simplicity our dataset contains only numerical variables that leads to an outcome i.e. either a patient is diabetic or not.


Python implementation of Naive Bayes Algorithm

Using the above example, we can write a Python implementation of the above problem. With real datasets we have to first work hard in preprocessing i.e. to clean the data such that it makes sense but in our example, we are already provided with a clean data set which have at least reduced 50% of our work. So now, we are using the following steps

1. Loading data

We are starting by loading the data into memory from a csv. For this purpose, there are many Python built-in packages but we are using csv package. We iterate through each line in the data file and converting the whole data set into list of lists. Each sublist is a single line with all features as floats.
The next thing we do straight away is to split our data set into training and testing data set. In this example, our split ratio is 70/30 i.e. 70% data will randomly selected for training the model and 30% for testing the model. Following is the code that does this

import csv
import random
import math, sys
# load the data into memoery and return list of lists
def loadData(filename):
    lines = csv.reader(open(filename, "rt"))
    dataset = list(lines)
    for i in range(len(dataset)):
        dataset[i] = [float(x) for x in dataset[i]]
    return dataset
# split the data into training and testing
def train_test_data(dataset, splitRatio):
    trainData = int(len(dataset) * splitRatio)
    trainSet = []
    copy = list(dataset)
    while len(trainSet) < trainData:
        index = random.randrange(len(copy))
        trainSet.append(copy.pop(index))
    return [trainSet, copy]


2. Data Summary
We discussed earlier that Naive Bayes is a generative algorithm so it is comprised of summaries in the data set for each feature which will then be used for making predictions for a new data point. The summary consist of mean and standard deviation for each attribute. Remember that the summary will be different for different types of features. For example, continuous variables mean and standard deviation are simple. But for boolean variables, binomial distribution is used and therefore its mean and standard deviation will calculated from its own formula. Similarly for categorical variables, multinomial distribution is used.

We start by separating by each class and caculating the summary separately for each class. Then calculate the mean and standard deviation for each attribute. statistics_continuous() method calculates the summary for continuous variables and statistics_discrete() method calculates the probabilities for each category in the variable. After that we summarize the features by class. If there are any categorical variables we have to specify them in the summarize method with their index in a list. Following is the code that does this

def splitByClass(dataset):
    separated = {}
    for i in range(len(dataset)):
        vector = dataset[i]
        if (vector[-1] not in separated):
            separated[vector[-1]] = []
        separated[vector[-1]].append(vector)
    return separated

def statistics_continuous(numbers):
    mean = sum(numbers) / float(len(numbers))
    variance = sum([pow(x - mean, 2) for x in numbers]) / float(len(numbers) - 1)
    return [sum(numbers) / float(len(numbers)), math.sqrt(variance)]

def statistics_discrete(feature):
    n = len(feature)
    vals = list(set(feature))
    feature_stats = {}
    for val in vals:
        n_i = feature.count(val)
        p_i = n_i/n # probability for a specific object
        feature_stats[val] = (p_i)
    return feature_stats

def summarize(dataset, categoricalList=[]):
    unpack_data = [x for x in zip(*dataset)]
    summaries = []
    if categoricalList != []:
        for cat in categoricalList:
            summaries.insert(cat, statistics_discrete(unpack_data[cat]))
            # summaries.insert(cat, ())
    continuous_var = list(set([x for x in range(len(unpack_data))]) - set(categoricalList))
    continuous_data = [x for i, x in enumerate(unpack_data) if i in continuous_var]
    for i in range(len(continuous_data)):
        mean, stdev = statistics_continuous(continuous_data[i])
        summaries.insert(continuous_var[i], (mean, stdev))
    del summaries[-1] # Decision variable must be the last column of the dataset
    return summaries

def summarizeByClass(dataset, categoricals):
    separated = splitByClass(dataset)
    summaries = {}
    for classValue, instances in separated.items():
        summaries[classValue] = summarize(instances, categoricals) # put here [] is the index of categorical variable(s)
    return summaries



3. Learning the model

After summarizing each class we have to calculate Gaussian probabilities for each input vector for which we have calculated the mean and variance from the training dataset. Gaussian probabilities are calculated by using probability density function (PDF) of Gaussian distribution. Following is the code that does this

def estimateProbability_gaussian(x, mean, stdev):
    exponent = math.exp(-(math.pow(x - mean, 2) / (2 * math.pow(stdev, 2))))
    return (1 / (math.sqrt(2 * math.pi) * stdev)) * exponent


We’ll then calculate the conditional probabilities of an instance belonging to a particular class by multiplying the individual probabilities of each feature. This will result in the map of class values to probabilities. The probabilities for categorical variables were calculated earlier in the method statistics_discrete() method and for a new input vector we determine the category of the variable and used its prior probability in the model. If the categorical variable has not been observed in the training data, the probability for that category will be determined by 1/lengthOfTrainingData but we have used 0.0001 for simplicity. Following is the code that does this

def calculateClassProbabilities(summaries, inputVector):
    probabilities = {}
    for classValue, classSummaries in summaries.items():
        probabilities[classValue] = 1
        for i in range(len(classSummaries)):
            try:
                mean, stdev = classSummaries[i]
                x = inputVector[i]
                probabilities[classValue] *= estimateProbability_gaussian(x, mean, stdev)
            except ValueError:
                category = inputVector[i]
                try:
                    probability_categorical = classSummaries[i][category]
                    probabilities[classValue] *= probability_categorical
                except KeyError: # If the key or category has not observed yet
                    probabilities[classValue] *= 0.0001 
    return probabilities

After calculating the class probabilities, we’ll determine the highest probability for an input vector for its association with a particular class using predict() and classify the new input vector to that class. Further we used getPredictions() to record all classifications in a list. Following is the code that does this

def predict(summaries, inputVector):
    probabilities = calculateClassProbabilities(summaries, inputVector)
    bestLabel, bestProb = None, -1
    for classValue, probability in probabilities.items():
        if bestLabel is None or probability > bestProb:
            bestProb = probability
            bestLabel = classValue
    return bestLabel


def getPredictions(summaries, testSet):
    predictions = []
    for i in range(len(testSet)):
        result = predict(summaries, testSet[i])
        predictions.append(result)
    return predictions


4. Accuracy

The final step while learning any machine learning model is to determine its accuracy and so it is for Naive Bayes. We’ll compare the results we have got in the list from getPredictions() with original testing data and determine how well our model has performed. Following is the code that does this

def getAccuracy(testSet, predictions):
    correct = 0
    for i in range(len(testSet)):
        if testSet[i][-1] == predictions[i]:
            correct += 1
    return (correct / float(len(testSet))) * 100.0


Conclusion

There are many Naive Bayes implementations but I have added a very common scenario of working with categorical variable features. I did not find enough examples like this so I thought I should add one. This will allow the reader to use either continuous, boolean or categorical variables in the feature set by just defining the boolean or categorical feature columns in a list. 


Note: If you want to understand the above implementation and run the code yourself, then download the data from the link above. After that use this github link to clone the repository on your computer and paste the data file in the same repository. 

1 comments:

Post a Comment