Naive Bayes (NB) is considered as one of the basic algorithm in the class of classification algorithms in machine learning. It is famous because it is not only straight forward but also produce effective results sometimes in hard problems. In this blog, I am trying to explain NB algorithm from the scratch and make it very simple even for those who have very little background in machine learning. Later, I’ll compare the results we obtained using our custom algorithm with the ones obtained from scikit-learn.
There
are two main classes of classification algorithms i.e. discriminative
and generative. NB class of
algorithms lie in the category of generative learning algorithms. The
good thing about generative algorithms is that they are really fast
and efficient with small number of training examples and therefore,
can be implemented very easily at scale.
Introduction
The
basis of NB algorithm comes from Bayes rule which is \[Pr(h|D) = \frac{Pr(D|h)Pr(h)}{Pr(D))}\] Where
h is the hypothesis and D is the data with one or more
features impacting hypothesis. In simple words, probability (Pr ) of
a hypothesis h given data D is determined by the above
formula.is the likelihood andis the posterior probability.
In
NB algorithm we have a very strong assumption that all the prior
events are independent of each other. Based on the bayes rule we can
compute maximum a posteriori (MAP) hypothesis for any
data set \[h_{MAP} = argmax_{h\in H}Pr(h|D)\] \[h_{MAP} = argmax_{h\in H}\frac{Pr(D|h)Pr(h)}{Pr(D)} --- (1)\] \[h_{MAP} = argmax_{h\in H}Pr(D|h)Pr(h) --- (2)\]
We
can eliminate Pr(D) due to independence assumption. Further if we
have class probabilities P(h) uniformly distributed then the formula
can be further simplified as \[h_{MAP} = argmax_{h\in H}Pr(D|h) --- (3)\]
Understanding
Naive Bayes using examples
The
above formula can be understood better if we solve a couple of simple
examples with the data.
Example
1
Days
|
Weather
|
Play
?
|
1
|
SUNNY
|
YES
|
2
|
RAINY
|
YES
|
3
|
CLOUDY
|
NO
|
4
|
SUNNY
|
YES
|
5
|
RAINY
|
NO
|
6
|
SUNNY
|
YES
|
7
|
SUNNY
|
YES
|
8
|
CLOUDY
|
NO
|
9
|
SUNNY
|
YES
|
10
|
RAINY
|
YES
|
Find →
P(RAINY|Play=YES)= P(Play=YES|RAINY).P(YES) / P(RAINY)
→ P(RAINY|Play=YES) = (2/7*7/10)/3/10
= 0.667
→
P(RAINY|Play=NO) = (3/10*1/3)/3/10
= 0.333
The
numbers in red came from RAINY days when Play was done (YES). The
numbers in blue simply came from total probability of a play (YES)
which 7 days out of total 10 days. Similarly, red numbers came from
total probability of being RAINY which was 3 days out of total 10
days. The resulting probability is 0.667 which shows that there is
66.7% chances that there will be play if there is a rainy weather.
Now
from (2) argmax(0.667, 0.333) = 0.667 which corresponds to class
(Play=YES)
Example
2
Lets
add some more features in the data that might be responsible for
making decision whether to play or not. We have added 2 more features
in the same data i.e. temperature and windy
Days
|
Outlook
|
Temperature
|
Windy
|
Play
?
|
1
|
SUNNY
|
HOT
|
TRUE
|
YES
|
2
|
RAINY
|
MILD
|
TRUE
|
YES
|
3
|
CLOUDY
|
HOT
|
FALSE
|
NO
|
4
|
SUNNY
|
HOT
|
TRUE
|
YES
|
5
|
RAINY
|
MILD
|
FALSE
|
NO
|
6
|
SUNNY
|
HOT
|
TRUE
|
YES
|
7
|
SUNNY
|
COOL
|
FALSE
|
YES
|
8
|
CLOUDY
|
HOT
|
TRUE
|
NO
|
9
|
SUNNY
|
COOL
|
TRUE
|
YES
|
10
|
RAINY
|
HOT
|
TRUE
|
YES
|
Now
we have on day 11 the following features observed outlook=RAINY,
temperature=MILD and windy=TRUE. We have to determine the posterior
probability that there will be play or not.
P(Play=YES|
outlook=RAINY, temperature=MILD and windy=TRUE) =
P(Play=YES)*P(RAINY|YES)*P(MILD|YES)*P(TRUE|YES)
→
7/10*2/7*1/7*6/7 = 0.0244
→
3/10*1/3*1/3*2/3 = 0.0222
We
have calculated the above probabilities using the independence
assumptions from (3). Using (2) again argmax(0.0244, 0.0222) = 0.0244
which corresponds to class (Play=YES).
A relatively detailed example
Lets try to work with a relatively large, practical and detailed example so we can understand how to implement Naive Bayes in an effective way. In our example of Bayes algorithm implementation, we’ll use Pima Indians Diabetes problem data set. From this file you can download the whole data to your local drive. Metadata can be found in this file.
A small description about the data set is that it contains 768 observations of Pima Indian patients. Each row of the data set contains the following measurements from the patients:
1. Number of times pregnant
2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test
3. Diastolic blood pressure (mm Hg)
4. Triceps skin fold thickness (mm)
5. 2-Hour serum insulin (mu U/ml)
6. Body mass index (weight in kg/(height in m)^2)
7. Diabetes pedigree function
8. Age (years)
9. Class variable (0 or 1)
There are various types of variables that can exist with combination in a data set but for simplicity our dataset contains only numerical variables that leads to an outcome i.e. either a patient is diabetic or not.
Python implementation of Naive Bayes Algorithm
Using the above example, we can write a Python implementation of the above problem. With real datasets we have to first work hard in preprocessing i.e. to clean the data such that it makes sense but in our example, we are already provided with a clean data set which have at least reduced 50% of our work. So now, we are using the following steps
1. Loading data
We are starting by loading the data into memory from a csv. For this purpose, there are many Python built-in packages but we are using csv package. We iterate through each line in the data file and converting the whole data set into list of lists. Each sublist is a single line with all features as floats.
The next thing we do straight away is to split our data set into training and testing data set. In this example, our split ratio is 70/30 i.e. 70% data will randomly selected for training the model and 30% for testing the model. Following is the code that does this
Lets try to work with a relatively large, practical and detailed example so we can understand how to implement Naive Bayes in an effective way. In our example of Bayes algorithm implementation, we’ll use Pima Indians Diabetes problem data set. From this file you can download the whole data to your local drive. Metadata can be found in this file.
A small description about the data set is that it contains 768 observations of Pima Indian patients. Each row of the data set contains the following measurements from the patients:
1. Number of times pregnant
2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test
3. Diastolic blood pressure (mm Hg)
4. Triceps skin fold thickness (mm)
5. 2-Hour serum insulin (mu U/ml)
6. Body mass index (weight in kg/(height in m)^2)
7. Diabetes pedigree function
8. Age (years)
9. Class variable (0 or 1)
There are various types of variables that can exist with combination in a data set but for simplicity our dataset contains only numerical variables that leads to an outcome i.e. either a patient is diabetic or not.
Python implementation of Naive Bayes Algorithm
Using the above example, we can write a Python implementation of the above problem. With real datasets we have to first work hard in preprocessing i.e. to clean the data such that it makes sense but in our example, we are already provided with a clean data set which have at least reduced 50% of our work. So now, we are using the following steps
1. Loading data
We are starting by loading the data into memory from a csv. For this purpose, there are many Python built-in packages but we are using csv package. We iterate through each line in the data file and converting the whole data set into list of lists. Each sublist is a single line with all features as floats.
The next thing we do straight away is to split our data set into training and testing data set. In this example, our split ratio is 70/30 i.e. 70% data will randomly selected for training the model and 30% for testing the model. Following is the code that does this
import csv import random import math, sys # load the data into memoery and return list of lists def loadData(filename): lines = csv.reader(open(filename, "rt")) dataset = list(lines) for i in range(len(dataset)): dataset[i] = [float(x) for x in dataset[i]] return dataset # split the data into training and testing def train_test_data(dataset, splitRatio): trainData = int(len(dataset) * splitRatio) trainSet = [] copy = list(dataset) while len(trainSet) < trainData: index = random.randrange(len(copy)) trainSet.append(copy.pop(index)) return [trainSet, copy]
2. Data Summary
We discussed earlier that Naive Bayes is a generative algorithm so it is comprised of summaries in the data set for each feature which will then be used for making predictions for a new data point. The summary consist of mean and standard deviation for each attribute. Remember that the summary will be different for different types of features. For example, continuous variables mean and standard deviation are simple. But for boolean variables, binomial distribution is used and therefore its mean and standard deviation will calculated from its own formula. Similarly for categorical variables, multinomial distribution is used.
We start by separating by each class and caculating the summary separately for each class. Then calculate the mean and standard deviation for each attribute. statistics_continuous() method calculates the summary for continuous variables and statistics_discrete() method calculates the probabilities for each category in the variable. After that we summarize the features by class. If there are any categorical variables we have to specify them in the summarize method with their index in a list. Following is the code that does this
def splitByClass(dataset): separated = {} for i in range(len(dataset)): vector = dataset[i] if (vector[-1] not in separated): separated[vector[-1]] = [] separated[vector[-1]].append(vector) return separated def statistics_continuous(numbers): mean = sum(numbers) / float(len(numbers)) variance = sum([pow(x - mean, 2) for x in numbers]) / float(len(numbers) - 1) return [sum(numbers) / float(len(numbers)), math.sqrt(variance)] def statistics_discrete(feature): n = len(feature) vals = list(set(feature)) feature_stats = {} for val in vals: n_i = feature.count(val) p_i = n_i/n # probability for a specific object feature_stats[val] = (p_i) return feature_stats def summarize(dataset, categoricalList=[]): unpack_data = [x for x in zip(*dataset)] summaries = [] if categoricalList != []: for cat in categoricalList: summaries.insert(cat, statistics_discrete(unpack_data[cat])) # summaries.insert(cat, ()) continuous_var = list(set([x for x in range(len(unpack_data))]) - set(categoricalList)) continuous_data = [x for i, x in enumerate(unpack_data) if i in continuous_var] for i in range(len(continuous_data)): mean, stdev = statistics_continuous(continuous_data[i]) summaries.insert(continuous_var[i], (mean, stdev)) del summaries[-1] # Decision variable must be the last column of the dataset return summaries def summarizeByClass(dataset, categoricals): separated = splitByClass(dataset) summaries = {} for classValue, instances in separated.items(): summaries[classValue] = summarize(instances, categoricals) # put here [] is the index of categorical variable(s) return summaries
3. Learning the model
After summarizing each class we have to calculate Gaussian probabilities for each input vector for which we have calculated the mean and variance from the training dataset. Gaussian probabilities are calculated by using probability density function (PDF) of Gaussian distribution. Following is the code that does this
def estimateProbability_gaussian(x, mean, stdev): exponent = math.exp(-(math.pow(x - mean, 2) / (2 * math.pow(stdev, 2)))) return (1 / (math.sqrt(2 * math.pi) * stdev)) * exponent
We’ll then calculate the conditional probabilities of an instance belonging to a particular class by multiplying the individual probabilities of each feature. This will result in the map of class values to probabilities. The probabilities for categorical variables were calculated earlier in the method statistics_discrete() method and for a new input vector we determine the category of the variable and used its prior probability in the model. If the categorical variable has not been observed in the training data, the probability for that category will be determined by 1/lengthOfTrainingData but we have used 0.0001 for simplicity. Following is the code that does this
def calculateClassProbabilities(summaries, inputVector): probabilities = {} for classValue, classSummaries in summaries.items(): probabilities[classValue] = 1 for i in range(len(classSummaries)): try: mean, stdev = classSummaries[i] x = inputVector[i] probabilities[classValue] *= estimateProbability_gaussian(x, mean, stdev) except ValueError: category = inputVector[i] try: probability_categorical = classSummaries[i][category] probabilities[classValue] *= probability_categorical except KeyError: # If the key or category has not observed yet probabilities[classValue] *= 0.0001 return probabilities
After calculating the class probabilities, we’ll determine the highest probability for an input vector for its association with a particular class using predict() and classify the new input vector to that class. Further we used getPredictions() to record all classifications in a list. Following is the code that does this
def predict(summaries, inputVector):
probabilities = calculateClassProbabilities(summaries, inputVector)
bestLabel, bestProb = None, -1
for classValue, probability in probabilities.items():
if bestLabel is None or probability > bestProb:
bestProb = probability
bestLabel = classValue
return bestLabel
def getPredictions(summaries, testSet):
predictions = []
for i in range(len(testSet)):
result = predict(summaries, testSet[i])
predictions.append(result)
return predictions
4. Accuracy
The final step while learning any machine learning model is to determine its accuracy and so it is for Naive Bayes. We’ll compare the results we have got in the list from getPredictions() with original testing data and determine how well our model has performed. Following is the code that does this
def getAccuracy(testSet, predictions):
correct = 0
for i in range(len(testSet)):
if testSet[i][-1] == predictions[i]:
correct += 1
return (correct / float(len(testSet))) * 100.0
Conclusion
There are many Naive Bayes implementations but I have added a very common scenario of working with categorical variable features. I did not find enough examples like this so I thought I should add one. This will allow the reader to use either continuous, boolean or categorical variables in the feature set by just defining the boolean or categorical feature columns in a list.
Note: If you want to understand the above implementation and run the code yourself, then download the data from the link above. After that use this github link to clone the repository on your computer and paste the data file in the same repository.
1 comments:
görüntülüshow
Replyücretli show
4EP
Post a Comment