A typical example of problem ML tries to solve is classification. It can be expressed as the ability, given some input data, to assign a ‘class label’ to a sample.

To make things clearer, let’s make an example. Imagine we performed analysis on samples of objects and we collected their specs. Now, given this information, we would like to know if that object is a window glass (from vehicle or building) or not a window glass (containers, tableware, or headlamps). Unfortunately, we do not have a formula which, given these values, will provide us with the answer.

Someone who has handled glasses might be able to tell just by looking or touching it if that is a window glass or not. That is because he has acquired experience by looking at many examples of different kind of glasses. That is exactly what happens with machine learning. We say that we ‘train’ the algorithm to learn from known examples.

We provide a ‘training set’ where we specify both the input specs of the class and its category. The algorithm goes through the examples, learns the distinctive features of a window glass and so it can infer the class of a given uncategorized example.

We will use a dataset titled ‘Glass Identification Database’, created by B. German from Central Research Establishment Home Office Forensic Science Service. The original dataset classified the glass into 7 classes: 4 types of window glass classes, and 3 types of non-window glass classes. Our version treats all 4 types of window glass classes as one class, and all 3 types of non-window glass classes as one class.

Attribute and Class Information:

Every row is an example and contains 11 attributes as listed below.

  1. Example number
  2. RI: refractive index
  3. Na: Sodium (unit measurement: weight percent in corresponding oxide, as are attributes 4-10)
  4. Mg: Magnesium
  5. Al: Aluminum
  6. Si: Silicon
  7. K: Potassium
  8. Ca: Calcium
  9. Ba: Barium
  10. Fe: Iron
  11. Type of glass: (class) – 1 window glass (from vehicle or building) – 2 not window glass (containers, tableware, or headlamps)

The following is an extract of how the dataset looks like:

1,1.51824,12.87,3.48,1.29,72.95,0.6,8.43,0,0,1
2,1.51832,13.33,3.34,1.54,72.14,0.56,8.99,0,0,1
3,1.51747,12.84,3.5,1.14,73.27,0.56,8.55,0,0,1
...
196,1.52315,13.44,3.34,1.23,72.38,0.6,8.83,0,0,2
197,1.51848,13.64,3.87,1.27,71.96,0.54,8.32,0,0.32,1
198,1.523,13.31,3.58,0.82,71.99,0.12,10.17,0,0.03,1
199,1.51905,13.6,3.62,1.11,72.64,0.14,8.76,0,0,1
200,1.52213,14.21,3.82,0.47,71.77,0.11,9.57,0,0,1

Naive Bayes classifier

One of the simplest yet effective algorithm that should be tried to solve the classification problem is Naive Bayes. It is a probabilistic method which is based on the Bayes’ theorem with the naive independence assumptions between the input attributes.

We define C as the class we are analyzing and x as the input data or observation. The following equation, which is Bayes’ theorem, is the probability of class C, given the observation x. This is equal to the fraction of the probability of class C (without considering at the input) multiplied by the probability of the observation given the class C over the probability of the observation.

P(C) is also called the ‘prior probability’ because it is the knowledge we have as to the value of C before looking at the observables x. We also know that P(C = 0) + P(C = 1) = 1.

P(x | C) is called the class likelihood, which is the probability that an event belonging to C has the associated observation value x. In statistical inference, we make a decision using the information provided by a sample. In this case, we assume that the sample is drawn from some distribution that obeys a known model, for example, Gaussian. Part of this task is to generate the Gaussian that describes our data, so we can use the probability density function to compute the probability for a given attribute 2. As already mentioned, every attribute will be treated as independent from the others.

Finally, P(x), also called the evidence, is the probability that an observation x is seen, regardless of the class C of the example.

P(C|x)=P(C)⋅P(x|C)|P(x)

The above equation is the ‘posterior probability’, which is the probability of class C after have seen the observation x.

At this point, given the posterior probability of several classes, we are able to decide which one is the most likely. It is interesting to notice that the denominator would be the same for all the classes, so we can simplify the calculation by comparing only the numerator of the Bayes’ theorem.

Read data

First thing first, we want to read our dataset so we are able to perform analysis on it. It is a CSV file, so we could use the csvPython library, but I personally prefer to use something more powerful like pandas.

Pandas is an open source library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. 3

pandas.read_csv will read our CVS file into a DataFrame, which is a two-dimensional tabular data structure with labeled axes. In this way, our dataset will be damn easy to manipulate. I also decided to label my columns so everything will be much clearer.

ATTR_NAMES = ["RI", "Na", "Mg", "Al", "Si", "K", "Ca", "Ba", "Fe"]
FIELD_NAMES = ["Num"] + ATTR_NAMES + ["Class"]data = pandas.read_csv(args.filename, names=FIELD_NAMES)

Now that we have our dataset in memory we want to split it into two parts: the training set and the test set. The former will be used to train our ML model, while the latter to check how accurate the model is.

The following code will split the data dividing the dataset in chunks (based on the number of blocks_num) and choose as a test set the chunk at position test_block which will also be removed from the training set. If nothing is provided apart from the dataset, the function will just use the same data for both training and test sets.

def split_data(data, blocks_num=1, test_block=0):blocks = numpy.array_split(data, blocks_num)test_set = blocks[test_block]if blocks_num > 1:del blocks[test_block]training_set = pandas.concat(blocks)return training_set, test_set

Prior

Estimating the P(C) of a given training sample is pretty straightforward. Prior probabilities are based on previous experience, in this case, the percentage of a class in the dataset.

We want to count the frequency of each class and get the ratio by dividing by the number of examples. The code to do so is extremely concise, also because pandas library makes the calculation of frequencies trivial.

def __prior(self):counts = self.__training_set["Class"].value_counts().to_dict()self.__priors = {(k, v / self.__n) for k, v in counts.items()}

Mean and variance

To calculate the ‘pdf’ (probability density function) we need to know how the distribution that describes our data looks like. To do that we need to compute the mean and the variance (or eventually the standard deviation) for each attribute for every single class. Since we have 9 attributes and 2 classes in our dataset, we will end up having 18 mean-variance pairs.

Again for this task we can use the helper functions provided by pandas, where we select the column of interest and call their mean() and std() methods.

def __calculate_mean_variance(self):self.__mean_variance = {}for c in self.__training_set["Class"].unique():filtered_set = self.__training_set[(self.__training_set['Class'] == c)]m_v = {}for attr_name in ATTR_NAMES:m_v[attr_name] = []m_v[attr_name].append(filtered_set[attr_name].mean())m_v[attr_name].append(math.pow(filtered_set[attr_name].std(), 2))self.__mean_variance[c] = m_v

Gaussian Probability Density Function

The function to compute the ‘pdf’ is just a static method that takes as input the value of the attribute and the description of the Gaussian (mean and variance) and returns a probability according to the ‘pdf’ equation.

@staticmethod
def __calculate_probability(x, mean, variance):exponent = math.exp(-(math.pow(x - mean, 2) / (2 * variance)))return (1 / (math.sqrt(2 * math.pi * variance))) * exponent

Predict

Now that we have everything in place, it is time to predict our classes.

Basically, what the following does, is it iterating through the test set and for each sample calculates the probability of every class using the Bayes’ theorem. The only difference here is that we use log probabilities since the probabilities for each class given an attribute value are small and they could underflow.

So it becomes: log[p(x|C)∗P(C)]=logP(C)+∑9i=1logp(xi|C)log⁡[p(x|C)∗P(C)]=log⁡P(C)+∑i=19log⁡p(xi|C)

def predict(self):predictions = {}for _, row in self.__test_set.iterrows():results = {}for k, v in self.__priors:p = 0for attr_name in ATTR_NAMES:prob = self.__calculate_probability(row[attr_name], self.__mean_variance[k][attr_name][0], self.__mean_variance[k][attr_name][1])if prob > 0:p += math.log(prob)results[k] = math.log(v) + ppredictions[int(row["Num"])] = max([key for key in results.keys() if results[key] == results[max(results, key=results.get)]])return predictions

As a result, we need to take as a prediction the class with the highest probability. If two or more classes end up having the same probability we decided to take the class which comes earlier in reverse alphabetical order, but this was not really needed for the given dataset.

Accuracy

Once we obtain the predictions, we can compare them to the class value present in the test dataset, so we can calculate the ratio of correct ones over the total number of predictions. This measure is also called accuracy and allows to estimate the quality of the ML model used.

def calculate_accuracy(test_set, predictions):correct = 0for _, t in test_set.iterrows():if t["Class"] == predictions[t["Num"]]:correct += 1return (correct / len(test_set)) * 100.0

In our tests, we obtained a 90% accuracy using the same dataset for both training and test.

Cross validation

Now that we know how to perform a prediction, let’s look at the data again. Does it really make any sense to train an algorithm on something and then test it on the same data? Probably not. We want to have two different sets then, but this is not always possible when you do not have enough data.

Our example dataset contains 200 records, ideally, we would like to squeeze it as much as we can and perform a test on the all 200 samples, but then we would not have anything to use to train the model.

The way ML people do this is called cross validation. The dataset is divided into chunks (as shown before), say 5 for example, and the model is trained against 4 of 5 chunks and the other chunk is used for the test. This operation is repeated as many times as the number of chunks so that the test is performed on every chunk. Finally, the accuracy values collected for every repetition is averaged.

Again, even using 5-fold cross validation we obtained the same accuracy equal to 90%.

Zero-R classifier

Zero-R classifier simply predicts the majority class (the class that is most frequent in the training set). Sometimes a not-very-intelligent learning algorithm can achieve high accuracy on a particular learning task simply because the task is easy. For example, it can achieve high accuracy in a 2-class problem if the dataset is very imbalanced.

Running a Zero-R classifier on our dataset just as a comparison with Naive Bayes, it achieved 74.5% accuracy.

Here it is the trivial implementation:

class zero_r_classifier(object):def __init__(self, training_set, test_set):self.__test_set = test_setclasses = training_set["Class"].value_counts().to_dict()self.__most_freq_class = max(classes, key=classes.get)def predict(self):predictions = {}for _, row in self.__test_set.iterrows():predictions[int(row["Num"])] = self.__most_freq_classreturn predictions

Comparing the Zero-R classifier accuracy with the Naive Bayes one we realized that our model is pretty accurate when compared to simplistic ones. Indeed, Zero-R only achieves a 74.5% accuracy.

Popular implementation

One of the most popular library in Python which implements several ML algorithms such as classification, regression and clustering is scikit-learn. The library also has a Gaussian Naive Bayes classifier implementation and its API is fairly easy to use. You can find the documentation and some examples here: http://scikit-learn.org/…/sklearn.naive_bayes.GaussianNB.html

This implementation is definitely not production ready, even though it obtains the same predictions of scikit-learn since what is actually happening under the hood is the same. On the other hand, it has not been engineered too much as its scope was only to play with Naive Bayes. Anyway, most of the times looking at a simple implementation might be easier and more effective. You can find the whole source code and the dataset used here: https://github.com/amallia/GaussianNB

Let's implement a Gaussian Naive Bayes classifier in Python相关推荐

  1. Naive Bayes Classifier - 朴素贝叶斯分类器

    Naive Bayes Classifier - 朴素贝叶斯分类器 简介 在机器学习中,朴素贝叶斯分类器是一系列基于"贝叶斯原理"和"特征之间独立分布假设"的概 ...

  2. 【机器学习sklearn】高斯朴素贝叶斯 Gaussian naive bayes

    贝叶斯Bayes - Thomas Bayes 前言 一.贝叶斯决策论(Bayesian decision theory) 二.实例:高斯朴素贝叶斯 Gaussian Naive Bayes (Gau ...

  3. 高斯贝叶斯(Gaussian Naive Bayes)基于Numpy的python实现

    学了贝叶斯以后,不用SKlearn现成的包,基于numpy自己实现了一下高斯贝叶斯算法.可以按照顺序把代码贴进去,自己跑一下试试. 导入需要的包 import time #调用时间,显示算法运行时间 ...

  4. 机器学习算法系列(十三)-朴素贝叶斯分类算法(Naive Bayes Classifier Algorithm)

    阅读本文需要的背景知识点:一丢丢编程知识 一.引言   前面几节介绍了一类分类算法--线性判别分析.二次判别分析,接下来介绍另一类分类算法--朴素贝叶斯分类算法1 (Naive Bayes Class ...

  5. R构建朴素贝叶斯分类器(Naive Bayes Classifier)

    R构建朴素贝叶斯分类器(Naive Bayes Classifier) 目录 R构建朴素贝叶斯分类器(Naive Bayes Classifier) 朴素贝叶斯原理及分类器

  6. 朴素贝叶斯(Naive Bayes)分类和Gaussian naive Bayes

    朴素贝叶斯(Naive Bayes)   参考资料:https://www.cnblogs.com/pinard/p/6069267.html   朴素贝叶斯最关键的就是 (强制认为每种指标都是独立的 ...

  7. 朴素贝叶斯分类器(Naive Bayes classifier)

    朴素贝叶斯 生活中很多场合需要用到分类,比如新闻分类.病人分类等等. 本文介绍朴素贝叶斯分类器(Naive Bayes classifier),它是一种简单有效的常用分类算法. 病人分类 让我从一个例 ...

  8. python贝叶斯模型_【机器学习速成宝典】模型篇05朴素贝叶斯【Naive Bayes】(Python版)...

    目录 先验概率与后验概率 条件概率公式.全概率公式.贝叶斯公式 什么是朴素贝叶斯(Naive Bayes) 拉普拉斯平滑(Laplace Smoothing) 应用:遇到连续变量怎么办?(多项式分布, ...

  9. 朴素贝叶斯(Naive Bayes)(原理+Python实现)

    朴素贝叶斯(Naive Bayes)(原理+Python实现) 创作背景 算法分类 生成类算法 判别类算法 区别 知识补充 朴素贝叶斯算法 举个栗子 求解思路 求解过程(数学计算) 代码实现 自己实现 ...

最新文章

  1. Qt 使用#define+qDebug()输出调试信息
  2. python使用符号#表示单行注释-Pyhton中单行和多行注释的使用方法及规范
  3. iOS网络传输Delegate不被触发的本质原因
  4. 基于visual Studio2013解决面试题之1007鸡蛋和篮子
  5. Java Web -【分页功能】详解
  6. 云原生系列「二」Kubernetes网络详解
  7. 3分钟了解带参数的main函数
  8. 每天都在支付,你真的了解信息流和资金流?
  9. c语言入门 在线,c语言入门课件1.docx
  10. Javascript获取日期和星期
  11. 小米靠着“便宜”在手机市场中占有一席之地
  12. [No0000FF]鸡蛋煮熟了蛋黄为什么发黑?
  13. Linux 配置rdate时间服务器方法
  14. Atitit. 解决unterminated string literal 缺失引号
  15. java控制zebra打印机_从Zebra打印机读取状态
  16. 吉林大学线性代数知识点及解题方法
  17. 构造图片对网络进行对抗攻击n+m=7
  18. Grub Rescue恢复
  19. 企业邮箱管理员在哪里找?域名邮箱如何管理?
  20. HTML小游戏15 —— 网页版3D反恐英雄(附完整源码)

热门文章

  1. 2|电子技术|数字电子技术基础|雨课堂习题|考前回顾
  2. C++编程——求解一元二次方程的根,附分析过程
  3. 旅游消费券发放,驴妈妈造福旅游爱好者
  4. 有哪些免费视频转音频的软件?这有支持视频转音频的软件
  5. Arduino IED for EDP8266编写的相关函数
  6. c语言常考易错知识点,C语言中易错点知识点拾遗
  7. VTD软件说明书阅读之ROD(Road Dsigner)
  8. 折线迷你图怎么设置_Office小技巧-在EXCEL单元格中也可以有迷你折线图-迷你office...
  9. 由英语presentation引发的关于PPT制作的思考
  10. cocos creator 3D | 拇指射箭