Naive Bayes

朴素贝叶斯算法,是应用最为广泛的分类算法之一。该算法利用贝叶斯定理与特征条件独立假设做预测,直接且易于理解。该算法在实际运用中,往往能得到意想不到的好结果。

1.算法介绍

朴素贝叶斯算法的其本质就是计算

,即数据
属于某一类别
的概率。

朴素贝叶斯算法的核心就是贝叶斯公式,贝叶斯公式为我们提供了一个适用于计算一些数据属于某一类别的概率的计算方法。贝叶斯公式如下:

其中,

表示
属于某个
的概率。同时,上式假设各个特征条件是独立的。

我们认为,使得

最大的
就是该
所属的
,从而我们可以预测出该数据所属类别。

下面,我们将结合代码讲解,朴素贝叶斯算法是如何计算

,进而对
做预测的。

2.算法讲解

2.1 数据分类

我们需把所有的数据按照各自的类别进行分类,组成一个新的数组。下面先给出将数据的分类函数:

double*** separate_by_class(double **dataset,int class_num, int *class_num_list, int row, int col) {double ***separated;separated = (double***)malloc(class_num * sizeof(double**));int i, j;for (i = 0; i < class_num; i++) {separated[i] = (double**)malloc(class_num_list[i] * sizeof(double *));for (j = 0; j < class_num_list[i]; j++) {separated[i][j] = (double*)malloc(col * sizeof(double));}}int* index = (int *)malloc(class_num * sizeof(int));for (i = 0; i < class_num; i++) {index[i] = 0;}for (i = 0; i < row; i++) {for (j = 0; j < class_num; j++) {if (dataset[i][col - 1] == j) {separated[j][index[j]] = dataset[i];index[j]++;}}}return separated;
}

以下面的10条数据为例,利用上述函数对数据进行分类

X1    X2    Lable
2.000000        2.000000        0.000000
2.005000        1.995000        0.000000
2.010000        1.990000        0.000000
2.015000        1.985000        0.000000
2.020000        1.980000        0.0000005.000000        5.000000        1.000000
5.005000        4.995000        1.000000
5.010000        4.990000        1.000000
5.015000        4.985000        1.000000
5.020000        4.980000        1.000000

代码如下:

double*** separate_by_class(double **dataset,int class_num, int *class_num_list, int row, int col) {double ***separated;separated = (double***)malloc(class_num * sizeof(double**));int i, j;for (i = 0; i < class_num; i++) {separated[i] = (double**)malloc(class_num_list[i] * sizeof(double *));for (j = 0; j < class_num_list[i]; j++) {separated[i][j] = (double*)malloc(col * sizeof(double));}}int* index = (int *)malloc(class_num * sizeof(int));for (i = 0; i < class_num; i++) {index[i] = 0;}for (i = 0; i < row; i++) {for (j = 0; j < class_num; j++) {if (dataset[i][col - 1] == j) {separated[j][index[j]] = dataset[i];index[j]++;}}}return separated;
}void main() {double **dataset;dataset = (double **)malloc(row * sizeof(double *));for (int i = 0; i < row; ++i) {dataset[i] = (double *)malloc(col * sizeof(double));}for (int i = 0; i < 5; i++) {dataset[i][0] = 2 + i * 0.005;dataset[i][1] = 2 - i * 0.005;dataset[i][2] = 0;}for (int i = 0; i < 5; i++) {dataset[i+5][0] = 5 + i * 0.005;dataset[i+5][1] = 5 - i * 0.005;dataset[i + 5][2] = 1;}int class_num_list[2] = {5,5};double ***separated = separate_by_class(dataset, 2, class_num_list, 10, 3);// 输出结果for (int i = 0; i < 2; i++) {//先按照类别输出for (int j = 0; j < 5; j++) {for (int k = 0; k < 3; k++) {printf("%ft", separated[i][j][k]);}printf("n");}printf("n");}
}

输出结果如下:

2.000000        2.000000        0.000000
2.005000        1.995000        0.000000
2.010000        1.990000        0.000000
2.015000        1.985000        0.000000
2.020000        1.980000        0.0000005.000000        5.000000        1.000000
5.005000        4.995000        1.000000
5.010000        4.990000        1.000000
5.015000        4.985000        1.000000
5.020000        4.980000        1.000000

2.2 计算统计量

我们需要计算数据的均值与标准差,其公式如下所示:

其代码如下:

double get_mean(double**dataset, int row, int col) {int i;double mean = 0;for (i = 0; i < row; i++) {mean += dataset[i][col];}return mean / row;
}double get_std(double**dataset, int row, int col) {int i;double mean = 0;double std = 0;for (i = 0; i < row; i++) {mean += dataset[i][col];}mean /= row;for (i = 0; i < row; i++) {std += pow((dataset[i][col]-mean),2);}return sqrt(std / (row - 1));
}

仍然以下面的10条数据为例,利用上述函数按照类别计算数据的统计量

X1    X2    Lable
2.000000        2.000000        0.000000
2.005000        1.995000        0.000000
2.010000        1.990000        0.000000
2.015000        1.985000        0.000000
2.020000        1.980000        0.0000005.000000        5.000000        1.000000
5.005000        4.995000        1.000000
5.010000        4.990000        1.000000
5.015000        4.985000        1.000000
5.020000        4.980000        1.000000

代码如下:

double get_mean(double**dataset, int row, int col) {int i;double mean = 0;for (i = 0; i < row; i++) {mean += dataset[i][col];}return mean / row;
}double get_std(double**dataset, int row, int col) {int i;double mean = 0;double std = 0;for (i = 0; i < row; i++) {mean += dataset[i][col];}mean /= row;for (i = 0; i < row; i++) {std += pow((dataset[i][col]-mean),2);}return sqrt(std / (row - 1));
}double*** separate_by_class(double **dataset,int class_num, int *class_num_list, int row, int col) {double ***separated;separated = (double***)malloc(class_num * sizeof(double**));int i, j;for (i = 0; i < class_num; i++) {separated[i] = (double**)malloc(class_num_list[i] * sizeof(double *));for (j = 0; j < class_num_list[i]; j++) {separated[i][j] = (double*)malloc(col * sizeof(double));}}int* index = (int *)malloc(class_num * sizeof(int));for (i = 0; i < class_num; i++) {index[i] = 0;}for (i = 0; i < row; i++) {for (j = 0; j < class_num; j++) {if (dataset[i][col - 1] == j) {separated[j][index[j]] = dataset[i];index[j]++;}}}return separated;
}double** summarize_dataset(double **dataset,int row, int col) {int i;double **summary = (double**)malloc((col - 1) * sizeof(double *));for (i = 0; i < (col - 1); i++) {summary[i] = (double*)malloc(2 * sizeof(double));summary[i][0] = get_mean(dataset, row, i);summary[i][1] = get_std(dataset, row, i);  }return summary;
}double*** summarize_by_class(double **train, int class_num, int *class_num_list, int row, int col) {int i;double ***summarize;summarize = (double***)malloc(class_num * sizeof(double**));double ***separate = separate_by_class(train, class_num, class_num_list, row, col);for (i = 0; i < class_num; i++) {summarize[i] = summarize_dataset(separate[i], class_num_list[i], col);}return summarize;
}void main() {int row = 10;int col = 3;int class_num = 2;int class_num_list[2] = { 5,5 };double** dataset;dataset = (double**)malloc(row * sizeof(double*));for (int i = 0; i < row; ++i) {dataset[i] = (double*)malloc(col * sizeof(double));}for (int i = 0; i < 5; i++) {dataset[i][0] = 2 + i * 0.005;dataset[i][1] = 2 - i * 0.005;dataset[i][2] = 0;}for (int i = 0; i < 5; i++) {dataset[i + 5][0] = 5 + i * 0.005;dataset[i + 5][1] = 5 - i * 0.005;dataset[i + 5][2] = 1;}double*** summarize = summarize_by_class(dataset, class_num, class_num_list, row, col);for (int i = 0; i < 2; i++) {//先按照类别输出for (int j = 0; j < 2; j++) {for (int k = 0; k < 2; k++) {printf("%ft", summarize[i][j][k]);}printf("n");}printf("n");}
}

按照类别依次得到每列数据的均值与标准差:

2.010000        0.007906
1.990000        0.0079065.010000        0.007906
4.990000        0.007906

2.3 高斯概率分布函数

高斯概率密度函数表达式为:

计算代码如下:

double calculate_probability(double x, double mean, double std)
{double pi = acos(-1.0);double p = 1 / (pow(2 * pi, 0.5) * std) * exp(-(pow((x - mean), 2) / (2 * pow(std, 2))));return p;
}

2.4 计算类别概率与预测

下面就是朴素贝叶斯的关键——计算数据属于某一类别的概率。

根据计算得到的概率数组每个数的大小,可以预测出数据所属的类别。

代码如下:

double* calculate_class_probabilities(double ***summaries,double *test_row, int class_num, int *class_num_list, int row, int col) {int i, j;double *probabilities = (double *)malloc(class_num * sizeof(double));for (i = 0; i < class_num; i++) {probabilities[i] = (double)class_num_list[i] / row;}for (i = 0; i < class_num; i++) {for (j = 0; j < col-1; j++) {probabilities[i] *= calculate_probability(test_row[j], summaries[i][j][0], summaries[i][j][1]);}}return probabilities;
}double predict(double ***summaries, double *test_row, int class_num, int *class_num_list, int row, int col) {int i;double *probabilities = calculate_class_probabilities(summaries, test_row, class_num, class_num_list, row, col);double label = 0;double best_prob = probabilities[0];for (i = 1; i < class_num; i++) {if (probabilities[i] > best_prob) {label = i;best_prob = probabilities[i];}}return label;
}

3.算法实现步骤

下将通过通过多C文件,以鸢尾花数据集为例,使用朴素贝叶斯算法,对该数据进行分类。

3.1 read_csv.c

该步骤代码与前面代码一致,不再重复给出。

3.2 k_fold.c

该步骤代码与前面代码一致,不再重复给出。

3.3 test_prediction.c

#include<stdlib.h>
#include<stdio.h>extern double predict(double ***summaries, double *test_row, int class_num, int *class_num_list, int row, int col);
extern int get_class_num(double **dataset, int row, int col);
extern int* get_class_num_list(double **dataset, int class_num, int row, int col);
extern double*** summarize_by_class(double **train, int class_num, int *class_num_list, int row, int col);double* get_test_prediction(double **train, int train_size, double **test, int test_size, int col)
{int class_num = get_class_num(train, train_size, col);int *class_num_list = get_class_num_list(train, class_num, train_size, col);double* predictions = (double*)malloc(test_size * sizeof(double));//因为test_size和fold_size一样大double ***summaries = summarize_by_class(train, class_num, class_num_list, train_size, col);for (int i = 0; i < test_size; i++){predictions[i] = predict(summaries, test[i], class_num, class_num_list, train_size, col);}return predictions;//返回对test的预测数组
}

3.4 score.c

该步骤代码与前面代码一致,不再重复给出。

3.5 evaluate.c

#include<stdlib.h>
#include<stdio.h>extern double* get_test_prediction(double **train, int train_size, double **test, int test_size, int col);
extern double accuracy_metric(double *actual, double *predicted, int fold_size);
extern double***  cross_validation_split(double **dataset, int row, int col, int n_folds, int fold_size);void evaluate_algorithm(double **dataset, int row, int col, int n_folds) {int fold_size = (int)row / n_folds;double ***split = cross_validation_split(dataset, row, n_folds, fold_size, col);int i, j, k, l;int test_size = fold_size;int train_size = fold_size * (n_folds - 1);double* score = (double*)malloc(n_folds * sizeof(double));for (i = 0; i < n_folds; i++) {double*** split_copy = (double***)malloc(n_folds * sizeof(double**));for (j = 0; j < n_folds; j++) {split_copy[j] = (double**)malloc(fold_size * sizeof(double*));for (k = 0; k < fold_size; k++) {split_copy[j][k] = (double*)malloc(col * sizeof(double));}}for (j = 0; j < n_folds; j++){for (k = 0; k < fold_size; k++){for (l = 0; l < col; l++){split_copy[j][k][l] = split[j][k][l];}}}double** test_set = (double**)malloc(test_size * sizeof(double*));for (j = 0; j < test_size; j++) {test_set[j] = (double*)malloc(col * sizeof(double));for (k = 0; k < col; k++) {test_set[j][k] = split_copy[i][j][k];}}for (j = i; j < n_folds - 1; j++) {split_copy[j] = split_copy[j + 1];}double** train_set = (double**)malloc(train_size * sizeof(double*));for (k = 0; k < n_folds - 1; k++) {for (l = 0; l < fold_size; l++) {train_set[k*fold_size + l] = (double*)malloc(col * sizeof(double));train_set[k*fold_size + l] = split_copy[k][l];}}double* predicted = (double*)malloc(test_size * sizeof(double));predicted = get_test_prediction(train_set, train_size, test_set, test_size, col);double* actual = (double*)malloc(test_size * sizeof(double));for (l = 0; l < test_size; l++) {actual[l] = test_set[l][col - 1];//printf("%ft%fn", predicted[l], actual[l]);}double acc = accuracy_metric(actual, predicted, test_size);score[i] = acc;printf("Scores[%d]=%f%%n", i, score[i]);free(split_copy);}double total = 0;for (l = 0; l < n_folds; l++) {total += score[l];}printf("mean_accuracy=%f%%n", total / n_folds);
}

3.6 main.c

#include<stdlib.h>
#include<string.h>
#include<stdio.h>
#include<math.h>extern int get_row(char *filename);
extern int get_col(char *filename);
extern void get_two_dimension(char *line, double **dataset, char *filename);
extern void evaluate_algorithm(double **dataset, int row, int col, int n_folds);void quicksort(double *arr, int L, int R) {int i = L;int j = R;//支点int kk = (L + R) / 2;double pivot = arr[kk];//左右两端进行扫描,只要两端还没有交替,就一直扫描while (i <= j) {//寻找直到比支点大的数while (pivot > arr[i]){i++;}//寻找直到比支点小的数while (pivot < arr[j]){j--;}//此时已经分别找到了比支点小的数(右边)、比支点大的数(左边),它们进行交换if (i <= j) {double temp = arr[i];arr[i] = arr[j];arr[j] = temp;i++; j--;}}//上面一个while保证了第一趟排序支点的左边比支点小,支点的右边比支点大了。//“左边”再做排序,直到左边剩下一个数(递归出口)if (L < j){quicksort(arr, L, j);}//“右边”再做排序,直到右边剩下一个数(递归出口)if (i < R){quicksort(arr, i, R);}
}
double get_mean(double**dataset, int row, int col) {int i;double mean = 0;for (i = 0; i < row; i++) {mean += dataset[i][col];}return mean / row;
}
double get_std(double**dataset, int row, int col) {int i;double mean = 0;double std = 0;for (i = 0; i < row; i++) {mean += dataset[i][col];}mean /= row;for (i = 0; i < row; i++) {std += pow((dataset[i][col]-mean),2);}return sqrt(std / (row - 1));
}int get_class_num(double **dataset,int row, int col) {int i;int num = 1;double *class_data = (double *)malloc(row * sizeof(double));for (i = 0; i < row; i++) {class_data[i] = dataset[i][col - 1];}quicksort(class_data, 0, row - 1);for (i = 0; i < row-1; i++) {if (class_data[i] != class_data[i + 1]) {num += 1;}}return num;
}
int* get_class_num_list(double **dataset, int class_num, int row, int col) {int i,j;int *class_num_list = (int *)malloc(class_num * sizeof(int));for (j = 0; j < class_num; j++) {class_num_list[j] = 0;}for (j = 0; j < class_num; j++) {for (i = 0; i < row; i++) {if (dataset[i][col - 1] == j) {class_num_list[j] += 1;}}}return class_num_list;
}double*** separate_by_class(double **dataset,int class_num, int *class_num_list, int row, int col) {double ***separated;separated = (double***)malloc(class_num * sizeof(double**));int i, j;for (i = 0; i < class_num; i++) {separated[i] = (double**)malloc(class_num_list[i] * sizeof(double *));for (j = 0; j < class_num_list[i]; j++) {separated[i][j] = (double*)malloc(col * sizeof(double));}}int* index = (int *)malloc(class_num * sizeof(int));for (i = 0; i < class_num; i++) {index[i] = 0;}for (i = 0; i < row; i++) {for (j = 0; j < class_num; j++) {if (dataset[i][col - 1] == j) {separated[j][index[j]] = dataset[i];index[j]++;}}}return separated;
}
double** summarize_dataset(double **dataset,int row, int col) {int i;double **summary = (double**)malloc((col - 1) * sizeof(double *));for (i = 0; i < (col - 1); i++) {summary[i] = (double*)malloc(2 * sizeof(double));summary[i][0] = get_mean(dataset, row, i);summary[i][1] = get_std(dataset, row, i);  }return summary;
}
double*** summarize_by_class(double **train, int class_num, int *class_num_list, int row, int col) {int i;double ***summarize;summarize = (double***)malloc(class_num * sizeof(double**));double ***separate = separate_by_class(train, class_num, class_num_list, row, col);for (i = 0; i < class_num; i++) {summarize[i] = summarize_dataset(separate[i], class_num_list[i], col);}return summarize;
}double calculate_probability(double x, double mean, double std)
{double pi = acos(-1.0);double p = 1 / (pow(2 * pi, 0.5) * std) * exp(-(pow((x - mean), 2) / (2 * pow(std, 2))));return p;
}
double* calculate_class_probabilities(double ***summaries,double *test_row, int class_num, int *class_num_list, int row, int col) {int i, j;double *probabilities = (double *)malloc(class_num * sizeof(double));for (i = 0; i < class_num; i++) {probabilities[i] = (double)class_num_list[i] / row;}for (i = 0; i < class_num; i++) {for (j = 0; j < col-1; j++) {probabilities[i] *= calculate_probability(test_row[j], summaries[i][j][0], summaries[i][j][1]);}}return probabilities;
}
double predict(double ***summaries, double *test_row, int class_num, int *class_num_list, int row, int col) {int i;double *probabilities = calculate_class_probabilities(summaries, test_row, class_num, class_num_list, row, col);double label = 0;double best_prob = probabilities[0];for (i = 1; i < class_num; i++) {if (probabilities[i] > best_prob) {label = i;best_prob = probabilities[i];}}return label;
}void main() {char filename[] = "iris.csv";char line[1024];int row = get_row(filename);int col = get_col(filename);//printf("row = %d, col = %dn", row, col);double **dataset;dataset = (double **)malloc(row * sizeof(double *));for (int i = 0; i < row; ++i) {dataset[i] = (double *)malloc(col * sizeof(double));}get_two_dimension(line, dataset, filename);int n_folds = 5;evaluate_algorithm(dataset, row, col, n_folds);
}

B站视频讲解:

C语言机器学习_哔哩哔哩 (゜-゜)つロ 干杯~-bilibili​www.bilibili.com

完整的项目文件:

https://github.com/Gao-Jianxiong-SDUWH/C-machine-learning/tree/main/Naive%20Bayes​github.com

朴素贝叶斯算法_C语言实现朴素贝叶斯算法(Naive Bayes)相关推荐

  1. c语言代码先来先服务算法_C语言十大经典排序算法(动态演示+代码,值得收藏)...

    以前也零零碎碎发过一些排序算法,但排版都不太好,又重新整理一次,排序算法是数据结构的重要部分,系统地学习很有必要. 时间.空间复杂度比较 排序算法 平均时间复杂度 最差时间复杂度 空间复杂度 数据对象 ...

  2. 数据结构视频教程 -《[猎豹网校]数据结构与算法_C#语言》

    整个视频打包下载地址:史上最全的数据结构视频教程系列分享之<[猎豹网校]数据结构与算法_C#语言>,转载请保留出处和链接! 更多优秀资源请访问:我是码农 在猎豹网校授课的基本都是在IT行业 ...

  3. 密码学实验报告c语言程序,密码学_实验一_古典密码算法_C语言.doc

    您所在位置:网站首页 > 海量文档 &nbsp>&nbsp高等教育&nbsp>&nbsp实验设计 密码学_实验一_古典密码算法_C语言.doc8页 本 ...

  4. rsa算法c语言实现_数据结构与算法之线性表-顺序表实现(C语言版本)

    原文托管在Github: https://github.com/shellhub/blog/issues/52 数据结构与算法之线性表-顺序表实现(C语言版本) 前言 数据结构与算法是一个程序员必备的 ...

  5. java合一算法_Prolog语言的编译原理:合一算法

    Prolog语言的编译原理:合一算法 分类:软考 | 更新时间:2016-07-08| 来源:转载 Prolog是一种基于谓词演算的程序设计语言.Prolog是一种说明性语言,它的基本意思是程序员着重 ...

  6. c4.5算法 程序语言,决策树之C4.5算法详解-Go语言中文社区

    决策树之C4.5算法详解 主要内容 C4.5算法简介 分裂属性的选择--信息增益率 连续型属性的离散化处理 剪枝--PEP(Pessimistic Error Pruning)剪枝法 缺失属性值的处理 ...

  7. 最坏适应算法c语言源码,首次适应算法,最佳适应算法,最坏适应算法源代码

    这是一个非常完美的程序,输出显示的格式也很棒,里面包含首次适应算法,最佳适应算法,最坏适应算法 #include #include #define Free 0 //空闲状态 #define Busy ...

  8. 自适应对消算法c语言,LMS自适应对消算法

    LMS算法最小均方误差算法,是一种自适应滤波算法.该算法通过对输入信号进行滤波输出一个信号y(n),将输出信号与期望输出信号作差得到一个误差信号,再将误差信号输入到自适应滤波器中形成一个反馈回路.LM ...

  9. java pnpoly算法_C语言实现的PNPoly算法代码例子

    写C语言的实验用到的一个算法,判断一个点是否在多边形的内部.C的代码如下: int pnpoly(int nvert, float *vertx, float *verty, float testx, ...

  10. 生活中的算法的实际举例_c语言问题: 什么是算法?试从日常生活中找3个例子,描述它们的算法。 详细点,谢谢!...

    c语言中的算法是指:一系列解决问题的清晰指令,用系统的方法描述解决问题的策略机制.也就是说,能够对一定规范的输入,在有限时间内获得所要求的输出.通俗说就是解决问题的方法和步骤. 描述算法的例子: 问题 ...

最新文章

  1. object-c全局变量
  2. UOJ#272. 【清华集训2016】石家庄的工人阶级队伍比较坚强
  3. 网络攻防-20169213-刘晶-第六周作业
  4. what does boston dynamics do?
  5. 机器学习大牛是如何选择回归损失函数的?
  6. error: expected ‘{‘ before ‘;‘ token
  7. 外观模式和代理模式的联系和区别_java23种设计模式-结构型模式之外观模式
  8. ***CI查询辅助函数:insert_id()、affected_rows()
  9. data类型的Url的格式
  10. linux 下i2c读写命令,S3C2440 Linux下的I2C驱动以及I2C体系下对EEPROM进行读写操作
  11. 洛谷 P1313 计算系数 —— 水题
  12. 当心币圈高仿号!也别指望AI,它有心无力
  13. java基本变量的堆栈_JAVA经验谈:尽可能使用堆栈变量
  14. matlab分析具体问题论文,matlab论文12010245327马文建.doc
  15. 使用pt-query-digest,找到不是很合适的sql
  16. 两种前端在线json编辑器方案(无法解决number精度丢失问题)
  17. 数据结构和算法 数论 素数/质数、回文素数
  18. 刷机后IMEI丢失如何能刷回来
  19. [渝粤教育] 新乡医学院三全学院 医学分子生物学 参考 资料
  20. Presto error executing query

热门文章

  1. 自动化遍历-appcrawler
  2. [POJ3233] Matrix Power Series(矩阵快速幂)
  3. 《HelloGitHub》第 13 期
  4. Android学习(十二) ContentProvider
  5. jmeter之自定义java请求性能测试
  6. 灰度图像--频域滤波 概论
  7. Spring属性编辑器解读(转载)
  8. 一个人磊个小山包,与大家磊同一个小山包
  9. 探索线程安全背后的本质——volatile
  10. 垃圾回收相关算法总结