基于Python的Fisher二分类判别模型实现

1. 完成形式

本Fisher二分类判别模型的代码是利用Python独立完成编写的，基本基于上课所讲内容，没有参考网上代码。

2. 实现算法思路

- 数据集选择与载入初始化
电力行业中，比较适合Fisher分类判别模型的数据集为用户画像的分类。然而电力行业由于国家管控的特殊性，导致网络上能够找到的开源的数据集过少，在Dataju平台原先有的十多个能源客户画像数据集在今年下半年也全部由于版权、客户信息保密原则等原因而下架。在搜索了一段时间之后无奈放弃选择现成的开源分类数据集。本代码的数据集采用的是Scikit-Learn包中的乳腺癌数据集。

#数据
#二分类问题
#直接从sklearn.datasets导入乳腺癌数据集
from sklearn.datasets import load_breast_cancer

以下为乳腺癌数据集中较为重要的信息
Breast cancer wisconsin (diagnostic) dataset

Data Set Characteristics:

:Number of Instances: 569:Number of Attributes: 30 numeric, predictive attributes and the class:Attribute Information:- radius (mean of distances from center to points on the perimeter)- texture (standard deviation of gray-scale values)- perimeter- area- smoothness (local variation in radius lengths)- compactness (perimeter^2 / area - 1.0)- concavity (severity of concave portions of the contour)- concave points (number of concave portions of the contour)- symmetry- fractal dimension ("coastline approximation" - 1)The mean, standard error, and "worst" or largest (mean of the threeworst/largest values) of these features were computed for each image,resulting in 30 features.  For instance, field 0 is Mean Radius, field10 is Radius SE, field 20 is Worst Radius.- class:- WDBC-Malignant- WDBC-Benign:Summary Statistics:===================================== ====== ======Min    Max
===================================== ====== ======
radius (mean):                        6.981  28.11
texture (mean):                       9.71   39.28
perimeter (mean):                     43.79  188.5
area (mean):                          143.5  2501.0
smoothness (mean):                    0.053  0.163
compactness (mean):                   0.019  0.345
concavity (mean):                     0.0    0.427
concave points (mean):                0.0    0.201
symmetry (mean):                      0.106  0.304
fractal dimension (mean):             0.05   0.097
radius (standard error):              0.112  2.873
texture (standard error):             0.36   4.885
perimeter (standard error):           0.757  21.98
area (standard error):                6.802  542.2
smoothness (standard error):          0.002  0.031
compactness (standard error):         0.002  0.135
concavity (standard error):           0.0    0.396
concave points (standard error):      0.0    0.053
symmetry (standard error):            0.008  0.079
fractal dimension (standard error):   0.001  0.03
radius (worst):                       7.93   36.04
texture (worst):                      12.02  49.54
perimeter (worst):                    50.41  251.2
area (worst):                         185.2  4254.0
smoothness (worst):                   0.071  0.223
compactness (worst):                  0.027  1.058
concavity (worst):                    0.0    1.252
concave points (worst):               0.0    0.291
symmetry (worst):                     0.156  0.664
fractal dimension (worst):            0.055  0.208===================================== ====== ======

研究数据集可以发现，该数据集由30个指标与一个二分类的target组成表明是否患病。由min&max表可以发现存在方差特别大的指标area，在不确定其对患病影响的情况下，需要考虑是否需要对数据集做标准化处理。
载入数据集之后，将20%的数据作为测试样本。

from sklearn.model_selection import train_test_split
x = breast_cancer['data']
y = breast_cancer['target']
#随机采样，将20%的数据作为测试样本
x_train,x_test,y_train,y_test=train_test_split(x,y,random_state=0,test_size=0.2)

标准化代码如下，在实际运行模型的时候根据最后生成结果选择是否做数据与处理。

# 标准化处理
from sklearn.preprocessing import StandardScaler
ss_x=StandardScaler()
ss_y=StandardScaler()
#分别对训练和测试数据的特征以及目标值进行标准化处理
x_train = ss_x.fit_transform(x_train)
x_test = ss_x.transform(x_test)

def get_mean_vector(target):'''求均值向量:param target::return:'''m_target_list = [0 for i in range(x_train.shape[1])]count = 0for i in range(x_train.shape[0]):if y_train[i] == target:count = count + 1temp = x_train[i].tolist()m_target_list = [m_target_list[j] + temp[j] for j in range(x_train.shape[1])]m_target_list = [x / count for x in m_target_list]# 其实可以用类似torch的压缩维度的函数直接求和return m_target_list

通过target的值选择计算标签为target的均值向量。

def get_dispersion_matrix(target, mean_vector):'''求样本内离散度矩阵:param target::param mean_vector::return:'''s_target_matrix = np.zeros((x_train.shape[1], x_train.shape[1]))for i in range(x_train.shape[0]):if y_train[i] == target:temp = np.multiply(x_train[i] - mean_vector, (x_train[i] - mean_vector).transpose())s_target_matrix = s_target_matrix + tempreturn s_target_matrix

通过target和与其匹配的mean_vector计算求得样本内离散度矩阵。

def get_sample_divergence(mean_vector1, mean_vector2):'''求样本间离散度:param mean_vector1::param mean_vector2::return:'''return np.multiply((mean_vector1 - mean_vector2), (mean_vector1 - mean_vector2).transpose())

计算两个均值向量的样本间离散度。

def get_w_star(dispersion_matrix, mean_vector1, mean_vector2):'''求Fisher准则函数的w_star解:param dispersion_matrix::param mean_vector1::param mean_vector2::return:'''return np.matmul(np.linalg.inv(dispersion_matrix), (mean_vector1 - mean_vector2))

由样本内离散度矩阵和两个均值向量，根据Fisher准则逆向求解被投影向量的最优解w_star。

def get_sample_projection(w_star, x):'''求一特征向量在w_star上的投影:param w_star::param x::return:'''return np.matmul(w_star.transpose(), x)

利用求得的w_star求一特征向量在w_star上的投影值。

def get_segmentation_threshold(w_star, way_flag):'''求分割阈值:param w_star::param way_flag::return:'''if way_flag == 0:y0_list = []y1_list = []for i in range(x_train.shape[0]):if y_train[i] == 0:y0_list.append(get_sample_projection(w_star, x_train[i]))else:y1_list.append(get_sample_projection(w_star, x_train[i]))ny0 = len(y0_list)ny1 = len(y1_list)my0 = sum(y0_list) / ny0my1 = sum(y1_list) / ny1segmentation_threshold = (ny0 * my0 + ny1 * my1) / (ny0 + ny1)return  segmentation_thresholdelif way_flag == 1:y0_list = []y1_list = []for i in range(x_train.shape[0]):if y_train[i] == 0:y0_list.append(get_sample_projection(w_star, x_train[i]))else:y1_list.append(get_sample_projection(w_star, x_train[i]))ny0 = len(y0_list)ny1 = len(y1_list)my0 = sum(y0_list) / ny0my1 = sum(y1_list) / ny1py0 = ny0 / (ny0 + ny1)py1 = ny1 / (ny0 + ny1)segmentation_threshold = (my0 + my1) / 2 + math.log(py0 / py1) / (ny0 - ny1 - 2)return segmentation_thresholdelse:return 0

利用w_star投影标签内的原特征向量用来求分割阈值，该函数提供了两种分割阈值的实现方法。

def test_single_smaple(w_star, y0, test_sample, test_target):'''单例测试:param y0::param x::return:'''y_test = get_sample_projection(w_star, test_sample)predection = 1if y_test > y0:predection = 0print("This x_vector's target is {}, and the predection is {}".format(test_target, predection))

测试函数，该单例测试函数可以由用户输入一个新的特征向量，之后会给出模型的预测结果。

def test_single_smaple_check(w_star, y0, test_sample, test_target):'''单例测试（用于统计）:param y0::param x::return:'''y_test = get_sample_projection(w_star, test_sample)predection = 1if y_test > y0:predection = 0if test_target == predection:return Trueelse:return False

该单例测试函数用于统计数据使用。

def test_check(w_star, y0):'''统计测试样本:param w_star::param y0::return:'''right_count = 0for i in range(x_test.shape[0]):boolean = test_single_smaple_check(w_star, y0, x_test[i], y_test[i])if boolean == True:right_count = right_count + 1return x_test.shape[0], right_count, right_count / x_test.shape[0]

测试数据集的测试统计，通过比对预测结果与实际标签求得统计样本数，正确预测样本数和准确率。

- 算法实现

if __name__ == "__main__":m0 = np.array(get_mean_vector(0)).reshape(-1, 1)m1 = np.array(get_mean_vector(1)).reshape(-1, 1)s0 = get_dispersion_matrix(0, m0)s1 = get_dispersion_matrix(1, m1)sw = s0 + s1sb = get_sample_divergence(m0, m1)w_star = np.array(get_w_star(sw, m0, m1)).reshape(-1, 1)y0 = get_segmentation_threshold(w_star, 0)print("The segmentation_threshold is ", y0)test_sum, right_sum, accuracy = test_check(w_star, y0)print("Total specimen number:{}\nNumber of correctly predicted samples:{}\nAccuracy:{}\n".format(test_sum, right_sum, accuracy))

根据划分的数据集，利用正反标签的训练数据集分别计算均值向量，样本内离散度矩阵，样本间离散度矩阵。利用正反标签的样本内离散度矩阵求得总样本内离散度矩阵。之后利用总样本内离散度矩阵求得被投影向量。根据被投影向量将原特征向量均投影到一维线上，采用阈值割裂的方法加权修正得到分割阈值。之后利用测试样本数据集检验得到的分割阈值的质量。

- 实验结果
采用数据标准化和修正加权阈值算法结果：

采用数据标准化和普通加权阈值算法结果：

不采用数据标准化，采用修正加权阈值算法结果：

不采用数据标准化，不采用修正加权阈值算法结果：

- 实验结论分析

area这一指标对患病与否并没有什么影响，而其他指标会由于标准化而被削弱特征差异导致预测质量的下降。
不采用数据标准化，不采用修正加权阈值算法结果的情况下，模型预测准确率可达到93.8596%，超过笔者对此模型的期望。
修正加权阈值函数在样本分布均匀的情况下可能会有更好的效果，但是该样本数据集可能分布不太均匀。
添加函数：

def analysis_train_set():train_positive_count = 0train_negative_count = 0sum_count = 0for i in range(x_train.shape[0]):if y_train[i] == 0:train_negative_count = train_negative_count + 1else:train_positive_count = train_positive_count + 1sum_count = sum_count + 1print("Train Set Analysis:\nTotal number:{}\nNumber of positive samples:{}\tProportion of positive samples:{""}\nNumber of negative samples:{}\tProportion of negative samples:{}\nPositive and negative sample ratio:{""}\n".format(sum_count, train_positive_count, train_positive_count / sum_count, train_negative_count,train_negative_count / sum_count, train_positive_count / train_negative_count))

观察输出结果：
可以发现，该数据集中，正样本数据占比达到了63.7%，正负样本比率达到了1.76，远超过1.2的比率阈值。导致了修正加权阈值函数的更偏向性加权从而使得预测质量大幅下降。想要解决这个问题可能需要修改数据集或者做数据增强。此处暂且不提。

- 后续试验改进

训练测试样本集划分比例参数调整
在代码中，笔者采用了train_test_split(x,y,random_state=0,test_size=0.2)来划分训练集和测试集，为获取最好的预测准确率，调整test_size参数最终为0.326。训练结果如下：

The segmentation_threshold is 4.033086595385092e-06
Total specimen number:186
Number of correctly predicted samples:179
Accuracy:0.9623655913978495
Train Set Analysis:
Total number:383
Number of positive samples:238 Proportion of positive samples:0.6214099216710183
Number of negative samples:145 Proportion of negative samples:0.3785900783289817
Positive and negative sample ratio:1.6413793103448275