sklearn的roc_curve()函数分析

转载自：https://blog.csdn.net/Titan0427/article/details/79356290，本文只做个人记录学习使用，版权归原作者所有。

在用sklearn的roc_curve()函数的时候，发现返回的结果和想象中不太一样，理论上threshold应该取遍所有的y_score（即模型预测值）。但是roc_curve()的结果只输出了一部分的threhold。从源码找到了原因。

初始数据：

y_true = [0, 0, 1, 0, 0, 1, 0, 1, 0, 0]
y_score = [0.31689620142873609, 0.32367439192936548, 0.42600526758001989, 0.38769987193780364, 0.3667541015524296, 0.39760831479768338, 0.42017521636505745, 0.41936155918127238, 0.33803961944475219, 0.33998332945141224]

通过sklearn的roc_curve函数计算false positive rate和true positive rate以及对应的threshold：

fpr_skl, tpr_skl, thresholds_skl = roc_curve(y_true, y_score, drop_intermediate=False)

计算得到的值如下：

fpr_skl
[ 0.          0.14285714  0.14285714  0.14285714  0.28571429  0.428571430.57142857  0.71428571  0.85714286  1.        ]tpr_skl
[ 0.          0.14285714  0.14285714  0.14285714  0.28571429  0.428571430.57142857  0.71428571  0.85714286  1.        ]thresholds_skl
[ 0.42600527  0.42017522  0.41936156  0.39760831  0.38769987  0.36675410.33998333  0.33803962  0.32367439  0.3168962 ]

roc_curve()函数

分析一下roc_curve()代码，看看这三个值都是怎么算出来的，其实就是常规auc的计算过程。

首先是_binary_clf_curve()函数：

    fps, tps, thresholds = _binary_clf_curve(y_true, y_score, pos_label=pos_label, sample_weight=sample_weight)

fps和tps就是混淆矩阵中的FP和TP的值；thresholds就是y_score逆序排列后的结果（由于保留的小数位数不同，所以表面上看上去不一样，其实是一样的）。在这个例子中，其值如下：

fps = [0, 1, 1, 1, 2, 3, 4, 5, 6, 7]
tps = [1, 1, 2, 3, 3, 3, 3, 3, 3, 3]
thresholds = [0.42600526758001989, 0.42017521636505745, 0.41936155918127238, 0.39760831479768338, 0.38769987193780364, 0.3667541015524296, 0.33998332945141224, 0.33803961944475219, 0.32367439192936548, 0.31689620142873609]

为了便于理解，这里用更直观的方式实现了fps和tps的计算：

for threshold in thresholds:# 大于等于阈值为1, 否则为0y_prob = [1 if i>=threshold else 0 for i in y_score]# 结果是否正确result = [i==j for i,j in zip(y_true, y_prob)]# 是否预测为正类positive = [i==1 for i in y_prob]tp = [i and j for i,j in zip(result, positive)] # 预测为正类且预测正确fp = [(not i) and j for i,j in zip(result, positive)] # 预测为正类且预测错误print(tp.count(True), fp.count(True))# 输出
0 1
1 1
1 2
1 3
2 3
3 3
4 3
5 3
6 3
7 3

通过fps和tps，就可以计算出相应的fpr和tpr，其中-1就是阈值取最小，也就是所有样本都判断为positive，相应地，fps[-1]就是负样本总和，tpr[-1]就是正样本总和。源码相应的计算代码简化后如下：

fpr = [i/fps[-1] for i in fps] # fps / fps[-1]
tpr = [i/tps[-1] for i in tps] # tps / tps[-1]

drop_intermediate参数

roc_curve()函数有drop_intermediate参数，相应的源码为：

if drop_intermediate and len(fps) > 2:optimal_idxs = np.where(np.r_[True,np.logical_or(np.diff(fps, 2),np.diff(tps, 2)),True])[0]fps = fps[optimal_idxs]tps = tps[optimal_idxs]thresholds = thresholds[optimal_idxs]

在这个例子中，相应变量的值为：

# 取两阶差值
np.diff(fps, 2)
[-1  0  1  0  0  0  0  0]
np.diff(tps, 2)
[ 1  0 -1  0  0  0  0  0]# 取或
np.logical_or(np.diff(fps, 2), np.diff(tps, 2))
[ True, False,  True, False, False, False, False, False]# 在头尾各加上一个True
np.r_[True, np.logical_or(np.diff(fps, 2), np.diff(tps, 2)), True]
[ True,  True, False,  True, False, False, False, False, False,  True]# True所在的数组下标
np.where(np.r_[True, np.logical_or(np.diff(fps, 2), np.diff(tps, 2)), True])[0]
[0, 1, 3, 9]

optimal_idxs实际上就是roc图像的拐点，对于画图而言，只需要拐点即可。将fps和tps想象为一个人在图上的位移，则一阶差值即为“移动速度”，二阶差值即为“加速度”。

“roc图像”如下：

fps = [0, 1, 1, 1, 2, 3, 4, 5, 6, 7]
tps = [1, 1, 2, 3, 3, 3, 3, 3, 3, 3]plt.plot(fps,tps,'b')
plt.xlim([-1, 8])
plt.ylim([-1, 8])
plt.ylabel('tps')
plt.xlabel('fps')
plt.show()

因此，drop_intermediate参数实际上是对roc计算过程的优化，不影响roc图像。