Tensorflow：分类模型评估

tf.keras.metrics.AUC

tensorflow2.*常用

tf.keras.metrics.AUC(
num_thresholds=200, curve='ROC',
summation_method='interpolation', name=None, dtype=None,
thresholds=None, multi_label=False, num_labels=None, label_weights=None,
from_logits=False
)

此度量标准创建用于计算AUC的四个局部变量 true_positives ， true_negatives ， false_positives 和 false_negatives 。为了离散化AUC曲线，使用一组线性间隔的阈值来计算成对的召回率和精度值。因此，ROC曲线下方的面积是通过假阳性率使用召回值的高度来计算的，而PR曲线下方的面积是通过召回率使用精度值的高度来计算的。

该值最终以 auc 形式返回，这是一个幂等运算，用于计算精度与召回值（使用上述变量计算）的离散曲线下的面积。

num_thresholds 变量控制离散化具有较大的阈值更紧密地逼近真实AUC的数量程度。近似的质量可能会因 num_thresholds 而有很大差异。

thresholds 参数可用于手动指定阈值，这些阈值可以更均匀地划分预测。

sample_weight 为 None ，则权重默认为1。使用 sample_weight 为0掩盖值。

为了获得最佳结果， predictions 应该在[0，1]范围内大致均匀分布，并且不要在0或1附近达到峰值。如果不是这种情况，则AUC近似值的质量可能很差。将 summation_method 设置为“ minoring”或“ majoring”可以通过提供AUC的下限或上限估算值来帮助量化近似值中的误差。

[tf.keras.metrics.AUC]

[https://runebook.dev/zh-CN/docs/tensorflow/keras/metrics/auc]

tf.metrics.AUC

tensorflow1.*常用

tf.metrics.auc(
labels, predictions, weights=None, num_thresholds=200, metrics_collections=None,
updates_collections=None, curve='ROC', name=None,
summation_method='trapezoidal', thresholds=None
)

[Module: tf.keras.metrics]

使用tf.estimator时，如果调用 Estimator 的 evaluate 方法，则 model_fn 会收到 mode = ModeKeys.EVAL。在这种情况下，模型函数必须返回一个包含模型损失和一个或多个指标（可选）的 tf.estimator.EstimatorSpec。虽然返回指标是可选的，但大多数自定义 Estimator 至少会返回一个指标。TensorFlow 提供一个指标模块 tf.metrics 来计算常用指标。

几个常用的指标

这些可能只针对二分类

文档表示标签和预测都将转换为bool，因此它只涉及二进制分类。也许有可能对这些例子进行热门编码，它会起作用吗？但不确定这一点。[Tensorflow中多类分类的类精度和召回率？]

accuracy(...): Calculates how often predictions matches labels.

The accuracy function creates two local variables, total and count that are used to compute the frequency with which predictions matches labels. This frequency is ultimately returned as accuracy: an idempotent operation that simply divides total by count.

auc(...): Computes the approximate AUC via a Riemann sum.

average_precision_at_k(...): Computes average precision@k of predictions with respect to sparse labels.

precision(...): Computes the precision of the predictions with respect to the labels. 准确率。tf.metrics.accuracy 函数会将我们的预测值与真实值进行比较，即与输入函数提供的标签进行比较。tf.metrics.accuracy 函数要求标签和预测具有相同的形状。

precision_at_k(...): Computes precision@k of the predictions with respect to sparse labels.

recall(...): Computes the recall of the predictions with respect to the labels.

recall_at_k(...): Computes recall@k of the predictions with respect to sparse labels.

[Module: tf.metrics]

[评估]

初始化

这些函数创建的都是local variables，直接初始化时需要使用sess.run(tf.local_variables_initializer())而不是tf.global_variables_initializer()。不初始化可能出错：Attempting to use uninitialized value total_confusion_matrix。

参数

1 如果输出的是序列label（如ner模型），则一般需要使用mask。[Tensorflow：tensor变换]

2 对于分类模型，

2.1 计算precission、recall时，pred_ids需要是one-hot形式，如

labels = [[0, 1, 0],
[1, 0, 0],
[0, 0, 1]]，

[tensorflow – 如何正确使用tf.metrics.accuracy？]

[Tensorflow踩坑记之tf.metrics]

note:

1 当然对比的pred_ids不能是有负值的logits，否则出错[`predictions` contains negative values] # [Condition x >= 0 did not hold element-wise:] [x (Reshape_2:0) = ] [0 -6 3...]。

2 非要改成非one-hot形式，如果argmax维度搞错没写或0，输入(batch_size, num_labels)，输出本应是(batch_size,)，变成了输出(num_labels,)，一般如果num_labels>batch_size不会报错，<则报错“(batch_size, num_labels) tf_metircs [`labels` out of bound] [Condition x < y did not hold element-wise:]”，但是两者都是错误的。

2.2 计算acc、auc（这个不清楚原理）时则不需要这种转换，直接输入即可。

多类分类的测试

计算precission、recall时，pred_ids需要是one-hot形式，如

labels = [[0, 1, 0],
[1, 0, 0],
[0, 0, 1]]，

经大规模测试，发现其计算实际上是micro平均，即precission=recall=acc；同时自带的这种等价于使用下面提到的多分类指标评价tf.metrics.accuracy(labels=labels, predictions=pred_ids)等价于tf_metrics.accuracy(labels=tf.argmax(labels, 1), predictions=tf.argmax(pred_ids,1))。

返回值

以accuracy的返回值为例：

accuracy: A Tensor representing the accuracy, the value of total divided by count. 准确性调用不会使用新输入更新度量标准，它只使用两个局部变量返回值。（具体意思看示例1就ok了）
update_op: An operation that increments the total and count variables appropriately and whose value matches accuracy.

Multi-class metrics for Tensorflow: tf_metrics

precision(labels, predictions, num_classes, pos_indices=None, weights=None, average='micro'):
参数：
labels : Tensor of tf.int32 or tf.int64
The true labels 输入为shape=(batch,)的非one-hot的labels列表。
predictions : Tensor of tf.int32 or tf.int64
The predictions, same shape as labels
num_classes : int
The number of classes
pos_indices : list of int, optional
The indices of the positive classes, default is all
weights : Tensor of tf.int32, optional
Mask, must be of compatible shape with labels
average : str, optional
'micro': counts the total number of true positives, false
positives, and false negatives for the classes in
`pos_indices` and infer the metric from it.
'macro': will compute the metric separately for each class in
`pos_indices` and average. Will not account for class
imbalance.
'weighted': will compute the metric separately for each class in
`pos_indices` and perform a weighted average by the total
number of true labels for each class.

recall(labels, predictions, num_classes, pos_indices=None, weights=None, average='micro')
f1(labels, predictions, num_classes, pos_indices=None, weights=None, average='micro')

输入如果是one-hot形式，需要转换成预测标签类别

acc, acc_op = tf_metrics.accuracy(labels=tf.argmax(labels, 1), predictions=tf.argmax(logits,1))

示例

示例1

label_ids = tf.constant([[3, 1, 5]])
pred_ids = tf.constant([[3, 2, 5]])
acc, acc_op = tf.metrics.accuracy(label_ids, pred_ids)
stream_vars = [i for i in tf.local_variables()]
print(stream_vars)

with tf.Session() as sess:
sess.run(tf.local_variables_initializer())
print('[total, count]:', sess.run(stream_vars))
print(acc.eval()) # 只使用两个局部变量（此时未更新为0）返回值
print(acc_op.eval())
print('[total, count]:', sess.run(stream_vars))
print(acc.eval()) # 只使用两个局部变量（此时已更新非0）返回值[<tf.Variable 'accuracy/total:0' shape=() dtype=float32_ref>, <tf.Variable 'accuracy/count:0' shape=() dtype=float32_ref>]

[total, count]: [0.0, 0.0]
0.0
0.6666667
[total, count]: [2.0, 3.0]
0.6666667

[tensorflow – 如何正确使用tf.metrics.accuracy？]

[深入理解TensorFlow中的tf.metrics算子]

[Tensorflow踩坑记之tf.metrics]

示例2

# Compute evaluation metrics.
acc, acc_op = tf.metrics.accuracy(labels=tf.argmax(labels, 1), predictions=tf.argmax(logits,1))

示例3：多分类

label_ids = tf.constant([[0, 0, 0, 1],[0, 0, 1, 0],[1, 0, 0, 0],[0, 1, 0, 0],[0, 1, 0, 0]])
pred_ids = tf.constant([[0, 0, 0, 1],[0, 1, 0, 0],[1, 0, 0, 0],[1, 0, 0, 0],[1, 0, 0, 0]])
num_labels = label_ids.shape[1]
label_arg_ids = tf.argmax(label_ids, 1)
pred_arg_ids = tf.argmax(pred_ids, 1)
# _, tp_op = tf.metrics.true_positives(label_ids, pred_ids)
# _, fp_op = tf.metrics.false_positives(label_ids, pred_ids)_, acc_op = tf.metrics.precision(label_ids, pred_ids)
_, acc_op1 = tf.metrics.accuracy(label_arg_ids, pred_arg_ids)
_, pre_op = tf.metrics.precision(label_ids, pred_ids)
# _, pre_op1 = tf.metrics.precision(label_arg_ids, pred_arg_ids)
_, rec_op = tf.metrics.recall(label_ids, pred_ids)
# _, rec_op1 = tf.metrics.recall(label_arg_ids, pred_arg_ids)# _, pre_op_ = tf_metrics.precision(label_ids, pred_ids, num_labels)
_, pre_op1_ = tf_metrics.precision(label_arg_ids, pred_arg_ids, num_labels, average='macro')
# _, rec_op_ = tf_metrics.recall(label_ids, pred_ids, num_labels)
_, rec_op1_ = tf_metrics.recall(label_arg_ids, pred_arg_ids, num_labels, average='macro')
_, f1_op1_ = tf_metrics.f1(label_arg_ids, pred_arg_ids, num_labels, average='macro')stream_vars = [i for i in tf.local_variables()]
print(stream_vars)with tf.Session() as sess:sess.run(tf.local_variables_initializer())print(label_arg_ids.eval())print(pred_arg_ids.eval())# print(tp_op.eval())  # 2# print(fp_op.eval())  # 3print('acc_op:', acc_op.eval())print('acc_op1:', acc_op1.eval())print('pre_op:', pre_op.eval())# print('pre_op1:', pre_op1.eval())  # 1.0print('rec_op:', rec_op.eval())# print('rec_op1:', rec_op1.eval())  # 0.5# print(pre_op_.eval()) # 0.7print('pre_op1_:', pre_op1_.eval())# print(rec_op_.eval()) # 0.7print('rec_op1_:', rec_op1_.eval())print('f1_op1_:', f1_op1_.eval())

[3 2 0 1 1]
[3 1 0 0 0]
2.0
3.0
acc_op: 0.4
acc_op1: 0.4
pre_op: 0.4
pre_op1: 1.0
rec_op: 0.4
rec_op1: 0.5
0.7
pre_op1_: 0.33333334
0.7
rec_op1_: 0.5
f1_op1_: 0.375

average_precision_at_k示例

在以后的tf版本里，将tf.metrics.average_precision_at_k替代tf.metrics.sparse_average_precision_at_k。

y_true = tf.constant([[2], [1], [0], [3], [0]])
y_true = tf.cast(y_true, tf.int64)

y_pred = tf.constant([[0.1, 0.2, 0.6, 0.1],
[0.8, 0.05, 0.1, 0.05],
[0.3, 0.4, 0.1, 0.2],
[0.6, 0.25, 0.1, 0.05],
[0.1, 0.2, 0.6, 0.1]
])

_, m_ap = tf.metrics.average_precision_at_k(y_true, y_pred, 3)
stream_vars = [i for i in tf.local_variables()]
tmp_rank = tf.nn.top_k(y_pred, 3)

with tf.Session() as sess:
sess.run(tf.local_variables_initializer())
print("TF_MAP", sess.run(m_ap))
print("STREAM_VARS", sess.run(stream_vars))
print("TMP_RANK", sess.run(tmp_rank))

输出

TF_MAP 0.4333333333333333
STREAM_VARS [5.0, 2.1666666666666665]
TMP_RANK TopKV2(values=array([[0.6 , 0.2 , 0.1 ],
[0.8 , 0.1 , 0.05],
[0.4 , 0.3 , 0.2 ],
[0.6 , 0.25, 0.1 ],
[0.6 , 0.2 , 0.1 ]], dtype=float32),

indices=array(

[[2, 1, 0],
[0, 2, 1],
[1, 0, 3],
[0, 1, 2],
[2, 1, 0]], dtype=int32))

计算逻辑是：第一个2命中[2 1 0]的top1，则是1；第二个1命中[0 2 1]的top3，则是1/3；类似第三个1/2；第4个在top3中都没命中，为0；第5个1/3；平均一下。即(1+1/3+1/2+0+1/3)/5=13/30=0.433

[搜索排序评估方法]

precision_at_k示例

上面代码中average_precision_at_k改成precision_at_k，其计算逻辑则是：命中则是1/k，如第一个命中[2 1 0]，则是1/k = 1/3；第二个1命中也是1/3；类似第三个1/3；第4个没命中，为0；第5个1/3；平均一下。即(1/3+1/3+1/3+0+1/3)/5=4/15=0.2667

输出TF_MAP 0.26666666666666666

[Tensorflow踩坑记之tf.metrics]

-柚子皮-

其它方法及示例

计算softmax输出的准确度

import tensorflow as tf
import osos.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'def evaluation(sess, outputs, labels):correct = tf.nn.in_top_k(outputs, labels, 1)print(sess.run(correct))return tf.reduce_sum(tf.cast(correct, tf.int32))with tf.Graph().as_default():sess = tf.Session()sess.run(tf.global_variables_initializer())a = evaluation(sess, [[0.8, 0.1, 0.1], [0.2, 0.6, 0.2], [0.7, 0.1, 0.2]], [0, 1, 2])print(sess.run(a))

from: -柚子皮-

ref: