[思维][模拟]Scholomance Academy 第45届icpc区域赛沈阳站K

题目描述

As a student of the Scholomance Academy, you are studying a course called \textit{Machine Learning}. You are currently working on your course project: training a binary classifier.

A binary classifier is an algorithm that predicts the classes of instances, which may be positive (+)({+})(+) or negative (−)({-})(−). A typical binary classifier consists of a scoring function S{S}S that gives a score for every instance and a threshold θ\thetaθ that determines the category. Specifically, if the score of an instance S(x)≥θS(x) \geq \thetaS(x)≥θ, then the instance x{x}x is classified as positive; otherwise, it is classified as negative. Clearly, choosing different thresholds may yield different classifiers.

Of course, a binary classifier may have misclassification: it could either classify a positive instance as negative (false negative) or classify a negative instance as positive (false positive).

Given a dataset and a classifier, we may define the true positive rate (TPR{TPR}TPR) and the false positive rate (FPR{FPR}FPR) as follows:

TPR=#TP#TP+#FN,FPR=#FP#TN+#FP{TPR} = \frac{\# {TP}} {\# {TP} + \# {FN}}, \quad {FPR} = \frac{\# {FP}} {\# {TN} + \# {FP}}TPR=#TP+#FN#TP,FPR=#TN+#FP#FP

where #TP\# TP#TP is the number of true positives in the dataset; #FP,#TN,#FN\# FP, \#TN, \#FN#FP,#TN,#FN are defined likewise.

Now you have trained a scoring function, and you want to evaluate the performance of your classifier. The classifier may exhibit different TPR and FPR if we change the threshold θ\thetaθ. Let TPR(θ),FPR(θ){TPR}(\theta), FPR(\theta)TPR(θ),FPR(θ) be the TPR,FPR{TPR, FPR}TPR,FPR when the threshold is θ\thetaθ, define the area under curve{area\;under\;curve}areaundercurve (AUC{AUC}AUC) as
AUC=∫01max⁡θ∈R{TPR(θ)∣FPR(θ)≤r}dr{AUC} = \int_{0}^{1} \max_{\theta \in \mathbb{R}} \{TPR(\theta)|FPR(\theta) \leq r\} d rAUC=∫01maxθ∈R{TPR(θ)∣FPR(θ)≤r}dr
where the integrand, called receiver operating characteristic{receiver\;operating\;characteristic}receiveroperatingcharacteristic (ROC), means the maximum possible of TPR{TPR}TPR given that FPR≤rFPR \leq rFPR≤r.

Given the actual classes and predicted scores of the instances in a dataset, can you compute the AUC{AUC}AUC of your classifier?

For example, consider the third test data. If we set threshold θ=30\theta = 30θ=30, there are 3 true positives, 2 false positives, 2 true negatives, and 1 false negative; hence, TPR(30)=0.75{TPR}(30) = 0.75TPR(30)=0.75 and FPR(30)=0.5{FPR}(30) = 0.5FPR(30)=0.5. Also, as θ\thetaθ varies, we may plot the ROC curve and compute the AUC accordingly, as shown in Figure 1.

输入描述:

The first line contains a single integer n{n}n (2≤n≤106)(2 \leq n \leq 10^6)(2≤n≤106), the number of instances in the dataset. Then follow n{n}n lines, each line containing a character c∈{+,−}c \in \{{+},{-}\}c∈{+,−} and an integer s{s}s (1≤s≤109)(1 \leq s \leq 10^9)(1≤s≤109), denoting the actual class and the predicted score of an instance.It is guaranteed that there is at least one instance of either class.

输出描述:

Print the AUC{AUC}AUC of your classifier within an absolute error of no more than 10−910^{-9}10−9.

示例1

输入

3
+ 2
- 3
- 1

输出

0.5

示例2

输入

6
+ 7
- 2
- 5
+ 4
- 2
+ 6

输出

0.888888888888889

示例3

输入

输出

0.5625

说明

ROC and AUC{AUC}AUC of the third sample data.

题意： 题目巨长无比，实在考验人的耐心......有一台分类器，可以根据设定的指标θ来把目标分类成+或者-，如果目标参数大于等于θ就分类成+，如果小于θ分类成-。给出n个目标的目标参数以及它们真正的类别，设FPR为真实类别为-的目标中被机器分类为+的目标个数 / 真实类别为-的目标个数，设TPR为真实类别为+的目标中被机器分类为+的目标个数 / 真实类别为+的目标个数，显然FPR与TPR是关于θ的函数。令θ取遍实数可以得到一系列的FPR(θ)、TPR(θ)，即以FPR和TPR为轴的一系列散点，构造函数值f(FPR)为小于等于FPR的区域内TPR的最大值，求f函数在[0, 1]上的积分。

分析： 显然散点一定在θ取每个目标参数时可以全部获取到，因此只需要枚举目标参数就可以得到图上的所有散点。根据f函数的定义可以得知f是个分段函数且每段都是直线，同时f一定递增。因此求积分就是一个求矩形面积的过程，for循环枚举断点累加求和即可。

具体代码如下：

#include<cstdio>
#include<cstring>
#include<algorithm>
#include<iostream>
#include<queue>
#include<map>
#define int long long
#define double long double
using namespace std;
const int N = 1e6+10;
typedef pair<int,int> PII;
map<int,int> mp;
int a[N];
int p[N],ne[N],cnt1,cnt2;
signed main()
{int n;cin >> n;char t[2];for(int i = 1; i <= n; i++){scanf("%s%lld", t, &a[i]);if(t[0] == '+')p[cnt1++] = a[i];else ne[cnt2++] = a[i];} sort(p,p+cnt1);sort(ne,ne+cnt2);double pp = 0;if(cnt2 == 0){printf("%.9Lf\n",pp);return 0;}for(int i=0;i<cnt2;i++){int x = cnt2 - (lower_bound(ne,ne+cnt2,ne[i]) - ne);int t = cnt1 - (lower_bound(p,p+cnt1,ne[i]) - p);mp[x] = max(mp[x],t);}for(int i=0;i<cnt1;i++){int x = cnt2 - (lower_bound(ne,ne+cnt2,p[i]) - ne);int t = cnt1 - (lower_bound(p,p+cnt1,p[i]) - p);mp[x] = max(mp[x],t);}int xl = 0,y = mp[0],ans = 0;for(map<int,int>::iterator it = mp.begin();it != mp.end();it++){int xr = it->first;ans += (xr-xl)*y;y = it->second;xl = xr;}printf("%.9Lf\n",(double)ans/cnt1/cnt2);return 0;
}