知识图谱实体预测任务如何计算filtered MRR

什么是filtered MRR?

首先什么是MRR呢？MRR即Mean Reciprocal Rank，在实体预测任务中，给定头/尾实体和关系，预测缺失的尾/头实体，由于预测难度较大，实体数量较多，难以直接用准确率来衡量模型性能（也就是如果正确实体的预测分数最大得一分，不是最大得零分），因此选用了更具平滑性的MRR。
在MRR的计算过程中，目标实体的分数最大得一分，排第二大得1/2分，第三大得1/3分，最后取所有样本的平均记为整体的MRR。
那什么是filtered MRR呢，在论文《Interaction embeddings for prediction and explanation in knowledge graphs》是这么说的：

We also apply filter and raw settings. In filter setting, we filter all candidate triples in train, test or validation datasets before ranking, as they are not negative triples.

那候选三元组是什么意思呢，再看看《Embedding entities and relations for learning and inference in knowledge bases》怎么说：

we design a new evaluation setting where the predicted entities are automatically filtered according to “entity types” (entities that appear as the subjects/objects of a relation have the same type defined by that relation).

所以filter MRR其实是为了解决知识图谱里面的一种特殊情况，也就是给定一个实体和关系，会有多个实体与之对应，比如头实体是我，关系是擅长，那尾实体就可以写很多个科目或者活动，因此在计算MRR时，比如我擅长语数英，但此时的标签是语文，那就要在所有的实体分数中把数学和英语对应的分数去掉。

SACN实现

这一段是在calc_mrr函数中计算给定头实体和关系，预测尾实体的filtered MRR

# 提取出三元组里面所有的头实体和关系
head_relation_triplets = all_triplets[:, :2]
for test_triplet in tqdm(test_triplets):# test_triplet大小为(3,)# 分离实体和关系subject = test_triplet[0]relation = test_triplet[1]object_ = test_triplet[2]# 分离目标三元组对应的头实体和关系subject_relation = test_triplet[:2]# 最大值为2，最小值为0，为2对应的就是要被filter的三元组，因为其对应的头实体和关系等于目标三元组delete_index = torch.sum(head_relation_triplets == subject_relation, dim=1)# nonzero的含义是找出所有非零元素的索引，这里是找到所有对应为2的三元组的索引delete_index = torch.nonzero(delete_index == 2).squeeze()# 提取出这些三元组对应的尾实体delete_entity_index = all_triplets[delete_index, 2].view(-1).numpy()# 从所有的实体中删除这些尾实体perturb_entity_index = np.array(list(set(np.arange(num_entity)) - set(delete_entity_index)))perturb_entity_index = torch.from_numpy(perturb_entity_index)# 删除后加上要预测的目标实体，来后续计算得分perturb_entity_index = torch.cat((perturb_entity_index, object_.view(-1)))emb_ar = embedding[subject] * w[relation]emb_ar = emb_ar.view(-1, 1, 1)emb_c = embedding[perturb_entity_index]emb_c = emb_c.transpose(0, 1).unsqueeze(1)out_prod = torch.bmm(emb_ar, emb_c)score = torch.sum(out_prod, dim=0)score = torch.sigmoid(score)# target是删除后加的，所以target固定是最后一位target = torch.tensor(len(perturb_entity_index) - 1)ranks_s.append(sort_and_rank(score, target))

可以看到SACN版本的实现是比较符合我们的直观理解的，从所有候选三元组把重复的三元组删除，然后正常计算。

CompGCN实现

results = {}
for step, batch in enumerate(train_iter):sub, rel, obj, label = self.read_batch(batch, split)# [128, 14541]pred = self.model.forward(sub, rel, False)# [0, ..., 127]b_range = torch.arange(pred.size()[0], device=self.device)# 大小为[128]，存了对每个正确样本的预测分数target_pred = pred[b_range, obj]# torch.where(condition, x, y) → Tensor# 根据条件，返回从x,y中选择元素所组成的张量。如果满足条件，则返回x中元素。若不满足，返回y中元素# 结果就是，如果pred的某处label为1，就把分数设为-10000000，从而保证除了此时的obj之外的其他候选实体分数都很低，不会被排在前面，达到间接删除的效果pred = torch.where(label.byte(), -torch.ones_like(pred) * 10000000, pred)# 还原obj对应的分数pred[b_range, obj] = target_pred# argsort 返回的不是rank，而是indicies# pr = prtorch.argsort(pred, dim=1, descending=True)返回索引，也就是分数按从大到小的实体排名# pr[0] = 123，代表id为123的实体排在第1位# torch.argsort(pr, dim=1, descending=False)输出实体升序排名之后对应的索引，也就是每个实体对应的排名位次# ranks大小为[128]，对应预测实体的名次ranks = 1 + torch.argsort(torch.argsort(pred, dim=1, descending=True),dim=1,descending=False)[b_range, obj]ranks = ranks.float()results['mrr'] = torch.sum(1.0 / ranks).item() + results.get('mrr', 0.0)

个人觉得这里的实现更加巧妙一些，并不是显式删除，而是把候选三元组分数设为最低，从而确保不会排在前面，减少一些计算量