论文复现-1：bertscore

Bertscore是计算相似度的一种方法。

遗留问题：使用model layer 中的单一层还是多个层，会对结果造成很大的影响吗？

sent_encode函数是使用tokenizer将句子做encode。
tokenizer.encode(
sent,
add_special_tokens=True,
add_prefix_space=True,
max_length=tokenizer.model_max_length,
truncation=True,
)

get_tokenizer函数和get_model函数是根据model-name调用相应的Model和Tokenizer函数。
tokenizer = AutoTokenizer.from_pretrained(model_type, use_fast=use_fast)
model = AutoModel.from_pretrained(model_type)

整个bertscore是无梯度更新的过程中完成相似度运算的。model的模式是model.eval()

greedy_cos_idf函数是计算P、R和Recall的关键函数。
论文中指明，使用了cosine函数计算sentence之间的similarity score。
正常的cosine sim= $xiTxj∥xi∥∥xj∥\frac{x_{i}^Tx_{j}}{\left \| x_{i} \right \|\left \| x_{j} \right \| }$ ,
文中使用了pre-normalized 函数，将embedding做了normalize后，使用的是 $x_{i}^T *x_{j}$ 计算的similarity score。

使用的是greedy search函数最大化的similarity score，每个token match到相似度最高的那一个token。（即表格每个token对应行的max 选择操作）

在代码中的实现：
normalize操作：A.div_(B)是A 中每个值除以B的值

ref_embedding.div_(torch.norm(ref_embedding, dim=-1).unsqueeze(-1))#torch.norm(ref_embedding, dim=-1),维度由b*l*d，缩减为b*l。div_(value),将tensor中每个值除以valuehyp_embedding.div_(torch.norm(hyp_embedding, dim=-1).unsqueeze(-1))# unsqueeze(-1)是在tensor中添加一个维度，由b*l变为b*l*1

sim_metric得到：bmm函数

sim = torch.bmm(hyp_embedding, ref_embedding.transpose(1, 2))#torch.bmm函数最终计算得到的是b*l*l的矩阵。 bmm:b*h*m|b*n*h=b*m*nmasks = torch.bmm(hyp_masks.unsqueeze(2).float(), ref_masks.unsqueeze(1).float())masks = masks.float().to(sim.device)sim = sim * masks

greedy search操作，参考原文中的公式：注意底标是两个不同的维度。

word_precision = sim.max(dim=2)[0]word_recall = sim.max(dim=1)[0]

IDF加权操作：

hyp_idf.div_(hyp_idf.sum(dim=1, keepdim=True))ref_idf.div_(ref_idf.sum(dim=1, keepdim=True))precision_scale = hyp_idf.to(word_precision.device)recall_scale = ref_idf.to(word_recall.device)P = (word_precision * precision_scale).sum(dim=1)R = (word_recall * recall_scale).sum(dim=1)F = 2 * P * R / (P + R)

IDF计算，函数详解：

#!/usr/bin/python3
# -*- coding: utf-8 -*-
# @Time    : 2022/12/18 19:07
# @Author  : YOURNAME
# @FileName: get_idf_dict.py
# @Software: PyCharm
from collections import Counter, defaultdict
from functools import partial
from itertools import chain
from math import log
from multiprocessing import Pool
from transformers import AutoTokenizer
tokenizer=AutoTokenizer.from_pretrained('bert-base-chinese')
arr=['空间站应用有效载荷安全性','实验柜通信交换协议','空间站有效载荷安全性、可靠性和维修性']def process(a, tokenizer=None):if tokenizer is not None:a = tokenizer.encode(a,add_special_tokens=True,max_length=tokenizer.model_max_length,truncation=True,)return set(a)def get_idf_dict(arr, tokenizer, nthreads=0):"""Returns mapping from word piece index to its inverse document frequency.Args:- :param: `arr` (list of str) : sentences to process.- :param: `tokenizer` : a BERT tokenizer corresponds to `model`.- :param: `nthreads` (int) : number of CPU threads to use"""idf_count = Counter()num_docs = len(arr)process_partial = partial(process,tokenizer=tokenizer)# if nthreads > 0:#with Pool(nthreads) as p:#         idf_count.update(chain.from_iterable(p.map(process_partial, arr)))# else:idf_count.update(chain.from_iterable(map(process_partial, arr)))# update函数，如果字典中无该键，则添加此键值对，值为数字。如有，则更新值。# tokenizer.encode()函数每次只能处理一个text，不能一次处理完整个list of str.idf_dict = defaultdict(lambda: log((num_docs + 1) / (1)))idf_dict.update({idx: log((num_docs + 1) / (c + 1)) for (idx, c) in idf_count.items()}#循环迭代更新idf_dict中的值，值=log()/c+1，在论文中该公式前面有一个负号，转换之后就是当前的公式)return idf_dictidf=get_idf_dict(arr,tokenizer,nthreads=4)
print(idf)
print(idf.keys())

在之前的实验中，是根本没有使用到IDF这一项的，在代码中，这一项默认是False，意味着IDF_dict=[0,1,1,1,1,1…0]。除了CLS和SEP为0 之外，其余的权重值均为1.
若要使用IDF，需设置score参数中的IDF=TRUE，verbalizer=TRUE。

之前的实验结果，也就意味着在不用IDF的情形下，bert score以单个token计算的相似度评分值是可观的。

论文复现-1：bertscore相关推荐

论文无法复现「真公开处刑」，PapersWithCode上线「论文复现报告」
点击上方"视学算法",选择加"星标"或"置顶" 重磅干货,第一时间送达来源丨机器之心编辑丨极市平台导读近日,机器学习资源网站 Pap ...
公开处刑：PapersWithCode上线“论文复现报告”，遏制耍流氓行为！
点击上方"3D视觉工坊",选择"星标" 干货第一时间送达成功复现一篇论文到底有多难? 大概就像这样吧: 可太TM难了--相信这是大多数机器学习研究者都吐槽过的 ...
经典论文复现 | LSGAN：最小二乘生成对抗网络
来源:PaperWeekly 本文约2500字,建议阅读10分钟. 本文介绍了经典AI论文--LSGAN,它比常规GAN更稳定,比WGAN收敛更迅速. 笔者这次选择复现的是 Least Squares ...
这个顶会论文复现比赛，单篇最高现金奖3W！
飞桨论文复现挑战赛(第六期)和春天一起来啦!本次挑战赛再度升级,无论是奋战过前五期比赛的复现冠军,还是首次接触论文复现的潜力选手,都有惊喜赛题任务等你来挑战. 论文复现是深入掌握前沿模型原理的最优方式 ...
宅在家限制智力输出？这场论文复现赛让思维发光
农历庚子鼠年,我们经历了一个不一样的春节. 大朋友和小朋友宅在家中,囤了口罩,肥了蛮腰,同时还收获了大把的时间来感受宅的无聊.但宅真的一无是处吗?显然并非如此. 当前,宅不仅是一种有效的为国做贡献的举 ...
这场论文复现的华山论剑，谁能拔得头筹
问世间,AI技术谁家最高? 每人都有不同的看法.而在各个市场调研机构的眼中,评价一家企业AI技术实力最直观的数据,就是其AI专利和相关论文的产出量. 而评价一个员工在AI方面能力的高低就要复杂一些.因 ...
经典论文复现 | 基于深度卷积网络的图像超分辨率算法
过去几年发表于各大 AI 顶会论文提出的 400 多种算法中,公开算法代码的仅占 6%,其中三分之一的论文作者分享了测试数据,约 54% 的分享包含"伪代码".这是今年 AAAI ...
经典论文复现 | 基于标注策略的实体和关系联合抽取
过去几年发表于各大 AI 顶会论文提出的 400 多种算法中,公开算法代码的仅占 6%,其中三分之一的论文作者分享了测试数据,约 54% 的分享包含"伪代码".这是今年 AAAI ...
经典论文复现 | ICML 2017大热论文：Wasserstein GAN
过去几年发表于各大 AI 顶会论文提出的 400 多种算法中,公开算法代码的仅占 6%,其中三分之一的论文作者分享了测试数据,约 54% 的分享包含"伪代码".这是今年 AAAI ...
经典论文复现 | InfoGAN：一种无监督生成方法
过去几年发表于各大 AI 顶会论文提出的 400 多种算法中,公开算法代码的仅占 6%,其中三分之一的论文作者分享了测试数据,约 54% 的分享包含"伪代码".这是今年 AAAI ...

论文复现-1：bertscore

论文复现-1：bertscore相关推荐

最新文章

热门文章