文章目录

  • tokenizer基本含义
  • bert里涉及的tokenizer
    • BasicTokenzer
    • wordpiecetokenizer
    • FullTokenzier
    • PretrainTokenizer
  • 关系图
  • 实操
  • 如何训练
  • 训练自己中文的tokenizer
  • 总结
  • 引用

tokenizer基本含义

tokenizer就是分词器; 只不过在bert里和我们理解的中文分词不太一样,主要不是分词方法的问题,bert里基本都是最大匹配方法。

最大的不同在于“词”的理解和定义。 比如:中文基本是字为单位。
英文则是subword的概念,例如将"unwanted"分解成[“un”, “##want”, “##ed”] 请仔细理解这个做法的优点。
这是tokenizer的一个要义。

bert里涉及的tokenizer

BasicTokenzer

主要的类是BasicTokenizer,做一些基础的大小写、unicode转换、标点符号分割、小写转换、中文字符分割、去除重音符号等操作,最后返回的是关于词的数组(中文是字的数组)

 def tokenize(self, text):"""Tokenizes a piece of text."""text = convert_to_unicode(text)text = self._clean_text(text)# This was added on November 1st, 2018 for the multilingual and Chinese# models. This is also applied to the English models now, but it doesn't# matter since the English models were not trained on any Chinese data# and generally don't have any Chinese data in them (there are Chinese# characters in the vocabulary because Wikipedia does have some Chinese# words in the English Wikipedia.).text = self._tokenize_chinese_chars(text)orig_tokens = whitespace_tokenize(text)split_tokens = []for token in orig_tokens:if self.do_lower_case:token = token.lower()token = self._run_strip_accents(token)split_tokens.extend(self._run_split_on_punc(token))output_tokens = whitespace_tokenize(" ".join(split_tokens))return output_tokens

BasicTokenzer是预处理。

wordpiecetokenizer

另外一个则是关键wordpiecetokenizer,就是基于vocab切词。

  def tokenize(self, text):"""Tokenizes a piece of text into its word pieces.This uses a greedy longest-match-first algorithm to perform tokenizationusing the given vocabulary.For example:input = "unaffable"output = ["un", "##aff", "##able"]Args:text: A single token or whitespace separated tokens. This should havealready been passed through `BasicTokenizer.Returns:A list of wordpiece tokens."""text = convert_to_unicode(text)output_tokens = []for token in whitespace_tokenize(text):chars = list(token)if len(chars) > self.max_input_chars_per_word:output_tokens.append(self.unk_token)continueis_bad = Falsestart = 0sub_tokens = []while start < len(chars):end = len(chars)cur_substr = None#找个单词,找不到end向前滑动;还是看代码实在!!!while start < end:substr = "".join(chars[start:end])if start > 0:substr = "##" + substrif substr in self.vocab:cur_substr = substrbreakend -= 1if cur_substr is None:is_bad = Truebreaksub_tokens.append(cur_substr)start = endif is_bad:output_tokens.append(self.unk_token)else:output_tokens.extend(sub_tokens)return output_tokens

FullTokenzier

这个基本上就是利用basic和wordpiece来切分。用于bert训练的预处理。基本就一个tokenize方法。不会有encode_plus等方法。

PretrainTokenizer

这个则是bert的base类,定义了很多方法(convert_ids_to_tokens)等。 后续的BertTokenzier,GPT2Tokenizer都继承自pretrainTOkenizer,下面的关系图可以看到这个全貌。

关系图

实操

from transformers.tokenization_bert import BertTokenizertokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
print("词典大小:",tokenizer.vocab_size)
text = "the game has gone!unaffable  I have a new GPU!"
tokens = tokenizer.tokenize(text)
print("英文分词来一个:",tokens)text = "我爱北京天安门,吢吣"
tokens = tokenizer.tokenize(text)
print("中文分词来一个:",tokens)input_ids = tokenizer.convert_tokens_to_ids(tokens)
print("id-token转换:",input_ids)sen_code = tokenizer.encode_plus("i like  you  much", "but not him")
print("多句子encode:",sen_code)print("decode:",tokenizer.decode(sen_code['input_ids']))

输出结果:

词典大小: 30522
英文分词来一个: ['the', 'game', 'has', 'gone', '!', 'una', '##ffa', '##ble', 'i', 'have', 'a', 'new', 'gp', '##u', '!']
中文分词来一个: ['我', '[UNK]', '北', '京', '天', '安', '[UNK]', ',', '[UNK]', '[UNK]']
id-token转换: [1855, 100, 1781, 1755, 1811, 1820, 100, 1989, 100, 100]
多句子encode: {'input_ids': [101, 1045, 2066, 2017, 2172, 102, 2021, 2025, 2032, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
decode: [CLS] i like you much [SEP] but not him [SEP]

看代码或者实际操练一遍,再来看理论知识更好。实操是关键,是思想的体现。

当然也可以单独实验bertwordpiecetokenzer

from transformers.tokenization_bert import BertWordPieceTokenizer
# initialize tokenizer
tokenizer = BertWordPieceTokenizer(vocab_file= "vocab.txt",unk_token = "[UNK]",sep_token = "[SEP]",cls_token = "[CLS]",pad_token  = "[PAD]",mask_token = "[MASK]",clean_text = True,handle_chinese_chars = True,strip_accents= True,lowercase = True,wordpieces_prefix = "##"
)# sample sentence
sentence = "Language is a thing of beauty. But mastering a new language from scratch is quite a daunting prospect."# tokenize the sample sentence
encoded_output = tokenizer.encode(sentence)
print(encoded_output)
print(encoded_output.tokens)

如何训练

其实就是提取vacab的过程。
BPE算法也比较容易理解:不断的选择most common的加入到词典,为什么? 因为覆盖的语料量比较大。

举个bpe的例子。

原始统计词:
('hug', 10), ('pug', 5), ('pun', 12), ('bun', 4), ('hugs', 5)开始统计char:
('h' 'u' 'g', 10), ('p' 'u' 'g', 5), ('p' 'u' 'n', 12), ('b' 'u' 'n', 4), ('h' 'u' 'g' 's', 5)合并最大的ug:('h' 'ug', 10), ('p' 'ug', 5), ('p' 'u' 'n', 12), ('b' 'u' 'n', 4), ('h' 'ug' 's', 5)合并最大频度的hug:['b', 'g', 'h', 'n', 'p', 's', 'u', 'ug', 'un', 'hug'] 最后原始统计词的表示转换为:('hug', 10), ('p' 'ug', 5), ('p' 'un', 12), ('b' 'un', 4), ('hug' 's', 5)

训练自己中文的tokenizer

def train_cn_tokenizer():# ! pip install tokenizersfrom pathlib import Pathfrom tokenizers import ByteLevelBPETokenizerpaths = [str(x) for x in Path("zho-cn_web_2015_10K").glob("**/*.txt")]# Initialize a tokenizertokenizer = ByteLevelBPETokenizer()# Customize trainingtokenizer.train(files=paths, vocab_size=52_000, min_frequency=3, special_tokens=["<s>","<pad>","</s>","<unk>","<mask>",])# Save files to disktokenizer.save( ".","zh-tokenizer-train")

我强烈建议,根据自己的业务定制自己的vocab,当然要配套模型。
最后的结果

{"<s>":0,"<pad>":1,"</s>":2,"<unk>":3,"<mask>":4,"!":5,"\"":6,"#":7,"$":8,"%":9,"&":10,"'":11,"(":12,")":13,"*":14,"+":15,",":16,"-":17,".":18,"/":19,"0":20,"1":21,"2":22,"3":23,"4":24,"5":25,"6":26,"7":27,"8":28,"9":29,":":30,";":31,"<":32,"=":33,">":34,"?":35,"@":36,"A":37,"B":38,"C":39,"D":40,"E":41,"F":42,"G":43,"H":44,"I":45,"J":46,"K":47,"L":48,"M":49,"N":50,"O":51,"P":52,"Q":53,"R":54,"S":55,"T":56,"U":57,"V":58,"W":59,"X":60,"Y":61,"Z":62,"[":63,"\\":64,"]":65,"^":66,"_":67,"`":68,"a":69,"b":70,"c":71,"d":72,"e":73,"f":74,"g":75,"h":76,"i":77,"j":78,"k":79,"l":80,"m":81,"n":82,"o":83,"p":84,"q":85,"r":86,"s":87,"t":88,"u":89,"v":90,"w":91,"x":92,"y":93,"z":94,"{":95,"|":96,"}":97,"~":98,"¡":99,"¢":100,"£":101,"¤":102,"¥":103,"¦":104,"§":105,"¨":106,"©":107,"ª":108,"«":109,"¬":110,"®":111,"¯":112,"°":113,"±":114,"²":115,"³":116,"´":117,"µ":118,"¶":119,"·":120,"¸":121,"¹":122,"º":123,"»":124,"¼":125,"½":126,"¾":127,"¿":128,"À":129,"Á":130,"Â":131,"Ã":132,"Ä":133,"Å":134,"Æ":135,
...

总结

  1. 理论结合实践,敲代码仔细深度理解。
  2. tokenzier的本质是分词,提取有意义的wordpiece,又尽可能的少,用尽量少的信息单元来描述无限的组合。
  3. 几个类的集成理清楚。
  4. 里面的细节可以继续阅读原始类来继续跟进。
  5. wordpiece是比word更小的概念,有何好处? 能解决oov吗。 需要再次思考。

引用

  1. https://albertauyeung.github.io/2020/06/19/bert-tokenization.html
  2. https://spacy.io/usage/spacy-101
  3. https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.PreTrainedTokenizer
  4. https://zhuanlan.zhihu.com/p/160813500
  5. https://github.com/google/sentencepiece
  6. https://huggingface.co/transformers/tokenizer_summary.html
  7. https://huggingface.co/blog/how-to-train

bert第三篇:tokenizer相关推荐

  1. bert模型可以做文本主题识别吗_NLP之文本分类:「Tf-Idf、Word2Vec和BERT」三种模型比较...

    字幕组双语原文:NLP之文本分类:「Tf-Idf.Word2Vec和BERT」三种模型比较 英语原文:Text Classification with NLP: Tf-Idf vs Word2Vec ...

  2. 厉害了,网易伏羲三篇论文上榜 AI 顶会 ACL

    近日,国际AI顶尖学术会议ACL 2021(Annual Meeting of the Associationfor Computational Linguistics)公布了论文录用结果.网易伏羲共 ...

  3. NLP能否像人脑一样工作?CMU、MIT三篇论文详解机器和大脑范畴下NLP的关系

    本文转载自公众号机器之心 作为计算机科学领域与人工智能领域的重要研究课题,自然语言处理已经在各领域展开了广泛的研究与探讨.随着研究的深入,一些学者开始探讨机器中的自然语言处理和大脑中的自然语言处理是否 ...

  4. AAAI 2020 提前看 | 三篇论文解读问答系统最新研究进展

    机器之心原创 作者:仵冀颖 编辑:H4O 2020 年 2 月 7 日至 12 日,AAAI 2020 将于美国纽约举办.今年 AAAI 共接受了 8800 篇提交论文,其中评审了 7737 篇,接收 ...

  5. ACM顶会CIKM 2022放榜!度小满AI Lab三篇入选

      视学算法报道   编辑:好困 [导读]国际顶会历来是AI技术的试金石,也是各家企业大秀肌肉的主战场. 近日,第31届ACM信息与知识管理国际会议(The 31th ACM Internationa ...

  6. 一文看完澜舟科技被EMNLP'22录用的三篇论文

    每天给你送来NLP技术干货! 来自:澜舟科技 作为NLP领域的新创企业,澜舟科技非常注重对核心技术的研究和实习生的培养.自 2021 年成立以来,澜舟已培养了来自国内外知名高校的 120 名余位实习生 ...

  7. 清华2020计算机系张晨,本科三篇顶会一作、超算竞赛冠军,2020清华本科特奖结果出炉...

    原标题:本科三篇顶会一作.超算竞赛冠军,2020清华本科特奖结果出炉 在今年的清华本科生特奖候选人中,来自电子系的刘泓 ICML.CVPR.NeurIPS 三篇顶会一作的成绩十分亮眼,此外,来自姚班的 ...

  8. castle典范英语 storm_新版典范英语 第三篇翻译

    典范英语 7 1 新版典范英语 7 (旧版 6 ) 3 第三篇 Princess Pip's Holiday 皮皮公主的假期 1 Ready to go 准备出发 Everyone in Prince ...

  9. centOS7 LNMP+phpmyadmin环境搭建 第三篇phpmyadmin安装

    这篇文章主要介绍了CentOS7 LNMP+phpmyadmin环境搭建,第三篇phpmyadmin安装,具有一定的参考价值,感兴趣的小伙伴们可以参考一下 之前我们已经安装了lnmp的环境,现在让我们 ...

最新文章

  1. Android Dialog 弹框之外的区域 默认透明背景色修改
  2. 【Java基础】字符串与数组
  3. Mac安装mysqldb
  4. Vivado联合ModelSim仿真设置(附图步骤)
  5. 划分字母区间(双指针,贪心)
  6. 正则表达式: input框禁止输入空格:不能只有空格、不能有空格、不能为空的判断
  7. 终于修好了MacBook
  8. 打开u盘提示不在计算机中,U盘不被电脑识别怎么办 U盘在电脑上打不开解决方法...
  9. dreamweaver 8的替换功能
  10. 做生意做不过中国,于是英国发动了鸦片战争
  11. ThinkPHP扩展,实现Redis的CURD操作。
  12. “商圈合伙人”让异业联盟无边界化,打造共赢生态圈
  13. 如何破解qq仅三天可见
  14. 如何为word文档增加脚注
  15. linux 内核---------董昊 ( Robin Dong ) and OenHan
  16. 苹果手机怎么设置流量限制
  17. MarkdownPad2注册码——亲测有效
  18. SpringBoot之从零搭建网站(可提供源码)
  19. 高职网络系统管理国赛--网络赛题1路由选路解析
  20. 技术报告 | 罗汉堂:理解大数据:数字时代的数据和隐私2021.pdf(附下载链接)

热门文章

  1. excel里mid函数的用法
  2. 作为一名软件工程师,计算机基础学科到底要不要学?
  3. 两台华为三层交换机配置不同vlan不同网段之间互通
  4. B站项目-云E办的项目代码(含前端和后端)
  5. 11|链路追踪:如何定制一个分布式链路跟踪系统?
  6. VS2017+Fortran(Intel Parallel Studio XE 2018)+MPI
  7. leetcode:子集
  8. 手放开游戏(擦衣服游戏)项目源码
  9. openssl高版本升级,nginx高版本升级
  10. 玩蛇记--Python处理海量手机号码