【备忘录】transformers tokenizer.tokenize和tokenizer.encode
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('hfl/chinese-bert-wwm')
text = '在此基础上,美国试图挑拨伊朗和伊拉克关系。'
tokenizer_out = tokenizer.tokenize(text)
print(tokenizer_out)['在', '此', '基', '础', '上', ',', '美', '国', '试', '图', '挑', '拨', '伊', '朗', '和', '伊', '拉', '克', '关', '系', '。']
tokenize仅仅只是分词,没有在句子前后加[cls]和[sep]
tokenizer_encode = tokenizer.encode(text)
print(tokenizer_encode)[101, 1762, 3634, 1825, 4794, 677, 8024, 5401, 1744, 6407, 1745, 2904, 2884, 823, 3306, 1469, 823, 2861, 1046, 1068, 5143, 511, 102]
encode在分词的基础上,使用bert自带的vocab.txt转化为id,并且在前后都加了[cls]和[sep],
因为他俩在字典的id分别是101和102
也就是input_ids
data = tokenizer(text)
print(data)
{'input_ids': [101, 1762, 3634, 1825, 4794, 677, 8024, 5401, 1744, 6407, 1745, 2904, 2884, 823, 3306, 1469, 823, 2861, 1046, 1068, 5143, 511, 102],'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
}
直接调用,则返回一个字典,包含了input_ids,token_type_ids, attention_mask
bert结构
BertModel((embeddings): BertEmbeddings((word_embeddings): Embedding(21128, 768, padding_idx=0)(position_embeddings): Embedding(512, 768)(token_type_embeddings): Embedding(2, 768)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False))(encoder): BertEncoder((layer): ModuleList((0): BertLayer((attention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(intermediate): BertIntermediate((dense): Linear(in_features=768, out_features=3072, bias=True))(output): BertOutput((dense): Linear(in_features=3072, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(1): BertLayer((attention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(intermediate): BertIntermediate((dense): Linear(in_features=768, out_features=3072, bias=True))(output): BertOutput((dense): Linear(in_features=3072, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(2): BertLayer((attention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(intermediate): BertIntermediate((dense): Linear(in_features=768, out_features=3072, bias=True))(output): BertOutput((dense): Linear(in_features=3072, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(3): BertLayer((attention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(intermediate): BertIntermediate((dense): Linear(in_features=768, out_features=3072, bias=True))(output): BertOutput((dense): Linear(in_features=3072, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(4): BertLayer((attention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(intermediate): BertIntermediate((dense): Linear(in_features=768, out_features=3072, bias=True))(output): BertOutput((dense): Linear(in_features=3072, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(5): BertLayer((attention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(intermediate): BertIntermediate((dense): Linear(in_features=768, out_features=3072, bias=True))(output): BertOutput((dense): Linear(in_features=3072, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(6): BertLayer((attention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(intermediate): BertIntermediate((dense): Linear(in_features=768, out_features=3072, bias=True))(output): BertOutput((dense): Linear(in_features=3072, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(7): BertLayer((attention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(intermediate): BertIntermediate((dense): Linear(in_features=768, out_features=3072, bias=True))(output): BertOutput((dense): Linear(in_features=3072, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(8): BertLayer((attention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(intermediate): BertIntermediate((dense): Linear(in_features=768, out_features=3072, bias=True))(output): BertOutput((dense): Linear(in_features=3072, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(9): BertLayer((attention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(intermediate): BertIntermediate((dense): Linear(in_features=768, out_features=3072, bias=True))(output): BertOutput((dense): Linear(in_features=3072, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(10): BertLayer((attention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(intermediate): BertIntermediate((dense): Linear(in_features=768, out_features=3072, bias=True))(output): BertOutput((dense): Linear(in_features=3072, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(11): BertLayer((attention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(intermediate): BertIntermediate((dense): Linear(in_features=768, out_features=3072, bias=True))(output): BertOutput((dense): Linear(in_features=3072, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))))(pooler): BertPooler((dense): Linear(in_features=768, out_features=768, bias=True)(activation): Tanh())
)
【备忘录】transformers tokenizer.tokenize和tokenizer.encode相关推荐
- tokenizer.encode、tokenizer.tokenize、tokenizer.encode_plus的用法差异
一.tokenizer.encode和tokenizer.tokeninze tokenizer.tokenize :先分词,再转成id,直接输出tensor tokenizer.encode :直接 ...
- tokenizer.encode() 与 tokenizer.tokenize()对比,言简意赅 转 高人讲学
tokenizer.encode()_不知道起什么名字-CSDN博客_tokenizer.encode tokenizer.encode("说你跑的挺远",add_special_ ...
- Transformers 库中的 Tokenizer 使用
文章目录 概述 基本使用方法 进阶 基本使用不能满足的情况 解决思路 问题一解决:(有两种思路) 问题二解决: Tokenizer 中的 Encoder vocab_base 部分 vocab_add ...
- transformers model inputs
transformers中大部分模型的输入都是相同的. Input IDs from transformers import BertTokenizer tokenizer = BertTokeniz ...
- bert4torch快速上手
背景 之前一篇参考bert4keras的pytorch实现介绍了bert4torch的主要功能和特性,简而言之就是一款用pytorch来复现bert4keras的简洁训练框架.经过几个月的维护,功能已 ...
- 全网最详细的Bert4torch入门教程
本人经常会阅读苏神的科学空间网站,里面有很多对前言paper浅显易懂的解释,以及很多苏神自己的创新实践:并且基于bert4keras框架都有了相应的代码实现.但是由于本人主要用pytorch开发,因此 ...
- 【transformers】tokenizer用法(encode、encode_plus、batch_encode_plus等等)
tranformers中的模型在使用之前需要进行分词和编码,每个模型都会自带分词器(tokenizer),熟悉分词器的使用将会提高模型构建的效率. string tokens ids 三者转换 str ...
- transformers库使用--tokenizer
在我使用transformers进行预训练模型学习及微调的时候,需要先对数据进行预处理,然后经过处理过的数据才能送进bert模型里,这个过程中使用的主要的工具就是tokenizer.通过与相关预训练模 ...
- Transformers的RoBERTa model怎么使用word level的tokenizer
2022年8月25日更新: 昨天改了tokenizer之后以为好了,结果发现还是有问题.具体来说,用后面方法训练的tokenizer,并不能被正确加载为RobertaTokenizerFast,会导致 ...
- encode和tokenize的区别
encode和encode_plus的区别可以参考我的另一篇博客:https://blog.csdn.net/weixin_41862755/article/details/120070535 enc ...
最新文章
- 「linux网络管理」OSI模型
- 本地存储与云存储方案价值对比—Vecloud
- MySQL 子查询 嵌套查询
- Office 365 On MacOS 系列——安装 O365 其他组件
- Visual Studio的Node.js插件:NTVS 1.0正式发布
- Linux基础—3.Linux基础命令总结【有图有真相】
- Mstar的Monitor方案OSD 菜单制作(四)——图片转换代码工具详解
- SECS/GEM协议库开发开源代码
- 抽帧定格动画如何制作?AE制作哈利波特定格动画教程
- IAST技术进阶系列(四):DevOps流水线敏捷实践
- nginx代理 域名重定向
- 支持苹果18W-60W PD快充协议芯片JD6606S
- 【蓝桥杯】水题 基础练习 回文数 c语言
- 如何把一个长链接缩短成一个短链接?
- (附源码)计算机毕业设计SSM基于ETC用户的自驾游推荐系统
- linux设置NLS_LANG
- python爬取国家男女比例_用python爬取3万多条评论,看韩国人如何评价韩国电影《寄生虫》?...
- 个人站长应该怎么设置服务器才可以无视CC攻击
- 为什么截图粘贴到WORD里看不见呢
- iOS SwiftUI整合人工智能制作照片识别App(2020教程)