d2l.Vocab(sentences, min_freq=5, reserved_tokens=[‘＜pad＞‘, ‘＜mask＞‘, ‘＜cls＞‘, ‘＜sep＞‘]) 参数讲解

d2l.Vocab(sentences, min_freq=2, reserved_tokens=['<pad>', '<mask>', '<cls>', '<sep>'])
sentences：源句子，比如说 sentences赋值为下边的五个句子 注意第一句我手动添加了两个'<unk>',第二句添加了'<pad>', '<mask>', '<cls>', '<sep>',作为第三个参数的测试
[['<unk>', '<unk>', 'the', 'ottoman', 'turkish', 'empire', 'entered', 'the', 'first', 'world', 'war', 'on', 'the', 'side', 'of', 'the', 'central', 'powers', 'on', '31', 'october', '1914'], ['<pad>', '<mask>', '<cls>', '<sep>', 'the', 'stalemate', 'of', 'trench', 'warfare', 'on', 'the', 'western', 'front', 'convinced', 'the', 'british', 'imperial', 'war', 'cabinet', 'that', 'an', 'attack', 'on', 'the', 'central', 'powers', 'elsewhere', ',', 'particularly', 'turkey', ',', 'could', 'be', 'the', 'best', 'way', 'of', 'winning', 'the', 'war'], ['from', 'february', '1915', 'this', 'took', 'the', 'form', 'of', 'naval', 'operations', 'aimed', 'at', 'forcing', 'a', 'passage', 'through', 'the', 'dardanelles', ',', 'but', 'after', 'several', 'setbacks', 'it', 'was', 'decided', 'that', 'a', 'land', 'campaign', 'was', 'also', 'necessary'],['to', 'that', 'end', ',', 'the', 'mediterranean', 'expeditionary', 'force', 'was', 'formed', 'under', 'the', 'command', 'of', 'general', 'ian', 'hamilton'],['three', 'amphibious', 'landings', 'were', 'planned', 'to', 'secure', 'the', 'gallipoli', 'peninsula', ',', 'which', 'would', 'allow', 'the', 'navy', 'to', 'attack', 'the', 'turkish', 'capital', 'constantinople', ',', 'in', 'the', 'hope', 'that', 'would', 'convince', 'the', 'turks', 'to', 'ask', 'for', 'an', 'armistice', '.']
]
min_freq：把出现次数少于2次的低频率词元视为相同的未知词元即视为'<unk>'
reserved_tokens：如果遇到reserved_tokens中的词也作为token保留下来

一个完整的例子：

d2l.Vocab(sentences, min_freq=5, reserved_tokens=[‘＜pad＞‘, ‘＜mask＞‘, ‘＜cls＞‘, ‘＜sep＞‘]) 参数讲解相关推荐

BERT和ERNIE中[PAD],[CLS],[SEP],[MASK],[UNK]所代表的含义
在BERT和ERNIE等预训练模型的词汇表文件vocab.txt中,有[PAD],[CLS],[SEP],[MASK],[UNK]这几种token,它们代表的具体含义如下: 1,[PAD] 要将句子处 ...
NLP中的特殊标记（Special Tokens）[PAD]、[CLS]、[SEP]、[UNK]
这些是BERT模型中的特殊标记(Special Tokens).它们的含义如下: [PAD]:在batch中对齐序列长度时,用 [PAD]进行填充以使所有序列长度相同.可以通过将其添加到较短的序列末尾 ...
自然语言处理：预训练
14.8 来自trans的双向编码器表示(Bert) Bidirectional Encoder Representation from Transformers 14.8.1 从上下文无关到上下 ...
Pytorch BERT
Pytorch BERT 0. 环境介绍环境使用 Kaggle 里免费建立的 Notebook 教程使用李沐老师的动手学深度学习网站和视频讲解小技巧:当遇到函数看不懂的时候可以按 Shift ...
BERT - PyTorch
动手学深度学习笔记一.BERT 1.BERT:把两个结合起来 2.BERT的输入表示 3.编码器 4.预训练任务掩蔽语言模型下一句预测 5.整合代码二.用于预训练BERT的数据集 1.下载并读 ...
【动手学深度学习】李沐——循环神经网络
本文内容目录序列模型文本预处理语言模型和数据集循环神经网络 RNN的从零开始实现 RNN的简洁实现通过时间反向传播门控循环单元GRU 长短期记忆网络(LSTM) 深度循环神经网络双向循环 ...
机器翻译baseline
1 下载和预处理数据集 # 导包 import os import torch from d2l import torch as d2l D:\ana3\envs\nlp_prac\lib\site- ...
NLP应用：情感分析和自然语言推断
0 序言回顾: 如何在文本序列中表示词元训练了词元的表示这样的预训练文本可通过不同的模型架构,放入不同的下游NLP任务之前的提到的NLP应用没有使用预训练本章: 重点:如何应用 DL表征学 ...
【动手学习pytorch笔记】28.机器翻译数据集
机器翻译数据集 import os import torch from d2l import torch as d2l 下载和预处理数据集 #@save d2l.DATA_HUB['fra-eng'] ...

d2l.Vocab(sentences, min_freq=5, reserved_tokens=[‘＜pad＞‘, ‘＜mask＞‘, ‘＜cls＞‘, ‘＜sep＞‘]) 参数讲解

d2l.Vocab(sentences, min_freq=5, reserved_tokens=[‘＜pad＞‘, ‘＜mask＞‘, ‘＜cls＞‘, ‘＜sep＞‘]) 参数讲解相关推荐

最新文章

热门文章