Language modeling tutorial in torchtext

import torchtext
from torchtext import data
import spacy
from spacy.symbols import ORTH
# my_tok = spacy.load('en_core_web_lg')
# def spacy_tok(x):
#     return [tok.text for tok in my_tok.tokenizer(x)]
spacy_tok = lambda x: x.split() # 由于spacy没有下载成功所以用这个代替
TEXT = data.Field(lower=True, tokenize=spacy_tok)

如果我们想将"don’t" 改成 “do” and “'nt”，则需要

my_tok.tokenizer.add_special_case("don't", [{ORTH: "do"}, {ORTH: "n't"}])

下载Wiki数据集

from torchtext.datasets import WikiText2
train, valid, test = WikiText2.splits(TEXT) # loading custom datasets requires passing in the field, but nothing else.

整个语料库都放在了一个sample中，所以：

len(train) # 1

开始创建词库(vocabulary)，这次我们使用预先计算好的embedding，我们将使用GloVe vectors(200 dimensions)。当然也有很多其他已经处理好的embedding。

TEXT.build_vocab(train, vectors="glove.6B.200d")

使用BPTTIterator来构建错位句子，注意第一维度是句子，第二维度是batch，所有的句子都错位(offset)1

train_iter, valid_iter, test_iter = data.BPTTIterator.splits((train, valid, test),batch_size=32,bptt_len=30, # this is where we specify the sequence lengthdevice=0,repeat=False)b = next(iter(train_iter)); vars(b).keys()
dict_keys(['batch_size', 'dataset', 'train', 'text', 'target'])>>> b.text[:5, :3]
Variable containing:9    953      010    324   59099     11  2001412   5906     273872  10434      2>>> b.target[:5, :3]
Variable containing:10    324   59099     11  2001412   5906     273872  10434      23892      3  10780

参考：
http://mlexplained.com/2018/02/15/language-modeling-tutorial-in-torchtext-practical-torchtext-part-2/

Language modeling tutorial in torchtext相关推荐

6-斯坦福大学自然语言处理第四课“语言模型（Language Modeling）
一.课程介绍斯坦福大学于2012年3月在Coursera启动了在线自然语言处理课程,由NLP领域大牛Dan Jurafsky 和 Chirs Manning教授授课: https://class.c ...
Chapter1-7_Speech_Recognition(Language Modeling)
文章目录 1 为什么需要Language Model 2 N-gram 3 Continuous LM 3 NN-based LM 4 RNN-based LM 5 合并LAS和LM 5.1 shal ...
【读论文】Character-Level Language Modeling with Deeper Self-Attention（Vanilla Transformer）
当初读这篇论文的目的只有1个:在读Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context这篇文章时,关于infe ...
论文笔记-Vanilla Transformer：Character-Level Language Modeling with Deeper Self-Attention
论文笔记-Vanilla Transformer:Character-Level Language Modeling with Deeper Self-Attention 1. 介绍 2. Chara ...
青源Seminar丨NAACL专场：Language Modeling Summarization
NAACL是自然语言处理领域的顶级学术会议,为了进一步促进国际间学术交流,青源会将于8月4日上午09:00-12:20举办「青源Seminar丨NAACL专场线上分享会」,召集人为青源研究组成员.耶鲁 ...
Paper简读 - ProGen: Language Modeling for Protein Generation
欢迎关注我的CSDN:https://spike.blog.csdn.net/ 本文地址:https://blog.csdn.net/caroline_wendy/article/details/12 ...
LLMs：《PaLM: Scaling Language Modeling with Pathways》翻译与解读
LLMs:<PaLM: Scaling Language Modeling with Pathways>翻译与解读导读:这项工作介绍了Pathways Language Model(Pa ...
Masked Language Modeling用于光谱分类模型
Masked Language Modeling(MLM)是一种自然语言处理任务,它的目的是预测句子中被"mask"(隐藏)的词的潜在值.为了训练MLM模型,我们通常会在输入句子中 ...
Causal Language Modeling和Conditional Generation有什么区别
和ChatGPT一起学习! 因果语言建模(Causal Language Modeling,简称CLM)和条件生成(Conditional Generation)是自然语言处理(NLP)和深度学习中的 ...

Language modeling tutorial in torchtext

Language modeling tutorial in torchtext相关推荐

最新文章

热门文章