文章目录

torchtext预处理流程
定义Field
定义Dataset
建立vocab
构造迭代器
references

import torch.nn as nn
import torch.nn.functional as F
import torch
import torchtext
from torchtext.legacy.data import Field, NestedField, LabelField, TabularDataset, BucketIterator
# from torchtext import data
from torchtext import datasets
from torchtext import vocab
import spacy
import pkusegdevice = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

torchtext预处理流程

定义Field：声明如何处理数据；
- Field，定义一个接收要求，或者说是处理条件。用于处理文本数据。
定义Dataset：得到数据集，此时数据集里每一个样本是一个经过Field声明的预处理后的 wordlist；
- TabularDataset，接受的数据集，就是pytorch中Dataset翻版，不过里面定义了一些文本处理操作，然后存储数据。相比自定义的Dataset而言，TabularDataset里一些预定义功能省去我们nlp中文本处理时间，但有些时候不如Dataset灵活。
建立vocab：在这一步建立词汇表，词向量(word embeddings)；
- Field().build_vocab()
构造迭代器：构造迭代器，用来分批次训练模型；
- BucketIterator，是pytorch中DataLoader翻版，可以进行批次训练。

下面来看具体代码，假数据如下：

train_demo.jsonl

{"idx": "1", "text": "东5环海棠公社230-290平2居准现房98折优惠", "label": 0}
{"idx": "2", "text": "海淀区领秀新硅谷宽景大宅预计10月底开盘", "label": 0}
{"idx": "3", "text": "柴志坤：土地供应量不断从紧 地价难现07水平(图)", "label": 0}
{"idx": "4", "text": "融景城2010年中期将推2居80-90平16000起(图)", "label": 0}
{"idx": "5", "text": "燕京航城形象代言人正式签约李小璐", "label": 0}
{"idx": "6", "text": "朝阳苹果派10-70平地下室均价13000元", "label": 0}
{"idx": "7", "text": "密云檀府家园125平米3居在售均价7千全款98折", "label": 0}
{"idx": "8", "text": "昌平准现房湾流汇推出双拼及独栋别墅16500元起", "label": 0}
{"idx": "9", "text": "东5环海棠公社170平3居26000元别墅600万起", "label": 0}
{"idx": "10", "text": "华侨城46亿开建武汉巴登城", "label": 0}
{"idx": "11", "text": "金证顾问：过山车行情意味着什么", "label": 1}
{"idx": "12", "text": "美股评论：科技行业财测消失为哪般", "label": 1}
{"idx": "13", "text": "美国保级三缺一：看标普脸色", "label": 1}
{"idx": "14", "text": "三新股今日上市 赣锋锂业大涨194.2%居首", "label": 1}
{"idx": "15", "text": "欧洲三大股市上升 英国股市升0.08%", "label": 1}
{"idx": "16", "text": "港交所推出中移动等5股票期权标准组合交易", "label": 1}
{"idx": "17", "text": "申银万国：上调准备金符合预期 对股市影响不大", "label": 1}
{"idx": "18", "text": "奥巴马将考虑签署短期调高借债上限协议", "label": 1}
{"idx": "19", "text": "世界经济靠什么避免大灾难", "label": 1}
{"idx": "20", "text": "国都香港：预测光汇石油中报盈利增长82%", "label": 1}
{"idx": "21", "text": "中华女子学院：本科层次仅1专业招男生", "label": 2}
{"idx": "22", "text": "北京考生需注意：英语考试14：45禁入场", "label": 2}
{"idx": "23", "text": "教育部通报一起涉嫌虚假招生宣传案件", "label": 2}
{"idx": "24", "text": "北京9场校园高招咨询会周末举行", "label": 2}
{"idx": "25", "text": "2010年高考文综试题(浙江卷)", "label": 2}
{"idx": "26", "text": "广西2010年成考报名时间及办法", "label": 2}
{"idx": "27", "text": "四六级19日开考 有学生为得高分多次报考", "label": 2}
{"idx": "28", "text": "2011年成考期间网上监控有害信息", "label": 2}
{"idx": "29", "text": "贴近生活 计算房价折扣成了中考试题", "label": 2}
{"idx": "30", "text": "自考故事：不走寻常路 自考毕业1年当老板", "label": 2}
{"idx": "31", "text": "大学生应征忙 征兵要求素质逐步提高(图)", "label": 2}
{"idx": "32", "text": "两天价网站背后重重迷雾：做个网站究竟要多少钱", "label": 3}
{"idx": "33", "text": "出色卡片！佳能IXUS 115竟不到1400元", "label": 3}
{"idx": "34", "text": "山东旅游局联手携程邀网友答问题赢礼品", "label": 3}
{"idx": "35", "text": "多款Windows Phone7手机现身台北电脑展", "label": 3}
{"idx": "36", "text": "儿童摄影团购遭遇“选片圈套”", "label": 3}
{"idx": "37", "text": "卖点一览无余 超低价热销平板电视导购", "label": 3}
{"idx": "38", "text": "17英寸游戏强本 华硕X71独显现售7795", "label": 3}
{"idx": "39", "text": "德国眼镜熊患皮疹全身脱毛变裸体熊(组图)", "label": 3}
{"idx": "40", "text": "细节显品质 热门全能型实用本本推荐", "label": 3}
{"idx": "41", "text": "前程无忧周四大跌19.97%创52周新低", "label": 3}
{"idx": "42", "text": "82岁老太为学生做饭扫地44年获授港大荣誉院士", "label": 4}
{"idx": "43", "text": "妻子照顾植物人丈夫6年教会其吃饭(图)", "label": 4}
{"idx": "44", "text": "饭店因物价上涨无法定菜价 订年夜饭改收餐位费", "label": 4}
{"idx": "45", "text": "少年患进行性肌肉萎缩11年考上大学(图)", "label": 4}
{"idx": "46", "text": "百余名英国母亲参加快闪活动当众喂奶(图)", "label": 4}

定义Field

Torchtext采用了一种声明式的方法来加载数据：你来告诉Torchtext你希望的数据是什么样子的，剩下的由torchtext来处理。

实现这种声明的是Field，Field确定了一种你想要怎么去处理数据：

data.Field(…)

常用参数如下：

tokenize：传入一个函数，表示如何将文本str变成token
lower：是否将此字段中的文本小写。默认值：False
include_lengths：是否返回填充小批量的元组和包含的列表
fix_length：使用此字段的所有示例都将被填充到的固定长度，或 None 用于灵活的序列长度。默认值：None。
tokenizer_language：要构造的标记器的语言。目前仅在 SpaCy 中支持的各种语言。
use_vocab：表示是否使用词典对象，用在要分词的语料内容上。后续将对应词转化为词向量需要。
sequential：表示是否展示成序列，用在要分词的语料内容上。
数据类型是否表示顺序数据，如果数据已经是序列化的了而且是数字类型的，则应该传递参数use_vocab = False和sequential = False
field在默认的情况下都期望一个输入是一组单词的序列，并且将单词映射成整数。这个映射被称为vocab。如果一个field已经被数字化了并且不需要被序列化，
pad_token：用作填充的字符串标记，Default: “”
unk_token：用于表示 OOV 词的字符串标记，Default: “”

# 定义分词函数，采用北大的pkuseg
seg = pkuseg.pkuseg()
def tokenizer(text):return seg.cut(text)# word_field = Field(tokenize='spacy', lower=True, include_lengths=True, fix_length=32, tokenizer_language='en_core_web_sm')
word_field = Field(tokenize=tokenizer, lower=True, include_lengths=True, fix_length=32, tokenizer_language='en_core_web_sm')
label_field = Field(sequential=False, use_vocab=False)

定义Dataset

Field知道当给定原始数据的时候要做什么。现在，我们需要告诉fields它需要处理什么样的数据。这个功能利用Datasets来实现。

TabularDataset官网介绍: Defines a Dataset of columns stored in CSV, TSV, or JSON format.

fields = {'text': ('text_word', word_field),'label': ('label', label_field)
}
train, val = TabularDataset.splits(path='./data/THUCNews', train='train_demo.jsonl', validation='val_demo.jsonl', format='json', skip_header=True, fields=fields)
# train, val = TabularDataset.splits(path='./data/THUCNews', train='train_demo.tsv', validation='val_demo.tsv', format='tsv', skip_header=True, fields=fields)print(train[5])
print(train[5].__dict__.keys())
print(train[5].text_word, train[5].label)
print()

<torchtext.legacy.data.example.Example object at 0x7fe0581e6340>
dict_keys(['text_word', 'label'])
['密云', '檀府', '家园', '125', '平米', '3', '居', '在', '售均价', '7千', '全款', '98', '折'] 0

建立vocab

上面打印的，我们可以看到第6行的输入，它是一个Example对象。Example对象绑定了一行中的所有属性，句子已经被分词了，但是没有转化为数字。

这是因为我们还没有建立vocab，我们将在下一步建立vocab。

Torchtext可以将词转化为数字，但是它需要被告知需要被处理的全部范围的词。

‘glove.840B.300d’ 为torchtext支持的词向量名字，第一次使用是会自动下载并保存在当前目录的 tmp_cache里面。

torchtext支持的词向量：
charngram.100d
fasttext.en.300d
fasttext.simple.300d
glove.42B.300d
glove.840B.300d
glove.twitter.27B.25d
glove.twitter.27B.50d
glove.twitter.27B.100d
glove.twitter.27B.200d
glove.6B.50d
glove.6B.100d
glove.6B.200d
glove.6B.300d

word_vectors = vocab.Vectors('./data/glove/glove.840B.300d.txt', './data/glove/tmp_cache')print('glove.840B.300d维度：', word_vectors.vectors.shape)
# glove.840B.300d维度： torch.Size([2196017, 300])

下面这行代码使得 Torchtext遍历训练集中的绑定word_field的数据，将单词注册到vocabulary，并自动构建embedding矩阵

min_freq表示最短词频，低于其的词语不会进行训练

# word_field.build_vocab(train, val, max_size=25000, vectors=word_vectors, min_freq=1, unk_init=torch.Tensor.normal_)
word_field.build_vocab(train, val, max_size=25000, vectors=word_vectors, min_freq=1, unk_init=torch.nn.init.xavier_uniform)

到这一步，我们已经可以把词转为数字，数字转为词，词转为词向量了:

print('词表索引0对应的是<unk>：', word_field.vocab.itos[0])
print('词表索引1对应的是<pad>：', word_field.vocab.itos[1])
# word 到 id 的映射
print('某词的对应索引', word_field.vocab.stoi['2010年'])print('查看词频：', word_field.vocab.freqs)# 词向量矩阵: TEXT.vocab.vectors
print('词向量矩阵维度---', word_field.vocab.vectors.shape)word_vec = word_field.vocab.vectors[word_field.vocab.stoi['2010年']]
print(word_vec.shape)
print(word_vec)print()

构造迭代器

和Dataset一样，torchtext有大量内置的迭代器，我们这里选择的是BucketIterator，官网对它的介绍如下：

Defines an iterator that batches examples of similar lengths together.
Minimizes amount of padding needed while producing freshly shuffled batches for each new epoch.

train_iter, dev_iter = BucketIterator.splits((train, val), batch_sizes=(3, len(val)), sort_key=lambda x: len(x.text_word),sort_within_batch=True, repeat=False, shuffle=True, device=device
)woed_embeddings = word_field.vocab.vectors
emb = nn.Embedding.from_pretrained(woed_embeddings, freeze=False)for batch, batch_data in enumerate(train_iter):text_word, y = batch_data.text_word, batch_data.labelprint(text_word)print(y)print(text_word[0].shape)   # [seq_len, batch_size]x = text_word[0].permute(1, 0)print(x)print(x.shape)   # [batch_size, seq_len]# embeddingx = emb(x)print(x)print(x.shape)break

(tensor([[141, 171, 188],[266, 434,  29],[160,   2, 365],[ 10, 429, 347],[402, 126, 334],[ 55, 335, 393],[138, 105, 122],[ 67, 330,  95],[159, 178, 148],[261,  75,   1],[  1,   1,   1],[  1,   1,   1],[  1,   1,   1],[  1,   1,   1],[  1,   1,   1],[  1,   1,   1],[  1,   1,   1],[  1,   1,   1],[  1,   1,   1],[  1,   1,   1],[  1,   1,   1],[  1,   1,   1],[  1,   1,   1],[  1,   1,   1],[  1,   1,   1],[  1,   1,   1],[  1,   1,   1],[  1,   1,   1],[  1,   1,   1],[  1,   1,   1],[  1,   1,   1],[  1,   1,   1]]), tensor([10, 10,  9]))
tensor([3, 1, 1])
torch.Size([32, 3])
tensor([[141, 266, 160,  10, 402,  55, 138,  67, 159, 261,   1,   1,   1,   1,1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,1,   1,   1,   1],[171, 434,   2, 429, 126, 335, 105, 330, 178,  75,   1,   1,   1,   1,1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,1,   1,   1,   1],[188,  29, 365, 347, 334, 393, 122,  95, 148,   1,   1,   1,   1,   1,1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,1,   1,   1,   1]])
torch.Size([3, 32])
tensor([[[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],...,[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000]],[[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],[-0.0042,  0.3527, -0.3186,  ...,  0.0935,  0.0106,  0.1928],...,[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000]],[[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],...,[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000]]],grad_fn=<EmbeddingBackward>)
torch.Size([3, 32, 300])

references

https://www.jianshu.com/p/e5adb235399e

https://blog.csdn.net/SCPmaster/article/details/124425381

torchtext中文文本预处理使用流程文档相关推荐

python中文文本预处理_中文文本预处理及表示
文本分类一.建立语料库文本数据的获取方法一般有两种: 使用别人做好的语料库爬虫去获取自己的预料数据二.文本预处理 1.除去数据中非文本部分一般可以使用正则表达式去进行删除 2.处理中文编码问 ...
使用Stanford NLP software进行中文文本预处理
中文文本预处理的几个主要步骤有: 中文分词,词性标注,生成词向量,生成中文依存语法树. Stanford NLP software是Stanford NLP小组提供的一个NLP处理工具集合,部分工具支 ...
java freemarker 图片_java通过freemarker导出包含富文本图片的word文档
废话不多说,进入正题! 本文重点在于:对富文本图片的导出(基础的freemarker+word模板导出这里不做详细解说哈) (ps:大神的东西太深奥~~懵逼了一周才搞定,为了方便后来在更加简单,清晰 ...
中文Ubuntu主目录下的文档文件夹改回英文
想把中文Ubuntu主目录下的文档文件夹改回英文,在Terminal下面操作的时候要输入中文特别不方便,于是便用了更改名字的想法方法一: 首先把那几个中文名称修改成相应的英文,比如 Desktop. ...
产品、技术：如何编写有效的流程文档？
流程文档是指一系列的.连续的.有规律的活动过程,而这些活动以特定的方式进行,并导致特定结果(创造价值)的产生.流程梳理是指围绕企业的内部要素与外部要素,对整个企业的业务特点和现状进行深入细致的分析.整 ...
word 中的边框底纹和PPT中的文本框制作专业文档
多用word 2003 中的"边框和底纹"和PPT中的文本框制作专业文档. word 2003中的用两种方法将一段文字框起来,这常见于报告一些安装或启动的屏幕信息过程log: 1. ...
Centos7 输入中文与右键添加新建文档
Centos7 输入中文与右键添加新建文档第一次装linux系统什么也不会,因为centos7右键没有"新建文档",百度了好久,自己做个详细总结,备忘填坑右上角,语言上有中 ...
流程文档+流量秘密思维导图模板
流程文档思维导图模板:http://www.mindmanager.cc/muban/detail_624.html 流量秘密思维导图模板:http://www.mindmanager.cc/muba ...
mysql sphinx 中文搜索_sphinx+mysql+mmseg 实现中文全站搜索安装配置文档
一.Sphinx的特性高速的建立索引(在当代CPU上,峰值性能可达到10 MB/秒); 高性能的搜索(在2 – 4GB 的文本数据上,平均每次检索响应时间小于0.1秒); 可处理海量数据(目前已知可 ...

torchtext中文文本预处理使用流程文档