pytorch 定义torch类型数据_PyTorch 使用TorchText进行文本分类

本教程演示如何在 torchtext 中使用文本分类数据集，包括

- AG_NEWS,

- SogouNews,

- DBpedia,

- YelpReviewPolarity,

- YelpReviewFull,

- YahooAnswers,

- AmazonReviewPolarity,

- AmazonReviewFull

此示例演示如何使用 TextClassification 数据集中的一个训练用于分类文本数据的监督学习算法。

使用ngrams加载数据

一个ngrams特征包(A bag of ngrams feature)被用来捕获一些关于本地词序的部分信息。在实际应用中，双字元(bi-gram)或三字元(tri-gram)作为词组比只使用一个单词(word)更有益处。例如：

"load data with ngrams"

Bi-grams results: "load data", "data with", "with ngrams"

Tri-grams results: "load data with", "data with ngrams"

TextClassification Dataset支持 ngrams 方法。通过将 ngrams 设置为2，数据集中的示例文本将是一个单字加上bi-grams字符串的列表。

import torch

import torchtext

from torchtext.datasets import text_classification

NGRAMS = 2

import os

if not os.path.isdir('./.data'):

os.mkdir('./.data')

train_dataset, test_dataset = text_classification.DATASETS['AG_NEWS'](

root='./.data', ngrams=NGRAMS, vocab=None)

BATCH_SIZE = 16

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

定义模型

模型由 EmbeddingBag 层和线性层组成(见下图)。 nn.EmbeddingBag 计算 embeddings 的 “bag” 的平均值。这里的文本条目有不同的长度。 nn.EmbeddingBag 此处不需要填充(padding)，因为文本长度以偏移量形式保存。

此外，由于 nn.EmbeddingBag 在线动态地累积了embeddings的平均值，因此 nn.EmbeddingBag 可以提高处理张量序列的性能和内存效率。

../_images/text_sentiment_ngrams_model.png

import torch.nn as nn

import torch.nn.functional as F

class TextSentiment(nn.Module):

def __init__(self, vocab_size, embed_dim, num_class):

super().__init__()

self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=True)

self.fc = nn.Linear(embed_dim, num_class)

self.init_weights()

def init_weights(self):

initrange = 0.5

self.embedding.weight.data.uniform_(-initrange, initrange)

self.fc.weight.data.uniform_(-initrange, initrange)

self.fc.bias.data.zero_()

def forward(self, text, offsets):

embedded = self.embedding(text, offsets)

return self.fc(embedded)

初始化模型

AG_NEWS 数据集有四个标签，因此类的数量是四个。

1 : World

2 : Sports

3 : Business

4 : Sci/Tec

The vocab size is equal to the length of vocab (including single word and ngrams). The number of classes is equal to the number of labels, which is four in AG_NEWS case.

VOCAB_SIZE = len(train_dataset.get_vocab())

EMBED_DIM = 32

NUN_CLASS = len(train_dataset.get_labels())

model = TextSentiment(VOCAB_SIZE, EMBED_DIM, NUN_CLASS).to(device)

用于产生批量数据的函数

由于文本条目的长度不同，因此使用自定义函数 generate_batch() 生成数据batch和偏移量。此函数传递给 torch.utils.data.DataLoader.中的 collate_fn 。 collate_fn 的输入是一个具有batch_size大小的张量列表， collate_fn 函数将它们打包成一个 mini-batch 。注意这里必须确保 collate_fn 被声明为顶级定义的函数，这样可以确保每个线程(worker)都可以使用该功能。

原始数据batch输入中的文本条目被打包成一个列表，并作为 nn.EmbeddingBag 的输入连接为单个张量。偏移量(offsets)是分隔符的张量，表示文本张量中单个序列的起始索引。Label 是保存单个文本条目标签的张量。

def generate_batch(batch):

label = torch.tensor([entry[0] for entry in batch])

text = [entry[1] for entry in batch]

offsets = [0] + [len(entry) for entry in text]

# torch.Tensor.cumsum returns the cumulative sum

# of elements in the dimension dim.

# torch.Tensor([1.0, 2.0, 3.0]).cumsum(dim=0)

offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)

text = torch.cat(text)

return text, offsets, label

定义训练和评估模型的函数

建议PyTorch用户使用 torch.utils.data.DataLoader ，它可以轻松地并行加载数据(这里有一个教程：数据加载 )。我们在这里使用 DataLoader 加载AG_NEWS数据集并将其发送到模型进行训练/验证。

from torch.utils.data import DataLoader

def train_func(sub_train_):

# 训练模型

train_loss = 0

train_acc = 0

data = DataLoader(sub_train_, batch_size=BATCH_SIZE, shuffle=True,

collate_fn=generate_batch)

for i, (text, offsets, cls) in enumerate(data):

optimizer.zero_grad()

text, offsets, cls = text.to(device), offsets.to(device), cls.to(device)

output = model(text, offsets)

loss = criterion(output, cls)

train_loss += loss.item()

loss.backward()

optimizer.step()

train_acc += (output.argmax(1) == cls).sum().item()

# 调整学习率

scheduler.step()

return train_loss / len(sub_train_), train_acc / len(sub_train_)

def test(data_):

loss = 0

acc = 0

data = DataLoader(data_, batch_size=BATCH_SIZE, collate_fn=generate_batch)

for text, offsets, cls in data:

text, offsets, cls = text.to(device), offsets.to(device), cls.to(device)

with torch.no_grad():

output = model(text, offsets)

loss = criterion(output, cls)

loss += loss.item()

acc += (output.argmax(1) == cls).sum().item()

return loss / len(data_), acc / len(data_)

划分数据集并运行模型

由于原始的 AG_NEWS 没有有效的数据集，我们将训练数据集分割为具有0.95(train)和0.05(valid)分割比的train/valid集。这里我们使用PyTorch核心库中的 torch.utils.data.dataset.random_split 函数。

CrossEntropyLoss 准则把 nn.LogSoftmax() 和 nn.NLLLoss() 组合进了一个类中。它在训练C类分类问题时非常有用。 SGD 作为优化器实现了随机梯度下降法。初始学习率设置为4.0。这里使用 StepLR 来调整各个回合(epoch)的学习率。

import time

from torch.utils.data.dataset import random_split

N_EPOCHS = 5

min_valid_loss = float('inf')

criterion = torch.nn.CrossEntropyLoss().to(device)

optimizer = torch.optim.SGD(model.parameters(), lr=4.0)

scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1, gamma=0.9)

train_len = int(len(train_dataset) * 0.95)

sub_train_, sub_valid_ = \

random_split(train_dataset, [train_len, len(train_dataset) - train_len])

for epoch in range(N_EPOCHS):

start_time = time.time()

train_loss, train_acc = train_func(sub_train_)

valid_loss, valid_acc = test(sub_valid_)

secs = int(time.time() - start_time)

mins = secs / 60

secs = secs % 60

print('Epoch: %d' %(epoch + 1), " | time in %d minutes, %d seconds" %(mins, secs))

print(f'\tLoss: {train_loss:.4f}(train)\t|\tAcc: {train_acc * 100:.1f}%(train)')

print(f'\tLoss: {valid_loss:.4f}(valid)\t|\tAcc: {valid_acc * 100:.1f}%(valid)')

在GPU上运行模型并得到以下信息:

Epoch: 1 | time in 0 minutes, 11 seconds

Loss: 0.0263(train) | Acc: 84.5%(train)

Loss: 0.0001(valid) | Acc: 89.0%(valid)

Epoch: 2 | time in 0 minutes, 10 seconds

Loss: 0.0119(train) | Acc: 93.6%(train)

Loss: 0.0000(valid) | Acc: 89.6%(valid)

Epoch: 3 | time in 0 minutes, 9 seconds

Loss: 0.0069(train) | Acc: 96.4%(train)

Loss: 0.0000(valid) | Acc: 90.5%(valid)

Epoch: 4 | time in 0 minutes, 11 seconds

Loss: 0.0038(train) | Acc: 98.2%(train)

Loss: 0.0000(valid) | Acc: 90.4%(valid)

Epoch: 5 | time in 0 minutes, 11 seconds

Loss: 0.0022(train) | Acc: 99.0%(train)

Loss: 0.0000(valid) | Acc: 91.0%(valid)

使用测试数据集评估模型

print('Checking the results of test dataset...')

test_loss, test_acc = test(test_dataset)

print(f'\tLoss: {test_loss:.4f}(test)\t|\tAcc: {test_acc * 100:.1f}%(test)')

检查测试数据集的结果

Loss: 0.0237(test) | Acc: 90.5%(test)

在一条随机新闻上测试

使用目前为止最好的模型，测试一个高尔夫(golf)新闻。标签信息在此处提供。

import re

from torchtext.data.utils import ngrams_iterator

from torchtext.data.utils import get_tokenizer

ag_news_label = {1 : "World",

2 : "Sports",

3 : "Business",

4 : "Sci/Tec"}

def predict(text, model, vocab, ngrams):

tokenizer = get_tokenizer("basic_english")

with torch.no_grad():

text = torch.tensor([vocab[token]

for token in ngrams_iterator(tokenizer(text), ngrams)])

output = model(text, torch.tensor([0]))

return output.argmax(1).item() + 1

ex_text_str = "MEMPHIS, Tenn. – Four days ago, Jon Rahm was \

enduring the season’s worst weather conditions on Sunday at The \

Open on his way to a closing 75 at Royal Portrush, which \

considering the wind and the rain was a respectable showing. \

Thursday’s first round at the WGC-FedEx St. Jude Invitational \

was another story. With temperatures in the mid-80s and hardly any \

wind, the Spaniard was 13 strokes better in a flawless round. \

Thanks to his best putting performance on the PGA Tour, Rahm \

finished with an 8-under 62 for a three-stroke lead, which \

was even more impressive considering he’d never played the \

front nine at TPC Southwind."

vocab = train_dataset.get_vocab()

model = model.to("cpu")

print("This is a %s news" %ag_news_label[predict(ex_text_str, model, vocab, 2)])

This is a Sports news

pytorch 定义torch类型数据_PyTorch 使用TorchText进行文本分类相关推荐

pytorch 定义torch类型数据_PyTorch 使用 TorchText 进行文本分类
本教程介绍了如何使用torchtext中的文本分类数据集,包括- AG_NEWS, - SogouNews, - DBpedia, - YelpReviewPolarity, - YelpReview ...
pytorch 定义torch类型数据_PyTorch官方中文文档：torch.Tensor
torch.Tensor torch.Tensor是一种包含单一数据类型元素的多维矩阵. Torch定义了七种CPU tensor类型和八种GPU tensor类型: Data tyoe CPU te ...
[翻译Pytorch教程]NLP部分：使用TorchText进行文本分类
本教程展示如何在torchtext中调用文本分类数据集,包括: AG_NEWS, SogouNews, DBpedia, YelpReviewPolarity, YelpReviewFull, Yah ...
PyTorch-17 使用TorchText进行文本分类
要查看图文并茂版教程,请移步: http://studyai.com/pytorch-1.4/beginner/text_sentiment_ngrams_tutorial.html 本教程演示如何在 ...
oracle中定义表类型数据,oracle 定义表字段域的数据类型
/****************************************************************************/ >oracle定义表字段域的数据类型 ...
【Pytorch】tensor类型数据.squeeze()和.unsqueeze()函数的简明教程（一看就会）
文章目录 1 squeeze 1.1 1.2 1.3 执行操作后需要写回 2 unsqueeze pytorch系列代码中常见的两个函数squeeze()和unsqueeze() 1 squeeze ...
php定义json类型数据,PHP中使用json数据格式定义字面量对象的方法_PHP
JSON PHPer都知道PHP是不支持字面量了,至少目前版本都不支持.比如,在JS中可以这样定义object 代码如下: var o = { 'name' : 'qttc' , 'url' : 'w ...
【源码开发分享】计算机毕业设计之Python+Spark+Scrapy新闻推荐系统新闻大数据新闻情感分析新闻文本分类新闻数据分析新闻爬虫可视化大数据毕业设计
开发技术 Hadoop.Spark.SparkSQL.Python.Scrapy爬虫框架.MySQL.协同过滤算法(双算法,基于用户.基于物品全实现).阿里云短信.百度AI人工智能识别.支付宝沙箱支付 ...
【PyTorch】7 文本分类TorchText实战——AG_NEWS四类别新闻分类
使用 TorchText 进行文本分类 1.访问原始数据集迭代器 2. 准备数据处理管道 3. 生成数据批次和迭代器 4. 定义模型 5. 初始化一个实例 6. 定义训练模型和评估结果的函数 7. 拆 ...

pytorch 定义torch类型数据_PyTorch 使用TorchText进行文本分类

pytorch 定义torch类型数据_PyTorch 使用TorchText进行文本分类相关推荐

最新文章

热门文章