今年才开始接触深度学习，基本上是小白的状态，希望能利用疫情放假的这些日子好好学习一下深度学习。第一次打卡包括Task01和Task02两部分，Task01的学习任务又分为线性回归、Softmax与分类模型与多层感知机三个内容。

Task01

1.线性回归

首先提出要解决的问题是探索房价(price)与房屋面积(area)和房龄(age)两个要素之间的关系。
线性回归假设输出与各个输入之间是线性关系:

即价格=面积×面积权重+房龄×房龄权重+偏差。

其中房屋被称为样本(sample)，实际房价被称作标签(label)，用来预测标签的area和age被称作feature。

利用损失函数L来衡量价格预测值与真实值之间的误差，这里选用的损失函数为平方函数。最终想要达到的目的是使损失函数不断减小，实现更优的预测效果。

优化函数采用mini-batch梯度下降法，先选取一组模型参数的初始值，如随机选取；接下来对参数进行多次迭代，使每次迭代都可能降低损失函数的值。在每次迭代中，先随机均匀采样一个由固定数目训练数据样本所组成的小批量（mini-batch），然后求小批量中数据样本的平均损失有关模型参数的导数（梯度），最后用此结果与预先设定的一个正数的乘积(学习率)作为模型参数在本次迭代的减小量。

采用mini-batch梯度下降法的好处：mini-batch梯度下降法在普通梯度下降法的基础上进行改进，在一些情况下，输入样本数量较多，对整个训练集执行梯度下降法花费时间过长，采用mini-batch梯度下降法将训练集分割成多个小的子训练集，使一部分样本先执行梯度下降。

优化函数的两个步骤：
(i)初始化模型参数，一般来说使用随机初始化；
(ii)我们在数据上迭代多次，通过在负梯度方向移动参数来更新每个参数。

此外使用矢量计算方法来进行矢量运算，与使用for循环的方法相比大大提高了计算效率。

从0实现线性模型

首先引入相关包和组件

#import packages and modules
%matplotlib inline
import torch
from IPython import display
from matplotlib import pyplot as plt
import numpy as np
import random

利用线性模型生成1000个数据集

# 输入特征数为2，area和age
num_inputs = 2
# 输入样本数量为1000
num_examples = 1000# 设定两个输入特征的权重w和偏差b
true_w = [2, -3.4]
true_b = 4.2features = torch.randn(num_examples, num_inputs,dtype=torch.float32)
#按照房价与面积和房龄的公式计算标签
labels = true_w[0] * features[:, 0] + true_w[1] * features[:, 1] + true_b
labels += torch.tensor(np.random.normal(0, 0.01, size=labels.size()),dtype=torch.float32)

其中：
①torch.randn(sizes, out=None)→ Tensor
返回一个张量，包含了从标准正态分布（均值为0，方差为1，即高斯白噪声）中抽取的一组随机数。张量的形状由参数sizes定义。在这里函数的作用是返回float32类型的1000*2的随机矢量，作为age和area，存入feature中。
②torch.tensor(data, dtype=None, device=None, requires_grad=False)
torch.tensor()可以从data中的数据部分做拷贝（而不是直接引用），根据原始数据类型生成相应的torch.LongTensor，torch.FloatTensor，torch.DoubleTensor。
③numpy.random.normal(loc=0.0, scale=1.0, size=None)
此函数为高斯分布的概率密度函数，loc为概率分布的均值，对应着整个分布的中心center；scale为概率分布的标准差；size为输出的shape，默认为None。由于实际不可能完全符合线性关系，因此在上述计算labels的基础上再加上一个随机生成的高斯分布的偏差。

使用图像来展示生成的数据

plt.scatter(features[:, 1].numpy(), labels.numpy(), 1);

读取数据集

def data_iter(batch_size, features, labels):num_examples = len(features)#得到features的行数为1000indices = list(range(num_examples))random.shuffle(indices)  # 打乱列表for i in range(0, num_examples, batch_size):j = torch.LongTensor(indices[i: min(i + batch_size, num_examples)]) # the last time may be not enough for a whole batchyield  features.index_select(0, j), labels.index_select(0, j

①for i in range(0, num_examples, batch_size)循环的含义是i从0到num_examples循环，步长为batch_size，即把数据集num_examples个样本分成了num_examples/batch_size个，每个mini-batch中包含batch_size个样本。
②.index_select(0, j)的含义是按行索引，索引序号为j

定义权重w及偏差b

w = torch.tensor(np.random.normal(0, 0.01, (num_inputs, 1)), dtype=torch.float32)
b = torch.zeros(1, dtype=torch.float32)w.requires_grad_(requires_grad=True)
b.requires_grad_(requires_grad=True)

Out：tensor([0.], requires_grad=True)

线性回归公式

def linreg(X, w, b):return torch.mm(X, w) + b

定义损失函数

def squared_loss(y_hat, y): return (y_hat - y.view(y_hat.size())) ** 2 / 2

定义优化函数

def sgd(params, lr, batch_size): for param in params:param.data -= lr * param.grad / batch_size # ues .data to operate param without gradient track

进行模型训练

# super parameters init
lr = 0.03
num_epochs = 5net = linreg
loss = squared_loss# training
for epoch in range(num_epochs):  # training repeats num_epochs times# in each epoch, all the samples in dataset will be used once# X is the feature and y is the label of a batch samplefor X, y in data_iter(batch_size, features, labels):l = loss(net(X, w, b), y).sum()  # calculate the gradient of batch sample loss l.backward()  # using small batch random gradient descent to iter model parameterssgd([w, b], lr, batch_size)  # reset parameter gradientw.grad.data.zero_()b.grad.data.zero_()train_l = loss(net(features, w, b), labels)print('epoch %d, loss %f' % (epoch + 1, train_l.mean().item()))

out:
epoch 1, loss 0.045923
epoch 2, loss 0.000187
epoch 3, loss 0.000055
epoch 4, loss 0.000055
epoch 5, loss 0.000055

定义学习率为0.03，迭代次数为5，进行5次迭代，在每次迭代中计算损失函数，进行反向传播，将参数代入优化函数中，得到训练后的损失函数。

分别输出训练前后的权重及偏差

w, true_w, b, true_b

out：
(tensor([[ 0.1838], [-0.4229]], requires_grad=True), [2, -3.4],
tensor([4.3392], requires_grad=True), 4.2)

线性回归模型使用pytorch的简洁实现
引用相关包和组件

import torch
from torch import nn
import numpy as np
torch.manual_seed(1)print(torch.__version__)
torch.set_default_tensor_type('torch.FloatTensor')

生成数据集，和之前相同

num_inputs = 2
num_examples = 1000true_w = [2, -3.4]
true_b = 4.2features = torch.tensor(np.random.normal(0, 1, (num_examples, num_inputs)), dtype=torch.float)
labels = true_w[0] * features[:, 0] + true_w[1] * features[:, 1] + true_b
labels += torch.tensor(np.random.normal(0, 0.01, size=labels.size()), dtype=torch.float)

读取数据集，将特征和标签组合形成数据集，利用DataLoader取数据集。

import torch.utils.data as Databatch_size = 10# combine featues and labels of dataset
dataset = Data.TensorDataset(features, labels)# put dataset into DataLoader
data_iter = Data.DataLoader(dataset=dataset,            # torch TensorDataset formatbatch_size=batch_size,      # mini batch sizeshuffle=True,               # 打乱数据num_workers=2,              # 双线程读取数据
)

接下来是利用pytorch模块实现模型，首先定义线性网络的类，然后调用类实现网络的实例化。
之前并没有接触过这样的概念，因此我查询了一下python中类和实例化的概念：
python类和实例化
简而言之，可以把类看作是一个模版，类下有不同的方法，在创建实例时候将必要的属性填写进去,在方法内部，self表示创建实例本身，因此类方法的参数传递中self不需要传参。

class LinearNet(nn.Module):def __init__(self, n_feature):super(LinearNet, self).__init__()      # call father function to init self.linear = nn.Linear(n_feature, 1)  # function prototype: `torch.nn.Linear(in_features, out_features, bias=True)`def forward(self, x):y = self.linear(x)return ynet = LinearNet(num_inputs)
print(net)

生成一个单层线性网络：
LinearNet( (linear): Linear(in_features=2, out_features=1, bias=True) )

上面只初始化了一个线性层，实际上还可以生成多层网络，这里用到了Sequential序贯模型，给出了三种方法。
深入学习Keras中Sequential模型及方法

# ways to init a multilayer network
# method one
net = nn.Sequential(nn.Linear(num_inputs, 1))
# method two
net = nn.Sequential()
net.add_module('linear', nn.Linear(num_inputs, 1))
# net.add_module ......# method three
from collections import OrderedDict
net = nn.Sequential(OrderedDict([('linear', nn.Linear(num_inputs, 1))# ......]))print(net)
print(net[0])

可以看到生成的网络如下：Sequential( (linear): Linear(in_features=2, out_features=1, bias=True))
Linear(in_features=2, out_features=1, bias=True)
即全部网络只有一层线性层，网络的第一层，也就是唯一的一层如Linear所示。

初始化模型参数w和b，采用init模块来初始化参数，选择不同的初始化模式，输入需要初始化的变量以及特征。

from torch.nn import initinit.normal_(net[0].weight, mean=0.0, std=0.01)
init.constant_(net[0].bias, val=0.0)  # or you can use `net[0].bias.data.fill_(0)` to modify it directly

直接调用nn中的均方误差损失函数

loss = nn.MSELoss()

调用torch中的随机梯度下降优化函数

import torch.optim as optimoptimizer = optim.SGD(net.parameters(), lr=0.03)   # built-in random gradient descent function
print(optimizer)  # function prototype: `torch.optim.SGD(params, lr=, momentum=0, dampening=0, weight_decay=0, nesterov=False)

进行训练，设定迭代次数为3，迭代过程同上，只不过利用pytorch实现中间函数

num_epochs = 3
for epoch in range(1, num_epochs + 1):for X, y in data_iter:output = net(X)l = loss(output, y.view(-1, 1))optimizer.zero_grad() # reset gradient, equal to net.zero_grad()l.backward()optimizer.step()print('epoch %d, loss: %f' % (epoch, l.item()))

out：
epoch 1, loss: 0.000422
epoch 2, loss: 0.000075
epoch 3, loss: 0.000176
比较一下训练后和真实的权重w以及偏差b

# result comparision
dense = net[0]
print(true_w, dense.weight.data)
print(true_b, dense.bias.data)

out：
[2, -3.4] tensor([[ 1.9996, -3.4005]])
4.2 tensor([4.1993])

2.softmax分类模型

softmax也是一个单层神经网络，其输出层是一个全连接层，即每个输出依赖于所有的输入。

用softmax实现图像分类可以进行如下理解：输入x为图片像素，输出o为分类结果。每个输出与输入之间都存在一个权重w，值最大的输出即可看作预测出的图像类别。有如下关系：

softmax运算符通过计算将输出值变换成值为正且和为1的概率分布，计算公式如下：

softmax回归的矢量计算表达式为：

损失函数采用交叉熵损失函数

使用准确率（accuracy）来评价模型的表现，其中准确率=正确预测数量/总预测数量。

下面是获取Fashion-MNIST训练集和读取数据
的一些代码：

导入需要的包。

# import needed package
%matplotlib inline
from IPython import display
import matplotlib.pyplot as pltimport torch
import torchvision
import torchvision.transforms as transforms
import timeimport sys
sys.path.append("/home/kesci/input")
import d2lzh1981 as d2lprint(torch.__version__)
print(torchvision.__version__)

获得数据集mnist_train和mnist_test。

mnist_train = torchvision.datasets.FashionMNIST(root='/home/kesci/input/FashionMNIST2065', train=True, download=True, transform=transforms.ToTensor())
mnist_test = torchvision.datasets.FashionMNIST(root='/home/kesci/input/FashionMNIST2065', train=False, download=True, transform=transforms.ToTensor())

将十个分类标签存入text_labels中。

# 本函数已保存在d2lzh包中方便以后使用
def get_fashion_mnist_labels(labels):text_labels = ['t-shirt', 'trouser', 'pullover', 'dress', 'coat','sandal', 'shirt', 'sneaker', 'bag', 'ankle boot']return [text_labels[int(i)] for i in labels]

读取数据

# 读取数据
batch_size = 256
num_workers = 4
train_iter = torch.utils.data.DataLoader(mnist_train, batch_size=batch_size, shuffle=True, num_workers=num_workers)
test_iter = torch.utils.data.DataLoader(mnist_test, batch_size=batch_size, shuffle=False, num_workers=num_workers)

sofxmax从0开始
首先导入相关包

import torch
import torchvision
import numpy as np
import sys
sys.path.append("/home/kesci/input")
import d2lzh1981 as d2l

然后获取训练集数据和测试集数据

batch_size = 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size, root='/home/kesci/input/FashionMNIST2065')

对权重w和b进行初始化，这里和上一节相同。

num_inputs = 784
num_outputs = 10W = torch.tensor(np.random.normal(0, 0.01, (num_inputs, num_outputs)), dtype=torch.float)
b = torch.zeros(num_outputs, dtype=torch.float)

W.requires_grad_(requires_grad=True)
b.requires_grad_(requires_grad=True)

应用广播机制按照公式定义softmax操作

def softmax(X):X_exp = X.exp()partition = X_exp.sum(dim=1, keepdim=True)# print("X size is ", X_exp.size())# print("partition size is ", partition, partition.size())return X_exp / partition  # 这里应用了广播机制

同样根据公式得到softmax回归模型

def net(X):return softmax(torch.mm(X.view((-1, num_inputs)), W) + b)

定义交叉熵损失函数

def cross_entropy(y_hat, y):return - torch.log(y_hat.gather(1, y.view(-1, 1)))

定义并计算准确率

def accuracy(y_hat, y):return (y_hat.argmax(dim=1) == y).float().mean().item()

其中y_hat.argmax(dim=1) == y).float().mean().item()的含义是将y_hat中每一行最大的列索引值和y相等的部分转换成float精度、取平均最后以字典形式返回。

# 本函数已保存在d2lzh_pytorch包中方便以后使用。该函数将被逐步改进：它的完整实现将在“图像增广”一节中描述
def evaluate_accuracy(data_iter, net):acc_sum, n = 0.0, 0for X, y in data_iter:acc_sum += (net(X).argmax(dim=1) == y).float().sum().item()n += y.shape[0]return acc_sum / n

训练模型，得到

num_epochs, lr = 5, 0.1# 本函数已保存在d2lzh_pytorch包中方便以后使用
def train_ch3(net, train_iter, test_iter, loss, num_epochs, batch_size,params=None, lr=None, optimizer=None):for epoch in range(num_epochs):train_l_sum, train_acc_sum, n = 0.0, 0.0, 0for X, y in train_iter:y_hat = net(X)l = loss(y_hat, y).sum()# 梯度清零if optimizer is not None:optimizer.zero_grad()elif params is not None and params[0].grad is not None:for param in params:param.grad.data.zero_()l.backward()if optimizer is None:d2l.sgd(params, lr, batch_size)else:optimizer.step() train_l_sum += l.item()train_acc_sum += (y_hat.argmax(dim=1) == y).sum().item()n += y.shape[0]test_acc = evaluate_accuracy(test_iter, net)print('epoch %d, loss %.4f, train acc %.3f, test acc %.3f'% (epoch + 1, train_l_sum / n, train_acc_sum / n, test_acc))train_ch3(net, train_iter, test_iter, cross_entropy, num_epochs, batch_size, [W, b], lr)

out：
epoch 1, loss 0.7851, train acc 0.750, test acc 0.791
epoch 2, loss 0.5704, train acc 0.814, test acc 0.810
epoch 3, loss 0.5258, train acc 0.825, test acc 0.819
epoch 4, loss 0.5014, train acc 0.832, test acc 0.824
epoch 5, loss 0.4865, train acc 0.836, test acc 0.827

最后进行模型预测：

X, y = iter(test_iter).next()true_labels = d2l.get_fashion_mnist_labels(y.numpy())
pred_labels = d2l.get_fashion_mnist_labels(net(X).argmax(dim=1).numpy())
titles = [true + '\n' + pred for true, pred in zip(true_labels, pred_labels)]d2l.show_fashion_mnist(X[0:9], titles[0:9])

从数据集中读取标签，取前十个样本，将预测类别与标签进行对比，可以看到预测均正确。

softmax的简洁实现
softmax的简洁实现是利用pytorch中一些现成的模块实现中间函数，同线性回归类似。
加载各种包或者模块

# 加载各种包或者模块
import torch
from torch import nn
from torch.nn import init
import numpy as np
import sys
sys.path.append("/home/kesci/input")
import d2lzh1981 as d2lprint(torch.__version__)

获取训练和测试数据

batch_size = 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size, root='/home/kesci/input/FashionMNIST2065')

与线性回归类似，定义softmax网络模型的类，然后调用类实现网络的实例化。

num_inputs = 784
num_outputs = 10class LinearNet(nn.Module):def __init__(self, num_inputs, num_outputs):super(LinearNet, self).__init__()self.linear = nn.Linear(num_inputs, num_outputs)def forward(self, x): # x 的形状: (batch, 1, 28, 28)y = self.linear(x.view(x.shape[0], -1))return y# net = LinearNet(num_inputs, num_outputs)class FlattenLayer(nn.Module):def __init__(self):super(FlattenLayer, self).__init__()def forward(self, x): # x 的形状: (batch, *, *, ...)return x.view(x.shape[0], -1)from collections import OrderedDict
net = nn.Sequential(# FlattenLayer(),# LinearNet(num_inputs, num_outputs) OrderedDict([('flatten', FlattenLayer()),('linear', nn.Linear(num_inputs, num_outputs))]) # 或者写成我们自己定义的 LinearNet(num_inputs, num_outputs) 也可以)

初始化w和b

init.normal_(net.linear.weight, mean=0, std=0.01)
init.constant_(net.linear.bias, val=0)

定义损失函数和优化函数

loss = nn.CrossEntropyLoss() # 下面是他的函数原型
# class torch.nn.CrossEntropyLoss(weight=None, size_average=None, ignore_index=-100, reduce=None, reduction='mean')optimizer = torch.optim.SGD(net.parameters(), lr=0.1) # 下面是函数原型
# class torch.optim.SGD(params, lr=, momentum=0, dampening=0, weight_decay=0, nesterov=False)

进行训练

num_epochs = 5
d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, batch_size, None, None, optimizer)

out：
epoch 1, loss 0.0031, train acc 0.751, test acc 0.795
epoch 2, loss 0.0022, train acc 0.813, test acc 0.809
epoch 3, loss 0.0021, train acc 0.825, test acc 0.806
epoch 4, loss 0.0020, train acc 0.833, test acc 0.813
epoch 5, loss 0.0019, train acc 0.837, test acc 0.822

注意在这里初始时训练集的准确率低于测试集的准确率，这是由于第一次迭代时参数是初始值，没有经过训练，随着迭代次数的升高，训练集的准确率逐渐升高了。

3.多层感知机

和前面单层神经网络不同，多层感知机包含一个隐藏层，该层中有5个隐藏单元。

隐藏层单元个数为h，隐藏层的输出直接作为输出层的输入。输入层与隐藏层以及隐藏层与输出层之间都是全连接，叠加起来仍然是仿射变换，因此需要引入非线性变换，即激活函数。激活函数有很多种类，包括ReLU函数、Sigmoid函数、tanh函数等，在这里隐藏层用到的激活函数是ReLU函数，函数定义如下：

多层感知机的输出公式如下：

多层感知机从0开始的实现
引入相关包

import torch
import numpy as np
import sys
sys.path.append("/home/kesci/input")
import d2lzh1981 as d2l
print(torch.__version__)

获取训练和测试数据集

batch_size = 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size,root='/home/kesci/input/FashionMNIST2065')

定义模型参数w和b

num_inputs, num_outputs, num_hiddens = 784, 10, 256W1 = torch.tensor(np.random.normal(0, 0.01, (num_inputs, num_hiddens)), dtype=torch.float)
b1 = torch.zeros(num_hiddens, dtype=torch.float)
W2 = torch.tensor(np.random.normal(0, 0.01, (num_hiddens, num_outputs)), dtype=torch.float)
b2 = torch.zeros(num_outputs, dtype=torch.float)params = [W1, b1, W2, b2]
for param in params:param.requires_grad_(requires_grad=True)

在这里可以看到w1的维度是(num_inputs,num_hiddens)，这是因为w1是输入层与隐藏层之间的权重，输入单元与隐藏单元之间是全连接，也就是每个输入但愿与隐藏单元之间都存在一个权重，w2同理。

接下来是定义激活函数

def relu(X):return torch.max(input=X, other=torch.tensor(0.0))

然后是按照公式定义网络

def net(X):X = X.view((-1, num_inputs))H = relu(torch.matmul(X, W1) + b1)return torch.matmul(H, W2) + b2

定义交叉熵损失函数

loss = torch.nn.CrossEntropyLoss()

进行模型训练

num_epochs, lr = 5, 100.0
d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, batch_size, params, lr)

out：
epoch 1, loss 0.0030, train acc 0.714, test acc 0.718
epoch 2, loss 0.0019, train acc 0.824, test acc 0.798
epoch 3, loss 0.0017, train acc 0.843, test acc 0.766
epoch 4, loss 0.0015, train acc 0.856, test acc 0.789
epoch 5, loss 0.0014, train acc 0.865, test acc 0.841
这里的d2l.train_ch3在前面已经softmax中已经定义并存储在d2l中了，因此这里直接使用。

多层感知机pytorch实现
同样引入相关包

import torch
from torch import nn
from torch.nn import init
import numpy as np
import sys
sys.path.append("/home/kesci/input")
import d2lzh1981 as d2l

初始化模型和各个参数，输入各个层的单元个数，其中输入单元数784，输出单元数10，隐藏层单元数256，将其代入Sequential中。

num_inputs, num_outputs, num_hiddens = 784, 10, 256net = nn.Sequential(d2l.FlattenLayer(),nn.Linear(num_inputs, num_hiddens),nn.ReLU(),nn.Linear(num_hiddens, num_outputs), )for params in net.parameters():init.normal_(params, mean=0, std=0.01)

这里包含一个隐藏层的网络可以与前面单层网络进行对比。
单层网络：

net = nn.Sequential(# FlattenLayer(),# LinearNet(num_inputs, num_outputs) OrderedDict([('flatten', FlattenLayer()),('linear', nn.Linear(num_inputs, num_outputs))])

包含一个隐藏层的网络：

net = nn.Sequential(d2l.FlattenLayer(),nn.Linear(num_inputs, num_hiddens),nn.ReLU(),nn.Linear(num_hiddens, num_outputs), )

然后进行训练

batch_size = 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size,root='/home/kesci/input/FashionMNIST2065')
loss = torch.nn.CrossEntropyLoss()optimizer = torch.optim.SGD(net.parameters(), lr=0.5)num_epochs = 5
d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, batch_size, None, None, optimizer)

最后得到迭代结果out：
epoch 1, loss 0.0031, train acc 0.701, test acc 0.774
epoch 2, loss 0.0019, train acc 0.821, test acc 0.806
epoch 3, loss 0.0017, train acc 0.841, test acc 0.805
epoch 4, loss 0.0015, train acc 0.855, test acc 0.834
epoch 5, loss 0.0014, train acc 0.866, test acc 0.840

Task02

Task02包括文本预处理、语言模型和循环神经网络基础三个内容。

1.文本预处理

文本预处理的主要任务如下：
1.读入文本
2.分词
3.建立字典，将每个词映射到一个唯一的索引（index）
4.将文本从词的序列转换为索引的序列，方便输入模型
用一部英文小说，即H. G. Well的Time Machine，作为示例，展示文本预处理的具体过程。

首先是读入文本，引入相关包，打开要处理的txt文本文件，把每一行的语句存到lines中，可以看到共有3221行。

import collections
import redef read_time_machine():with open('/home/kesci/input/timemachine7163/timemachine.txt', 'r') as f:lines = [re.sub('[^a-z]+', ' ', line.strip().lower()) for line in f]return lineslines = read_time_machine()
print('# sentences %d' % len(lines))

out：sentences 3221
其中strip()函数是把每个句子前后的空格、换行符等去掉；
lower()函数用来将大小字母转换成小写字母；
re模块可以提供正则表达式功能。正则表达式是一个特殊的字符序列，它能帮助你方便的检查一个字符串是否与某种模式匹配。而re.sub的功能可以从下面的例子中更清晰的看出：

a = re.sub(r‘(\d+)‘, ‘hello‘, ‘my numer is 400 and door num is 200‘)

out: my numer is hello and door num is hello
在这里，re.sub函数实现的用是把经过strip和lower处理后的文本中的非小写字符的非空字符串都替换为空格，便于后续的处理。

然后是对每个句子分词，分出的token再组合形成一个词的序列。

def tokenize(sentences, token='word'):"""Split sentences into word or char tokens"""if token == 'word':return [sentence.split(' ') for sentence in sentences]elif token == 'char':return [list(sentence) for sentence in sentences]else:print('ERROR: unkown token type '+token)tokens = tokenize(lines)
tokens[0:2]

out：[[‘the’, ‘time’, ‘machine’, ‘by’, ‘h’, ‘g’, ‘wells’, ‘’], [’’]]
其中参数token为标志，word代表将句子进行单词分词，char代表对句子进行字符分词，返回值是一个二维列表，第一个维度是句子，第二个维度是分词之后的token。在这里可以看到第二行是个空语句，所以只有第一行显示出了分词结果。

然后是建立字典，将字符串转换为数字，每个词映射到一个唯一的索引编号，便于模型处理。例如第一句话中the time machine by h.g.wells，在分词后可以形成七个词，而the可以映射到第0个编号，在之后的句子中再出现the这个单词时，其编号仍然为0。实现的具体函数如下：

class Vocab(object):def __init__(self, tokens, min_freq=0, use_special_tokens=False):counter = count_corpus(tokens)  # : self.token_freqs = list(counter.items())self.idx_to_token = []if use_special_tokens:# padding, begin of sentence, end of sentence, unknownself.pad, self.bos, self.eos, self.unk = (0, 1, 2, 3)self.idx_to_token += ['', '', '', '']else:self.unk = 0self.idx_to_token += ['']self.idx_to_token += [token for token, freq in self.token_freqsif freq >= min_freq and token not in self.idx_to_token]self.token_to_idx = dict()for idx, token in enumerate(self.idx_to_token):self.token_to_idx[token] = idxdef __len__(self):return len(self.idx_to_token)def __getitem__(self, tokens):if not isinstance(tokens, (list, tuple)):return self.token_to_idx.get(tokens, self.unk)return [self.__getitem__(token) for token in tokens]def to_tokens(self, indices):if not isinstance(indices, (list, tuple)):return self.idx_to_token[indices]return [self.idx_to_token[index] for index in indices]def count_corpus(sentences):tokens = [tk for st in sentences for tk in st]return collections.Counter(tokens)  # 返回一个字典，记录每个词的出现次数

建立Vocab类，实现字符与数字的映射，即输入数字或字符时，可以输出对应的字符和数字。首先利用count_corpus函数将二维列表展成一维列表，然后利用Counter追踪每个词的出现次数，由此得到了词频，并完成了去重。idx_to_token是字典需要维护的列，引入pad、bos、eos和unk四个特殊token，分别代表对不同长短句子的填充、在句子开始或结尾处添加特殊字符或是定义未登入词。__len__函数是返回idx_to_token的长度，__getitem__的作用是输完成字符到索引的映射，如果字典内不存在这样的词则输出unk；to_tokens的作用是完成索引到字符的映射。

下面给出一个例子：

vocab = Vocab(tokens)
print(list(vocab.token_to_idx.items())[0:10])

out：[(’’, 0), (‘the’, 1), (‘time’, 2), (‘machine’, 3), (‘by’, 4), (‘h’, 5), (‘g’, 6), (‘wells’, 7), (‘i’, 8), (‘traveller’, 9)]
将网络实例化，可以看到将前十个token的索引情况。

同样尝试一下将原文本中的第8-9行从单词序列转换为索引序列：

for i in range(8, 10):print('words:', tokens[i])print('indices:', vocab[tokens[i]])

out：words: [‘the’, ‘time’, ‘traveller’, ‘for’, ‘so’, ‘it’, ‘will’, ‘be’, ‘convenient’, ‘to’, ‘speak’, ‘of’, ‘him’, ‘’]
indices: [1, 2, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 0]
words: [‘was’, ‘expounding’, ‘a’, ‘recondite’, ‘matter’, ‘to’, ‘us’, ‘his’, ‘grey’, ‘eyes’, ‘shone’, ‘and’]
indices: [20, 21, 22, 23, 24, 16, 25, 26, 27, 28, 29, 30]

上面的分词方式存在一个问题就是我们在分词时没有考虑shouldn’t和Mr.类似单词的语意，直接将非小写字母全都转换成了空格，从前面第一句话的分词就可以看出，作者的名字H. G. Well被分成了三个单词：h g well，破坏了原有的语意。
实际上可以引入更复杂的规则来解决这些问题，目前已有一些现成的工具可以用来进行分词，如spaCy和NLTK。
使用它们对text进行分词：

text = "Mr. Chen doesn't agree with my suggestion."

spaCy：

import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
print([token.text for token in doc])
['Mr.', 'Chen', 'does', "n't", 'agree', 'with', 'my', 'suggestion', '.']

out：[‘Mr.’, ‘Chen’, ‘does’, “n’t”, ‘agree’, ‘with’, ‘my’, ‘suggestion’, ‘.’]

NLTK：

from nltk.tokenize import word_tokenize
from nltk import data
data.path.append('/home/kesci/input/nltk_data3784/nltk_data')
print(word_tokenize(text))

out：[‘Mr.’, ‘Chen’, ‘does’, “n’t”, ‘agree’, ‘with’, ‘my’, ‘suggestion’, ‘.’]

可以看到必要的标点符号都被保留了，且语意正确。

2.语言模型与数据集

一段自然语言文本可以看作是一个离散时间序列，语言模型的目标就是评估该序列是否合理，即计算该序列的概率：

本节主要用到了n元语法（n-gram）。
假设序列w1,w2,…,wT是依次生成的，那么一段含有T个词的文本序列概率应该为：

这里是条件概率的一些相关知识，有如下关系，其中n为文本量：

序列长度增加，计算和存储多个词共同出现的概率的复杂度会呈指数级增加。n元语法通过马尔可夫假设简化模型，马尔科夫假设是指一个词的出现只与前面个词相关。
马尔科夫模型
简化后可得如下公式：
其中n=1，2，3时，可以分别将其称作一元语法（unigram）、二元语法（bigram）和三元语法（trigram）。

n元语法也存在一些缺陷，如参数空间过大和数据稀疏等。
下面是具体实现：
首先读取数据集，数据集采用了周杰伦的歌词：

with open('/home/kesci/input/jaychou_lyrics4703/jaychou_lyrics.txt') as f:corpus_chars = f.read()
print(len(corpus_chars))
print(corpus_chars[: 40])
corpus_chars = corpus_chars.replace('\n', ' ').replace('\r', ' ')
corpus_chars = corpus_chars[: 10000]

可以看到文本一共有63282行，并且显示了前四十个字符（包括空格）。

接下来是建立字符索引：

idx_to_char = list(set(corpus_chars)) # 去重，得到索引到字符的映射
char_to_idx = {char: i for i, char in enumerate(idx_to_char)} # 字符到索引的映射
vocab_size = len(char_to_idx)
print(vocab_size)corpus_indices = [char_to_idx[char] for char in corpus_chars]  # 将每个字符转化为索引，得到一个索引的序列
sample = corpus_indices[: 20]
print('chars:', ''.join([idx_to_char[idx] for idx in sample]))
print('indices:', sample)

out：
1027
chars: 想要有直升机想要和你飞到宇宙去想要和
indices: [1022, 648, 1025, 366, 208, 792, 199, 1022, 648, 641, 607, 625, 26, 155, 130, 5, 199, 1022, 648, 641]
可以看到采用了set()函数进行字符去重，enumerate()函数的作用是将经过去重的字符组合为一个索引序列，同时列出数据和数据下标，实现了字符到索引的映射。然后将每个字符转化为索引，得到一个索引的序列corpus_indices。

定义函数load_data_jay_lyrics，在后续章节中直接调用。

def load_data_jay_lyrics():with open('/home/kesci/input/jaychou_lyrics4703/jaychou_lyrics.txt') as f:corpus_chars = f.read()corpus_chars = corpus_chars.replace('\n', ' ').replace('\r', ' ')corpus_chars = corpus_chars[0:10000]idx_to_char = list(set(corpus_chars))char_to_idx = dict([(char, i) for i, char in enumerate(idx_to_char)])vocab_size = len(char_to_idx)corpus_indices = [char_to_idx[char] for char in corpus_chars]return corpus_indices, char_to_idx, idx_to_char, vocab_size

这个函数就是整合了上面所有的小模块，可以直接实现txt文本中歌词的调用，并且可以返回索引序列、字符到索引的映射、索引到字符的映射以及句子长度。

数据集通常都比较大，训练时需要对时序数据进行采样，每次随机读取小批量样本和标签。与之前章节的实验数据不同的是，时序数据的一个样本通常包含连续的字符。假设时间步数为5，样本序列为5个字符，即“想”“要”“有”“直”“升”。该样本的标签序列为这些字符分别在训练集中的下一个字符，即“要”“有”“直”“升”“机”，即X=“想要有直升”，Y=“要有直升机”。

如果序列的长度为T，时间步数为n，那么一共有T-n个合法的样本，但是这些样本有大量的重合，我们通常采用更加高效的采样方式。可以采用其他两种方式对时序数据进行采样，分别是随机采样和相邻采样。

随机采样
每次从数据里随机采样一个小批量。其中批量大小batch_size是每个小批量的样本数，num_steps是每个样本所包含的时间步数。在随机采样中，每个样本是原始序列上任意截取的一段序列，相邻的两个随机小批量在原始序列上的位置不一定相毗邻。

import torch
import random
def data_iter_random(corpus_indices, batch_size, num_steps, device=None):# 减1是因为对于长度为n的序列，X最多只有包含其中的前n - 1个字符num_examples = (len(corpus_indices) - 1) // num_steps  # 下取整，得到不重叠情况下的样本个数example_indices = [i * num_steps for i in range(num_examples)]  # 每个样本的第一个字符在corpus_indices中的下标random.shuffle(example_indices)def _data(i):# 返回从i开始的长为num_steps的序列return corpus_indices[i: i + num_steps]if device is None:device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')for i in range(0, num_examples, batch_size):# 每次选出batch_size个随机样本batch_indices = example_indices[i: i + batch_size]  # 当前batch的各个样本的首字符的下标X = [_data(j) for j in batch_indices]Y = [_data(j + 1) for j in batch_indices]yield torch.tensor(X, device=device), torch.tensor(Y, device=device)

这里len(corpus_indices) - 1是因为在取样时，标签是样本序列对应下一个字符的序列，因此要考虑到标签序列取样的问题。返回的X即为随机取样的样本，Y为对应的标签。
输入0-29的连续整数测试一下：

my_seq = list(range(30))
for X, Y in data_iter_random(my_seq, batch_size=2, num_steps=6):print('X: ', X, '\nY:', Y, '\n')

out：
X: tensor([[ 6, 7, 8, 9, 10, 11],
[12, 13, 14, 15, 16, 17]])
Y: tensor([[ 7, 8, 9, 10, 11, 12],
[13, 14, 15, 16, 17, 18]])

X: tensor([[ 0, 1, 2, 3, 4, 5],
[18, 19, 20, 21, 22, 23]])
Y: tensor([[ 1, 2, 3, 4, 5, 6],
[19, 20, 21, 22, 23, 24]])
相邻采样
在相邻采样中，相邻的两个随机小批量在原始序列上的位置相毗邻。

def data_iter_consecutive(corpus_indices, batch_size, num_steps, device=None):if device is None:device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')corpus_len = len(corpus_indices) // batch_size * batch_size  # 保留下来的序列的长度corpus_indices = corpus_indices[: corpus_len]  # 仅保留前corpus_len个字符indices = torch.tensor(corpus_indices, device=device)indices = indices.view(batch_size, -1)  # resize成(batch_size, )batch_num = (indices.shape[1] - 1) // num_stepsfor i in range(batch_num):i = i * num_stepsX = indices[:, i: i + num_steps]Y = indices[:, i + 1: i + num_steps + 1]yield X, Y

测试一下：

for X, Y in data_iter_consecutive(my_seq, batch_size=2, num_steps=6):print('X: ', X, '\nY:', Y, '\n')

out：
X: tensor([[ 0, 1, 2, 3, 4, 5],
[15, 16, 17, 18, 19, 20]])
Y: tensor([[ 1, 2, 3, 4, 5, 6],
[16, 17, 18, 19, 20, 21]])

X: tensor([[ 6, 7, 8, 9, 10, 11],
[21, 22, 23, 24, 25, 26]])
Y: tensor([[ 7, 8, 9, 10, 11, 12],
[22, 23, 24, 25, 26, 27]])
可以看到输出的两个小批量位置是相邻的

3.循环神经网络

本节的目的是基于当前的输入与过去的输入序列，预测序列的下一个字符。

循环神经网络引入一个隐藏变量H，用t表示当前时刻，t-1表示上一时刻，则Ht表示H在时间步t的值，由图可知Ht的计算是基于Xt和Ht-1的，可以认为Ht记录了到当前字符为止的序列信息，利用对序列的下一个字符进行预测。

根据上图可以推测出循环神经网络的公式如下：

其中，X(n,d);H(n,d);O(n,q);Wxh(d,h);Whh(h,h);Whq(h,q);bh(1,h);bq(1,q)

从0开始实现循环神经网络
首先引入相关包

import torch
import torch.nn as nn
import time
import math
import sys
sys.path.append("/home/kesci/input")
import d2l_jay9460 as d2l
(corpus_indices, char_to_idx, idx_to_char, vocab_size) = d2l.load_data_jay_lyrics()
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

将字符表示成向量，采用one-hot向量。
为什么要用one-hot向量？one-hot向量编码

one-hot向量将类别变量转换为机器学习算法易于利用的一种形式的过程，这个向量的表示为一项属性的特征向量，也就是同一时间只有一个激活点（不为0），这个向量只有一个特征是不为0的，其他都是0，使特征变得稀疏。比如：“性别”这个人的特征，性别有“男性”、“女性”，这个特征有两个特征值，也只有两个特征值，如果对这个特征进行one-hot编码，则特征值为“男性”的编码为“10”，“女性”的编码为“01”，如果特征值有m个离散特征值，则one-hot后特征值的表示是一个m维的向量，每个样本的特征只能有一个值，这个值的向量坐标上就是1，其他都是0。

因为进行的是预测，不是分类，因此网络的输出应该是预测出的下一个字符。假设词典大小是N，每次字符对应一个从0到N-1的唯一的索引，则该字符的向量是一个长度为N的向量，若字符的索引是i，则该向量的第i个位置为1，其他位置为0。

def one_hot(x, n_class, dtype=torch.float32):result = torch.zeros(x.shape[0], n_class, dtype=dtype, device=x.device)  # shape: (n, n_class)result.scatter_(1, x.long().view(-1, 1), 1)  # result[i, x[i, 0]] = 1return result

其中X是一个一维向量，x.shape[0]为向量包含的元素个数n，每一个向量都是一个字符的索引，n_class是字典的大小，dtype是返回向量的类型，首先初始化一个大小为(n, n_class)的零向量，然后利用scatter_函数，按列填充，把对应位置改写成1，完成one_hot向量。
view(-1,1):第一维数据不变，第二维数据变成1。

pytorch scatter_理解轴含义

x = torch.tensor([0,1,1])
x_one_hot = one_hot(x, vocab_size)
print(x_one_hot)
print(x_one_hot.shape)
print(x_one_hot.sum(axis=1))

out：
tensor([[1., 0., 0., …, 0., 0., 0.],
[0., 1., 0., …, 0., 0., 0.],
[0., 1., ., …, 0., 0., 0.]])
torch.Size([3, 1027])
tensor([1., 1., 1.])
可以看到one-shoot向量测试情况以及对应的尺寸。

我们每次采样的小批量的形状是（批量大小, 时间步数）。下面的函数将这样的小批量变换成数个形状为（批量大小, 词典大小）的矩阵，矩阵个数等于时间步数。

def to_onehot(X, n_class):return [one_hot(X[:, i], n_class) for i in range(X.shape[1])]X = torch.arange(10).view(2, 5)
inputs = to_onehot(X, vocab_size)
print(len(inputs), inputs[0].shape)

out：5 torch.Size([2, 1027])
X[:, i]：进行循环，每次选中1列，当i=0时，选中[[0],[5]]，按列填充，
然后是初始化模型参数wxh、whh、bh、whq以及bq。

num_inputs, num_hiddens, num_outputs = vocab_size, 256, vocab_size
# num_inputs: d
# num_hiddens: h, 隐藏单元的个数是超参数
# num_outputs: qdef get_params():def _one(shape):param = torch.zeros(shape, device=device, dtype=torch.float32)nn.init.normal_(param, 0, 0.01)return torch.nn.Parameter(param)# 隐藏层参数W_xh = _one((num_inputs, num_hiddens))W_hh = _one((num_hiddens, num_hiddens))b_h = torch.nn.Parameter(torch.zeros(num_hiddens, device=device))# 输出层参数W_hq = _one((num_hiddens, num_outputs))b_q = torch.nn.Parameter(torch.zeros(num_outputs, device=device))return (W_xh, W_hh, b_h, W_hq, b_q)

接下来是定义rnn模型，按照公式，用循环的方式依次完成循环神经网络每个时间步的计算。

def rnn(inputs, state, params):# inputs和outputs皆为num_steps个形状为(batch_size, vocab_size)的矩阵W_xh, W_hh, b_h, W_hq, b_q = paramsH, = stateoutputs = []for X in inputs:H = torch.tanh(torch.matmul(X, W_xh) + torch.matmul(H, W_hh) + b_h)Y = torch.matmul(H, W_hq) + b_qoutputs.append(Y)return outputs, (H,)

函数init_rnn_state初始化隐藏变量，这里的返回值是一个元组。

def init_rnn_state(batch_size, num_hiddens, device):return (torch.zeros((batch_size, num_hiddens), device=device), )

随着时间步数的增加，循环神经网络中较容易出现梯度衰减或梯度爆炸，这会导致网络几乎无法训练。
梯度衰减及梯度爆炸
采用裁剪梯度的方式解决梯度爆炸的问题。
假设我们把所有模型参数的梯度拼接成一个向量 g，并设裁剪的阈值是θ。裁剪后的梯度

的L2范数不超过θ。下面是裁剪梯度的函数：

def grad_clipping(params, theta, device):norm = torch.tensor([0.0], device=device)for param in params:norm += (param.grad.data ** 2).sum()norm = norm.sqrt().item()if norm > theta:for param in params:param.grad.data *= (theta / norm)

采用norm来记录梯度，先将其初始化为0，然后将梯度的平方和组合，最后再开根号。如果norm大于θ，则梯度需要乘以θ/norm，和公式是对应的。

接下来定义预测函数，基于前缀prefix（含有数个字符的字符串）来预测接下来的num_chars个字符。也就是先处理前缀prefix，则隐藏层H记录了prefix的信息，由于模型在处理prefix时已经对下一个字符进行了预测，因此可以将预测出的字符作为下一个时刻的输入。一直重复这个过程，直到完成num_chars个字符的预测。

def predict_rnn(prefix, num_chars, rnn, params, init_rnn_state,num_hiddens, vocab_size, device, idx_to_char, char_to_idx):state = init_rnn_state(1, num_hiddens, device)output = [char_to_idx[prefix[0]]]   # output记录prefix加上预测的num_chars个字符for t in range(num_chars + len(prefix) - 1):# 将上一时间步的输出作为当前时间步的输入X = to_onehot(torch.tensor([[output[-1]]], device=device), vocab_size)# 计算输出和更新隐藏状态(Y, state) = rnn(X, state, params)# 下一个时间步的输入是prefix里的字符或者当前的最佳预测字符if t < len(prefix) - 1:output.append(char_to_idx[prefix[t + 1]])else:output.append(Y[0].argmax(dim=1).item())return ''.join([idx_to_char[i] for i in output])

采用困惑度（perplexity）来评价语言模型的好坏，其中：
最佳情况下，模型总是把标签类别的概率预测为1，此时困惑度为1；
最坏情况下，模型总是把标签类别的概率预测为0，此时困惑度为正无穷；
基线情况下，模型总是预测所有类别的概率都相同，此时困惑度为类别个数。

任何一个有效模型的困惑度必须小于类别个数。在本例中，困惑度必须小于词典大小vocab_size。

定义模型训练函数，对时序数据采用不同采样方法，在迭代模型参数前裁剪梯度，使用困惑度评价模型。

def train_and_predict_rnn(rnn, get_params, init_rnn_state, num_hiddens,vocab_size, device, corpus_indices, idx_to_char,char_to_idx, is_random_iter, num_epochs, num_steps,lr, clipping_theta, batch_size, pred_period,pred_len, prefixes):if is_random_iter:data_iter_fn = d2l.data_iter_randomelse:data_iter_fn = d2l.data_iter_consecutiveparams = get_params()loss = nn.CrossEntropyLoss()for epoch in range(num_epochs):if not is_random_iter:  # 如使用相邻采样，在epoch开始时初始化隐藏状态state = init_rnn_state(batch_size, num_hiddens, device)l_sum, n, start = 0.0, 0, time.time()data_iter = data_iter_fn(corpus_indices, batch_size, num_steps, device)for X, Y in data_iter:if is_random_iter:  # 如使用随机采样，在每个小批量更新前初始化隐藏状态state = init_rnn_state(batch_size, num_hiddens, device)else:  # 否则需要使用detach函数从计算图分离隐藏状态for s in state:s.detach_()# inputs是num_steps个形状为(batch_size, vocab_size)的矩阵inputs = to_onehot(X, vocab_size)# outputs有num_steps个形状为(batch_size, vocab_size)的矩阵(outputs, state) = rnn(inputs, state, params)# 拼接之后形状为(num_steps * batch_size, vocab_size)outputs = torch.cat(outputs, dim=0)# Y的形状是(batch_size, num_steps)，转置后再变成形状为# (num_steps * batch_size,)的向量，这样跟输出的行一一对应y = torch.flatten(Y.T)# 使用交叉熵损失计算平均分类误差l = loss(outputs, y.long())# 梯度清0if params[0].grad is not None:for param in params:param.grad.data.zero_()l.backward()grad_clipping(params, clipping_theta, device)  # 裁剪梯度d2l.sgd(params, lr, 1)  # 因为误差已经取过均值，梯度不用再做平均l_sum += l.item() * y.shape[0]n += y.shape[0]if (epoch + 1) % pred_period == 0:print('epoch %d, perplexity %f, time %.2f sec' % (epoch + 1, math.exp(l_sum / n), time.time() - start))for prefix in prefixes:print(' -', predict_rnn(prefix, pred_len, rnn, params, init_rnn_state,num_hiddens, vocab_size, device, idx_to_char, char_to_idx))

然后测试一下，可以分别采用随机采样和相邻采样的方式训练模型，每过50个迭代周期便根据当前训练的模型创作一段歌词。

num_epochs, num_steps, batch_size, lr, clipping_theta = 250, 35, 32, 1e2, 1e-2
pred_period, pred_len, prefixes = 50, 50, ['分开', '不分开']
#随机采样
train_and_predict_rnn(rnn, get_params, init_rnn_state, num_hiddens,vocab_size, device, corpus_indices, idx_to_char,char_to_idx, True, num_epochs, num_steps, lr,clipping_theta, batch_size, pred_period, pred_len,prefixes)
#相邻采样
train_and_predict_rnn(rnn, get_params, init_rnn_state, num_hiddens,vocab_size, device, corpus_indices, idx_to_char,char_to_idx, False, num_epochs, num_steps, lr,clipping_theta, batch_size, pred_period, pred_len,prefixes)

循环神经网络的pythoch简洁实现
使用Pytorch中的nn.RNN来构造循环神经网络：

rnn_layer = nn.RNN(input_size=vocab_size, hidden_size=num_hiddens)

定义一个完整的基于循环神经网络的语言模型。

class RNNModel(nn.Module):def __init__(self, rnn_layer, vocab_size):super(RNNModel, self).__init__()self.rnn = rnn_layerself.hidden_size = rnn_layer.hidden_size * (2 if rnn_layer.bidirectional else 1) self.vocab_size = vocab_sizeself.dense = nn.Linear(self.hidden_size, vocab_size)def forward(self, inputs, state):# inputs.shape: (batch_size, num_steps)X = to_onehot(inputs, vocab_size)X = torch.stack(X)  # X.shape: (num_steps, batch_size, vocab_size)hiddens, state = self.rnn(X, state)hiddens = hiddens.view(-1, hiddens.shape[-1])  # hiddens.shape: (num_steps * batch_size, hidden_size)output = self.dense(hiddens)return output, state

定义预测函数，其中前向计算和隐藏状态初始化于前面不同。

def predict_rnn_pytorch(prefix, num_chars, model, vocab_size, device, idx_to_char,char_to_idx):state = Noneoutput = [char_to_idx[prefix[0]]]  # output记录prefix加上预测的num_chars个字符for t in range(num_chars + len(prefix) - 1):X = torch.tensor([output[-1]], device=device).view(1, 1)(Y, state) = model(X, state)  # 前向计算不需要传入模型参数if t < len(prefix) - 1:output.append(char_to_idx[prefix[t + 1]])else:output.append(Y.argmax(dim=1).item())return ''.join([idx_to_char[i] for i in output])

接下来使用相邻采样进行训练。

def train_and_predict_rnn_pytorch(model, num_hiddens, vocab_size, device,corpus_indices, idx_to_char, char_to_idx,num_epochs, num_steps, lr, clipping_theta,batch_size, pred_period, pred_len, prefixes):loss = nn.CrossEntropyLoss()optimizer = torch.optim.Adam(model.parameters(), lr=lr)model.to(device)for epoch in range(num_epochs):l_sum, n, start = 0.0, 0, time.time()data_iter = d2l.data_iter_consecutive(corpus_indices, batch_size, num_steps, device) # 相邻采样state = Nonefor X, Y in data_iter:if state is not None:# 使用detach函数从计算图分离隐藏状态if isinstance (state, tuple): # LSTM, state:(h, c)  state[0].detach_()state[1].detach_()else: state.detach_()(output, state) = model(X, state) # output.shape: (num_steps * batch_size, vocab_size)y = torch.flatten(Y.T)l = loss(output, y.long())optimizer.zero_grad()l.backward()grad_clipping(model.parameters(), clipping_theta, device)optimizer.step()l_sum += l.item() * y.shape[0]n += y.shape[0]if (epoch + 1) % pred_period == 0:print('epoch %d, perplexity %f, time %.2f sec' % (epoch + 1, math.exp(l_sum / n), time.time() - start))for prefix in prefixes:print(' -', predict_rnn_pytorch(prefix, pred_len, model, vocab_size, device, idx_to_char,char_to_idx))

测试一下：

num_epochs, batch_size, lr, clipping_theta = 250, 32, 1e-3, 1e-2
pred_period, pred_len, prefixes = 50, 50, ['分开', '不分开']
train_and_predict_rnn_pytorch(model, num_hiddens, vocab_size, device,corpus_indices, idx_to_char, char_to_idx,num_epochs, num_steps, lr, clipping_theta,batch_size, pred_period, pred_len, prefixes)

out：
epoch 50, perplexity 9.405654, time 0.52 sec

分开始一起三步四步望著天看星星一颗两颗三颗四颗连成线背著背默默许下心愿一枝杨柳你的那我在
不分开爱情你的手一人的老斑鸠腿短毛不多快使用双截棍哼哼哈兮快使用双截棍哼哼哈兮快使用双截棍
epoch 100, perplexity 1.255020, time 0.54 sec
分开我人了的屋我一定令它心仪的母斑鸠爱像一阵风吹完美主这样还人的太快就是学怕眼口让我碰恨这
不分开不想我多的脑袋有问题随便说说其实我早已经猜透看透不想多说只是我怕眼泪撑不住不懂你的黑色幽默
epoch 150, perplexity 1.064527, time 0.53 sec
分开我轻外的溪边默默在一心抽离有话不知不觉一场悲剧我对不起藤蔓植物的爬满了伯爵的坟墓古堡里
不分开不想不多的脑有教堂有你笑我有多烦恼没有你烦有有样别怪走快后悔没说你我不多难熬我想就
epoch 200, perplexity 1.033074, time 0.53 sec
分开我轻外的溪边默默在一心向昏的愿古无着我只能一个黑远这想太久这样我不要再是你打我妈妈
不分开你只会我一起睡著样娘子却只想你和汉堡我想要你的微笑每天都能看到我知道这里很美但家乡的你更美
epoch 250, perplexity 1.047890, time 0.68 sec
分开我轻多的漫却已在你人演想要再直你我想要这样牵着你的手不放开爱可不可以简简单单没有伤害你
不分开不想不多的假已无能为力再提起决定中断熟悉然后在这里不限日期然后将过去慢慢温习让我爱上

动手学深度学习第一次打卡相关推荐

动手学深度学习打卡之二。
第二次打卡内容(2月15日-18日) Task03:过拟合.欠拟合及其解决方案:梯度消失.梯度爆炸:循环神经网络进阶(1天) Task04:机器翻译及相关技术:注意力机制与Seq2seq模型:Tran ...
资源 | 李沐等人开源中文书《动手学深度学习》预览版上线
来源:机器之心本文约2000字,建议阅读10分钟. 本文为大家介绍了一本交互式深度学习书籍. 近日,由 Aston Zhang.李沐等人所著图书<动手学深度学习>放出了在线预览版,以供读 ...
动手学深度学习需要这些数学基础知识
https://www.toutiao.com/a6716993354439066124/ 本附录总结了本书中涉及的有关线性代数.微分和概率的基础知识.为避免赘述本书未涉及的数学背景知识,本节中的少数 ...
《动手学深度学习》PyTorch版GitHub资源
之前,偶然间看到过这个PyTorch版<动手学深度学习>,当时留意了一下,后来,着手学习pytorch,发现找不到这个资源了.今天又看到了,赶紧保存下来. <动手学深度学习>P ...
【深度学习】李沐《动手学深度学习》的PyTorch实现已完成
这个项目是中文版<动手学深度学习>中的代码进行整理,用Pytorch实现,是目前全网最全的Pytorch版本. 项目作者:吴振宇博士简介 Dive-Into-Deep-Learnin ...
送10本今年最火的《动手学深度学习》
点击我爱计算机视觉标星,更快获取CVML新技术 52CV曾经多次介绍FlyAI机器学习竞赛平台,不少粉丝也曾在FlyAI拿到现金奖励. 本次52CV & FlyAI联合送书,CV君查找了两天, ...
《动手学深度学习》TF2.0 实现
本项目将<动手学深度学习> 原书中MXNet代码实现改为TensorFlow2.0实现.经过我的导师咨询李沐老师,这个项目的实施已得到李沐老师的同意.原书作者:阿斯顿·张.李沐.扎卡里 C ...
364 页 PyTorch 版《动手学深度学习》分享（全中文，支持 Jupyter 运行）
1 前言最近有朋友留言要求分享一下李沐老师的<动手学深度学习>,小汤本着一直坚持的"好资源大家一起分享,共同学习,共同进步"的初衷,于是便去找了资料,而且还是中文版的 ...
动手学深度学习Pytorch Task01
深度学习目前以及未来都有良好的发展前景.正值疫情期间,报名参加了动手学深度学习pytorch版的公开课,希望在以后的学习生活中能够灵活运用学到的这些知识. 第一次课主要包含三个部分:线性回归.soft ...

动手学深度学习第一次打卡