文章目录

自述
有用的工具
代码出处
目录
代码 Load and trim data
- 类 class
- _ _ init _ _ 初始化实例变量
- for word in sentence.split(' ')
- addWord
- trim
- unicodetoAscii
- normalizeString
- readVocs
- filterPair & filterPairs
- trimRareWords

自述

我是编程小白，别看注册时间长，但从事的不是coding工作，为了学AI才开始自学Python。
平时就是照着书上敲敲代码，并没有深刻理解。现在想要研究chatbot了，才发现自己的coding水平急需加强，所以开这个系列记录自己一行行扣代码的过程。当然这不是从0开始的，只是把自己不理解的写出来，将来也可以作为资料备查。

最后还要重申一下，我没有系统学过编程，写这个系列就是想突破自己，各位大神请不吝赐教！

有用的工具

可以视觉化代码的网站https://pythontutor.com/visualize.html

代码出处

Pytorch的CHATBOT TUTORIAL

https://pytorch.org/tutorials/beginner/chatbot_tutorial.html?highlight=gpu%20training

一步步读懂Pytorch Chatbot Tutorial代码(一) - 加载和预处理数据
一步步读懂Pytorch Chatbot Tutorial代码(二) - 数据处理
一步步读懂Pytorch Chatbot Tutorial代码(三) - 创建字典
一步步读懂Pytorch Chatbot Tutorial代码(四) - 为模型准备数据
一步步读懂Pytorch Chatbot Tutorial代码(五) - 定义模型

代码 Load and trim data

我们的下一个任务是创建词汇表并将查询/响应句子对加载到内存中。

注意，我们处理的是单词的序列，它们没有隐式映射到离散的数值空间。因此，我们必须通过将数据集中遇到的每个唯一单词映射到一个索引值来创建一个索引。

为此，我们定义了一个Voc类，它保存从单词到索引的映射、索引到单词的反向映射、每个单词的计数和总单词计数。该类提供了向词汇表中添加单词(addWord)、在句子中添加所有单词(addSentence)和trimming不常见单词(trim)的方法。稍后会有更多关于trimming的内容。

# Default word tokens
PAD_token = 0  # Used for padding short sentences
SOS_token = 1  # Start-of-sentence token
EOS_token = 2  # End-of-sentence tokenclass Voc:def __init__(self, name):self.name = nameself.trimmed = Falseself.word2index = {}self.word2count = {}self.index2word = {PAD_token: "PAD", SOS_token: "SOS", EOS_token: "EOS"}self.num_words = 3  # Count SOS, EOS, PADdef addSentence(self, sentence):for word in sentence.split(' '):self.addWord(word)def addWord(self, word):if word not in self.word2index:self.word2index[word] = self.num_wordsself.word2count[word] = 1self.index2word[self.num_words] = wordself.num_words += 1else:self.word2count[word] += 1# Remove words below a certain count thresholddef trim(self, min_count):if self.trimmed:returnself.trimmed = Truekeep_words = []for k, v in self.word2count.items():if v >= min_count:keep_words.append(k)print('keep_words {} / {} = {:.4f}'.format(len(keep_words), len(self.word2index), len(keep_words) / len(self.word2index)))# Reinitialize dictionariesself.word2index = {}self.word2count = {}self.index2word = {PAD_token: "PAD", SOS_token: "SOS", EOS_token: "EOS"}self.num_words = 3 # Count default tokensfor word in keep_words:self.addWord(word)

现在我们可以组合我们的词汇和查询/回答句子对。在准备使用这些数据之前，我们必须执行一些预处理。

首先，我们必须使用unicodeToAscii将Unicode字符串转换为ASCII。接下来，我们应该将所有字母转换为小写字母，并修剪除基本标点符号(normalizeString)之外的所有非字母字符。最后，为了帮助训练收敛，我们将过滤掉长度大于MAX_LENGTH阈值的句子(filterPairs)。

MAX_LENGTH = 10  # Maximum sentence length to consider# Turn a Unicode string to plain ASCII, thanks to
# https://stackoverflow.com/a/518232/2809427
def unicodeToAscii(s):return ''.join(c for c in unicodedata.normalize('NFD', s)if unicodedata.category(c) != 'Mn')# Lowercase, trim, and remove non-letter characters
def normalizeString(s):s = unicodeToAscii(s.lower().strip())s = re.sub(r"([.!?])", r" \1", s)s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)s = re.sub(r"\s+", r" ", s).strip()return s# Read query/response pairs and return a voc object
def readVocs(datafile, corpus_name):print("Reading lines...")# Read the file and split into lineslines = open(datafile, encoding='utf-8').\read().strip().split('\n')# Split every line into pairs and normalizepairs = [[normalizeString(s) for s in l.split('\t')] for l in lines]voc = Voc(corpus_name)return voc, pairs# Returns True iff both sentences in a pair 'p' are under the MAX_LENGTH threshold
def filterPair(p):# Input sequences need to preserve the last word for EOS tokenreturn len(p[0].split(' ')) < MAX_LENGTH and len(p[1].split(' ')) < MAX_LENGTH# Filter pairs using filterPair condition
def filterPairs(pairs):return [pair for pair in pairs if filterPair(pair)]# Using the functions defined above, return a populated voc object and pairs list
def loadPrepareData(corpus, corpus_name, datafile, save_dir):print("Start preparing training data ...")voc, pairs = readVocs(datafile, corpus_name)print("Read {!s} sentence pairs".format(len(pairs)))pairs = filterPairs(pairs)print("Trimmed to {!s} sentence pairs".format(len(pairs)))print("Counting words...")for pair in pairs:voc.addSentence(pair[0])voc.addSentence(pair[1])print("Counted words:", voc.num_words)return voc, pairs# Load/Assemble voc and pairs
save_dir = os.path.join("data", "save")
voc, pairs = loadPrepareData(corpus, corpus_name, datafile, save_dir)
# Print some pairs to validate
print("\npairs:")
for pair in pairs[:10]:print(pair)

类 class

类是抽象的模板, 详细内容可以看廖雪峰的介绍 https://www.liaoxuefeng.com/wiki/1016959663602400/1017496031185408
也可以参考这个视频 https://www.bilibili.com/video/BV1N5411t7EH?from=search&seid=3094138607375985455&spm_id_from=333.337.0.0

第一段代码Visualize：

第二段代码Visualize：

_ _ init _ _ 初始化实例变量

__init__()的作用是初始化某个类的一个实例

        self.name = name  self.trimmed = Falseself.word2index = {}  #单词到索引的映射self.word2count = {}  #单词出现的次数#索引到单词的映射 index2wordself.index2word = {PAD_token: "PAD", SOS_token: "SOS", EOS_token: "EOS"}  #  索引到单词映射self.num_words = 3  # Count SOS, EOS, PAD  计算token字符的数量，起始为3

for word in sentence.split(’ ')

根据空格来提取每个单词，举个栗子：

a='Good morning everyone!'
b=a.split(' ')
b['Good', 'morning', 'everyone!']

self.addWord(word) 原来还可以这样调用类中的其他函数!!

addWord

伪代码如下：
假如上面sentence中第一个单词（Good）不在word2index 字典中：
字典word2index = {Good : 3}
字典word2count= {Good: 1} #代表Good单词出现了1次
字典index2word = {3 : Good}
num_words 自+1 # 总共出现词的个数
或者： #单词出现在word2index字典中，
字典word2count 自+1

        if word not in self.word2index:self.word2index[word] = self.num_wordsself.word2count[word] = 1self.index2word[self.num_words] = wordself.num_words += 1else:self.word2count[word] += 1

trim

这段函数目的是删除低频词，即低于min_count的token。由于低频词删除后，单词和索引的映射还是旧的，所以后半段代码重新初始化字典，生成新的单词和索引的映射以及其他字典。

没搞懂这里为什么会出现这段代码。

        if self.trimmed:  returnself.trimmed = True

self.word2count.items() ：调用字典中的键值对，举个栗子：

a={'one':1,'two':2}
a.items()dict_items([('one', 1), ('two', 2)])

{:.4f} 取小数点后四位

a=3
b=4
print('a/b等于{:.4f}'.format(a/b))a/b等于0.7500

unicodetoAscii

unicodedata函数更多介绍参考https://blog.csdn.net/weixin_43866211/article/details/98384017

使用unicodeToAscii将Unicode字符串转换为ASCII. 举个栗子：（中文字符一般不需要进行这个操作）

def unicodeToAscii(s):return ''.join(c for c in unicodedata.normalize('NFD', s)if unicodedata.category(c) != 'Mn')
unicodeToAscii('Bonjour, ma chère.')'Bonjour, ma chere.'

normalizeString

所有大写字母改小写，删减空白以及非字母的字符

s.lower().strip() 大写字母改小写并按空格分隔

s.re.sub() 详细介绍参考 https://blog.csdn.net/jackandsnow/article/details/103885422

re.sub的函数原型为：re.sub(pattern, repl, string, count) 其中第二个函数是替换后的字符串；
第四个参数指替换个数。默认为0，表示每个匹配项都替换

其中的 r 是告诉编译器这是string，不要转译backslash

举个栗子：

其中 is 和 a 前面放了两个空格

s="This  is  a good question. I ##will $$$find [a] soluation for you ASAP!!! Do you have any more questions? "
s = unicodeToAscii(s.lower().strip())
s
Ouput:
'this  is  a good question. i ##will $$$find [a] soluation for you asap!!! do you have any more questions?'

在 . ! ? 三个符号前面加空格，后面的 \1 代表第一个加括号（bracketed）的group

s = re.sub(r"([.!?])", r" \1", s)
s
Output:
'this  is  a good question . i ##will $$$find [a] soluation for you asap ! ! ! do you have any more questions ?'

^ 代表取非a-zA-z.!?字符 + 代表多连续字符，将多余字符用空格代替

s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
s'this is a good question . i will find a soluation for you asap ! ! ! do you have any more questions ?'

将多余空格详细\s介绍参考 https://blog.csdn.net/weixin_40426830/article/details/108743258

s = re.sub(r"\s+", r" ", s).strip()
s'this is a good question . i will find a soluation for you asap ! ! ! do you have any more questions ?'

readVocs

\ 反斜杠后直接回车即可实现续行

strip() 方法用于移除字符串头尾指定的字符（默认为空格或换行符）或字符序列

split() 通过指定分隔符对字符串进行切片，如果参数 num 有指定值，则分隔 num+1 个子字符串

lines = open(datafile, encoding='utf-8').\read().strip().split('\n')
print(lines)
Output:
["Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again.\tWell, I thought we'd start with pronunciation, if that's okay with you.","Well, I thought we'd start with pronunciation, if that's okay with you.\tNot the hacking and gagging and spitting part.  Please.","Not the hacking and gagging and spitting part.  Please.\tOkay... then how 'bout we try out some French cuisine.  Saturday?  Night?","You're asking me out.  That's so cute. What's your name again?\tForget it.","No, no, it's my fault -- we didn't have a proper introduction ---\tCameron.","Cameron.\tThe thing is, Cameron -- I'm at the mercy of a particularly hideous breed of loser.  My sister.  I can't date until she does.",

normalizeString(s) for s in l.split('\t')] for l in lines 列表推导式层层分析如下：

可以看出这里将lines中的每段对白循环迭代给 l （由此我才发现，由于是电影台词的关系，问答对的句子并不是简单的一问一答。）

for l in lines: print lprint('\n')Output:Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again. Well, I thought we'd start with pronunciation, if that's okay with you.Well, I thought we'd start with pronunciation, if that's okay with you.  Not the hacking and gagging and spitting part.  Please.Not the hacking and gagging and spitting part.  Please.  Okay... then how 'bout we try out some French cuisine.  Saturday?  Night?You're asking me out.  That's so cute. What's your name again? Forget it.

for s in l.split('\t') 结果出乎我的意料，竟然只有一对对话内容，而且重新运行后内容不一样。
下面我会。
\t 代表4个空格即TAB

for s in l.split('\t'):print(s)Output:Most people can't hear me with the whole orchestra playing.  You're good.
I don't have to take abuse from you.  I have other people dying to give it to me.

换个方式来解释上面这句：这里就符合我们的预期

lines[0]
"Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again.\tWell, I thought we'd start with pronunciation, if that's okay with you."
lines[0].split('\t')
['Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again.',"Well, I thought we'd start with pronunciation, if that's okay with you."]

voc = Voc(corpus_name)中 corpus_name 即 Voc类中的self.name

filterPair & filterPairs

filterPairs : 判断问答对单词是否低于MAX_LENGTH的阈值

MAX_LENGTH = 10  # Maximum sentence length to consider
# Returns True if both sentences in a pair 'p' are under the MAX_LENGTH threshold
def filterPair(p):# Input sequences need to preserve the last word for EOS tokenreturn len(p[0].split(' ')) < MAX_LENGTH and len(p[1].split(' ')) < MAX_LENGTH

直观一点：

pairs[0][0]Output:
'can we make this quick ? roxanne korrine and andrew barrett are having an incredibly horrendous public break up on the quad . again .'pairs[0][0].split()Output:
['can','we','make','this','quick','?','roxanne','korrine','and','andrew','barrett','are','having','an','incredibly','horrendous','public','break','up','on','the','quad','.','again','.']len(pairs[0][0].split())<10Output:
False

filterPairs : 利用filterPair函数来筛选pairs里面的句子, 将所有长度低于10 的句子放入pair

# Filter pairs using filterPair condition
def filterPairs(pairs):return [pair for pair in pairs if filterPair(pair)]

之后根据上面两个函数，获得新的句子对Pairs , 总数64271。

pairs = filterPairs(pairs)
print("Read {!s} sentence pairs".format(len(pairs)))Output:
Read 64271 sentence pairs

其余函数和上面写的有重复，可自行理解。

最终得到结果与网页一致：

Start preparing training data ...
Reading lines...
Read 221282 sentence pairs
Trimmed to 64271 sentence pairs
Counting words...
Counted words: 18008pairs:
['there .', 'where ?']
['you have my word . as a gentleman', 'you re sweet .']
['hi .', 'looks like things worked out tonight huh ?']
['you know chastity ?', 'i believe we share an art instructor']
['have fun tonight ?', 'tons']
['well no . . .', 'then that s all you had to say .']
['then that s all you had to say .', 'but']
['but', 'you always been this selfish ?']
['do you listen to this crap ?', 'what crap ?']
['what good stuff ?', 'the real you .']

trimRareWords

另一个有助于在训练期间更快地实现衔接的策略是从我们的词汇表中删去很少使用的单词。减少特征空间也将降低模型必须学习近似的函数的难度。我们将分两步进行：

使用voc.trim函数修剪单词，使其低于MIN_COUNT阈值。
用修剪过的单词过滤掉对。

voc.trim 根据上面写的函数介绍，删掉低频词，即出现次数低于3次的词。

MIN_COUNT = 3    # Minimum word count threshold for trimmingdef trimRareWords(voc, pairs, MIN_COUNT):# Trim words used under the MIN_COUNT from the vocvoc.trim(MIN_COUNT)# Filter out pairs with trimmed wordskeep_pairs = []for pair in pairs:input_sentence = pair[0]output_sentence = pair[1]keep_input = Truekeep_output = True# Check input sentencefor word in input_sentence.split(' '):if word not in voc.word2index:keep_input = Falsebreak# Check output sentencefor word in output_sentence.split(' '):if word not in voc.word2index:keep_output = Falsebreak# Only keep pairs that do not contain trimmed word(s) in their input or output sentenceif keep_input and keep_output:keep_pairs.append(pair)print("Trimmed from {} pairs to {}, {:.4f} of total".format(len(pairs), len(keep_pairs), len(keep_pairs) / len(pairs)))return keep_pairs# Trim voc and pairs
pairs = trimRareWords(voc, pairs, MIN_COUNT)

结果与网页一致

keep_words 7823 / 18005 = 0.4345
Trimmed from 64271 pairs to 53165, 0.8272 of total

一步步读懂Pytorch Chatbot Tutorial代码(三) - 创建字典相关推荐

一步步读懂Pytorch Chatbot Tutorial代码(四) - 为模型准备数据
文章目录自述有用的工具代码出处目录头大代码及说明 Prepare Data for Models 重点关注 indexesFromSentence zeroPadding binaryMa ...
一步步读懂Pytorch Chatbot Tutorial代码(二) - 数据处理
文章目录自述代码出处目录代码 Create formatted data file (为了方便理解,把代码的顺序略微改一下, 此章节略长.) 1. `loadLines` 将文件的每一行拆分为 ...
加拿大11年级计算机课程代码,如何读懂加拿大高中课程代码？
原标题:如何读懂加拿大高中课程代码? 加拿大高中的课程(以安省为例)都是以标准的代码来选课的,这套代码也是申请大学时通用的.大学在录取学生的时候,对某些专业就要求高中对应的某些课程,比如数学按难易程度 ...
读懂Android中的代码混淆
本文为本人的一些实践总结,介绍一些混淆的知识和注意事项.希望可以帮助大家更好的学习和使用代码混淆. 什么是混淆关于混淆维基百科上该词条的解释为代码混淆(Obfuscated code)亦称花指令, ...
读懂 Android 中的代码混淆
在Android开发工作中,我们都或多或少接触过代码混淆.比如我们想要集成某个SDK,往往需要做一些排除混淆的操作. 本文为本人的一些实践总结,介绍一些混淆的知识和注意事项.希望可以帮助大家更好的学习 ...
android混淆成不可见字符,读懂 Android 中的代码混淆
在Android开发工作中,我们都或多或少接触过代码混淆.比如我们想要集成某个SDK,往往需要做一些排除混淆的操作. 本文为本人的一些实践总结,介绍一些混淆的知识和注意事项.希望可以帮助大家更好的学习 ...
如何快速读懂一个后端系统代码
如何快速看懂一个后端系统代码当一个java开发新手拿到一个系统代码并且没有人给你讲基本的需求与功能时,如何快速读懂代码了解功能是提高工作效率的必备技能.我作为一个参加工作一年的Java小菜是如何做的 ...
【观察】读懂维谛技术的三组关键词：新生与传承、迭代与进化、生态与赋能...
申耀的科技观察读懂科技,赢取未来! 硅谷精神教父.科技商业预言家凯文·凯利在<未来20年商业趋势与中国机遇>演讲中说:"你看那边是互联网+,我们知道要走到那边去,但是要摆脱我们 ...
一文读懂PyTorch张量基础（附代码）
作者:Matthew Mayo, KDnuggets 翻译:和中华校对:丁楠雅本文约1000字,建议阅读5分钟. 本文介绍了PyTorch Tensor最基础的知识以及如何跟Numpy的ndarr ...

一步步读懂Pytorch Chatbot Tutorial代码(三) - 创建字典