【自然语言处理入门系列】加载和预处理数据-以Cornell Movie-Dialogs Corpus数据集为例

Author: Yirong Chen from South China University of Technology
My CSDN Blog: https://blog.csdn.net/m0_37201243
My Homepage: http://www.yirongchen.com/

Dependencies:

  • Python: 3.6.9

参考网站:

  • https://pytorch.org/
  • http://pytorch123.com/
  • http://pytorch123.com/FifthSection/Chatbot/
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literalsimport csv
import random
import re
import os
import unicodedata
import codecs
from io import open
import itertools
import math

示例1:Cornell Movie-Dialogs Corpus数据集

Cornell Movie-Dialogs Corpus是一个丰富的电影角色对话数据集:

  • 10,292 对电影角色之间的220,579次对话
  • 617部电影中的9,035个电影角色
  • 总共304,713发言量

这个数据集庞大而多样,在语言形式、时间段、情感上等都有很大的变化。我们希望这种多样性使我们的模型能够适应多种形式的输入和查询。

1、下载数据集

### 下载数据集
import os
import requestsprint("downloading Cornell Movie-Dialogs Corpus数据集")
data_url = "http://www.cs.cornell.edu/~cristian/data/cornell_movie_dialogs_corpus.zip"path = "./data/"
if not os.path.exists(path):os.makedirs(path)res = requests.get(data_url)
with open("./data/cornell_movie_dialogs_corpus.zip", "wb") as fp:fp.write(res.content)
print("Cornell Movie-Dialogs Corpus数据集下载完毕!")
downloading Cornell Movie-Dialogs Corpus数据集
Cornell Movie-Dialogs Corpus数据集下载完毕!

2、解压数据集

import time
import zipfilesrcfile = "./data/cornell_movie_dialogs_corpus.zip"file = zipfile.ZipFile(srcfile, 'r')
file.extractall(path)
print('解压cornell_movie_dialogs_corpus.zip完毕!')
print("Cornell Movie-Dialogs Corpus数据集的文件组成如下:")
corpus_file_list=os.listdir("./data/cornell movie-dialogs corpus")
print(corpus_file_list)
解压cornell_movie_dialogs_corpus.zip完毕!
Cornell Movie-Dialogs Corpus数据集的文件组成如下:
['formatted_movie_lines.txt', 'chameleons.pdf', '.DS_Store', 'README.txt', 'movie_conversations.txt', 'movie_lines.txt', 'raw_script_urls.txt', 'movie_characters_metadata.txt', 'movie_titles_metadata.txt']

3、查看数据集的各个文件的部分数据

def printLines(file, n=10):with open(file, 'rb') as datafile:lines = datafile.readlines()for line in lines[:n]:print(line)
# corpus_name = "cornell movie-dialogs corpus"
# corpus = os.path.join("data", corpus_name)
corpus_file_list=os.listdir("./data/cornell movie-dialogs corpus")
for file_name in corpus_file_list:    file_dir = os.path.join("./data/cornell movie-dialogs corpus", file_name)print(file_dir,"的前10行")printLines(file_dir)

这部分的结果省略在博客中!

Note:movie_lines.txt是关键数据文件,其实我们在找到一个数据集的时候,是可以从它的官网、来源或者相应的论文当中看到相应的介绍。也就是,我们至少知道某个数据集它的文件组成。

4、创建格式化数据文件

以下函数便于解析原始 movie_lines.txt 数据文件。

  • loadLines:将文件的每一行拆分为字段(lineID, characterID, movieID, character, text)组合的字典
  • loadConversations :根据movie_conversations.txt将loadLines中的每一行数据进行归类
  • extractSentencePairs: 从对话中提取句子对
# 将文件的每一行拆分为字段字典
def loadLines(fileName, fields):lines = {}with open(fileName, 'r', encoding='iso-8859-1') as f:for line in f:values = line.split(" +++$+++ ")# Extract fieldslineObj = {}for i, field in enumerate(fields):lineObj[field] = values[i]lines[lineObj['lineID']] = lineObjreturn lines
# 将 `loadLines` 中的行字段分组为基于 *movie_conversations.txt* 的对话
def loadConversations(fileName, lines, fields):conversations = []with open(fileName, 'r', encoding='iso-8859-1') as f:for line in f:values = line.split(" +++$+++ ")# Extract fieldsconvObj = {}for i, field in enumerate(fields):convObj[field] = values[i]# Convert string to list (convObj["utteranceIDs"] == "['L598485', 'L598486', ...]")lineIds = eval(convObj["utteranceIDs"])# Reassemble linesconvObj["lines"] = []for lineId in lineIds:convObj["lines"].append(lines[lineId])conversations.append(convObj)return conversations
# 从对话中提取一对句子
def extractSentencePairs(conversations):qa_pairs = []for conversation in conversations:# Iterate over all the lines of the conversationfor i in range(len(conversation["lines"]) - 1):  # We ignore the last line (no answer for it)inputLine = conversation["lines"][i]["text"].strip()targetLine = conversation["lines"][i+1]["text"].strip()# Filter wrong samples (if one of the lists is empty)if inputLine and targetLine:qa_pairs.append([inputLine, targetLine])return qa_pairs

Note:以下代码使用上面定义的函数创建格式化数据文件

import csv
import codecs
# 定义新文件的路径
datafile = os.path.join(corpus, "formatted_movie_lines.txt")delimiter = '\t'delimiter = str(codecs.decode(delimiter, "unicode_escape"))# 初始化行dict,对话列表和字段ID
lines = {}
conversations = []
MOVIE_LINES_FIELDS = ["lineID", "characterID", "movieID", "character", "text"]
MOVIE_CONVERSATIONS_FIELDS = ["character1ID", "character2ID", "movieID", "utteranceIDs"]# 加载行和进程对话
print("\nProcessing corpus...")
lines = loadLines(os.path.join(corpus, "movie_lines.txt"), MOVIE_LINES_FIELDS)
print("\nLoading conversations...")
conversations = loadConversations(os.path.join(corpus, "movie_conversations.txt"),lines, MOVIE_CONVERSATIONS_FIELDS)# 写入新的csv文件
print("\nWriting newly formatted file...")
with open(datafile, 'w', encoding='utf-8') as outputfile:writer = csv.writer(outputfile, delimiter=delimiter, lineterminator='\n')for pair in extractSentencePairs(conversations):writer.writerow(pair)# 打印一个样本的行
print("\nSample lines from file:")
printLines(datafile)
Processing corpus...Loading conversations...Writing newly formatted file...Sample lines from file:
b"Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again.\tWell, I thought we'd start with pronunciation, if that's okay with you.\n"
b"Well, I thought we'd start with pronunciation, if that's okay with you.\tNot the hacking and gagging and spitting part.  Please.\n"
b"Not the hacking and gagging and spitting part.  Please.\tOkay... then how 'bout we try out some French cuisine.  Saturday?  Night?\n"
b"You're asking me out.  That's so cute. What's your name again?\tForget it.\n"
b"No, no, it's my fault -- we didn't have a proper introduction ---\tCameron.\n"
b"Cameron.\tThe thing is, Cameron -- I'm at the mercy of a particularly hideous breed of loser.  My sister.  I can't date until she does.\n"
b"The thing is, Cameron -- I'm at the mercy of a particularly hideous breed of loser.  My sister.  I can't date until she does.\tSeems like she could get a date easy enough...\n"
b'Why?\tUnsolved mystery.  She used to be really popular when she started high school, then it was just like she got sick of it or something.\n'
b"Unsolved mystery.  She used to be really popular when she started high school, then it was just like she got sick of it or something.\tThat's a shame.\n"
b'Gosh, if only we could find Kat a boyfriend...\tLet me see what I can do.\n'

5、加载和清洗数据

# 默认词向量
PAD_token = 0  # Used for padding short sentences
SOS_token = 1  # Start-of-sentence token
EOS_token = 2  # End-of-sentence token

创建了一个Voc类,它会存储从单词到索引的映射、索引到单词的反向映射、每个单词的计数和总单词量。这个类提供向词汇表中添加单词的方法(addWord)、添加句子的所有单词到词汇表中的方法 (addSentence) 和清洗不常见的单词方法(trim)。更多的数据清洗在后面进行。

class Voc:def __init__(self, name):self.name = nameself.trimmed = Falseself.word2index = {}self.word2count = {}self.index2word = {PAD_token: "PAD", SOS_token: "SOS", EOS_token: "EOS"}self.num_words = 3  # Count SOS, EOS, PAD# 添加句子中的所有单词到词汇表def addSentence(self, sentence):for word in sentence.split(' '):self.addWord(word)# 向词汇表中添加单词def addWord(self, word):if word not in self.word2index:self.word2index[word] = self.num_wordsself.word2count[word] = 1self.index2word[self.num_words] = wordself.num_words += 1else:self.word2count[word] += 1# 删除低于特定计数阈值的单词def trim(self, min_count):if self.trimmed:returnself.trimmed = Truekeep_words = []for k, v in self.word2count.items():if v >= min_count:keep_words.append(k)print('keep_words {} / {} = {:.4f}'.format(len(keep_words), len(self.word2index), len(keep_words) / len(self.word2index)))# 重初始化字典self.word2index = {}self.word2count = {}self.index2word = {PAD_token: "PAD", SOS_token: "SOS", EOS_token: "EOS"}self.num_words = 3 # Count default tokensfor word in keep_words:self.addWord(word)

使用unicodeToAscii将 unicode 字符串转换为 ASCII。然后,我们应该将所有字母转换为小写字母并清洗掉除基本标点之 外的所有非字母字符 (normalizeString)。最后,为了帮助训练收敛,我们将过滤掉长度大于MAX_LENGTH 的句子 (filterPairs)。

# 将Unicode字符串转换为纯ASCII
def unicodeToAscii(s):return ''.join(c for c in unicodedata.normalize('NFD', s)if unicodedata.category(c) != 'Mn')
# normalizeString函数是一个正则化的函数,也就是使数据更加标准化的
def normalizeString(s):s = unicodeToAscii(s.lower().strip())s = re.sub(r"([.!?])", r" \1", s)s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)return s
MAX_LENGTH = 10  # Maximum sentence length to consider# 初始化Voc对象 和 格式化pairs对话存放到list中
def readVocs(datafile, corpus_name):print("Reading lines...")# Read the file and split into lineslines = open(datafile, encoding='utf-8').read().strip().split('\n')# Split every line into pairs and normalizepairs = [[normalizeString(s) for s in l.split('\t')] for l in lines]voc = Voc(corpus_name)return voc, pairs# 如果对 'p' 中的两个句子都低于 MAX_LENGTH 阈值,则返回True
def filterPair(p):# Input sequences need to preserve the last word for EOS tokenreturn len(p[0].split(' ')) < MAX_LENGTH and len(p[1].split(' ')) < MAX_LENGTH# 过滤满足条件的 pairs 对话
def filterPairs(pairs):return [pair for pair in pairs if filterPair(pair)]# 使用上面定义的函数,返回一个填充的voc对象和对列表
def loadPrepareData(corpus, corpus_name, datafile, save_dir):print("Start preparing training data ...")voc, pairs = readVocs(datafile, corpus_name)print("Read {!s} sentence pairs".format(len(pairs)))pairs = filterPairs(pairs)print("Trimmed to {!s} sentence pairs".format(len(pairs)))print("Counting words...")for pair in pairs:voc.addSentence(pair[0])voc.addSentence(pair[1])print("Counted words:", voc.num_words)return voc, pairs# 加载/组装voc和对
save_dir = os.path.join("data", "save")
voc, pairs = loadPrepareData(corpus, corpus_name, datafile, save_dir)
# 打印一些对进行验证
print("\npairs:")
for pair in pairs[:10]:print(pair)
Start preparing training data ...
Reading lines...
Read 221282 sentence pairs
Trimmed to 63446 sentence pairs
Counting words...
Counted words: 17774pairs:
['there .', 'where ?']
['you have my word . as a gentleman', 'you re sweet .']
['hi .', 'looks like things worked out tonight huh ?']
['you know chastity ?', 'i believe we share an art instructor']
['have fun tonight ?', 'tons']
['well no . . .', 'then that s all you had to say .']
['then that s all you had to say .', 'but']
['but', 'you always been this selfish ?']
['do you listen to this crap ?', 'what crap ?']
['what good stuff ?', ' the real you . ']

另一种有利于让训练更快收敛的策略是去除词汇表中很少使用的单词。减少特征空间也会降低模型学习目标函数的难度。我们通过以下两个步 骤完成这个操作:

  • 使用voc.trim函数去除 MIN_COUNT 阈值以下单词 。
  • 如果句子中包含词频过小的单词,那么整个句子也被过滤掉。
MIN_COUNT = 3    # 修剪的最小字数阈值def trimRareWords(voc, pairs, MIN_COUNT):# 修剪来自voc的MIN_COUNT下使用的单词voc.trim(MIN_COUNT)# Filter out pairs with trimmed wordskeep_pairs = []for pair in pairs:input_sentence = pair[0]output_sentence = pair[1]keep_input = Truekeep_output = True# 检查输入句子for word in input_sentence.split(' '):if word not in voc.word2index:keep_input = Falsebreak# 检查输出句子for word in output_sentence.split(' '):if word not in voc.word2index:keep_output = Falsebreak# 只保留输入或输出句子中不包含修剪单词的对if keep_input and keep_output:keep_pairs.append(pair)print("Trimmed from {} pairs to {}, {:.4f} of total".format(len(pairs), len(keep_pairs), len(keep_pairs) / len(pairs)))return keep_pairs# 修剪voc和对
pairs = trimRareWords(voc, pairs, MIN_COUNT)
keep_words 7706 / 17771 = 0.4336
Trimmed from 63446 pairs to 52456, 0.8268 of total
print("pairs类型:", type(pairs))
print("pairs的Size:", len(pairs))
print("pairs前10个元素:", pairs[0:10])
pairs类型: <class 'list'>
pairs的Size: 52456
pairs前10个元素: [['there .', 'where ?'], ['you have my word . as a gentleman', 'you re sweet .'], ['hi .', 'looks like things worked out tonight huh ?'], ['have fun tonight ?', 'tons'], ['well no . . .', 'then that s all you had to say .'], ['then that s all you had to say .', 'but'], ['but', 'you always been this selfish ?'], ['do you listen to this crap ?', 'what crap ?'], ['what good stuff ?', ' the real you . '], ['wow', 'let s go .']]

Note: 实际上,在python当中,所有数据清洗到最后,在转换成数字之前,基本都转换成列表的形式:

[
[样本1],
[样本2],
[样本3],
...,
[样本n],
]

【作者简介】陈艺荣,男,目前在华南理工大学电子与信息学院广东省人体数据科学工程技术研究中心攻读博士,担任IEEE Access、IEEE Photonics Journal的审稿人。两次获得美国大学生数学建模竞赛(MCM)一等奖,获得2017年全国大学生数学建模竞赛(广东赛区)一等奖、2018年广东省大学生电子设计竞赛一等奖等科技竞赛奖项,主持一项2017-2019年国家级大学生创新训练项目获得优秀结题,参与两项广东大学生科技创新培育专项资金、一项2018-2019年国家级大学生创新训练项目获得良好结题,发表SCI论文4篇,授权实用新型专利8项,受理发明专利13项。
我的主页
我的Github
我的CSDN博客
我的Linkedin

【自然语言处理入门系列】加载和预处理数据-以Cornell Movie-Dialogs Corpus数据集为例相关推荐

  1. PCL入门系列 —— 加载ply格式mesh模型、点云数据并作可视化展示

    PCL入门系列 -- 加载ply格式mesh模型.点云数据并作可视化展示 前言 程序说明 输出结果 代码示例 总结 前言 随着工业自动化.智能化的不断推进,机器视觉(2D/3D)在工业领域的应用和重要 ...

  2. Tensorflow2.* 加载和预处理数据之用 tf.data 加载 Numpy数据(2)

    Tensorflow2.* 机器学习基础知识篇: 对服装图像进行分类 使用Tensorflow Hub对未处理的电影评论数据集IMDB进行分类 Keras 机器学习基础知识之对预处理的电影评论文本分类 ...

  3. Tensorflow2.* 加载和预处理数据之用 tf.data 加载磁盘图片数据(4)

    Tensorflow2.* 机器学习基础知识篇: 对服装图像进行分类 使用Tensorflow Hub对未处理的电影评论数据集IMDB进行分类 Keras 机器学习基础知识之对预处理的电影评论文本分类 ...

  4. 2.2-tensorflow2-基础教程-加载和预处理数据

    文章目录 1.CSV 2.Numpy 3.pandas.DataFrame 4.图像 5.文本 6.Unicode 7.TF.Text 8.TFRecord和tf.Example 1.CSV TRAI ...

  5. 【初级】TensorFlow教程之加载和预处理数据|学习总结

    学习记录 一 图像 ① 使用多线程并行化读取数据 AUTOTUNE = tf.data.experimental.AUTOTUNE表示tf.data模块运行时,框架会根据可用的CPU自动设置最大的可用 ...

  6. TensorFlow2简单入门-图像加载及预处理

    下载数据 import tensorflow as tfimport pathlib data_root_orig = tf.keras.utils.get_file(origin='https:// ...

  7. PyTorch 系列 | 数据加载和预处理教程

    图片来源:Unsplash,作者:Damiano Baschiera 2019 年第 66 篇文章,总第 90 篇文章 本文大约 8000 字,建议收藏阅读 原题 | DATA LOADING AND ...

  8. pytorch dataset自定义_PyTorch 系列 | 数据加载和预处理教程

    原题 | DATA LOADING AND PROCESSING TUTORIAL 作者 | Sasank Chilamkurthy 原文 | https://pytorch.org/tutorial ...

  9. pytorch dataset自定义_PyTorch | 数据加载及预处理教程

    原题 | DATA LOADING AND PROCESSING TUTORIAL 作者 | Sasank Chilamkurthy 译者 | kbsc13("算法猿的成长"公众号 ...

最新文章

  1. SD卡的控制方法(指令集和控制时序)
  2. java.text._Java.text
  3. Spring boot修改Servlet配置
  4. leetcode_add_two_numbers
  5. 计算机操作系统思维导图_我在b站学计算机
  6. [java] 虚拟机(JVM)底层结构详解[转]
  7. 智慧交通day01-算法库03:cv.dnn
  8. Pytorch——分类问题
  9. bzoj 4034: [HAOI2015]树上操作(树链剖分+线段树区间更新)
  10. linux下运行jar
  11. android ndk 混淆,OLLVM + NDK 混淆编译环境搭建
  12. C语言语法错误与语义错误的区别
  13. 聊天室 java代码_java聊天室的实现代码
  14. 淘宝APP用户体系运营拆解​
  15. B站视频怎么下载?提取视频文稿的简单方法!
  16. linux下xz格式,【转载】Linux下tar.xz格式文件的解压方法
  17. PPT插件(islide)
  18. 3D打印开源软件Cura分析(1) 【转】
  19. 马士兵 java 学习笔记_马士兵java教程笔记1
  20. 随机漫步的傻瓜:发现市场和人生中的隐藏机遇

热门文章

  1. pip install pyahocorasick(ahocorasick 改名成 pyahocorasick 了)
  2. 20100913武汉归来
  3. Centos6.5可视化界面脚本一键部署服务
  4. Mysql详细学习指南(超详细)
  5. 案例分析 | 无代码助力国企数字化转型破旧立新
  6. 10-1Python学习笔记 10-2C语言学习笔记 : 在文本编辑器中新建一个文件, 写几句话来总结一下你至此学到的Python知识
  7. 计算机考试中上网题怎么保存,计算机网页题怎么做 一级计算机考试上网题怎么做...
  8. 计算机上e盘拒绝访问,Win7系统E盘拒绝访问的解决方法
  9. 北邮赵玉平教授百家讲坛《曹操的启示》摘录
  10. 在 iOS 与 Android 上实现 React Native 应用深度链接,通过 URL 打开到指定页面