简易英文问答系统（glove.6B词向量）

1.读入json文件转换成dict形式`

# doc = open('train-v2.0.json').readlines()此时格式为list,需要转换格式成dict
import json
f = open('train1-v2.0.json')
doc = json.load(f)
# 利用json.load()将list形式转换成dict的形式

2.将全部问题和答案放入question_list和answe_list

question_list = []
answer_list = []
for data in doc['data']:for paragraphs in data['paragraphs']:for qas in paragraphs['qas']:question_list.append(qas['question'])if len(qas['answers']) == 0:answer_list.append(' ')else:for contents in qas['answers']:answer_list.append(contents["text"])
print(question_list, answer_list)
# 断言语句,确定question_list的问题和answer_list中的答案数目相同
assert len(question_list) == len(answer_list)

3.统计question_list中单词总数和单词种类

#此处采用的是nltk库中的word_tokenize函数
from nltk.tokenize import word_tokenize
word_in_train = {}
for x in question_list:for x1 in word_tokenize(x.rstrip('?')):if x1 in word_in_train:word_in_train[x1] += 1else:word_in_train[x1] = 1
print(word_in_train)
word_num = sum(word_in_train.values())
word_different = len(word_in_train.keys())
print(word_num, word_different)

4.可视化

import matplotlib.pyplot as plt
dictionary = dict(sorted(word_in_train.items(),key=lambda x:x[1],reverse=True))
x_axis = list(dictionary.keys())
y_axis = list(dictionary.values())
plt.plot(x_axis[:20],y_axis[:20])
plt.show()

5.文本预处理

import string
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()#构建停用词库
#构建低频词库
stopwords = stopwords.words('english')
stopword = ['when','how','what','where']
stopwords = [word for word in stopwords if word not in stopword]
low_frequency_words = [word for word in word_in_train.keys() if word_in_train[word]<2]
delete_word = stopwords + low_frequency_wordsdef text_processing(question):every_question_word = list()question = question.lower()for word in word_tokenize(question):word = ''.join(letters for letters in word if letters not in string.punctuation) if word not in string.punctuation else wordword = '#number'if word.isdigit() else wordif word not in delete_word:every_question_word.append(lemmatizer.lemmatize(word))return every_question_word

6.把经过文本预处理的语料库中的问题放入all_quetion_word的list中，构建index_inverted的倒排表

all_question_word = list()
index_inverted = dict()
for question in question_list:preprocessed_word = text_processing(question)all_question_word.append(preprocessed_word)for word in preprocessed_word:if word not in index_inverted:index_inverted.update({word: [question_list.index(question)]})else:index_inverted[word].append(question_list.index(question))

7.载入glove.6B词向量

glove = open('glove.6B.100d.txt','r',encoding='utf-8')#glove为utf-8格式，指定以utf—8形式读取，而不是默认文本模式t
glove_word = list()
glove_vector = list()
for line in glove.readlines(): #readlines返回list，line为字符串lines = list(line.split()) #list(str(str))glove_word.append(lines[0])            #strglove_vector.append(lines[1:])
#注：此时glove_word和glove_vector都为字符串格式后需要转换成float形式

8.对现实问题进行文本预处理后，经过倒排表过滤出疑似问题放到part_question_word中，并将每一个问题都用glove向量表示放入part_question_vector中，和现实问题的glove向量real_question_vector进行余弦相似度计算（注：记得用np.linalg.norm进行归一化操作，否则结果错误）利用PriorityQueue操作得到未排序的top5问题，进行排序逐次输出top5问题的答案

real_question = 'when did beyonce start becoming popular?'
real_question_word = text_processing(real_question)
filter_index = set()
for word in real_question_word:filter_index.update(index_inverted[word])import numpy as np
from queue import PriorityQueue
part_question_vector = list()
part_question_word = list()
for index in filter_index:one_question_word = all_question_word[index]one_question_vector = np.array([glove_vector[glove_word.index(word)] for word in one_question_word if word in glove_word]).astype('float')one_question_vector = one_question_vector.mean(axis=0)part_question_vector.append(one_question_vector)part_question_word.append(one_question_word)
part_question_vector = np.array(part_question_vector)
part_question_vector = part_question_vector/np.linalg.norm(part_question_vector,axis=1,keepdims=True)
real_question_vector_matrixs = np.array([glove_vector[glove_word.index(word)] for word in real_question_word if word in glove_word]).astype('float')
real_question_vector = real_question_vector_matrixs.mean(axis=0)/np.linalg.norm(real_question_vector_matrixs.mean(axis=0),keepdims=True)
cosine_simulirity = part_question_vector.dot(real_question_vector)
top_question = PriorityQueue()
for i in range(len(cosine_simulirity)):top_question.put((cosine_simulirity[i],part_question_word[i]))if len(top_question.queue) > 5:top_question.get()
top_question = top_question.queue
sorted_top_question = sorted(top_question, key=lambda t: t[0], reverse=True)
answers = [answer_list[all_question_word.index(question[1])] for question in sorted_top_question]

运行结果

sorted_top_question
[(1.0, ['when', 'beyonce', 'start', 'popular', '?']), (0.9513611487104962, ['when', 'beyonce', 'first', 'make', 'time', '#number', 'list', '?']), (0.9496347309803, ['what', 'beyonce', 'mother', 'start', 'march', '#number', ',', '#number', '?']), (0.9444646040134128, ['beyonce', 'appeared', 'time', '#number', 'list', 'what', 'year', '?']), (0.9402425470001232, ['when', 'beyonce', 'first', 'child', '?'])]
answers
['in the late 1990s', '2013', 'Beyoncé Cosmetology Center at the Brooklyn Phoenix House', '2014', 'January 7, 2012']

简易英文问答系统（glove.6B词向量）相关推荐

深度学习与自然语言处理教程(2) - GloVe及词向量的训练与评估（NLP通关指南·完结）
作者:韩信子@ShowMeAI 教程地址:https://www.showmeai.tech/tutorials/36 本文地址:https://www.showmeai.tech/article-d ...
glove中文词向量_Summary系列glove模型解读
一.Glove模型简介语义文本向量表示可以应用在信息抽取,文档分类,问答系统,NER(Named Entity Recognition)和语义解析等领域中,大都需要计算单词或者文本之间的距离或者相似 ...
glove中文词向量_NLP中文文本分类任务的笔记（一）
词向量的使用. 通用的词向量包含word2vec,glove,fasttext三种方式,通过n-gram以及COBW或者skip-gram的方式获取得到, 这边分享一个词向量的GitHub资 Embe ...
Ubuntu下GloVe中文词向量模型训练
开启美好的九月最近在学习textCNN进行文本分类,然后随机生成向量构建embedding网络的分类效果不是很佳,便考虑训练Glove词向量来进行训练,整个过程还是有遇到一些问题,希望懂的旁友能来指 ...
训练GloVe中文词向量
准备语料准备好自己的语料,保存为txt,每行一个句子或一段话,注意要分好词. 准备源码从GitHub下载代码,https://github.com/stanfordnlp/GloVe 将语料cor ...
glove中文词向量_《GloVe:Global Vectors for Word Representation》学习
1.概述自从2013年Mikolov提出了word2vec之后,无监督预训练的word embedding越来越火,很多学者都在研究如何获得更好的语义表达.于是,出现了同样是静态表示的Glove,动 ...
基于词向量word2vec匹配的英文问答系统
环境准备 python3.6 pandas --读取并处理csv文件 nltk --http://www.nltk.org/ 自然语言处理工具包,用于分词,词干提取,语料库 gensim -- 训练w ...
机器阅读理解笔记之glove词向量与attentive readerimpatient reader和bi-DAF
glove词向量模型词向量的表示可以分成两类: 基于统计方法共现矩阵.svd 基于语言模型神经网络语言模型,word2vector,glove,elmo word2vector中的skip-g ...
更别致的词向量模型(一)：simpler glove
如果问我哪个是最方便.最好用的词向量模型,我觉得应该是word2vec,但如果问我哪个是最漂亮的词向量模型,我不知道,我觉得各个模型总有一些不足的地方.且不说试验效果好不好(这不过是评测指标的问题), ...

简易英文问答系统（glove.6B词向量）

简易英文问答系统（glove.6B词向量）相关推荐

最新文章

热门文章