前言

在GitHub上写笔记要经常查看很麻烦，在此记录一些整合的各种代码。能附上原文链接的都附上了，多数非原创，不要杠。

备注

TextRank抽取式摘要，原理自行搜索
本代码原文链接：http://blog.itpub.net/31562039/viewspace-2286669/
适用英文，使用Glove 100d词向量，中文的话自己改改代码，我自己写的可参考https://blog.csdn.net/ziyi9663/article/details/106996293
数据集下载：
https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2018/10/tennis_articles_v4.csv
Glove词向量我自己有，原文链接里也提供了下载
NLTK下载停用词和断句数据可能需要科学上网

Talk is cheap, show me the code.

// 以下代码基于Python3.7，需要的库均为pip安装，部分库安装需要科学上网。亲测无bug，可以直接运行。
// 注释偏好为写在相关代码下方
import networkx
# 一个图结构的相关操作包，没用过无所谓，有兴趣可以搜索学习
import numpy as np
import pandas as pd
import nltk
# nltk.download('punkt')
# nltk.download('stopwords')
# 下载断句和停用词数据，下载一次就行，后续运行可直接注释掉
from sklearn.metrics.pairwise import cosine_similarity
import re
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwordsdf = pd.read_csv('tennis_articles_v4.csv')
# 读文章数据，原文中附带下载链接
sentences = []
for s in df['article_text']:sentences.append(sent_tokenize(s))# 断句，并写入sentences列表sentences = [y for x in sentences for y in x]
# 打平list。
# 原数据是好几篇文章，本代码将所有文章的所有句子放在一个列表里，摘要抽取也是基于所有句子（文章）的。word_embeddings = {}
GLOVE_DIR = 'glove.6B.100d.txt'
with open(GLOVE_DIR,encoding='utf-8') as f:for line in f:values = line.split()word = values[0]coefs = np.asarray(values[1:], dtype='float32')word_embeddings[word] = coefs
# 获取词向量
# 该词向量文件形式为：词 空格 词向量，然后换行，自行理解上述操作代码clean_sentences = pd.Series(sentences).str.replace('[^a-zA-Z]', ' ')
clean_sentences = [s.lower() for s in clean_sentences]
# 文本清洗，去除标点、数字、特殊符号、统一小写
stop_words = stopwords.words('english')
def remove_stopwords(str):sen = ' '.join([i for i in str if i not in stop_words])return sen
clean_sentences = [remove_stopwords(r.split()) for r in clean_sentences]
# 去停用词
sentences_vectors = []
for i in clean_sentences:if len(i) != 0:v = sum([word_embeddings.get(w,np.zeros((100,))) for w in i.split()])/(len(i.split())+1e-2)else:v = np.zeros((100,))sentences_vectors.append(v)
# 获取每个句子的所有组成词的向量（从GloVe词向量文件中获取，每个向量大小为100），
# 然后取这些向量的平均值，得出这个句子的合并向量为这个句子的特征向量similarity_matrix = np.zeros((len(clean_sentences),len(clean_sentences)))
# 初始化相似度矩阵（全零矩阵）
for i in range(len(clean_sentences)):for j in range(len(clean_sentences)):if i != j:similarity_matrix[i][j] = cosine_similarity(sentences_vectors[i].reshape(1,-1),sentences_vectors[j].reshape(1,-1))
# 计算相似度矩阵，基于余弦相似度
nx_graph = networkx.from_numpy_array(similarity_matrix)
scores = networkx.pagerank(nx_graph)
# 将相似度矩阵转为图结构
ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)),reverse=True
)
# 排序
for i in range(10):print(ranked_sentences[i][1])
# 打印得分最高的前10个句子，即为摘要

基于TextRank的抽取式文本摘要（英文）相关推荐

当知识图谱遇上文本摘要：保留抽象式文本摘要的事实性知识
论文标题: Boosting Factual Correctness of Abstractive Summarization with Knowledge Graph 论文作者: Chenguang ...
读取文本节点_TextRank抽取型文本摘要
import numpy as np import pandas as pd from nltk.corpus import stopwords import nltk import re from ...
python nlp文本摘要实现_用TextRank算法实现自动文本摘要
[51CTO.com快译]1. 引言文本摘要是自然语言处理(NLP)领域中的应用之一,它必将对我们的生活产生巨大影响.随着数字媒体和出版业的不断发展,谁还有时间浏览整篇文章/文档/书籍来决定它们是 ...
文本摘要 - 使用 TextRank4ZH 抽取中文文本摘要
文章目录关于 TextRank4ZH 安装关键词提取关键短语提取摘要生成使用示例报错处理关于 TextRank4ZH TextRank算法可以用来从文本中提取关键词和摘要(重要的句子). ...
使用Java基于数据流直接抽取word文本
2019独角兽企业重金招聘Python工程师标准>>> 如下代码是直接基于数据流进行文本抽取,支持word97-word2003版本,之后的版本实际都是xml,抽取文本非常简单,因此 ...
新手探索NLP（九）——文本摘要
转载自知乎https://zhuanlan.zhihu.com/p/67078700 文本摘要是一种从一个或多个信息源中抽取关键信息的方法,它帮助用户节省了大量时间,用户可以从摘要获取到文本的所有关键 ...
基于句子嵌入的无监督文本摘要（附代码实现）
©PaperWeekly· 作者|高开远学校|上海交通大学研究方向|自然语言处理本文主要介绍的是一个对多种语言的邮件进行无监督摘要抽取的项目,非常详细.文本摘要也是非常有意思的 NLP 任务之一 ...
文本摘要综述-bertsum、BottleSum、TextRANk
BottleSum--文本摘要论文系列解读/:https://zhuanlan.zhihu.com/p/84730122 主题关键词信息融合的中文生成式自动摘要研究:http://www.aas.ne ...
使用Amazon SageMaker 构建基于自然语言处理的文本摘要应用
背景介绍文本摘要,就是对给定的单个或者多个文档进行梗概,即在保证能够反映原文档的重要内容的情况下,尽可能地保持简明扼要.质量良好的文摘能够在信息检索过程中发挥重要的作用,比如利用文摘代替原文档参与索 ...

基于TextRank的抽取式文本摘要（英文）

基于TextRank的抽取式文本摘要（英文）

前言

备注

Talk is cheap, show me the code.

基于TextRank的抽取式文本摘要（英文）相关推荐

最新文章

热门文章