基于Word2vec加TextRank算法生成中文新闻摘要（附python代码）

转自

# https://blog.csdn.net/qq_36910634/article/details/97764251
import numpy as np
import pandas as pd
import re, os, jieba
from itertools import chain# 本文要处理的新闻一共3篇，都是关于证监会主席易会满同志新官上任的报道，新闻的大致内容是易会满
# 同志怎么对中国资本市场的改革指点江山。
#
# 文档的原网页（看到百家号，不要鄙视我）：
#
# https://baijiahao.baidu.com/s?id=1626615436040775944&wfr=spider&for=pc
# https://baijiahao.baidu.com/s?id=1626670136476331971&wfr=spider&for=pc
# http://finance.sina.com.cn/roll/2019-02-28/doc-ihrfqzka9798377.shtml"""第一步：把文档划分成句子"""# 文档所在的文件夹
c_root = os.getcwd() + os.sep + "cnews" + os.sepsentences_list = []
for file in os.listdir(c_root):fp = open(c_root + file, 'r', encoding="utf8")for line in fp.readlines():if line.strip():# 把元素按照[。！；？]进行分隔，得到句子。line_split = re.split(r'[。！；？]', line.strip())# [。！；？]这些符号也会划分出来，把它们去掉。line_split = [line.strip() for line in line_split ifline.strip() not in ['。', '！', '？', '；'] and len(line.strip()) > 1]sentences_list.append(line_split)
sentences_list = list(chain.from_iterable(sentences_list))
print("前10个句子为：\n")
print(sentences_list[:10])"""第二步：文本预处理，去除停用词和非汉字字符,并进行分词"""# 创建停用词列表
stopwords = [line.strip() for line in open('./stopwords.txt', encoding='UTF-8').readlines()]# 对句子进行分词
def seg_depart(sentence):# 去掉非汉字字符sentence = re.sub(r'[^\u4e00-\u9fa5]+', '', sentence)sentence_depart = jieba.cut(sentence.strip())word_list = []for word in sentence_depart:if word not in stopwords:word_list.append(word)# 如果句子整个被过滤掉了，如：'02-2717:56'被过滤，那就返回[],保持句子的数量不变return word_listsentence_word_list = []
for sentence in sentences_list:line_seg = seg_depart(sentence)sentence_word_list.append(line_seg)
print("一共有", len(sentences_list), '个句子。\n')
print("前10个句子分词后的结果为：\n", sentence_word_list[:10])# 保证处理后句子的数量不变，我们后面才好根据textrank值取出未处理之前的句子作为摘要。
if len(sentences_list) == len(sentence_word_list):print("\n数据预处理后句子的数量不变！")"""第三步：准备词向量"""word_embeddings = {}
# f = open('./sgns.financial.char', encoding='utf-8')
f = open('sgns.financial.word', encoding='utf-8')
for line in f:# 把第一行的内容去掉if '467389 300\n' not in line:values = line.split()# 第一个元素是词语word = values[0]embedding = np.asarray(values[1:], dtype='float32')word_embeddings[word] = embedding
f.close()
print("一共有" + str(len(word_embeddings)) + "个词语/字。")"""第四步：得到词语的embedding，用WordAVG作为句子的向量表示"""sentence_vectors = []
for i in sentence_word_list:if len(i) != 0:# 如果句子中的词语不在字典中，那就把embedding设为300维元素为0的向量。# 得到句子中全部词的词向量后，求平均值，得到句子的向量表示v = sum([word_embeddings.get(w, np.zeros((300,))) for w in i]) / (len(i))else:# 如果句子为[]，那么就向量表示为300维元素为0个向量。v = np.zeros((300,))sentence_vectors.append(v)"""第五步：计算句子之间的余弦相似度，构成相似度矩阵"""
sim_mat = np.zeros([len(sentences_list), len(sentences_list)])from sklearn.metrics.pairwise import cosine_similarityfor i in range(len(sentences_list)):for j in range(len(sentences_list)):if i != j:sim_mat[i][j] = cosine_similarity(sentence_vectors[i].reshape(1, 300), sentence_vectors[j].reshape(1, 300))[0, 0]
print("句子相似度矩阵的形状为：", sim_mat.shape)"""第六步：迭代得到句子的textrank值，排序并取出摘要"""
import networkx as nx# 利用句子相似度矩阵构建图结构，句子为节点，句子相似度为转移概率
nx_graph = nx.from_numpy_array(sim_mat)# 得到所有句子的textrank值
scores = nx.pagerank(nx_graph)# 根据textrank值对未处理的句子进行排序
ranked_sentences = sorted(((scores[i], s) for i, s in enumerate(sentences_list)), reverse=True)# 取出得分最高的前10个句子作为摘要
sn = 10
for i in range(sn):print("第" + str(i + 1) + "条摘要：\n\n", ranked_sentences[i][1], '\n')

基于Word2vec加TextRank算法生成中文新闻摘要（附python代码）相关推荐

Word2vec加TextRank算法生成文章摘要
依赖包:https://download.csdn.net/download/dreamzuora/10853874 代码: String document = "算法可大致分为基本算法.数 ...
【车间调度】基于改进帝国企鹅算法求解车间调度问题附matlab代码
1 内容介绍传统车间调度问题仅仅考虑工件的分配问题.而柔性车间调度问题在传统车间调度问题上做了一定的延伸,它更接近实际生产过程的原因是由于其在传统车间调度问题中加入了对加工机器的选择.因此对其的研究 ...
手把手教你深度学习强大算法进行序列学习(附Python代码)
作者:NSS 翻译:陈之炎校对:丁楠雅本文共3200字,建议阅读10分钟. 本文将教你使用做紧致预测树的算法来进行序列学习. 概述序列学习是近年来深度学习的热点之一.从推荐系统到语音识别再到自然 ...
TSP问题解析篇之自适应大邻域搜索(ALNS)算法深度通读（附python代码）
01 概念科普篇关于neighborhood serach,这里有好多种衍生和变种出来的胡里花俏的算法.大家在上网搜索的过程中可能看到什么Large Neighborhood Serach,也可能看 ...
基于TextRank算法的文本摘要（附Python代码）
基于TextRank算法的文本摘要(附Python代码): https://www.jiqizhixin.com/articles/2018-12-28-18
【负荷预测】基于灰色预测算法的负荷预测（Python代码实现）
目录 1 概述 2 流程图 3 入门算例 4 基于灰色预测算法的负荷预测(Python代码实现) 1 概述 "由于数据列的离散性,信息时区内将出现空集(不包含信息的定时区),因此只能按近似 ...
## ***电池SOC仿真系列-基于扩展卡尔曼(EKF)算法的SOC估计（内含代码等资料）***
## ***电池SOC仿真系列-基于扩展卡尔曼(EKF)算法的SOC估计(内含代码等资料)*** ## 1 研究背景电池的荷电状态(SOC)代表的是电池当前的剩余容量,数值定义是电池剩余电量与电池额 ...
联邦学习算法介绍-FedAvg详细案例-Python代码获取
联邦学习算法介绍-FedAvg详细案例-Python代码获取一.联邦学习系统框架二.联邦平均算法(FedAvg) 三.联邦随梯度下降算法 (FedSGD) 四.差分隐私随联邦梯度下降算法 (DP- ...
10 种机器学习算法的要点（附 Python 和 R 代码）（转载）
10 种机器学习算法的要点(附 Python 和 R 代码)(转载) from:https://zhuanlan.zhihu.com/p/25273698 前言谷歌董事长施密特曾说过:虽然谷歌的无人 ...

基于Word2vec加TextRank算法生成中文新闻摘要（附python代码）

基于Word2vec加TextRank算法生成中文新闻摘要（附python代码）相关推荐

最新文章

热门文章