TextRank提取句子的关键词

在这个任务中，词就是Graph中的节点，而词与词之间的边，则利用“共现”关系来确定。所谓“共现”，就是共同出现，即在一个给定大小的滑动窗口内的词，认为是共同出现的，而这些单词间也就存在着边，举例：“淡黄的长裙，蓬松的头发牵着我的手看最新展出的油画”

相对于PageRank里的无权有向图，这里建立的是无权无向图，原论文中对于关键词提取任务主要也是构建的无向无权图，对于有向图，论文提到是基于词的前后顺序角度去考虑，即给定窗口，比如对于“长裙”来说，“淡黄”与它之间是入边，而“蓬松”与它之间是出边，但是效果都要比无向图差。

构造好图后，剩下的就是按照PageRank的公式进行迭代计算

这个权重，是针对摘要任务中的句子相似度而言的。

最终效果还是不错的

from collections import OrderedDict
import numpy as np
import spacy
from spacy.lang.en.stop_words import STOP_WORDSnlp = spacy.load('en_core_web_sm')class TextRank4Keyword():"""Extract keywords from text"""def __init__(self):self.d = 0.85 # damping coefficient, usually is .85self.min_diff = 1e-5 # convergence thresholdself.steps = 10 # iteration stepsself.node_weight = None # save keywords and its weightdef set_stopwords(self, stopwords):  """Set stop words"""for word in STOP_WORDS.union(set(stopwords)):lexeme = nlp.vocab[word]lexeme.is_stop = Truedef sentence_segment(self, doc, candidate_pos, lower):"""Store those words only in cadidate_pos"""sentences = []for sent in doc.sents:selected_words = []for token in sent:# Store words only with cadidate POS tagif token.pos_ in candidate_pos and token.is_stop is False:if lower is True:selected_words.append(token.text.lower())else:selected_words.append(token.text)sentences.append(selected_words)return sentencesdef get_vocab(self, sentences):"""Get all tokens"""vocab = OrderedDict()i = 0for sentence in sentences:for word in sentence:if word not in vocab:vocab[word] = ii += 1return vocabdef get_token_pairs(self, window_size, sentences):"""Build token_pairs from windows in sentences"""token_pairs = list()for sentence in sentences:for i, word in enumerate(sentence):for j in range(i+1, i+window_size):if j >= len(sentence):breakpair = (word, sentence[j])if pair not in token_pairs:token_pairs.append(pair)return token_pairsdef symmetrize(self, a):return a + a.T - np.diag(a.diagonal())def get_matrix(self, vocab, token_pairs):"""Get normalized matrix"""# Build matrixvocab_size = len(vocab)g = np.zeros((vocab_size, vocab_size), dtype='float')for word1, word2 in token_pairs:i, j = vocab[word1], vocab[word2]g[i][j] = 1# Get Symmeric matrixg = self.symmetrize(g)# Normalize matrix by columnnorm = np.sum(g, axis=0)g_norm = np.divide(g, norm, where=norm!=0) # this is ignore the 0 element in normreturn g_normdef get_keywords(self, number=10):"""Print top number keywords"""node_weight = OrderedDict(sorted(self.node_weight.items(), key=lambda t: t[1], reverse=True))for i, (key, value) in enumerate(node_weight.items()):#print(key + ' - ' + str(value))print(key)if i > number:breakdef analyze(self, text, candidate_pos=['NOUN', 'PROPN'], window_size=4, lower=False, stopwords=list()):"""Main function to analyze text"""# Set stop wordsself.set_stopwords(stopwords)# Pare text by spaCydoc = nlp(text)# Filter sentencessentences = self.sentence_segment(doc, candidate_pos, lower) # list of list of words# Build vocabularyvocab = self.get_vocab(sentences)# Get token_pairs from windowstoken_pairs = self.get_token_pairs(window_size, sentences)# Get normalized matrixg = self.get_matrix(vocab, token_pairs)# Initionlization for weight(pagerank value)pr = np.array([1] * len(vocab))# Iterationprevious_pr = 0for epoch in range(self.steps):pr = (1-self.d) + self.d * np.dot(g, pr)if abs(previous_pr - sum(pr))  < self.min_diff:breakelse:previous_pr = sum(pr)# Get weight for each nodenode_weight = dict()for word, index in vocab.items():node_weight[word] = pr[index]self.node_weight = node_weighttext = """
no a good product very crooked this is not a good product not sturdy or striaght when assembled instruction where not easy to follow had neighbor help me he agreed this is not good product
"""
tr4w = TextRank4Keyword()
tr4w.analyze(text, candidate_pos = ['NOUN', 'PROPN'], window_size=2, lower=False)
tr4w.get_keywords(5)

TextRank提取句子的关键词相关推荐

textrank提取文档关键词
前言:我大致介绍一下TextRank算法的实现,对于细节和相关公式的介绍不做过多的介绍,感兴趣的同学可以去看TextRank算法的论文(英文版)里面有具体的实现,文章下载地址http://downlo ...
gensim提取一个句子的关键词_聊一聊 NLPer 如何做关键词抽取
微信公众号:NLP从入门到放弃有兴趣的去github看更多NLP相关知识总结: https://github.com/DA-southampton/NLP_abilitygithub.com 关 ...
tfidf关键词提取_基于TextRank提取关键词、关键短语、摘要，文章排序
之前使用TFIDF做过行业关键词提取,TFIDF仅从词的统计信息出发,而没有充分考虑词之间的语义信息.TextRank考虑到了相邻词的语义关系,是一种基于图排序的关键词提取算法. TextRank的提 ...
textrank提取关键词与关键句
最近在用pg进行全文检索,如果检索全文则速度会慢,考虑可以检索关键句以提高速度.测试了一下textrank提取关键句,目前的思想是用全文提取关键词和关键句提取关键词进行比较,以评估关键句的提取.(提取 ...
python 使用jieba.analyse提取句子级的关键字
安装所需要的库 jieba(pip install jieba) 方法参数解释 jieba.analyse.extract_tags(sentence, topK=5, withWeight=True ...
Android studio根据文本提取出的关键词在sqlite数据库中查找相关内容
Android studio根据文本提取出的关键词在sqlite数据库中查找相关内容一.介绍二.Android studio连接.操作和查看sqlite数据库三.在数据库中查找相关内容四.运行 ...
VBA word自动排版（8）——批量自动搜索并提取带有特定关键词的内容
在做数据筛选时,会要求提取带有特定关键词的短句. 楼主比较懒,代码只提供了提取关键词短句的部分,并未加入重复检测功能待提取的word文档格式如下:(关键词为XX) aaaxxaa bbbxxbb s ...
java 文本分析关键词提取_文本关键词提取算法总结
1.TF-IDF 昨天给大家演示简单的文本聚类,但要给每个聚类再提取一两个关键词用于表示该聚类.我们还是用TFIDF算法来做,因为这是比较简单的提取特征算法,不过这里的TF是指某词在本聚类内所有文章的 ...
gensim提取一个句子的关键词_NLP（五）：关键词提取补充（语料库和向量空间）...
一.将语料库转化为向量(gensim) 在对语料库进行基本的处理后(分词,去停用词),有时需要将它进行向量化,便于后续的工作. from gensim importcorpora,similariti ...

TextRank提取句子的关键词

TextRank提取句子的关键词相关推荐

最新文章

热门文章