RAKE算法

RAKE算法用来做关键词(key word)的提取，实际上提取的是关键的短语(phrase)，并且倾向于较长的短语，在英文中，关键词通常包括多个单词，但很少包含标点符号和停用词，例如and，the，of等，以及其他不包含语义信息的单词。

算法思想

RAKE算法首先使用标点符号（如半角的句号、问号、感叹号、逗号等）将一篇文档分成若干分句。
然后对于每一个分句，使用停用词作为分隔符将分句分为若干短语，这些短语作为最终提取出的关键词的候选词。
最后，每个短语可以再通过空格分为若干个单词，可以通过给每个单词赋予一个得分，通过累加得到每个短语的得分。

一个关键点在于将这个短语中每个单词的共现关系考虑进去。最终定义的公式是:
wordScore=wordDegree(w)wordFrequency(w)wordScore=\frac{wordDegree(w)}{wordFrequency(w)} wordScore=wordFrequency(w)wordDegree(w)
词共现矩阵的两种不同含义：

词共现矩阵的每一列的值即为该词的度deg（是一个网络中的概念，每与一个单词共现在一个短语中，度就加1，考虑该单词本身）,每个词在文本中出现的次数即为频率freq。
通过统计一个事先指定大小的窗口内的word共现次数，以word周边的共现词的次数做为当前word的vector。具体来说，我们通过从大量的语料文本中构建一个共现矩阵来定义word representation。

对于第一解释存在以下示例：

共词矩阵对角线上元素赋值为它自身在所有文章出现次数。循环遍历特征词列表，构建全部两个词之间的组合,再遍历每一篇文章的切词结果，如果该两个词在同一片文章中出现，则该两词的权重+1，再将其存入共词矩阵的对应位置中。

实现可以参照[4]或[7]。

对于第二种解释有以下示例：

I like deep learning.
I like NLP.
I enjoy flying.

假设窗口大小为2，其词共现矩阵如下表所示：

表1 词共现矩阵

count	I	like	enjoy	deep	learning	NLP	flying	.
I	0	2	1	0	0	0	0	0
like	2	0	0	1	0	1	0	0
enjoy	1	0	0	0	0	0	1	0
deep	0	1	0	0	1	0	0	0
learning	0	0	0	1	0	0	0	1
NLP	0	1	0	0	0	0	0	1
flying	0	0	1	0	0	0	0	1
.	0	0	0	0	1	1	1	0

其中（I，like)=2是因为窗口等于3，“I like deep”和“I like NLP”中各出现了一次，其他的依此类推。

Reference

[1] python 共现矩阵构建_这是一个死肥宅的博客-CSDN博客_python共现矩阵

[2] https://mp.weixin.qq.com/s/nX_idrMlI_cv9aufnHoRwQ

[3] 共现矩阵_贾世林jiashilin的博客-CSDN博客_共现矩阵

[4] https://github.com/zelandiya/RAKE-tutorial

[5] Rose, S. J. , et al. “Automatic Keyword Extraction from Individual Documents.” (2010).

[6] 英文关键词提取之RAKE算法_lizzy05的博客-CSDN博客_rake算法

[7] laserwave/keywords_extraction_rake: A python implementation of the Rapid Automatic Keyword Extraction (github.com)

附页

代码实现

import re
import operator
import argparse
import codecsdef isNumber(s):try:float(s) if '.' in s else int(s)return Trueexcept ValueError:return Falseclass Rake:def __init__(self, inputFilePath, stopwordsFilePath, outputFilePath, minPhraseChar, maxPhraseLength):self.outputFilePath = outputFilePathself.minPhraseChar = minPhraseCharself.maxPhraseLength = maxPhraseLength# read documentsself.docs = []for document in codecs.open(inputFilePath, 'r', 'utf-8'):self.docs.append(document)# read stopwordsstopwords = []for word in codecs.open(stopwordsFilePath, 'r', 'utf-8'):stopwords.append(word.strip())stopwordsRegex = []for word in stopwords:regex = r'\b' + word + r'(?![\w-])'stopwordsRegex.append(regex)self.stopwordsPattern = re.compile('|'.join(stopwordsRegex), re.IGNORECASE)def separateWords(self, text):splitter = re.compile('[^a-zA-Z0-9_\\+\\-/]')words = []for word in splitter.split(text):word = word.strip().lower()# leave numbers in phrase, but don't count as words, since they tend to invalidate scores of their phrasesif len(word) > 0 and word != '' and not isNumber(word):words.append(word)return wordsdef calculatePhraseScore(self, phrases):# calculate wordFrequency and wordDegreewordFrequency = {}wordDegree = {}for phrase in phrases:wordList = self.separateWords(phrase)wordListLength = len(wordList)wordListDegree = wordListLength - 1for word in wordList:wordFrequency.setdefault(word, 0)wordFrequency[word] += 1wordDegree.setdefault(word, 0)wordDegree[word] += wordListDegreefor item in wordFrequency:wordDegree[item] = wordDegree[item] + wordFrequency[item]# calculate wordScore = wordDegree(w)/wordFrequency(w)wordScore = {}for item in wordFrequency:wordScore.setdefault(item, 0)wordScore[item] = wordDegree[item] * 1.0 / wordFrequency[item]# calculate phraseScorephraseScore = {}for phrase in phrases:phraseScore.setdefault(phrase, 0)wordList = self.separateWords(phrase)candidateScore = 0for word in wordList:candidateScore += wordScore[word]phraseScore[phrase] = candidateScorereturn phraseScoredef execute(self):file = codecs.open(self.outputFilePath,'w','utf-8')for document in self.docs:# split a document into sentencessentenceDelimiters = re.compile(u'[.!?,;:\t\\\\"\\(\\)\\\'\u2019\u2013]|\\s\\-\\s')sentences = sentenceDelimiters.split(document)# generate all valid phrasesphrases = []for s in sentences:tmp = re.sub(self.stopwordsPattern, '|', s.strip())phrasesOfSentence = tmp.split("|")for phrase in phrasesOfSentence:phrase = phrase.strip().lower()if phrase != "" and len(phrase) >= self.minPhraseChar and len(phrase.split()) <= self.maxPhraseLength:phrases.append(phrase)# calculate phrase scorephraseScore = self.calculatePhraseScore(phrases)keywords = sorted(phraseScore.items(), key = operator.itemgetter(1), reverse=True)file.write(str(keywords[0:int(len(keywords)/3)]) + "\n")file.close()def readParamsFromCmd():parser = argparse.ArgumentParser(description = "This is a python implementation of rake(rapid automatic keyword extraction).")parser.add_argument('inputFilePath', help = 'The file path of input document(s). One line represents a document.')parser.add_argument('stopwordsFilePath', help = 'The file path of stopwords, each line represents a word.')parser.add_argument('-o', '--outputFilePath', help = 'The file path of output. (default output.txt in current dir).', default = 'output.txt')parser.add_argument('-m', '--minPhraseChar', type = int, help = 'The minimum number of characters of a phrase.(default 1)', default = 1)parser.add_argument('-a', '--maxPhraseLength', type = int, help = 'The maximum length of a phrase.(default 3)', default = 3)return parser.parse_args()params = readParamsFromCmd().__dict__rake = Rake(params['inputFilePath'], params['stopwordsFilePath'], params['outputFilePath'], params['minPhraseChar'], params['maxPhraseLength'])
rake.execute()

RAKE（快速自动关键字抽取）算法原理与实现相关推荐

YAKE!无监督关键字抽取算法解读
本周任务如下,接续上周的关键字抽取任务,前面一两周主要学习了RAKE.TF-IDF.TextRank算法,详细见https://blog.csdn.net/qq_45041871/article/de ...
RRT（快速随机搜索树）算法原理及代码实践
RRT算法简介 RRT 算法为一种递增式的路径规划算法,算法不断在搜索空间中随机生成采样点,如果该点位于无碰撞位置,则寻找搜索树中离该节点最近的结点为基准结点,由基准结点出发以一定步长朝着该随机结点进 ...
快速计算距离Annoy算法原理及Python使用
快速计算距离Annoy算法基本原理高维稀疏数据进行快速相似查找,可以采用learning to hash参考:Minhashing & LSH & Simhash 技术汇总,但高维 ...
公开课｜智能文本信息抽取算法的进阶与应用
一.什么是文本挖掘? 讨论文本挖掘之前,我们要先说一下数据挖掘的概念,因为文本挖掘是数据挖掘的一个分支.数据挖掘(Data Mining)指从大量的数据中通过算法搜索隐藏在其中信息的过程.而文本挖掘就 ...
【NLP基础】NLP关键字提取技术之LDA算法原理与实践
点击上方,选择星标或置顶,每天给你送干货! 阅读大概需要11分钟跟随小博主,每天进步一丢丢引文人们是如何从大量文本资料中便捷得浏览和获取信息?答案你肯定会说通过关键字.仔细想想,我们人类是怎么提 ...
论文研究 | 快速自动模糊 C-均值聚类彩色图像分割算法
前面分享了机器视觉缺陷检测应用方面的论文,今天想分享关于图像分割领域的研究.图像分割是计算机视觉和图像理解中最重要的研究课题之一,该论文提出了一种快速自动的 FCM 彩色图像分割算法,所提算法的计算复 ...
java 网页正文抽取算法_GitHub - hfut-dmic/ContentExtractor: 自动抽取网页正文的算法，用JAVA实现...
ContentExtractor ##简介 ContentExtractor是一个开源的网页正文抽取工具,用JAVA实现,具有非常高的抽取精度. ##算法 ContentExtractor的网页正文抽 ...
快速神经网络的训练算法LARS/LAMB工作原理 --UC Berkeley在读博士生尤洋
快速神经网络的训练算法 --UC Berkeley博士尤洋快速神经网络的训练算法 --UC Berkeley博士尤洋为什么深度学习很慢? 如果我们利用并行计算,怎样才能完美的利用大量的proces ...
今日头条算法原理（全文）【转】
原文地址:https://www.toutiao.com/i6511211182064402951/?tt_from=weixin_moments&utm_campaign=client_sh ...

RAKE（快速自动关键字抽取）算法原理与实现

RAKE算法

Reference

附页

RAKE（快速自动关键字抽取）算法原理与实现相关推荐

最新文章

热门文章