jieba分词TFIDF算法2

2021SC@SDUSC
关键词提取器

class KeywordExtractor(object):STOP_WORDS = set(("the", "of", "is", "and", "to", "in", "that", "we", "for", "an", "are","by", "be", "as", "on", "with", "can", "if", "from", "which", "you", "it","this", "then", "at", "have", "all", "not", "one", "has", "or", "that"))def set_stop_words(self, stop_words_path):abs_path = _get_abs_path(stop_words_path)if not os.path.isfile(abs_path):raise Exception("jieba: file does not exist: " + abs_path)content = open(abs_path, 'rb').read().decode('utf-8')for line in content.splitlines():self.stop_words.add(line)def extract_tags(self, *args, **kwargs):raise NotImplementedError

set_stop_words方法中将"the", “of”, “is”, “and”, “to”, “in”, “that”, “we”, “for”, “an”, “are”,
“by”, “be”, “as”, “on”, “with”, “can”, “if”, “from”, “which”, “you”, “it”,
“this”, “then”, “at”, “have”, “all”, “not”, “one”, “has”, “or”, “that”
这些无意义的副词提取，因为这样的token，我们对今后的token分析没有贡献。读入stopwords.txt以删除这些token。也就是说分词的过程不变，打印时做个集合差运算,另外一个方法是使用extract_tags函数，这个函数会根据TF-IDF算法将特征词提取出来，在提取之前会去掉停用词，可以人工指定停用词字典.

class IDFLoader(object):

def __init__(self, idf_path=None):self.path = ""self.idf_freq = {}self.median_idf = 0.0if idf_path:self.set_new_path(idf_path)def set_new_path(self, new_idf_path):if self.path != new_idf_path:self.path = new_idf_pathcontent = open(new_idf_path, 'rb').read().decode('utf-8')self.idf_freq = {}for line in content.splitlines():word, freq = line.strip().split(' ')self.idf_freq[word] = float(freq)self.median_idf = sorted(self.idf_freq.values())[len(self.idf_freq) // 2]def get_idf(self):return self.idf_freq, self.median_idf

词频 (term frequency, TF) 指的是某一个给定的词语在该文件中出现的次数。这个数字通常会被归一化(一般是词频除以文章总词数), 以防止它偏向长的文件。IDF的主要思想是：如果包含词条t的文档越少, IDF越大，则说明词条具有很好的类别区分能力。某一特定词语的IDF，可以由总文件数目除以包含该词语之文件的数目，再将得到的商取对数得到。
解析 idf.txt，拿到词与idf的对应值，建立一个字典key = word，value = idf
extract_tags:
使用TF-IDF算法从句子中提取关键词。返回多少个关键字。None表示所有可能的单词。数量：如果为真，返回（单词、数量）列表；
如果为False，则返回一个单词列表。-允许的POS列表。如“ns”、“n”、“vn”、“v”、“nr”]。


class TFIDF(KeywordExtractor):def __init__(self, idf_path=None):self.tokenizer = jieba.dtself.postokenizer = jieba.posseg.dtself.stop_words = self.STOP_WORDS.copy()self.idf_loader = IDFLoader(idf_path or DEFAULT_IDF)self.idf_freq, self.median_idf = self.idf_loader.get_idf()def set_idf_path(self, idf_path):new_abs_path = _get_abs_path(idf_path)if not os.path.isfile(new_abs_path):raise Exception("jieba: file does not exist: " + new_abs_path)self.idf_loader.set_new_path(new_abs_path)self.idf_freq, self.median_idf = self.idf_loader.get_idf()def extract_tags(self, sentence, topK=20, withWeight=False, allowPOS=(), withFlag=False):"""Extract keywords from sentence using TF-IDF algorithm.Parameter:- topK: return how many top keywords. `None` for all possible words.- withWeight: if True, return a list of (word, weight);if False, return a list of words.- allowPOS: the allowed POS list eg. ['ns', 'n', 'vn', 'v','nr'].if the POS of w is not in this list,it will be filtered.- withFlag: only work with allowPOS is not empty.if True, return a list of pair(word, weight) like posseg.cutif False, return a list of words"""if allowPOS:allowPOS = frozenset(allowPOS)words = self.postokenizer.cut(sentence)else:words = self.tokenizer.cut(sentence)freq = {}for w in words:if allowPOS:if w.flag not in allowPOS:continueelif not withFlag:w = w.wordwc = w.word if allowPOS and withFlag else wif len(wc.strip()) < 2 or wc.lower() in self.stop_words:continuefreq[w] = freq.get(w, 0.0) + 1.0total = sum(freq.values())for k in freq:kw = k.word if allowPOS and withFlag else kfreq[k] *= self.idf_freq.get(kw, self.median_idf) / totalif withWeight:tags = sorted(freq.items(), key=itemgetter(1), reverse=True)else:tags = sorted(freq, key=freq.__getitem__, reverse=True)if topK:return tags[:topK]else:return tags

jieba分词TFIDF算法2相关推荐

jieba分词textrank算法
2021SC@SDUSC TextRank是一种用以关键词提取的算法,因为是基于PageRank的,所以先介绍PageRank. PageRank通过互联网中的超链接关系确定一个网页的排名,其公式是通 ...
自然语言处理之jieba分词
在处理英文文本时,由于英文文本天生自带分词效果,可以直接通过词之间的空格来分词(但是有些人名.地名等需要考虑作为一个整体,比如New York).而对于中文还有其他类似形式的语言,我们需要根据来特殊处 ...
3.TF-IDF算法介绍、应用、NLTK实现TF-IDF算法、Sklearn实现TF-IDF算法、算法的不足、算法改进
3.TF-IDF 3.1.TF-IDF算法介绍 3.2.TF-IDF应用 3.3.NLTK实现TF-IDF算法 3.4.Sklearn实现TF-IDF算法 3.5.Jieba实现TF-IDF算法 3. ...
用通俗易懂的方式讲解：TF-IDF算法介绍及实现
文章目录 1.TF-IDF算法介绍 (1)TF是词频(Term Frequency) (2) IDF是逆向文件频率(Inverse Document Frequency) (3)TF-IDF实际上是: ...
TF-IDF算法实现
Python实现TF-IDF算法 # -*- coding: utf-8 -*- from collections import defaultdict import math import oper ...
自然语言处理之中文文本分析（jieba分词、词袋doc2bow、TFIDF文本挖掘）
中文分词常用的分词工具有jieba等,本文以jieba分词为例,讲解中文文本分析. 一.jieba分词来源github:https://github.com/fxsjy/jieba 1.主要模式支 ...
jieba分词算法总结
jieba分词算法总结特点: 支持三种分词模式 –精确模式,试图将句子最精确地切开,适合文本分析; –全模式,把句子中所有的可以成词的词语都扫描出来,速度非常快,但不能解决歧义; –搜索引擎模式,在 ...
jieba分词流程及算法学习
目录 jieba 特点算法 jieba分词流程图 Trie 树建立 DAG 词图分词 DAG 代码实现计算全局概率Route ,基于词频最大切分组合隐马尔可夫HMM 算法引用 jieba ...
Hanlp分词实例：Java实现TFIDF算法
2019独角兽企业重金招聘Python工程师标准>>> 算法介绍最近要做领域概念的提取,TFIDF作为一个很经典的算法可以作为其中的一步处理. 关于TFIDF算法的介绍可以参考这篇 ...

jieba分词TFIDF算法2

jieba分词TFIDF算法2相关推荐

最新文章

热门文章