TF-IDF介绍

TF-IDF是NLP中一种常用的统计方法,用以评估一个字词对于一个文件集或一个语料库中的其中一份文件的重要程度,通常用于提取文本的特征,即关键词。字词的重要性随着它在文件中出现的次数成正比增加,但同时会随着它在语料库中出现的频率成反比下降。

在NLP中,TF-IDF的计算公式如下:

其中,tf是词频(Term Frequency),idf为逆向文件频率(Inverse Document Frequency)。

  • tf为词频,即一个词语在文档中的出现频率,假设一个词语在整个文档中出现了i次,而整个文档有N个词语,则tf的值为i/N.
  • idf为逆向文件频率,假设整个文档有n篇文章,而一个词语在k篇文章中出现,则idf值为
  • 当然,不同地方的idf值计算公式会有稍微的不同。比如有些地方会在分母的k上加1,防止分母为0,还有些地方会让分子,分母都加上1,这是smoothing技巧。在本文中,还是采用最原始的idf值计算公式,因为这与gensim里面的计算公式一致。

文本介绍及预处理

  我们将采用以下三个示例文本:

text1 = """
Football is a family of team sports that involve, to varying degrees, kicking a ball to score a goal.
Unqualified, the word football is understood to refer to whichever form of football is the most popular
in the regional context in which the word appears. Sports commonly called football in certain places
include association football (known as soccer in some countries); gridiron football (specifically American
football or Canadian football); Australian rules football; rugby football (either rugby league or rugby union);
and Gaelic football. These different variations of football are known as football codes.
"""text2 = """
Basketball is a team sport in which two teams of five players, opposing one another on a rectangular court,
compete with the primary objective of shooting a basketball (approximately 9.4 inches (24 cm) in diameter)
through the defender's hoop (a basket 18 inches (46 cm) in diameter mounted 10 feet (3.048 m) high to a backboard
at each end of the court) while preventing the opposing team from shooting through their own hoop. A field goal is
worth two points, unless made from behind the three-point line, when it is worth three. After a foul, timed play stops
and the player fouled or designated to shoot a technical foul is given one or more one-point free throws. The team with
the most points at the end of the game wins, but if regulation play expires with the score tied, an additional period
of play (overtime) is mandated.
"""text3 = """
Volleyball, game played by two teams, usually of six players on a side, in which the players use their hands to bat a
ball back and forth over a high net, trying to make the ball touch the court within the opponents’ playing area before
it can be returned. To prevent this a player on the opposing team bats the ball up and toward a teammate before it touches
the court surface—that teammate may then volley it back across the net or bat it to a third teammate who volleys it across
the net. A team is allowed only three touches of the ball before it must be returned over the net.
"""

这三篇文章分别是关于足球,篮球,排球的介绍,它们组成一篇文档。
  接下来是文本的预处理部分。
  首先是对文本去掉换行符,然后是分句,分词,再去掉其中的标点,完整的Python代码如下,输入的参数为文章text:

import nltk
import string#  文本预处理
#  函数:text文件分句、分词,并去掉标点
def get_token(text):text = text.replace('\n', '')sents = nltk.sent_tokenize(text)  # 分句print(len(sents))tokens = []for sent in sents:for word in nltk.word_tokenize(sent):  # 分词if word not in string.punctuation:  # 去掉标点tokens.append(word)return tokensprint(get_token(text1))

运行结果:

4
['Football', 'is', 'a', 'family', 'of', 'team', 'sports', 'that', 'involve', 'to', 'varying', 'degrees', 'kicking', 'a', 'ball', 'to', 'score', 'a', 'goal', 'Unqualified', 'the', 'word', 'football', 'is', 'understood', 'to', 'refer', 'to', 'whichever', 'form', 'of', 'football', 'is', 'the', 'most', 'popular', 'in', 'the', 'regional', 'context', 'in', 'which', 'the', 'word', 'appears', 'Sports', 'commonly', 'called', 'football', 'in', 'certain', 'places', 'include', 'association', 'football', 'known', 'as', 'soccer', 'in', 'some', 'countries', 'gridiron', 'football', 'specifically', 'American', 'football', 'or', 'Canadian', 'football', 'Australian', 'rules', 'football', 'rugby', 'football', 'either', 'rugby', 'league', 'or', 'rugby', 'union', 'and', 'Gaelic', 'football', 'These', 'different', 'variations', 'of', 'football', 'are', 'known', 'as', 'football', 'codes']

接着,去掉文章中的停用词(stopwords),然后统计每个单词的出现次数,完整的Python代码如下,输入的参数为文章text:

from nltk.corpus import stopwords  # 停用词
from collections import Counter#  对原始的text文件去掉停用词
#  生成count字典,即每个单词的出现次数
def make_count(text):tokens = get_token(text)filtered = [w for w in tokens if w not in stopwords.words('english')]  # 去掉停用词count = Counter(filtered)return countprint(make_count(text1))

以text1为例,生成的count字典如下:

Counter({'football': 12, 'rugby': 3, 'word': 2, 'known': 2, 'Football': 1, 'family': 1, 'team': 1, 'sports': 1, 'involve': 1, 'varying': 1, 'degrees': 1, 'kicking': 1, 'ball': 1, 'score': 1, 'goal': 1, 'Unqualified': 1, 'understood': 1, 'refer': 1, 'whichever': 1, 'form': 1, 'popular': 1, 'regional': 1, 'context': 1, 'appears': 1, 'Sports': 1, 'commonly': 1, 'called': 1, 'certain': 1, 'places': 1, 'include': 1, 'association': 1, 'soccer': 1, 'countries': 1, 'gridiron': 1, 'specifically': 1, 'American': 1, 'Canadian': 1, 'Australian': 1, 'rules': 1, 'either': 1, 'league': 1, 'union': 1, 'Gaelic': 1, 'These': 1, 'different': 1, 'variations': 1, 'codes': 1})

Gensim中的TF-IDF

  对文本进行预处理后,对于以上三个示例文本,我们都会得到一个count字典,里面是每个文本中单词的出现次数。下面,我们将用gensim中的已实现的TF-IDF模型,来输出每篇文章中TF-IDF排名前三的单词及它们的tfidf值,完整的代码如下:

from nltk.corpus import stopwords
from gensim import corpora, models, matutils#  training by gensim tfidf model
def get_words(text):tokens = get_token(text)filtered = [w for w in tokens if w not in stopwords.words('english')]return filtered#  get text
count1, count2, count3 = get_words(text1), get_words(text2), get_words(text3)
count_list = [count1, count2, count3]#  training by tfidf model in gensim
dictionary = corpora.Dictionary(count_list)
new_dict = {v: k for k, v in dictionary.token2id.items()}
corpus2 = [dictionary.doc2bow(count) for count in count_list]
tfidf2 = models.TfidfModel(corpus2)
corpus_tfidf = tfidf2[corpus2]#  output
print('\nTraining by gensim tfidf model......\n')
for i, doc in enumerate(corpus_tfidf):print('Top words in document %d' % (i + 1))sorted_words = sorted(doc, key=lambda x: x[1], reverse=True)  # type=listfor num, score in sorted_words[:3]:print('\tWord: %s, Tfidf: %s' % (new_dict[num], round(score, 5)))

运行结果:

Training by gensim tfidf model......Top words in document 1Word: football, Tfidf: 0.84766Word: rugby, Tfidf: 0.21192Word: known, Tfidf: 0.14128
Top words in document 2Word: play, Tfidf: 0.29872Word: cm, Tfidf: 0.19915Word: diameter, Tfidf: 0.19915
Top words in document 3Word: net, Tfidf: 0.45775Word: teammate, Tfidf: 0.34331Word: across, Tfidf: 0.22888

输出的结果还是比较符合我们的预期的,比如关于足球的文章中提取了football, rugby关键词,关于篮球的文章中提取了plat, cm关键词,关于排球的文章中提取了net, teammate关键词。

自己动手实践TF-IDF模型

  有了以上我们对TF-IDF模型的理解,其实我们自己也可以动手实践一把,这是学习算法的最佳方式!
  以下是笔者实践TF-IDF的代码(接文本预处理代码):

import math#   计算tf
def tf(word, count):return count[word] / sum(count.values())#  计算count_list有多少个文件包含word
def n_containing(word, count_list):return sum(1 for count in count_list if word in count)#  计算idf
def idf(word, count_list):return math.log2(len(count_list) / n_containing(word, count_list))  # 对数以2为底#  计算tf-idf
def tfidf(word, count, count_lsit):return tf(word, count) * idf(word, count_list)#  tf-idf测试
# TF-IDF测试
count1, count2, count3 = make_count(text1), make_count(text2), make_count(text3)
countlist = [count1, count2, count3]
print("Training by original algorithm......\n")
for i, count in enumerate(countlist):print("Top words in document %d" % (i + 1))scores = {word: tfidf(word, count, countlist) for word in count}sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)  # type=list# sorted_words = matutils.unitvec(sorted_words)for word, score in sorted_words[:3]:print("\tWord: %s, TF-IDF: %s" % (word, round(score, 5)))

运行结果:

Training by original algorithm......Top words in document 1Word: football, TF-IDF: 0.30677Word: rugby, TF-IDF: 0.07669Word: word, TF-IDF: 0.05113
Top words in document 2Word: play, TF-IDF: 0.05283Word: one, TF-IDF: 0.03522Word: shooting, TF-IDF: 0.03522
Top words in document 3Word: net, TF-IDF: 0.10226Word: teammate, TF-IDF: 0.07669Word: bat, TF-IDF: 0.05113

可以看到,笔者自己动手实践的TF-IDF模型提取的关键词与gensim一致,至于篮球中为什么后两个单词不一致,是因为这些单词的tfidf一样,随机选择的结果不同而已。但是有一个问题,那就是计算得到的tfidf值不一样,这是什么原因呢?

究其原因,也就是说,gensim对得到的tf-idf向量做了规范化(normalize),将其转化为单位向量。因此,我们需要在刚才的代码中加入规范化这一步,代码如下:

import numpy as np#  对向量做规范化, normalize
def unitvec(sorted_words):lst = [item[1] for item in sorted_words]L2Norm = math.sqrt(sum(np.array(lst) * np.array(lst)))unit_vector = [(item[0], item[1] / L2Norm) for item in sorted_words]return unit_vector#  tf-idf测试
# TF-IDF测试
count1, count2, count3 = make_count(text1), make_count(text2), make_count(text3)
countlist = [count1, count2, count3]
print("Training by original algorithm......\n")
for i, count in enumerate(countlist):print("Top words in document %d" % (i + 1))scores = {word: tfidf(word, count, countlist) for word in count}sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)  # type=listsorted_words = unitvec(sorted_words)for word, score in sorted_words[:3]:print("\tWord: %s, TF-IDF: %s" % (word, round(score, 5)))

运行结果:

Training by original algorithm......Top words in document 1Word: football, TF-IDF: 0.84766Word: rugby, TF-IDF: 0.21192Word: word, TF-IDF: 0.14128
Top words in document 2Word: play, TF-IDF: 0.29872Word: one, TF-IDF: 0.19915Word: shooting, TF-IDF: 0.19915
Top words in document 3Word: net, TF-IDF: 0.45775Word: teammate, TF-IDF: 0.34331Word: bat, TF-IDF: 0.22888

现在的输出结果与gensim得到的结果一致!

全部代码:

import nltk
import string
import math
import numpy as np
from nltk.corpus import stopwords  # 停用词
from collections import Counter
from gensim import corpora, models, matutilstext1 = """
Football is a family of team sports that involve, to varying degrees, kicking a ball to score a goal.
Unqualified, the word football is understood to refer to whichever form of football is the most popular
in the regional context in which the word appears. Sports commonly called football in certain places
include association football (known as soccer in some countries); gridiron football (specifically American
football or Canadian football); Australian rules football; rugby football (either rugby league or rugby union);
and Gaelic football. These different variations of football are known as football codes.
"""text2 = """
Basketball is a team sport in which two teams of five players, opposing one another on a rectangular court,
compete with the primary objective of shooting a basketball (approximately 9.4 inches (24 cm) in diameter)
through the defender's hoop (a basket 18 inches (46 cm) in diameter mounted 10 feet (3.048 m) high to a backboard
at each end of the court) while preventing the opposing team from shooting through their own hoop. A field goal is
worth two points, unless made from behind the three-point line, when it is worth three. After a foul, timed play stops
and the player fouled or designated to shoot a technical foul is given one or more one-point free throws. The team with
the most points at the end of the game wins, but if regulation play expires with the score tied, an additional period
of play (overtime) is mandated.
"""text3 = """
Volleyball, game played by two teams, usually of six players on a side, in which the players use their hands to bat a
ball back and forth over a high net, trying to make the ball touch the court within the opponents’ playing area before
it can be returned. To prevent this a player on the opposing team bats the ball up and toward a teammate before it touches
the court surface—that teammate may then volley it back across the net or bat it to a third teammate who volleys it across
the net. A team is allowed only three touches of the ball before it must be returned over the net.
"""#  文本预处理
#  函数:text文件分句、分词,并去掉标点
def get_token(text):text = text.replace('\n', '')sents = nltk.sent_tokenize(text)  # 分句print(len(sents))tokens = []for sent in sents:for word in nltk.word_tokenize(sent):  # 分词if word not in string.punctuation:  # 去掉标点tokens.append(word)return tokensprint(get_token(text1))#  对原始的text文件去掉停用词
#  生成count字典,即每个单词的出现次数
def make_count(text):tokens = get_token(text)filtered = [w for w in tokens if w not in stopwords.words('english')]  # 去掉停用词count = Counter(filtered)return countprint(make_count(text1))#  training by gensim tfidf model
def get_words(text):tokens = get_token(text)filtered = [w for w in tokens if w not in stopwords.words('english')]return filtered#  get text
count1, count2, count3 = get_words(text1), get_words(text2), get_words(text3)
count_list = [count1, count2, count3]#  training by tfidf model in gensim
dictionary = corpora.Dictionary(count_list)
new_dict = {v: k for k, v in dictionary.token2id.items()}
corpus2 = [dictionary.doc2bow(count) for count in count_list]
tfidf2 = models.TfidfModel(corpus2)
corpus_tfidf = tfidf2[corpus2]#  output
print('\nTraining by gensim tfidf model......\n')
for i, doc in enumerate(corpus_tfidf):print('Top words in document %d' % (i + 1))sorted_words = sorted(doc, key=lambda x: x[1], reverse=True)  # type=listfor num, score in sorted_words[:3]:print('\tWord: %s, Tfidf: %s' % (new_dict[num], round(score, 5)))#   计算tf
def tf(word, count):return count[word] / sum(count.values())#  计算count_list有多少个文件包含word
def n_containing(word, count_list):return sum(1 for count in count_list if word in count)#  计算idf
def idf(word, count_list):return math.log2(len(count_list) / n_containing(word, count_list))  # 对数以2为底#  计算tf-idf
def tfidf(word, count, count_lsit):return tf(word, count) * idf(word, count_list)#  对向量做规范化, normalize
def unitvec(sorted_words):lst = [item[1] for item in sorted_words]L2Norm = math.sqrt(sum(np.array(lst) * np.array(lst)))unit_vector = [(item[0], item[1] / L2Norm) for item in sorted_words]return unit_vector#  tf-idf测试
# TF-IDF测试
count1, count2, count3 = make_count(text1), make_count(text2), make_count(text3)
countlist = [count1, count2, count3]
print("Training by original algorithm......\n")
for i, count in enumerate(countlist):print("Top words in document %d" % (i + 1))scores = {word: tfidf(word, count, countlist) for word in count}sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)  # type=listsorted_words = unitvec(sorted_words)for word, score in sorted_words[:3]:print("\tWord: %s, TF-IDF: %s" % (word, round(score, 5)))

运行结果:

4
['Football', 'is', 'a', 'family', 'of', 'team', 'sports', 'that', 'involve', 'to', 'varying', 'degrees', 'kicking', 'a', 'ball', 'to', 'score', 'a', 'goal', 'Unqualified', 'the', 'word', 'football', 'is', 'understood', 'to', 'refer', 'to', 'whichever', 'form', 'of', 'football', 'is', 'the', 'most', 'popular', 'in', 'the', 'regional', 'context', 'in', 'which', 'the', 'word', 'appears', 'Sports', 'commonly', 'called', 'football', 'in', 'certain', 'places', 'include', 'association', 'football', 'known', 'as', 'soccer', 'in', 'some', 'countries', 'gridiron', 'football', 'specifically', 'American', 'football', 'or', 'Canadian', 'football', 'Australian', 'rules', 'football', 'rugby', 'football', 'either', 'rugby', 'league', 'or', 'rugby', 'union', 'and', 'Gaelic', 'football', 'These', 'different', 'variations', 'of', 'football', 'are', 'known', 'as', 'football', 'codes']
4
Counter({'football': 12, 'rugby': 3, 'word': 2, 'known': 2, 'Football': 1, 'family': 1, 'team': 1, 'sports': 1, 'involve': 1, 'varying': 1, 'degrees': 1, 'kicking': 1, 'ball': 1, 'score': 1, 'goal': 1, 'Unqualified': 1, 'understood': 1, 'refer': 1, 'whichever': 1, 'form': 1, 'popular': 1, 'regional': 1, 'context': 1, 'appears': 1, 'Sports': 1, 'commonly': 1, 'called': 1, 'certain': 1, 'places': 1, 'include': 1, 'association': 1, 'soccer': 1, 'countries': 1, 'gridiron': 1, 'specifically': 1, 'American': 1, 'Canadian': 1, 'Australian': 1, 'rules': 1, 'either': 1, 'league': 1, 'union': 1, 'Gaelic': 1, 'These': 1, 'different': 1, 'variations': 1, 'codes': 1})
4
4
3Training by gensim tfidf model......Top words in document 1Word: football, Tfidf: 0.84766Word: rugby, Tfidf: 0.21192Word: known, Tfidf: 0.14128
Top words in document 2Word: play, Tfidf: 0.29872Word: cm, Tfidf: 0.19915Word: diameter, Tfidf: 0.19915
Top words in document 3Word: net, Tfidf: 0.45775Word: teammate, Tfidf: 0.34331Word: across, Tfidf: 0.22888
4
4
3
Training by original algorithm......Top words in document 1Word: football, TF-IDF: 0.84766Word: rugby, TF-IDF: 0.21192Word: word, TF-IDF: 0.14128
Top words in document 2Word: play, TF-IDF: 0.29872Word: one, TF-IDF: 0.19915Word: shooting, TF-IDF: 0.19915
Top words in document 3Word: net, TF-IDF: 0.45775Word: teammate, TF-IDF: 0.34331Word: bat, TF-IDF: 0.22888

自然语言处理(NLP)之TF-IDF原理及使用相关推荐

  1. 智能语音助手的工作原理是?先了解自然语言处理(NLP)与自然语言生成(NLG)

    智能语音助手的工作原理是?先了解自然语言处理(NLP)与自然语言生成(NLG) 语音助手越来越像人类了,与人类之间的交流不再是简单的你问我答,不少语音助手甚至能和人类进行深度交谈.在交流的背后,离不开 ...

  2. 使用lingpipe自然语言处理包进行文本分类/** * 使用 lingpipe的tf/idf分类器训练语料 * * @author laigood */ public class trai

    /**  * 使用 lingpipe的tf/idf分类器训练语料  *   * @author laigood  */ public class traintclassifier { //训练语料文件 ...

  3. 自然语言处理NLP中文分词,词性标注,关键词提取和文本摘要

    NLP相关工具包的介绍 1.1 jieba "结巴"中文分词,理念是做最好的 Python 中文分词组件. 支持三种分词模式: (1)精确模式,试图将句子最精确地切开,适合文本分析 ...

  4. 【组队学习】【28期】基于transformers的自然语言处理(NLP)入门

    基于transformers的自然语言处理(NLP)入门 论坛版块: http://datawhale.club/c/team-learning/39-category/39 开源内容: https: ...

  5. 搜索引擎:文本分类——TF/IDF算法

    原理 TFIDF的主要思想是:如果某个词或短语在一篇文章中出现的频率TF高,并且在其他文章中很少出现,则认为此词或者短语具有很好的类别区分能力,适合用来分类.TFIDF实际上是:TF * IDF,TF ...

  6. 关键词提取算法—TF/IDF算法

    关键词提取算法一般可分为有监督学习和无监督学习两类. 有监督的关键词提取方法可以通过分类的方式进行,通过构建一个较为完善的词表,然后判断每个文档与词表中的每个词的匹配程度,以类似打标签的方式,达到关键 ...

  7. mpls工作原理通俗解释_用这两种方法向最终用户解释NLP模型的工作原理还是不错的...

    点击上方关注,All in AI中国 上周,我看了一个关于"NLP的实践特性工程"的演讲.主要是关于LIME和SHAP在文本分类可解释性方面是如何工作的. 我决定写一篇关于它们的文 ...

  8. 自然语言处理模型:bert 结构原理解析——attention+transformer(翻译自:Deconstructing BERT)

    原文:Deconstructing BERT: Distilling 6 Patterns from 100 Million Parameters 关于transformer 和attention的机 ...

  9. 自然语言处理NLP简介

    自然语言处理NLP简介 NLP简介 1 引言 人工智能.机器学习.深度学习 什么是自然语言处理? 人工智能的流派 2 NLP发展历史 人工智能发展历史 推理期 知识期 学习期 文本的预训练可分为两个重 ...

  10. 自然语言处理NLP面试问题

    自然语言处理NLP面试问题 前言 一.机器学习相关模型 1.朴素贝叶斯 1-1.相关概念介绍 1-2.贝叶斯定理 1-3.贝叶斯算法的优缺点 1-4.拓展延伸 1-4-1.MLE(最大似然) 1-4- ...

最新文章

  1. 微服务测试之性能测试
  2. Linux维护笔记四
  3. spring整合mybatis接口无法注入问题
  4. ctr 平滑_CTR预估中的贝叶斯平滑方法及其代码实现
  5. ubuntu版本号查询
  6. 北京工业大学c语言期末考试题,北京工业大学C语言部分练习答案.docx
  7. 测试环境搭建mysql数据库_软件测试环境的搭建系列:[2] MySQL数据库的安装
  8. 拓端tecdat|R语言分段回归数据分析案例报告
  9. gitbook mac 版本的安装
  10. mysql pxc 原理_mysql PXC配置
  11. 计算机组装diy,电脑diy,详细教您如何组装电脑
  12. 传统计算机硬盘和固态硬盘有哪些区别,工业级固态硬盘与传统硬盘有什么区别...
  13. excel拆分工具怎么拆分表格?
  14. 联想涉密专用计算机 字体,Lenovo出厂高分屏笔记本高分辨率下字体模糊的解决方法...
  15. 服务器硬盘灯不亮 阵列是正常的,服务器磁盘阵列出现故障有哪些解决办法?...
  16. 浅谈Vue 自定义事件——原理及用法
  17. 血与荣耀(第一章-激战)
  18. 解决Mac能接受qq消息但打不开网页的问题
  19. python 基础 | 4.运算符
  20. 自学时间也有五个月了吧,说下自学这五个月的感受吧

热门文章

  1. 智能改变未来,创新引领世界,第二届深圳国际人工智能展暨智能制造创新高峰论坛盛大启幕!
  2. AI换脸鉴别率超99.6%,微软用技术应对虚假信息
  3. 黄皓之后,计算机科学上帝Don Knuth仅用一页纸证明布尔函数敏感度猜想
  4. 嫌Terminal终端太单调?快收下这几个有趣的改造工具!
  5. “重构”黑洞:26岁MIT研究生的新算法 | 人物志
  6. 算力超英伟达?华为推出两款“昇腾”芯片;五大AI战略正式公布
  7. 明晚8点直播 | Transformer新型神经网络在机器翻译中的应用
  8. 清北浙交大比拼,南大强势上榜,AI到底哪家强?
  9. 一文讲清楚什么是迁移学习?以及它都用在哪些深度学习场景?
  10. CMS:听我的,生产环境上要这样配置JVM参数