【笔记】手敲版 TF IDF
注1:
1. 涉及计算向向量夹角 【笔记】向量点乘(内积)和叉乘(外积、向量积):对两个向量执行点乘运算,是对这两个向量对应位一一相乘之后求和的操作,点乘的结果是一个标量;叉乘结果是一个向量,它垂直于两向量构成的平面(法向量)_程序猿的探索之路的博客-CSDN博客_两个矢量的点乘注:行列式知识点:正文:https://blog.csdn.net/nyist_yangguang/article/details/121801944?ops_request_misc=%257B%2522request%255Fid%2522%253A%2522165258565416782395362014%2522%252C%2522scm%2522%253A%252220140713.130102334.pc%255Fblog.%2522%257D&request_id=165258565416782395362014&biz_id=0&utm_medium=distribute.pc_search_result.none-task-blog-2~blog~first_rank_ecpm_v1~rank_v31_ecpm-1-121801944-null-null.nonecase&utm_term=%E5%86%85%E7%A7%AF&spm=1018.2226.3001.4450
2. set() 会将数据元素打乱
注2:
tf * idf 得到各自文章的句向量,我们现在需要用新的句子(句向量)来计算与原来各个句向量的夹角大小。我们在计算新句子的 tf 的基础上,来扩展原来句子的 idf 向量长度,使得它们可以进行矩阵乘法。
import numpy as np
from collections import Counter
import itertools
from visual import show_tfidf # this refers to visual.py in my [repo](https://github.com/MorvanZhou/NLP-Tutorials/)docs = ["it it is a good day, I like to stay here","I am happy to be here","I am bob",# "it is sunny today",# "I have a party today",# "it is a dog and that is a cat",# "there are dog and cat on the tree",# "I study hard this morning",# "today is a good day",# "tomorrow will be a good day",# "I like coffee, I like book and I like apple",# "I do not like it",# "I am kitty, I like bob",# "I do not care who like bob, but I like kitty",# "It is coffee time, bring your cup",
]docs_words = [d.replace(",","").split(" ") for d in docs]vocab = sorted(set(itertools.chain(*docs_words)))v2i = {v: i for i, v in enumerate(vocab)}
i2v = {i: v for v, i in v2i.items()}def safe_log(x):mask = x != 0x[mask] = np.log(x[mask])return xtf_methods = {"log": lambda x: np.log(1+x),"augmented": lambda x: 0.5 + 0.5 * x / np.max(x, axis=1, keepdims=True),"boolean": lambda x: np.minimum(x, 1),"log_avg": lambda x: (1 + safe_log(x)) / (1 + safe_log(np.mean(x, axis=1, keepdims=True))),}
idf_methods = {"log": lambda x: 1 + np.log(len(docs) / (x+1)),"prob": lambda x: np.maximum(0, np.log((len(docs) - x) / (x+1))),"len_norm": lambda x: x / (np.sum(np.square(x))+1),}def get_tf(method="log"):# term frequency: how frequent a word appears in a doc_tf = np.zeros((len(vocab), len(docs)), dtype=np.float64) # [n_vocab, n_doc]for i, d in enumerate(docs_words):counter = Counter(d)for v in counter.keys():_tf[v2i[v], i] = counter[v] / counter.most_common(1)[0][1]weighted_tf = tf_methods.get(method, None)if weighted_tf is None:raise ValueErrorreturn weighted_tf(_tf)def get_idf(method="log"):# inverse document frequency: low idf for a word appears in more docs, mean less importantdf = np.zeros((len(i2v), 1))for i in range(len(i2v)):d_count = 0for d in docs_words:d_count += 1 if i2v[i] in d else 0df[i, 0] = d_countidf_fn = idf_methods.get(method, None)if idf_fn is None:raise ValueErrorreturn idf_fn(df)def cosine_similarity(q, _tf_idf):print(q.shape,q)print(_tf_idf.shape,_tf_idf)input()# unit_q = q / np.sqrt(np.sum(np.square(q), axis=0, keepdims=True))# unit_ds = _tf_idf / np.sqrt(np.sum(np.square(_tf_idf), axis=0, keepdims=True))print(_tf_idf.T.dot(q))# print(unit_ds.T.dot(unit_q))similarity=_tf_idf.T.dot(q).ravel()# similarity = unit_ds.T.dot(unit_q).ravel()print(similarity)input()return similaritydef docs_score(q, len_norm=False):q_words = q.replace(",", "").split(" ")# add unknown wordsunknown_v = 0for v in set(q_words):if v not in v2i:v2i[v] = len(v2i)i2v[len(v2i)-1] = vunknown_v += 1if unknown_v > 0:_idf = np.concatenate((idf, np.zeros((unknown_v, 1), dtype=np.float)), axis=0)_tf_idf = np.concatenate((tf_idf, np.zeros((unknown_v, tf_idf.shape[1]), dtype=np.float)), axis=0)else:_idf, _tf_idf = idf, tf_idfcounter = Counter(q_words)q_tf = np.zeros((len(_idf), 1), dtype=np.float) # [n_vocab, 1]for v in counter.keys():q_tf[v2i[v], 0] = counter[v]q_vec = q_tf * _idf # [n_vocab, 1]q_scores = cosine_similarity(q_vec, _tf_idf)if len_norm:len_docs = [len(d) for d in docs_words]q_scores = q_scores / np.array(len_docs)return q_scoresdef get_keywords(n=2):for c in range(3):col = tf_idf[:, c]idx = np.argsort(col)[-n:]print("doc{}, top{} keywords {}".format(c, n, [i2v[i] for i in idx]))tf = get_tf() # [n_vocab, n_doc]
idf = get_idf() # [n_vocab, 1]
tf_idf = tf * idf # [n_vocab, n_doc]
print("tf shape(vecb in each docs): ", tf.shape)
print("\ntf samples:\n", tf[:2])
print("\nidf shape(vecb in all docs): ", idf.shape)
print("\nidf samples:\n", idf[:2])
print("\ntf_idf shape: ", tf_idf.shape)
print("\ntf_idf sample:\n", tf_idf[:2])# test
get_keywords()
q = "I get a coffee cup"
scores = docs_score(q)
d_ids = scores.argsort()[-3:][::-1]
print("\ntop 3 docs for '{}':\n{}".format(q, [docs[i] for i in d_ids]))show_tfidf(tf_idf.T, [i2v[i] for i in range(tf_idf.shape[0])], "tfidf_matrix")
tf shape(vecb in each docs): (14, 3)tf samples:[[0.40546511 0.69314718 0.69314718][0.40546511 0. 0. ]]idf shape(vecb in all docs): (14, 1)idf samples:[[0.71231793][1.40546511]]tf_idf shape: (14, 3)tf_idf sample:[[0.28882007 0.49374116 0.49374116][0.56986706 0. 0. ]]
doc0, top2 keywords ['stay', 'it']
doc1, top2 keywords ['be', 'happy']
doc2, top2 keywords ['am', 'bob']
(17, 1) [[0.71231793][1.40546511][0. ][0. ][0. ][0. ][0. ][0. ][0. ][0. ][0. ][0. ][0. ][0. ][0. ][0. ][0. ]]
(17, 3) [[0.28882007 0.49374116 0.49374116][0.56986706 0. 0. ][0. 0.69314718 0.69314718][0. 0.97419418 0. ][0. 0. 0.97419418][0.56986706 0. 0. ][0.56986706 0. 0. ][0. 0.97419418 0. ][0.40546511 0.69314718 0. ][0.56986706 0. 0. ][0.97419418 0. 0. ][0.56986706 0. 0. ][0.56986706 0. 0. ][0.40546511 0.69314718 0. ][0. 0. 0. ][0. 0. 0. ][0. 0. 0. ]][[1.00665998][0.35170068][0.35170068]]
[1.00665998 0.35170068 0.35170068]top 3 docs for 'I get a coffee cup':
['it it is a good day, I like to stay here', 'I am bob', 'I am happy to be here']
def show_tfidf(tfidf, vocab, filename):# [n_doc, n_vocab]plt.imshow(tfidf, cmap="YlGn", vmin=tfidf.min(), vmax=tfidf.max())plt.xticks(np.arange(tfidf.shape[1]), vocab, fontsize=6, rotation=90)plt.yticks(np.arange(tfidf.shape[0]), np.arange(1, tfidf.shape[0]+1), fontsize=6)plt.tight_layout()# creating the output folder output_folder = './visual/results/'os.makedirs(output_folder, exist_ok=True)plt.savefig(os.path.join(output_folder, '%s.png') % filename, format="png", dpi=500)plt.show()
【笔记】手敲版 TF IDF相关推荐
- 瑜伽断食法——From《瑜伽祖本》(手敲版)
--在网上看了一下,没有完整的,最后忍痛手敲了5页书.希望能对大家有用,如果我能管住自己的嘴,我也想试试. 断食是自然的生活法 对一切生物来说,断食是天赐给身心的改造法,能收到意想不到的神奇效果.动物 ...
- 1.从零开始手敲次时代游戏引擎(序言)
大家好.我"正式"从事软件工程师这个职业已经快15年了.至于编程的历史则更长,有20余年了.记忆当中第一次编程的机器里只有ROM BASIC,用"*"打了个金字 ...
- 出招吧!腾讯专家手敲《Redis源码日志笔记》,不服来对打!
引言 本文分为六个部分,包括 Redis 源码日志,服务框架,基础数据结构,内功心法,应用,其他,从源码层面循序渐进的了解Redis.可以快速.有效地了解Redis 的内部构造以及运作机制,更好.更高 ...
- 《Java学习笔记(第8版)》学习指导
<Java学习笔记(第8版)>学习指导 目录 图书简况 学习指导 第一章 Java平台概论 第二章 从JDK到IDE 第三章 基础语法 第四章 认识对象 第五章 对象封装 第六章 继承与多 ...
- Android——教你10分钟手敲 Butter Knife(小刀注解)
教你10分钟手敲 Butter Knife(小刀注解) 在用 AndroidStudio 集成 Butter Knife(小刀注解)的时候感觉像极了J2EE的Spring IOC容器 自己研究了一下, ...
- 搜索引擎:文本分类——TF/IDF算法
原理 TFIDF的主要思想是:如果某个词或短语在一篇文章中出现的频率TF高,并且在其他文章中很少出现,则认为此词或者短语具有很好的类别区分能力,适合用来分类.TFIDF实际上是:TF * IDF,TF ...
- 人工智能导论笔记——江湖救急版
人工智能导论笔记--江湖救急版 Powered by DZY 以下部分图片来源于老师课件,仅供学习交流使用,侵权致删! 一.绪论 感觉并无考点,列出提纲 人工智能的基本概念 人工智能的发展简史 人工智 ...
- 软件质量保证与测试笔记——江湖救急版
软件质量保证与测试笔记--江湖救急版 Powered by DZY 以下部分图片来源于老师课件,仅供学习交流使用,侵权致删! Ch1 软件质量与测试概念 软件质量的定义 软件质量是"反映实体 ...
- 关键词提取算法—TF/IDF算法
关键词提取算法一般可分为有监督学习和无监督学习两类. 有监督的关键词提取方法可以通过分类的方式进行,通过构建一个较为完善的词表,然后判断每个文档与词表中的每个词的匹配程度,以类似打标签的方式,达到关键 ...
最新文章
- Unity3D之主菜单
- 不可错过的javascript迷你库
- 《Head First Python》第六章--定制数据对象
- 将CSDN600W用户及密码帐号存入本地MySql数据库
- Python爬虫爬取美剧网站
- Latex 中插入超链接 插入网址
- 写python笔记本推荐_写个python程序帮你清理垃圾
- java接收rowtype类型_Java PhysType.getJavaRowType方法代码示例
- php url gb2312 utf8,php实现utf-8与gb2312的url编码转换
- 实变函数与计算机有关系吗,实变函数论文.doc
- 数字电路设计之低功耗设计方法六:旁路(by-passing)
- Android高德地图marker和InfoWindow的使用
- 无法创建目录d oracle,Qt无法创建目录(Qt could not create directory)
- 「React 基础」组件生命周期函数componentDidMount()介绍
- 《重构:改善既有代码的设计》读书笔记(上)
- 自动添加芝麻代理白名单的方法
- 回溯法解01背包问题(最通俗易懂,附C++代码)
- Day4:蓝牙4.0与5.0模块的使用
- SQLite效率评测!
- spring cloud ribbon源码解析(一)