python用余弦相似度计算英文文本相似度

reference：https://blog.csdn.net/u012160689/article/details/15341303

# -*- coding:utf-8 -*-
# 余弦计算相似度度量 http://blog.csdn.net/u012160689/article/details/15341303import math
import re
import datetime
import timetext1 = "This game is one of the very best. games ive  played. the  ;pictures? " \"cant descripe the real graphics in the game."
text2 = "this game have/ is3 one of the very best. games ive  played. the  ;pictures? " \"cant descriPe now the real graphics in the game."
text3 = "So in the picture i saw a nice size detailed metal puzzle. Eager to try since I enjoy 3d wood puzzles, i ordered it. Well to my disappointment I got in the mail a small square about 4 inches around. And to add more disappointment when I built it it was smaller than the palm of my hand. For the price it should of been much much larger. Don't be fooled. It's only worth $5.00.Update 4/15/2013I have bought and completed 13 of these MODELS from A.C. Moore for $5.99 a piece, so i stand by my comment that thiss one is overpriced. It was still fun to build just like all the others from the maker of this brand.Just be warned, They are small."
text4 = "I love it when an author can bring you into their made up world and make you feel like a friend, confidant, or family. Having a special child of my own I could relate to the teacher and her madcap class. I've also spent time in similar classrooms and enjoyed the uniqueness of each and every child. Her story drew me into their world and had me laughing so hard my family thought I had lost my mind, so I shared the passage so they could laugh with me. Read this book if you enjoy a book with strong women, you won't regret it."def compute_cosine(text_a, text_b):# 找单词及词频words1 = text_a.split(' ')words2 = text_b.split(' ')# print(words1)words1_dict = {}words2_dict = {}for word in words1:# word = word.strip(",.?!;")word = re.sub('[^a-zA-Z]', '', word)word = word.lower()# print(word)if word != '' and words1_dict.has_key(word):num = words1_dict[word]words1_dict[word] = num + 1elif word != '':words1_dict[word] = 1else:continuefor word in words2:# word = word.strip(",.?!;")word = re.sub('[^a-zA-Z]', '', word)word = word.lower()if word != '' and words2_dict.has_key(word):num = words2_dict[word]words2_dict[word] = num + 1elif word != '':words2_dict[word] = 1else:continue# print(words1_dict)# print(words2_dict)# return Truedic1 = sorted(words1_dict.iteritems(), key=lambda asd: asd[1], reverse=True)dic2 = sorted(words2_dict.iteritems(), key=lambda asd: asd[1], reverse=True)# print(dic1)# print(dic2)# 得到词向量words_key = []for i in range(len(dic1)):words_key.append(dic1[i][0])  # 向数组中添加元素for i in range(len(dic2)):if dic2[i][0] in words_key:# print 'has_key', dic2[i][0]passelse:  # 合并words_key.append(dic2[i][0])# print(words_key)vect1 = []vect2 = []for word in words_key:if words1_dict.has_key(word):vect1.append(words1_dict[word])else:vect1.append(0)if words2_dict.has_key(word):vect2.append(words2_dict[word])else:vect2.append(0)# print(vect1)# print(vect2)# 计算余弦相似度sum = 0sq1 = 0sq2 = 0for i in range(len(vect1)):sum += vect1[i] * vect2[i]sq1 += pow(vect1[i], 2)sq2 += pow(vect2[i], 2)try:result = round(float(sum) / (math.sqrt(sq1) * math.sqrt(sq2)), 2)except ZeroDivisionError:result = 0.0# print(result)return resultif __name__ == '__main__':# print(pow(2,5))# compute_cosine(text1, text2)begin_time = datetime.datetime.now()begin = time.time()i = 0while i < 10:compute_cosine(text3, text4)i += 1end_time = datetime.datetime.now()end = time.time()print("datatime:", end_time - begin_time)print("time:", end - begin)print(round(0.955,2))print(float('%.2f' % 0.956))

网上都是tf-idf的，我自己写了个cosine的～

python用余弦相似度计算英文文本相似度相关推荐

利用文本相似度进行英文文本分类（C++实现）
利用文本相似度进行英文文本分类(C++实现).仅用于应付课程小作业. 代码在链接:利用文本相似度进行英文文本分类(C++实现)-C++文档类资源-CSDN下载文本分类是自然语言处理中比较常见且重要的 ...
python 文本相似度现状_python文本相似度分析
如何用python计算文本的相似度同学欢迎来到CSS布局HTML~文本的相似度计算是NLP(自然语言处理)方向的范畴,感兴趣可以找相关的书籍详细学习研究.同学问的这个问题,可以搜索:python文本 ...
Java实现标题相似度计算，文本内容相似度匹配，Java通过SimHash计算标题文本内容相似度
目录一.前言二.关于SimHash 补充知识一).什么是海明距离二).海明距离的应用三).什么是编辑距离三.SimHash算法的几何意义和原理一).SimHash算法的几何意义二). ...
中文/英文文本相似度/文本推理/文本匹配数据集汇总（SNLI、MSRP、MultiNLI、Quora、SciTail、SICK、STS、CCKS2018、LCQMC、OCNLI、XNLI）
中文/英文文本相似度/文本推理/文本匹配数据集汇总(SNLI.MSRP.MultiNLI.Quora.SciTail.SICK.STS.CCKS2018.LCQMC.OCNLI.XNLI) 1. 所 ...
python中文相似度_基于TF-IDF、余弦相似度算法实现文本相似度算法的Python应用
基于TF-IDF算法.余弦相似度算法实现相似文本推荐--文本相似度算法,主要应用于文本聚类.相似文本推荐等场景. 设计说明使用jieba切词,设置自定义字典使用TF-IDF算法,找出文章的关键词: ...
机器学习算法Python实现：gensim里的similarities文本相似度计算
# -*- coding:utf-8 -* #本代码是在jupyter notebook上实现,author:huzhifei, create time:2018/8/14 #本脚本主要实现了基于py ...
[原创]python计算中文文本相似度神器
介绍最近因为工作需要,需要使用一个功能,就是中文文本相似度的计算.属于nlp领域的一个应用吧,这里找到一个非常好的包和大家分享.这个包叫sentence-transformers. 这里给大家介绍, ...
python文本相似度分析_文本相似度分析（基于jieba和gensim）
##基础概念本文在进行文本相似度分析过程分为以下几个部分进行, 文本分词语料库制作算法训练结果预测分析过程主要用两个包来实现jieba,gensim jieba:主要实现分词过程 gensi ...
使用BERT做中文文本相似度计算与文本分类
转载请注明出处,原文地址: https://terrifyzhao.github.io/2018/11/29/使用BERT做中文文本相似度计算.html 简介最近Google推出了NLP大杀器BER ...

python用余弦相似度计算英文文本相似度

python用余弦相似度计算英文文本相似度相关推荐

最新文章

热门文章