reference:https://blog.csdn.net/u012160689/article/details/15341303

# -*- coding:utf-8 -*-
# 余弦计算相似度度量 http://blog.csdn.net/u012160689/article/details/15341303import math
import re
import datetime
import timetext1 = "This game is one of the very best. games ive  played. the  ;pictures? " \"cant descripe the real graphics in the game."
text2 = "this game have/ is3 one of the very best. games ive  played. the  ;pictures? " \"cant descriPe now the real graphics in the game."
text3 = "So in the picture i saw a nice size detailed metal puzzle. Eager to try since I enjoy 3d wood puzzles, i ordered it. Well to my disappointment I got in the mail a small square about 4 inches around. And to add more disappointment when I built it it was smaller than the palm of my hand. For the price it should of been much much larger. Don't be fooled. It's only worth $5.00.Update 4/15/2013I have bought and completed 13 of these MODELS from A.C. Moore for $5.99 a piece, so i stand by my comment that thiss one is overpriced. It was still fun to build just like all the others from the maker of this brand.Just be warned, They are small."
text4 = "I love it when an author can bring you into their made up world and make you feel like a friend, confidant, or family. Having a special child of my own I could relate to the teacher and her madcap class. I've also spent time in similar classrooms and enjoyed the uniqueness of each and every child. Her story drew me into their world and had me laughing so hard my family thought I had lost my mind, so I shared the passage so they could laugh with me. Read this book if you enjoy a book with strong women, you won't regret it."def compute_cosine(text_a, text_b):# 找单词及词频words1 = text_a.split(' ')words2 = text_b.split(' ')# print(words1)words1_dict = {}words2_dict = {}for word in words1:# word = word.strip(",.?!;")word = re.sub('[^a-zA-Z]', '', word)word = word.lower()# print(word)if word != '' and words1_dict.has_key(word):num = words1_dict[word]words1_dict[word] = num + 1elif word != '':words1_dict[word] = 1else:continuefor word in words2:# word = word.strip(",.?!;")word = re.sub('[^a-zA-Z]', '', word)word = word.lower()if word != '' and words2_dict.has_key(word):num = words2_dict[word]words2_dict[word] = num + 1elif word != '':words2_dict[word] = 1else:continue# print(words1_dict)# print(words2_dict)# return Truedic1 = sorted(words1_dict.iteritems(), key=lambda asd: asd[1], reverse=True)dic2 = sorted(words2_dict.iteritems(), key=lambda asd: asd[1], reverse=True)# print(dic1)# print(dic2)# 得到词向量words_key = []for i in range(len(dic1)):words_key.append(dic1[i][0])  # 向数组中添加元素for i in range(len(dic2)):if dic2[i][0] in words_key:# print 'has_key', dic2[i][0]passelse:  # 合并words_key.append(dic2[i][0])# print(words_key)vect1 = []vect2 = []for word in words_key:if words1_dict.has_key(word):vect1.append(words1_dict[word])else:vect1.append(0)if words2_dict.has_key(word):vect2.append(words2_dict[word])else:vect2.append(0)# print(vect1)# print(vect2)# 计算余弦相似度sum = 0sq1 = 0sq2 = 0for i in range(len(vect1)):sum += vect1[i] * vect2[i]sq1 += pow(vect1[i], 2)sq2 += pow(vect2[i], 2)try:result = round(float(sum) / (math.sqrt(sq1) * math.sqrt(sq2)), 2)except ZeroDivisionError:result = 0.0# print(result)return resultif __name__ == '__main__':# print(pow(2,5))# compute_cosine(text1, text2)begin_time = datetime.datetime.now()begin = time.time()i = 0while i < 10:compute_cosine(text3, text4)i += 1end_time = datetime.datetime.now()end = time.time()print("datatime:", end_time - begin_time)print("time:", end - begin)print(round(0.955,2))print(float('%.2f' % 0.956))

网上都是tf-idf的,我自己写了个cosine的~

python用余弦相似度计算英文文本相似度相关推荐

  1. 利用文本相似度进行英文文本分类(C++实现)

    利用文本相似度进行英文文本分类(C++实现).仅用于应付课程小作业. 代码在链接:利用文本相似度进行英文文本分类(C++实现)-C++文档类资源-CSDN下载 文本分类是自然语言处理中比较常见且重要的 ...

  2. python 文本相似度现状_python文本相似度分析

    如何用python计算文本的相似度 同学欢迎来到CSS布局HTML~文本的相似度计算是NLP(自然语言处理)方向的范畴,感兴趣可以找相关的书籍详细学习研究.同学问的这个问题,可以搜索:python文本 ...

  3. Java实现标题相似度计算,文本内容相似度匹配,Java通过SimHash计算标题文本内容相似度

     目录 一.前言 二.关于SimHash 补充知识 一).什么是海明距离 二).海明距离的应用 三).什么是编辑距离 三.SimHash算法的几何意义和原理 一).SimHash算法的几何意义 二). ...

  4. 中文/英文 文本相似度/文本推理/文本匹配数据集汇总(SNLI、MSRP、MultiNLI、Quora、SciTail、SICK、STS、CCKS2018、LCQMC、OCNLI、XNLI)

    中文/英文 文本相似度/文本推理/文本匹配数据集汇总(SNLI.MSRP.MultiNLI.Quora.SciTail.SICK.STS.CCKS2018.LCQMC.OCNLI.XNLI) 1. 所 ...

  5. python中文相似度_基于TF-IDF、余弦相似度算法实现文本相似度算法的Python应用

    基于TF-IDF算法.余弦相似度算法实现相似文本推荐--文本相似度算法,主要应用于文本聚类.相似文本推荐等场景. 设计说明 使用jieba切词,设置自定义字典 使用TF-IDF算法,找出文章的关键词: ...

  6. 机器学习算法Python实现:gensim里的similarities文本相似度计算

    # -*- coding:utf-8 -* #本代码是在jupyter notebook上实现,author:huzhifei, create time:2018/8/14 #本脚本主要实现了基于py ...

  7. [原创]python计算中文文本相似度神器

    介绍 最近因为工作需要,需要使用一个功能,就是中文文本相似度的计算.属于nlp领域的一个应用吧,这里找到一个非常好的包和大家分享.这个包叫sentence-transformers. 这里给大家介绍, ...

  8. python文本相似度分析_文本相似度分析(基于jieba和gensim)

    ##基础概念 本文在进行文本相似度分析过程分为以下几个部分进行, 文本分词 语料库制作 算法训练 结果预测 分析过程主要用两个包来实现jieba,gensim jieba:主要实现分词过程 gensi ...

  9. 使用BERT做中文文本相似度计算与文本分类

    转载请注明出处,原文地址: https://terrifyzhao.github.io/2018/11/29/使用BERT做中文文本相似度计算.html 简介 最近Google推出了NLP大杀器BER ...

最新文章

  1. C#语言与面向对象技术(2)
  2. P4887 【模板】莫队二次离线(第十四分块(前体))
  3. paper每日谈——动机
  4. Android自定义View——可以设置最大宽高的FrameLayout
  5. js中的面向对象入门
  6. (转)百度Map API
  7. 如何解决System.Web.HttpRequestValidationException的异常
  8. myEclipse-svn的安装使用
  9. Java—读取指定路径下文件的内容
  10. python爬虫案例
  11. word文档中实现目录索引中标题加粗,前导符和页码不加粗
  12. 万字长文!浏览器是如何工作的:Chrome V8让你更懂JavaScript
  13. matlab测量直流母线上的电压,基于模糊控制的有源滤波器直流母线电压控制
  14. 【《Real-Time Rendering 3rd》提炼总结】完结篇:系列合集电子书PDF下载实时渲染知识网络图谱新系列预告
  15. 和平精英分数计算机制,和平精英段位对应的积分完整一览 和平精英多少分上王牌...
  16. Secure CRT 乱码 问题
  17. Google 开源的项目集合
  18. FL Studio2023水果完整中文版音乐制作软件
  19. iView中Table组件通过render属性渲染自定义组件
  20. 深蓝学院SLAM理论与实践课程

热门文章

  1. Apache Hadoop大数据集群及相关生态组件安装
  2. 夜思 | 听说你单身?“替代性恋爱”了解一下
  3. 二元非洲秃鹫优化算法(Matlab代码实现)
  4. 超市会员管理系统(面向对象)
  5. 百度大脑FaceID人脸识别模型量化技术,确保算法精度无损加速一倍
  6. 应届毕业生零基础转行做程序员,怎么看?
  7. 树莓派安装共享打印机HP LaserJet CP1025(foo2zjs)
  8. criterion of IMAP4.search 限制规范
  9. 洛谷P4218 [CTSC2010]珠宝商(后缀自动机+点分治)
  10. JAVA用Math 给pi赋值_导入Math.PI作为参考或值