(精)广东工业大学 2018实时大数据分析——Shingling&Minhasn实验报告

一、实验内容

采用Shinling及Minhash技术分析以下两段文本的Jaccard相似度:

(1) The TOEFL test is an English language assessment that is often required for admission by English-speaking universities and programs around the world. In addition to being accepted at more than 10,000 institutions in over 130 countries, including Australia, Canada, and the US, TOEFL scores help you get noticed by admissions officers who consider the TOEFL test a more accurate measure of your ability to succeed in a university setting.

(2) The TOEFL test is the most widely respected English-language test in the world, recognized by more than 10,000 colleges, universities and agencies in more than 130 countries, including Australia, Canada, the U.K. and the United States. Wherever you want to study, the TOEFL test can help you get there.

二、实验设计(原理分析及流程)

三、实验代码及数据记录

1.代码

1.0 文件结构图

1.1 KShingle.java

package com.devyy;import java.io.IOException;
import java.text.DecimalFormat;
import java.util.Set;
import java.util.TreeSet;public class KShingle {// 文本一protected static final String str1 = "The TOEFL test is an English language assessment that is often required for admission by "+ "English-speaking universities and programs around the world. In addition to being accepted at "+ "more than 10,000 institutions in over 130 countries, including Australia, Canada, and the US, "+ "TOEFL scores help you get noticed by admissions officers who consider the TOEFL test a more "+ "accurate measure of your ability to succeed in a university setting.";// 文本二protected static final String str2 = "The TOEFL test is the most widely respected English-language test in the world, recognized by "+ "more than 10,000 colleges, universities and agencies in more than 130 countries, including "+ "Australia, Canada, the U.K. and the United States. Wherever you want to study, the TOEFL "+ "test can help you get there.";   // 删除停用词及空格符(仅针对本样例文本)protected static String deleteWord(String str) {String replaceStr = str.replaceAll("and ", "").replaceAll("by ", "").replaceAll("the ", "").replaceAll("of ", "").replace("with ", "").replaceAll("\\)", "").replaceAll("\\(", "").replaceAll(",", "").replaceAll("\\D\\.", "").replaceAll("\\s", "");return replaceStr;}// 使用k-shingle算法分隔protected static Set<String> split(String str, int k) {Set<String> shingSet = new TreeSet<String>();// 使用TreeSet而不使用HashSet有利于在MinHash算法中降低算法复杂度for (int i = 0; i <= str.length() - k; i++) {shingSet.add(str.substring(i, i + k));}return shingSet;}// 获得两段文本之间的相似度protected static Set<String> jaccard(int k) throws IOException {String replacedStr1 = deleteWord(str1);String replacedStr2 = deleteWord(str2);Set<String> set1 = split(replacedStr1, k);Set<String> set2 = split(replacedStr2, k);Set<String> allElementSet = new TreeSet<String>();allElementSet.addAll(set1);allElementSet.addAll(set2);double jaccardValue = (set1.size() + set2.size() - allElementSet.size()) * 1.0 / allElementSet.size();DecimalFormat df = new DecimalFormat("0.00");System.out.println("使用" + k + "-shingle的两段文本之间的相似度结果为:" + df.format(jaccardValue));return allElementSet;}
}

1.2 Main.java

package com.devyy;import java.io.IOException;
import java.util.Scanner;
import java.util.Set;public class Main {public static void main(String[] args) throws IOException {  System.out.println("-----------使用k-shingle技术分析两段文本之间的Jaccard相似度-------");  System.out.println("请输入k-shingle中k的值:");  Scanner scann = new Scanner(System.in);  int k = scann.nextInt();  scann.close();  Set<String> set = KShingle.jaccard(k);  System.out.println("-----------使用MinHash技术分析两段文本之间的Jaccard相似度------------");  MinHash.minHashJaccard(k, set);  System.out.println("-----------用hash函数代替行打乱计算最小哈希签名------------");  MinHashSignature.signatureJaccard(set, k);  }
}

1.3 MinHash.java

package com.devyy;import java.io.IOException;
import java.text.DecimalFormat;
import java.util.Iterator;
import java.util.Random;
import java.util.Set;public class MinHash {// 得到set  protected static Set<String> getSet(int k, String str) throws IOException{  String replacedStr = KShingle.deleteWord(str);  Set<String> set = KShingle.split(replacedStr, k);  return set;  }  // 构建特征集合矩阵  protected static String[][] characteristicMatrix(Set<String> set, Set<String> set1, Set<String> set2){  String[] a = new String[set.size()];  set.toArray(a);  String[] set1Array = new String[set1.size()];  set1.toArray(set1Array);  String[] set2Array = new String[set2.size()];  set2.toArray(set2Array);  String[][] matrix = new String[a.length][5];//此处构造为5是为了后面的最小哈希签名中的两个哈希函数的结果存放。  int i, j, temp;  for(i = 0; i < matrix.length; i++){  for(j = 0; j < matrix[0].length; j++){  matrix[i][j] = "0";  }  }  i = 0;  for(Iterator<String> iter = set.iterator(); iter.hasNext();){  matrix[i++][0] = iter.next();  }  i = 0;  temp = 0;  for(j = i; j < a.length && temp < set1Array.length; j++){  if(matrix[j][0].equals(set1Array[temp])){  matrix[j][1] = "1";  temp++;  }  }  temp = 0;  for(j = i; j < a.length && temp < set2Array.length; j++){  if(matrix[j][0].equals(set2Array[temp])){  matrix[j][2] = "1";  temp++;  }  }  return matrix;  }  // 行打乱  protected static String[][] rowMess(String[][] matrix){  int rowNumber1, rowNumber2;  int i, j;  String temp;  Random r = new Random();  //随机进行行打乱十次  for(i = 0; i < 9; i++){  rowNumber1 = r.nextInt(matrix.length);  rowNumber2 = r.nextInt(matrix.length);  for(j = 0; j < matrix[0].length; j ++){  temp = matrix[rowNumber2][j];  matrix[rowNumber2][j] = matrix[rowNumber1][j];  matrix[rowNumber1][j] = temp;  }  }  return matrix;  }  // 根据最小hash值求相似度  protected static double minHashJaccard(int k, Set<String> set) throws IOException{  Set<String> set1 = getSet(k, KShingle.str1);  Set<String> set2 = getSet(k, KShingle.str2);  String[][] matrix = characteristicMatrix(set, set1, set2);  matrix = rowMess(matrix);  double result;  System.out.println("已知定义:两个集合经随机排列转换之后得到的两个最小哈希值相等的概率等于这两个集合的jaccard相似度");  int equalHashValue = 0;  for(int i = 0; i < matrix.length; i++){  if(matrix[i][1].equals(matrix[i][2]) && matrix[i][1].equals("1")){  equalHashValue++;  }  }  System.out.println("全集共有项的数目:" + set.size());  System.out.println("都为1(该子串在两段文本中均出现)的数目:" + equalHashValue);  result = equalHashValue * 1.0 / set.size();  DecimalFormat df = new DecimalFormat("0.00");  System.out.println("第一项与第二项得到最小哈希值相等的概率计算为P = " + equalHashValue + " / "  + set.size() + " = " + df.format(result));  System.out.println("即MinHash算得的两段文本之间的jaccard相似度结果为:" + df.format(result));  return equalHashValue;  }
}

1.4 MinHashSignature.java

package com.devyy;import java.io.IOException;
import java.text.DecimalFormat;
import java.util.Set;public class MinHashSignature {protected static final int INF = 10000;// 构造出特征矩阵protected static String[][] createCharacteristicMatrix(Set<String> set, int k) throws IOException {Set<String> set1 = MinHash.getSet(k, KShingle.str1);Set<String> set2 = MinHash.getSet(k, KShingle.str2);String[][] matrix = MinHash.characteristicMatrix(set, set1, set2);return matrix;}// 将哈希函数h1(r)=(3r +1) mod 7,h2(r)=(5r +1) mod 7的结果加入protected static String[][] addHashFunction(String[][] matrix) {for (int i = 0; i < matrix.length; i++) {matrix[i][3] = Integer.toString((3 * i + 1) % 7);matrix[i][4] = Integer.toString((5 * i + 1) % 7);}return matrix;}// 签名矩阵的计算算法protected static int[][] signatureCount(String[][] matrix) {int[][] signatureMatrix = new int[][] { { INF, INF }, { INF, INF } };int i, j;for (i = 0; i < matrix.length; i++) {if (matrix[i][1].equals("1") && Integer.valueOf(matrix[i][3]) <= signatureMatrix[0][0]) {signatureMatrix[0][0] = Integer.valueOf(matrix[i][3]);}if (matrix[i][1].equals("1") && Integer.valueOf(matrix[i][4]) <= signatureMatrix[1][0]) {signatureMatrix[1][0] = Integer.valueOf(matrix[i][4]);}if (matrix[i][2].equals("1") && Integer.valueOf(matrix[i][3]) <= signatureMatrix[0][1]) {signatureMatrix[0][1] = Integer.valueOf(matrix[i][3]);}if (matrix[i][2].equals("1") && Integer.valueOf(matrix[i][4]) <= signatureMatrix[1][1]) {signatureMatrix[1][1] = Integer.valueOf(matrix[i][3]);}}System.out.println("得到的签名矩阵为:");System.out.println("     S1  S2");for (i = 0; i < signatureMatrix.length; i++) {System.out.print("h" + i + "    ");for (j = 0; j < signatureMatrix[0].length; j++) {System.out.print(signatureMatrix[i][j] + "  ");}System.out.println();}return signatureMatrix;}// 求jaccard相似度protected static double signatureJaccard(Set<String> set, int k) throws IOException {int count = 0;double result = 0.0;String[][] matrix = createCharacteristicMatrix(set, k);matrix = addHashFunction(matrix);int[][] signatureMatrix = signatureCount(matrix);for (int i = 0; i < signatureMatrix.length; i++) {if (signatureMatrix[i][0] == signatureMatrix[i][1]) {count++;}}result = count * 1.0 / signatureMatrix.length;DecimalFormat df = new DecimalFormat("0.00");System.out.println("所以可以推测SIM(S1, S2) = " + df.format(result));return result;}
}

2.结果截图

(1)当k取值为1时,有:

(2)当k取值为2时,有:

(3)当k取值为3时,有:

(精)广东工业大学 2018实时大数据分析——ShinglingMinhash实验报告相关推荐

  1. 使用Storm实现实时大数据分析!

    随着数据体积的越来越大,实时处理成为了许多机构需要面对的首要挑战.Shruthi Kumar和Siddharth Patankar在Dr.Dobb's上结合了汽车超速监视,为我们演示了使用Storm进 ...

  2. clickhouse hadoop_大数据分析之解决Hadoop的短板,实时大数据分析引擎ClickHouse解析...

    本篇文章探讨了大数据分析之解决Hadoop的短板,实时大数据分析引擎ClickHouse解析,希望阅读本篇文章以后大家有所收获,帮助大家对相关内容的理解更加深入. 一.背景 提到大数据不得不提Hado ...

  3. 数据分析挖掘实验报告及算法源码

    数据分析挖掘实验报告及算法源码 四个实验21面,帮助你学习参考使用,帮助你取得更好成绩 报告地址:数据分析挖掘实验报告及其算法源码 1.Apriori关联规则算法 必修 实验类型 设计 Python3 ...

  4. 西工大计算机操作系统实验报告,西工大操作系统实验报告os4.doc

    西工大操作系统实验报告os4 篇一:西北工业大学-操作系统实验报告-实验四 实验四 进程与线程 一. 实验目的 (1)理解进程的独立空间: (2)理解线程的相关概念. 二. 实验内容与要求 1.查阅资 ...

  5. JavaFx/Java 大作业 五子棋 实验报告

    Java大作业五子棋实验报告 实验目的 通过此次实验,对这一学期学习的内容尤其是界面开发部分做了一个很好的回顾,看似简单的五子棋程序,设计好也确实费了我一点功夫 功能模块简介和系统结构图 ChessG ...

  6. 计算机基础数据表示实验,2018大学计算机基础上机实验报告

    <2018大学计算机基础上机实验报告>由会员分享,可在线阅读,更多相关<2018大学计算机基础上机实验报告(15页珍藏版)>请在人人文库网上搜索. 1.实验(一)1:计算机硬件 ...

  7. 《2018中国大数据发展指数报告》发布:广东、上海、贵州、北京、重庆领先

    来源:网络传播杂志 摘要:2018年8月24日, 中国电子信息产业发展研究院在首届"中国国际智能产业博览会"上发布了<中国大数据发展指数报告(2018年)>.此报告为我 ...

  8. 西工大计算机操作系统实验报告,西工大计算机操作系统课程设计实验报告bh05xh5...

    <西工大计算机操作系统课程设计实验报告bh05xh5>由会员分享,可在线阅读,更多相关<西工大计算机操作系统课程设计实验报告bh05xh5(7页珍藏版)>请在人人文库网上搜索. ...

  9. 大林算法计算机控制实验报告,实验二 大林算法实验报告

    实验二 大林算法实验 1. 实验目的 (1)理解大林算法的基本原理. (2)掌握大林算法的设计过程. 2. 实验仪器 (1) MATLAB 6.5软件 一套 (2) 个人PC机 一台 3. 实验原理 ...

最新文章

  1. jQuery 技术揭秘
  2. 计算机硬件技术 实验的软件,计算机硬件技术基础软件实验讲义.doc
  3. kafka系列三、Kafka三款监控工具比较
  4. python 发邮件_Python发邮件告别smtplib,迎接zmail
  5. wordpress插件-WP Rocket3.8.8与Rocket3.9.1双版本/火箭缓存插件免授权汉化版
  6. 行上下移动_这要是在我家,我是不会把上下铺这样设计的,看着特别,打扫困难...
  7. OpenShift 4 - DevSecOps Workshop (2) - 运行一个基于Tekton的Pipeline示例
  8. ajax/test1.txt,ajax原生请求方法.txt
  9. 数字图像处理复习记录(一)图像平滑、图像锐化、间隔检测
  10. 黑莓系统包服务器,黑莓os10.3.3.2163
  11. 网站性能常用指标与优化方法
  12. Cox比例风险模型与R实现
  13. 学堂在线CPP笔记上(1-6章)
  14. router-vue中meta参数
  15. As Shell Raises Dividend, Future Gets Hazy
  16. 免费H5页面制作工具大汇总
  17. 两点经纬度计算方位角,以正北为0度
  18. 书法拓片matlab,拓墨书法作品(拓片)的具体操作方法和步骤?
  19. Java之QQ界面实现
  20. 平板的android版本是5.0.2,安卓5.0平板诺基亚N1迎来第二个更新

热门文章

  1. 新版本读取老版本文件崩溃BUG
  2. iOS App怎么上架到苹果TestFlight?
  3. 应届生面试国企时要注意的地方有哪些?
  4. HRNet人体关键点检测
  5. 全新在线制作banner网站广告横幅源码
  6. visual studio code远程连接服务器
  7. 开源免费,最好用的3大系统9大防火墙软件安利给你们
  8. ZLIB 压缩的数据格式规范
  9. Tone Mapping Correction
  10. 前端学习笔记:CSS学习之选择器篇