java 字符串对齐

有一阵子,我使用了Levenshtein distance的Apache Commons lang StringUtils实现。 它实现了一些众所周知的技巧,通过仅挂接到两个数组而不是为备忘录表分配巨大的nxm表来使用较少的内存。 它还仅检查宽度为2 * k +1的“条带”,其中k是最大编辑次数。

在levenshtein的大多数实际用法中,您只关心一个字符串是否在另一个字符串的少量编辑(1、2、3)之内。 这避免了使levenstein变得“昂贵”的大部分n * m计算。 我们发现,在ak <= 3的情况下,具有这些技巧的levenshtein的速度比Jaro-Winkler distance快,后者是一种近似编辑距离计算,被创建为更快的近似值(这有很多原因)。

不幸的是,Apache Commons Lang实现仅计算Levenshtein,而不计算可能更有用的Damerau-Levenshtein距离 。 Levenshtein定义了编辑操作的插入,删除和替换。 Damerau变体将* transposition *添加到列表中,这对于我使用编辑距离的大多数位置都非常有用。 不幸的是,DL距离不是真正的度量标准,因为它不考虑三角形不等式,但是有很多应用不受此影响。 从该维基百科页面可以看到,“最佳字符串对齐”和DL距离之间经常会混淆。 实际上,OSA是一种更简单的算法,并且需要较少的簿记,因此运行时间可能略微更快。

我找不到任何使用我在Apache Commons Lang中看到的内存技巧和“条带化”技巧的OSA或DL实现。 因此,我使用这些技巧实现了自己的OSA。 在某些时候,我还将使用技巧来实现DL,并查看性能差异是什么:

这是Java中的OSA。 它是公共领域; 随意使用。 单元测试如下。 唯一的依赖关系是Guava-,但它只是前提条件类和文档注释,因此如果您愿意,可以轻松删除该依赖关系:

package com.github.steveash.util;import static com.google.common.base.Preconditions.checkArgument;
import static com.google.common.base.Preconditions.checkNotNull;
import static com.google.common.primitives.Shorts.checkedCast;
import static java.lang.Math.abs;
import static java.lang.Math.max;import java.util.Arrays;import com.google.common.annotations.VisibleForTesting;/*** Implementation of the OSA which is similar to the Damerau-Levenshtein in that it allows for transpositions to* count as a single edit distance, but is not a true metric and can over-estimate the cost because it disallows* substrings to edited more than once.  See wikipedia for more discussion on OSA vs DL* <p/>* See Algorithms on Strings, Trees and Sequences by Dan Gusfield for more information.* <p/>* This also has a set of local buffer implementations to avoid allocating new buffers each time, which might be* a premature optimization* <p/>* @author Steve Ash*/
public class OptimalStringAlignment {private static final int threadLocalBufferSize = 64;private static final ThreadLocal<short[]> costLocal = new ThreadLocal<short[]>() {@Overrideprotected short[] initialValue() {return new short[threadLocalBufferSize];}};private static final ThreadLocal<short[]> back1Local = new ThreadLocal<short[]>() {@Overrideprotected short[] initialValue() {return new short[threadLocalBufferSize];}};private static final ThreadLocal<short[]> back2Local = new ThreadLocal<short[]>() {@Overrideprotected short[] initialValue() {return new short[threadLocalBufferSize];}};public static int editDistance(CharSequence s, CharSequence t, int threshold) {checkNotNull(s, "cannot measure null strings");checkNotNull(t, "cannot measure null strings");checkArgument(threshold >= 0, "Threshold must not be negative");checkArgument(s.length() < Short.MAX_VALUE, "Cannot take edit distance of strings longer than 32k chars");checkArgument(t.length() < Short.MAX_VALUE, "Cannot take edit distance of strings longer than 32k chars");if (s.length() + 1 > threadLocalBufferSize || t.length() + 1 > threadLocalBufferSize)return editDistanceWithNewBuffers(s, t, checkedCast(threshold));short[] cost = costLocal.get();short[] back1 = back1Local.get();short[] back2 = back2Local.get();return editDistanceWithBuffers(s, t, checkedCast(threshold), back2, back1, cost);}@VisibleForTestingstatic int editDistanceWithNewBuffers(CharSequence s, CharSequence t, short threshold) {int slen = s.length();short[] back1 = new short[slen + 1];    // "up 1" row in tableshort[] back2 = new short[slen + 1];    // "up 2" row in tableshort[] cost = new short[slen + 1];     // "current cost"return editDistanceWithBuffers(s, t, threshold, back2, back1, cost);}private static int editDistanceWithBuffers(CharSequence s, CharSequence t, short threshold,short[] back2, short[] back1, short[] cost) {short slen = (short) s.length();short tlen = (short) t.length();// if one string is empty, the edit distance is necessarily the length of the otherif (slen == 0) {return tlen <= threshold ? tlen : -1;} else if (tlen == 0) {return slen <= threshold ? slen : -1;}// if lengths are different > k, then can't be within edit distanceif (abs(slen - tlen) > threshold)return -1;if (slen > tlen) {// swap the two strings to consume less memoryCharSequence tmp = s;s = t;t = tmp;slen = tlen;tlen = (short) t.length();}initMemoiseTables(threshold, back2, back1, cost, slen);for (short j = 1; j <= tlen; j++) {cost[0] = j; // j is the cost of inserting this many characters// stripe boundsint min = max(1, j - threshold);int max = min(slen, (short) (j + threshold));// at this iteration the left most entry is "too much" so reset itif (min > 1) {cost[min - 1] = Short.MAX_VALUE;}iterateOverStripe(s, t, j, cost, back1, back2, min, max);// swap our cost arrays to move on to the next "row"short[] tempCost = back2;back2 = back1;back1 = cost;cost = tempCost;}// after exit, the current cost is in back1// if back1[slen] > k then we exceeded, so return -1if (back1[slen] > threshold) {return -1;}return back1[slen];}private static void iterateOverStripe(CharSequence s, CharSequence t, short j,short[] cost, short[] back1, short[] back2, int min, int max) {// iterates over the stripefor (int i = min; i <= max; i++) {if (s.charAt(i - 1) == t.charAt(j - 1)) {cost[i] = back1[i - 1];} else {cost[i] = (short) (1 + min(cost[i - 1], back1[i], back1[i - 1]));}if (i >= 2 && j >= 2) {// possible transposition to check forif ((s.charAt(i - 2) == t.charAt(j - 1)) &&s.charAt(i - 1) == t.charAt(j - 2)) {cost[i] = min(cost[i], (short) (back2[i - 2] + 1));}}}}private static void initMemoiseTables(short threshold, short[] back2, short[] back1, short[] cost, short slen) {// initial "starting" values for inserting all the lettersshort boundary = (short) (min(slen, threshold) + 1);for (short i = 0; i < boundary; i++) {back1[i] = i;back2[i] = i;}// need to make sure that we don't read a default value when looking "up"Arrays.fill(back1, boundary, slen + 1, Short.MAX_VALUE);Arrays.fill(back2, boundary, slen + 1, Short.MAX_VALUE);Arrays.fill(cost, 0, slen + 1, Short.MAX_VALUE);}private static short min(short a, short b) {return (a <= b ? a : b);}private static short min(short a, short b, short c) {return min(a, min(b, c));}
}
import org.junit.Testimport static com.github.steveash.util.OptimalStringAlignment.editDistance/*** @author Steve Ash*/
class OptimalStringAlignmentTest {@Testpublic void shouldBeZeroForEqualStrings() throws Exception {assert 0 == editDistance("steve", "steve", 1)assert 0 == editDistance("steve", "steve", 0)assert 0 == editDistance("steve", "steve", 2)assert 0 == editDistance("steve", "steve", 100)assert 0 == editDistance("s", "s", 1)assert 0 == editDistance("s", "s", 0)assert 0 == editDistance("s", "s", 2)assert 0 == editDistance("s", "s", 100)assert 0 == editDistance("", "", 0)assert 0 == editDistance("", "", 1)assert 0 == editDistance("", "", 100)}@Testpublic void shouldBeOneForSingleOperation() throws Exception {def a = "steve";for (int i = 0; i < 5; i++) {assertOneOp(new StringBuilder(a).insert(i, 'f'), a)assertOneOp(new StringBuilder(a).deleteCharAt(i), a)def sb = new StringBuilder(a)sb.setCharAt(i, 'x' as char);assertOneOp(sb, a)if (i > 1) {sb = new StringBuilder(a)char t = sb.charAt(i - 1)sb.setCharAt(i - 1, sb.charAt(i))sb.setCharAt(i, t)println "comparing " + sb.toString() + " -> " + aassertOneOp(sb, a)}}}@Testpublic void shouldCountTransposeAsOne() throws Exception {assert 3 == editDistance("xxsteve", "steev", 4)assert 3 == editDistance("xxsteve", "steev", 3)assert 3 == editDistance("steev", "xxsteve", 4)assert 3 == editDistance("steev", "xxsteve", 3)assert -1 == editDistance("steev", "xxsteve", 2)assert 4 == editDistance("xxtseve", "steev", 4)assert 5 == editDistance("xxtsevezx", "steevxz", 5)assert 6 == editDistance("xxtsevezx", "steevxzpp", 6)assert 7 == editDistance("xxtsfevezx", "steevxzpp", 7)assert 4 == editDistance("xxtsf", "st", 7)assert 4 == editDistance("evezx", "eevxzpp", 7)assert 7 == editDistance("xxtsfevezx", "steevxzpp", 7)}@Testpublic void shouldCountLeadingCharacterTranspositionsAsOne() throws Exception {assert 1 == editDistance("rosa", "orsa", 2)}private void assertOneOp(CharSequence a, CharSequence b) {assert 1 == editDistance(a, b, 1)assert 1 == editDistance(b, a, 1)assert 1 == editDistance(a, b, 2)assert 1 == editDistance(b, a, 2)}@Testpublic void shouldShortCutWhenSpecialCase() throws Exception {assert 1 == editDistance("s", "", 1)assert 1 == editDistance("", "s", 1)assert -1 == editDistance("s", "", 0)assert -1 == editDistance("", "s", 0)assert -1 == editDistance("st", "", 1)assert -1 == editDistance("", "st", 1)assert -1 == editDistance("steve", "ste", 0)assert -1 == editDistance("ste", "steve", 0)assert -1 == editDistance("stev", "steve", 0)assert -1 == editDistance("ste", "steve", 1)assert -1 == editDistance("steve", "ste", 1)assert 1 == editDistance("steve", "stev", 1)assert 1 == editDistance("stev", "steve", 1)}
}
参考:来自我们的JCG合作伙伴史蒂夫·阿什(Steve Ash)的Java实现最佳字符串对齐 ,该博客来自Many Cups of Coffee博客。

翻译自: https://www.javacodegeeks.com/2013/11/java-implementation-of-optimal-string-alignment.html

java 字符串对齐

java 字符串对齐_最佳字符串对齐的Java实现相关推荐

  1. java 字符串匹配_多模字符串匹配算法原理及Java实现代码

    多模字符串匹配算法在这里指的是在一个字符串中寻找多个模式字符字串的问题.一般来说,给出一个长字符串和很多短模式字符串,如何最快最省的求出哪些模式字符串出现在长字符串中是我们所要思考的.该算法广泛应用于 ...

  2. java判断日文_判断字符串是否含有日文

    日文字符的Unicode编码范围是: U+3040–U+309F: Hiragana U+30A0–U+30FF: Katakana U+4E00–U+9FBF: Kanji 所以我们只需要对每一个字 ...

  3. java必读书籍_最佳5本Java性能调优书籍–精选,必读

    java必读书籍 为什么Java开发人员应该阅读有关性能调优的书? 当我很久以前第一次面对这个问题时,我以为以后会做,但是我很长一段时间都没有回过头来. 仅当我在用Java编写的任务关键型服务器端财务 ...

  4. js 字符串加减法_基于字符串的数值之加减乘除JS算法研究

    在我们的日常js项目中,我们不免会碰到需要进行前端计算的场景.而大家都知道,计算机进行计算时存在精度问题,且数值有值域,偶尔会碰到溢出问题.在最近的一个项目中,由于遇到了一个超过20位的数,因此,又不 ...

  5. java coin介绍_代码示例中的Java 7:Project Coin

    java coin介绍 该博客通过代码示例介绍了一些新的Java 7功能,这些项目在Project Coin一词下进行了概述. Project Coin的目标是向JDK 7添加一组小的语言更改.这些更 ...

  6. flex+java项目创建_创建Flex 4和Java Web应用程序

    当前的Web技术对它们的需求不断增长. 他们必须能够管理用户帐户,上载内容和流视频. 这种需求要求RIA开发人员寻求简化开发工作流程的技术,同时提供常见的功能. 开发人员面临的挑战是选择正确的技术来提 ...

  7. java derby数据库_使用Apache Derby进行Java数据库开发,第3部分

    该"使用Apache Derby进行Java数据库开发"系列的上一篇文章向您展示了如何使用Java Statement对象在Apache Derby数据库上执行SQL SELECT ...

  8. java 输出中文_没见过的 Java 编程入门教程!例程使用中文标识符代码:问个好吧...

    前言 Java 教程用中文写(如下)更能被新手理解学习.可惜至今没有看到类似入门教程,在此敢为人先. 注意:本教程的所有 Java 代码都可以正确运行,因为 Java 早已支持中文命名标识符. 编程语 ...

  9. 杭州java校园招聘_网易校园招聘杭州Java笔试题

    地点:杭州 职位:java研发 第一部分:计算机科学基础 (注:所有职位必做) 1.(2分)最坏情况下时间复杂度为O(nlogn)的排序算法有() A.基数排序 B.归并排序.C.堆排序 D.快速排序 ...

最新文章

  1. CPU 有个禁区,内核权限也无法进入!
  2. 让你不再害怕指针(一)
  3. eclipse默认的花括号对齐方式的修改
  4. SAP 物料主数据屏幕增强
  5. eclipse调试报错,无法进入类的解决办法
  6. Java2实用教程(第二版)程序代码——第十四章 Component类的常用方法
  7. adb-常用命令记录
  8. 二进制位运算中‘1’的个数
  9. spark学习-75-源代码:Endpoint模型介绍(6)-Endpoint的消息的接收(2)
  10. vscode修改c 项目_windows 10上使用vscode编译运行和调试C/C++
  11. linux安装opencv让输入密码,Linux下安装OpenCV步骤
  12. python音乐推荐系统的设计与实现_个性化音乐推荐系统设计与实现
  13. 【ESD专题】1.ESD基础及IEC61000-4-2标准
  14. 三体归零者和盘龙鸿蒙,《三体》中归零者这样的大神级文明已经脱离黑暗森林和猜疑链了吗,为什么?...
  15. 电商业务Alipay支付实战(当面付实现)
  16. 百度为什么打不开!最新消息~
  17. ▲ Android 签到打卡效果
  18. 阿里巴巴首届设计大赛,王牌大奖最终花落谁家?
  19. 昆山计算机等级考试考点,广东计算机等级考试报名时间和地点统计
  20. 客如云第二届开放平台大会 餐饮零售业新升级再赋能

热门文章

  1. 16、java中的集合(3)
  2. Eclipse 内置浏览器
  3. 获取一个 Byte 的各个 Bit 值
  4. Servlet使用适配器模式进行增删改查案例(DeptServiceImpl.java)
  5. SpringAOP之代理设计模式
  6. [译] RESTful API 设计最佳实践
  7. JAVA List集合转Page(分页对象) java 分页 PageModel 测试类TestPagerModel
  8. Buffers与cached啥区别
  9. MarkdownPad 汉化破解(含下载地址)
  10. how to build a paper's architecture?