It's a fact - people make typos or simply use alternate spellings on a frequent basis.

这是事实-人们经常打错字或只是使用替代拼写。

Whatever the cause, from a practical point of view, different variants of similar strings can pose challenges for software developers. Your application needs to be capable of handling these inevitable edge-cases.

从实际的角度来看,无论原因是什么,相似字符串的不同变体都会给软件开发人员带来挑战。 您的应用程序需要能够处理这些不可避免的情况。

Take names, for example. I go by Peter in some places, Pete in others. Amongst other variants, my name can be represented by:

以名字为例。 我在某些地方经过彼得,在其他地方经过彼得。 在其他变体中,我的名字可以用以下方式表示:

  • "Pete Gleeson"“皮特格里森”
  • "Peter J Gleeson"“彼得·格里森(Peter J Gleeson)”
  • "Mr P Gleeson"“格里森先生”
  • "Gleeson, Peter"“格里森,彼得”

And that's not to mention alternative spellings of my surname, such as "Gleason". All these different variations for just one string - matching them against each other programmatically might not seem obvious.

更不用说我姓氏的其他拼写形式,例如“格里森”。 对于一个字符串,所有这些不同的变体-以编程方式将它们彼此匹配可能并不明显。

Luckily, there are solutions out there.

幸运的是,那里有解决方案。

The generic name for these solutions is 'fuzzy string matching'. The 'fuzzy' refers to the fact that the solution does not look for a perfect, position-by-position match when comparing two strings. Instead, they allow some degree of mismatch (or 'fuzziness').

这些解决方案的通用名称是“模糊字符串匹配”。 “模糊”是指这样的事实,即在比较两个字符串时,解决方案并不寻求完美的逐位匹配。 相反,它们允许一定程度的不匹配(或“模糊性”)。

There are solutions available in many different programming languages. Today, we'll explore some options available in Postgresql (or 'Postgres') - a widely used open source SQL dialect with some seriously useful add-on features.

有许多不同编程语言提供的解决方案。 今天,我们将探讨Postgresql(或“ Postgres”)中可用的一些选项-一种广泛使用的开放源SQL方言,具有一些非常有用的附加功能。

配置 (Setting up)

First, make sure you have Postgres installed on your machine.

首先,确保在计算机上安装了Postgres 。

Then, create a new database in its own directory (you can call it anything you like, here, I called it 'fuzz-demo'). From the command line:

然后,在其自己的目录中创建一个新数据库(您可以随意命名,在这里,我将其称为“ fuzz-demo”)。 从命令行:

$ mkdir fuzz-demo && cd fuzz-demo
$ initdb .
$ pg_ctl -D . start
$ createdb fuzz-demo

For this demo, I used a table with details about artists in the Museum of Modern Art. You can download the artists.csv file from Kaggle.

对于此演示,我使用了一张桌子,上面有现代艺术博物馆中有关艺术家的详细信息。 您可以从Kaggle下载artist.csv文件。

Next, you can start psql (a terminal-based front end for Postgresql):

接下来,您可以启动psql(Postgresql的基于终端的前端):

$ psql fuzz-demo

Now, create a table called artists:

现在,创建一个名为artists的表:

CREATE TABLE artists (artist_id INT,name VARCHAR,nationality VARCHAR,gender VARCHAR,birth_year INT,death_year INT);

Finally, you can use Postgresql's COPY function to copy the contents of artists.csv into the table:

最后,您可以使用Postgresql的COPY函数将artist.csv的内容复制到表中:

COPY artists FROM '~/Downloads/artists.csv' DELIMTER ',' CSV HEADER;

If everything has worked so far, you should be able to start querying the artists table.

如果到目前为止一切正常,则应该可以开始查询Artists表。

SELECT * FROM artists LIMIT 10;

通配符过滤器 (Wildcard filters)

Say you remember the first name of an artist called Barbara, but cannot quite remember her second name. It began with 'Hep...', but you're not sure how it ended.

假设您记得一位叫Barbara的艺术家的名字,但是却不太记得她的名字。 它以“ Hep ...”开始,但是您不确定它是如何结束的。

Here, you can use a filter and SQL's wildcard operator %. This symbol stands in for any number of unspecified characters.

在这里,您可以使用过滤器和SQL的通配符% 。 该符号代表任意数量的未指定字符。

SELECT*
FROM artists
WHERE name LIKE 'Barbara%'
AND name LIKE '%Hep%';

The first part of the filter finds artists whose name begins with 'Barbara', and ends in any combination of characters.

过滤器的第一部分查找名称以'Barbara'开头且以任何字符组合结尾的艺术家。

The second part of the filter finds artists whose name can begin and end with any combination of characters, but must contain the letters 'Hep' in that order.

过滤器的第二部分查找艺术家的名字可以以任何字符组合开头和结尾,但必须按顺序包含字母“ Hep”。

But what if you are unsure of the spelling of either name? Filters and wildcards will only get you so far.

但是,如果您不确定两个名字的拼写怎么办? 过滤器和通配符只会帮助您解决问题。

使用三字组 (Using trigrams)

Luckily, Postgres has a helpful extension with the catchy name pg_trgm. You can enable it from psql using the command below:

幸运的是,Postgres的扩展名为pg_trgm,很有帮助。 您可以使用以下命令从psql启用它:

CREATE EXTENSION pg_trgm;

This extension brings with it some helpful functions for fuzzy string matching. The underlying principle is the use of trigrams (which sound like something out of Harry Potter).

此扩展带有一些有用的函数,用于模糊字符串匹配。 基本原理是使用三字组合(听起来像哈利·波特那样)。

Trigrams are formed by breaking a string into groups of three consecutive letters. For example, the string "hello" would be represented by the following set of trigrams:

通过将字符串分成三个连续字母的组来形成三元组。 例如,字符串“ hello”将由以下三字母组表示:

  • " h", " he", "hel", "ell", "llo", "lo "“ h”,“ he”,“ hel”,“ ell”,“ llo”,“ lo”

By comparing how similar the set of trigrams are between two strings, it is possible to estimate how similar they are on a scale between 0 and 1. This allows for fuzzy matching, by setting a similarity threshold above which strings are considered to match.

通过比较两个字符串之间的字母组合的相似程度,可以估计它们在0到1之间的尺度上的相似程度。这可以通过设置相似性阈值来进行模糊匹配,在该阈值之上可以认为字符串匹配。

SELECT*
FROM artists
WHERE SIMILARITY(name,'Claud Monay') > 0.4 ;

Perhaps you want to see the top five matches?

也许您想看到前五场比赛?

SELECT *
FROM artists
ORDER BY SIMILARITY(name,'Lee Casner') DESC
LIMIT 5;

The default threshold is 0.3. You can use the % operator in this case as shorthand for fuzzy matching names against a potential match:

默认阈值为0.3。 在这种情况下,可以使用%运算符作为针对潜在匹配的模糊匹配名称的简写:

SELECT*
FROM artists
WHERE name % 'Andrey Deran';

Perhaps you only have an idea of one part of the name. The % operator lets you compare against elements of an array, so you can match against any part of the name. The next query uses Postgres' STRING_TO_ARRAY function to split the artists' full names into arrays of separate names.

也许您只知道名称的一部分。 使用%运算符可以与数组的元素进行比较,因此可以与名称的任何部分进行匹配。 下一个查询使用Postgres的STRING_TO_ARRAY函数将艺术家的全名拆分为单独名称的数组。

SELECT*
FROM artists
WHERE 'Cadinsky' % ANY(STRING_TO_ARRAY(name,' '));

语音算法 (Phonetic algorithms)

Another approach to fuzzy string matching comes from a group of algorithms called phonetic algorithms.

模糊字符串匹配的另一种方法来自一组称为语音算法的算法。

These are algorithms which use sets of rules to represent a string using a short code. The code contains the key information about how the string should sound if read aloud. By comparing these shortened codes, it is possible to fuzzy match strings which are spelled differently but sound alike.

这些是使用规则集来使用短代码表示字符串的算法。 该代码包含有关大声读取字符串的声音的关键信息。 通过比较这些缩短的代码,可以对拼写不同但听起来相似的匹配字符串进行模糊处理。

Postgres comes with an extension that lets you make use of some of these algorithms. You can enable it with the following command:

Postgres带有一个扩展,可让您使用其中的一些算法。 您可以使用以下命令启用它:

CREATE EXTENSION fuzzystrmatch;

One example is an algorithm called Soundex. Its origins go back over 100 years - it was first patented in 1918 and was used in the 20th century for analysing US census data.

一个示例是称为Soundex的算法。 它的起源可以追溯到100年前-它于1918年首次获得专利,并在20世纪用于分析美国人口普查数据。

Soundex works by converting strings into four letter codes which describe how they sound. For example, the Soundex representations of 'flower' and 'flour' are both F460.

Soundex的工作原理是将字符串转换成四个字母代码,以描述它们的发音。 例如,“花”和“面粉”的Soundex表示形式均为F460。

The query below finds the record which sounds like the name 'Damian Hurst'.

下面的查询查找听起来像名称“ Damian Hurst”的记录。

SELECT*
FROM artists
WHERE nationality IN ('American', 'British')
AND SOUNDEX(name) = SOUNDEX('Damian Hurst');

Another algorithm is one called metaphone. This works on a similar basis to Soundex, in that it converts strings into a code representation using a set of rules.

另一种算法是称为元音。 这与Soundex相似,其工作原理是使用一组规则将字符串转换为代码表示形式。

The metaphone algorithm will return codes of different lengths (unlike Soundex, which always returns four characters). You can pass an argument to the METAPHONE function indicating the maximum length code you want it to return.

变音位算法将返回不同长度的代码(与Soundex不同,后者始终返回四个字符)。 您可以将一个参数传递给METAPHONE函数,该参数指示您希望其返回的最大长度代码。

SELECTartist_id,name,METAPHONE(name,10)
FROM artists
WHERE nationality = 'American'
LIMIT 5;

Because both metaphone and Soundex return strings as outputs, you can use them in other fuzzy string matching functions. This combined approach can yield powerful results. The example below finds the five closest matches for the name Si Tomlee.

因为metaphone和Soundex都将字符串作为输出返回,所以您可以在其他模糊字符串匹配函数中使用它们。 这种组合方法可以产生有力的结果。 下面的示例查找名称Si Tomlee的五个最接近的匹配项。

SELECT*
FROM artists
WHERE nationality = 'American'
ORDER BY SIMILARITY(METAPHONE(name,10),METAPHONE('Si Tomlee',10)) DESC
LIMIT 5;

Here, a trigram-only approach would not have helped much, as there is little overlap between 'Cy Twombly' and 'Si Tomlee'. In fact, these only have a SIMILARITY score of 0.05, even though they sound similar when read aloud.

在这里,仅使用三元组的方法不会有太大帮助,因为“ Cy Twombly”和“ Si Tomlee”之间几乎没有重叠。 实际上,尽管它们朗读时听起来相似,但它们的SIMILARITY仅为0.05。

Due to their historical origins, neither of these algorithms works well with names or words of non-English language origin. However, there are more internationally-focused versions.

由于其历史渊源,这些算法都不能很好地用于非英语来源的名称或单词。 但是,还有更多面向国际的版本。

One example is the double metaphone algorithm. This uses a more sophisticated set of rules for producing metaphones. It can provide alternative encodings for English and non-English origin strings.

一个例子是双变音位算法。 这使用了一套更复杂的规则来生产对讲机。 它可以为英语和非英语来源字符串提供替代编码。

As an example, see the query below. It compares the double metaphone outputs for different spellings of Spanish artist Joan Miró:

例如,请参见下面的查询。 它比较了西班牙艺术家JoanMiró的不同拼写的双音位输出:

SELECT'Joan Miró' AS name, DMETAPHONE('Joan Miró'),DMETAPHONE_ALT('Joan Miró')
UNION SELECT'Juan Mero' AS name,DMETAPHONE('Juan Mero'),DMETAPHONE_ALT('Juan Mero');

走远 (Going the distance)

Finally, another approach to fuzzy string matching in Postgres is to calculate the 'distance' between strings. There are several ways to do this. Postgres provides functionality to calculate the Levenshtein distance.

最后,Postgres中模糊字符串匹配的另一种方法是计算字符串之间的“距离”。 有几种方法可以做到这一点。 Postgres提供了计算Levenshtein距离的功能。

At a high level, the Levenshtein distance between two strings is the minimum number of edits required to transform one string into the other. Edits are considered at the character level, and can include:

在较高的水平上,两个字符串之间的Levenshtein距离是将一个字符串转换为另一个字符串所需的最小编辑次数。 编辑是在字符级别考虑的,可以包括:

  • substitutions,替换,
  • deletions, and删除,以及
  • insertions插入

For example, the Levenshtein distance between the words 'bigger' and 'better' is 3, because you can transform 'bigger' into 'better' by substituting 'igg' for 'ett'.

例如,单词“ bigger”和“ better”之间的Levenshtein距离为3,因为您可以通过用“ igg”代替“ ett”将“ bigger”变为“ better”。

Meanwhile, the Levenshtein distance between 'biggest' and 'best' is also 3, because you can transform 'biggest' into 'best' by deleting the letters 'igg'.

同时,“最大”和“最佳”之间的Levenshtein距离也是3,因为您可以通过删除字母“ igg”将“最大”转换为“最佳”。

See below for a query which finds the artists with the smallest Levenshtein distances from the name 'Freda Kallo'.

请参阅以下查询,查找距离名称“ Freda Kallo”最小的Levenshtein距离的艺术家。

SELECT*,LEVENSHTEIN(name, 'Freda Kallo')
FROM artists
ORDER BY LEVENSHTEIN(name, 'Freda Kallo') ASC
LIMIT 5

谢谢阅读! (Thanks for reading!)

Hopefully this overview of fuzzy string matching in Postgresql has given you some new insights and ideas for your next project.

希望Postgresql中的模糊字符串匹配概述为您的下一个项目提供一些新的见解和想法。

There are of course other methods for fuzzy string matching not covered here, and in other programming languages.

当然,这里还有其他编程语言中未提及的其他用于模糊字符串匹配的方法。

For example, if you use Python, take a look at the fuzzywuzzy package. Or if you prefer R, you can use the inbuilt agrep() function, or try out the stringdist package.

例如,如果您使用Python,请查看Fuzzywuzzy软件包 。 或者,如果您更喜欢R,则可以使用内置的agrep()函数,或尝试使用stringdist包 。

翻译自: https://www.freecodecamp.org/news/fuzzy-string-matching-with-postgresql/

如何在Postgresql中使用模糊字符串匹配相关推荐

  1. 【NLP】基于TF-IDF和KNN的模糊字符串匹配优化

    作者 | Audhi Aprilliant 编译 | VK 来源 | Towards Data Science 在处理真实数据时,最大的问题主要是数据预处理. 匹配可能是许多分析师面临的最大挑战之一. ...

  2. 如何在Java中写模糊查询

    转载自:https://blog.csdn.net/gradonisis/article/details/105323751 如何在Java中写模糊查询 模糊查询是什么? 数据库中查询: Java中查 ...

  3. 计算机统计字符数,如何在Word中统计相同字符(文字)出现的个数 -电脑资料

    大家都知道,在Word中我们可以统计一篇文章字符的总个数! 但是,却不知道是啥原因,不知道是Word觉得此功能太简单了,还是忽略了这一点;却没有统计相同字符个数的功能! 但这未提供的功能却广为大众所需 ...

  4. oracle竖线分隔符文件导入6,如何在sqlldr中倒入多字符分隔符文件

    如何在sqlldr中倒入多字符分隔符文件 今天有朋友询问: 我们...应用系统很多,它们之间要交换很多数据,目前是以文本方式交换,问题是文本的分隔符号是(|+|),为三个字符,主要是避免数据的混淆.. ...

  5. 编写一个方法参数接收一个字符串,返回一个Date对象(在多种日期格式中找到与字符串匹配的那一个)用到解析异常ParseException

    题目:编写一个方法参数接收一个字符串,返回一个Date对象(在多种日期格式中找到与字符串匹配的那一个)用到解析异常ParseException 具体代码如下: import java.text.Par ...

  6. 模糊字符串匹配:双音素算法

    介绍 名称匹配的主要问题之一是出错的可能性. 人们拼写同一个名字(错别字)的方式有很多,他们都听别人说的话. 有多种方法可以破坏自由格式语言数据. 当您需要搜索/匹配不良数据时,这会引起很多麻烦. 有 ...

  7. 模糊字符串匹配:双重解密算法

    名称匹配的一个大问题是错误的倾向.有许多不同的方式,人们拼写相同的名字,打字错误,误读了另一个人说的话.有许多方法可以免费形式的语言数据被破坏.当您需要搜索/匹配不良数据时,会导致许多头疼. 有很多不 ...

  8. Postgresql杂谈 22——Postgresql中的模糊匹配

    Postgresql对模糊查询的支持,主要有三种方法:传统的like操作符.SQL99新增的SIMILAR TO操作符以及POSIX正则表达式.除了前面两种SQL标准的模糊查询手段,Postgresq ...

  9. scala中字符串计数_如何在Scala中创建一系列字符?

    scala中字符串计数 The range is a set of data from a lower value to a larger value. In Scala, we have an ea ...

最新文章

  1. ISME:昆士兰大学郭建华组-人造甜味剂会促进细菌耐药性的传播
  2. 使用fliter实现ie下css中rgba的效果
  3. hihoCoder太阁最新面经算法竞赛18
  4. Infinispan 10.0.0.Beta2 和 9.4.8 发布,分布式集群缓存系统
  5. SharedActivityContext要引用那个单元?
  6. flask框架视图和路由_角度视图,路由和NgModule的解释
  7. php中 和 的优先级,理解php中的运算符优先级
  8. python QTreeWidgetItem下面有几个子tree_python-nlp ch1笔记:nlp的基础应用、高级应用、python优势、nltk环境搭建...
  9. 信息学奥赛一本通 1001:Hello,World | OpenJudge NOI 1.1 01:Hello, World
  10. IDLE打开Python报错 api-ms-win-crt-runtimel1-1-0.dll缺失的解决方案
  11. Windows运维的学习笔记
  12. 关于digit统计算法(C语言实现)
  13. pip可识别的requirements.txt的写法
  14. 海盗分金问题 冲突分析—非合作博弈
  15. CART 分类决策树
  16. 【工业互联网】工业发展的痛点体现在哪里?如何整合边缘计算与云计算的优势来成就工业智能?
  17. java模板引擎哪个好_模板引擎比较
  18. 用CSS画一只哆啦A梦
  19. SAP S4 会计科目表的设计
  20. Galahad tutorial与虚拟筛选--sybyl

热门文章

  1. 案例 实现文件读写器 c# 1614523907
  2. 演练 五家限购专卖店
  3. css 精灵图 0302
  4. js高级程序设计 - 温故而知新
  5. Fedora 27安装vim插件YouCompleteMe
  6. ipvs,ipvsadm的安装及使用
  7. 【Android】论ViewHolder存在的意义
  8. 《Windows服务器配置与管理》远程桌面管理
  9. netstat(win)
  10. (转)ScriptManager.RegisterStartupScript方法和Page.ClientScript.RegisterStartupScript() 方法...