h5正则表达式

by Vikash Singh

由Vikash Singh

正则表达式需要5天才能运行。 因此,我构建了一个可以在15分钟内完成操作的工具。 (Regex was taking 5 days to run. So I built a tool that did it in 15 minutes.)

When developers work with text, they often need to clean it up first. Sometimes it’s by replacing keywords. Like replacing “Javascript” with “JavaScript”. Other times, we just want to find out whether “JavaScript” was mentioned in a document.

当开发人员使用文本时,他们通常需要先对其进行清理。 有时是通过替换关键字。 就像用“ JavaScript”代替“ Javascript”。 其他时候,我们只是想了解文档中是否提到了“ JavaScript”。

Data cleaning tasks like these are standard for most Data Science projects dealing with text.

对于大多数处理文本的数据科学项目,此类数据清理任务是标准的。

数据科学从数据清理开始。 (Data Science starts with data cleaning.)

I had a very similar task to work on recently. I work as a Data Scientist at Belong.co and Natural Language Processing is half of my work.

我最近有一项非常相似的任务要处理。 我在Belong.co担任数据科学家,自然语言处理是我工作的一半。

When I trained a Word2Vec model on our document corpus, it started giving synonyms as similar terms. “Javascripting” was coming as a similar term to “JavaScript”.

当我在文档语料库上训练Word2Vec模型时,它开始以相似的术语提供同义词。 “ JavaScript”与“ JavaScript”类似。

To resolve this, I wrote a regular expression (Regex) to replace all known synonyms with standardized names. The Regex replaced “Javascripting” with “JavaScript”, which solved 1 problem but created another.

为了解决这个问题,我编写了一个正则表达式(Regex),用标准名称替换所有已知的同义词。 正则表达式取代了“ JavaScript” 使用“ JavaScript”解决了1个问题,但又创建了另一个问题。

Some people, when confronted with a problem, think

有些人遇到问题时会想

The above quote is from this stack-exchange question and it came true for me.

上面的引用来自这个堆栈交换问题 ,对我来说是正确的。

It turns out that Regex is fast if the number of keywords to be searched and replaced is in the 100s. But my corpus had over 20K keywords and 3 Million documents.

事实证明,如果要搜索和替换的关键字数量在100 s之内,则Regex很快。 但是我的语料库有2万多个关键字和300万个文档。

When I benchmarked my Regex code, I found it was going to take 5 days to complete one run.

当我对Regex代码进行基准测试时,我发现需要5 才能完成一次运行。

The natural solution was to run it in parallel. But that won’t help when we reach 10s of millions of documents and 100s of thousands of keywords. There had to be a better way! And I started looking for it…

自然的解决方案是并行运行。 但是,当我们到达数以千万计的文档和数十万个关键字时,这将无济于事。 必须有更好的方法! 然后我开始寻找它…

I asked around in my office and on Stack Overflow — a couple of suggestions came up. Vinay Pandey, Suresh Lakshmanan and Stack Overflow pointed towards the beautiful algorithm called Aho-Corasick algorithm and the Trie Data Structure approach. I looked for existing solutions but couldn’t find much.

我在办公室和堆栈溢出中四处询问-提出了一些建议。 Vinay Pandey , Suresh Lakshmanan和Stack Overflow指出了称为Aho-Corasick算法的美丽算法和Trie数据结构方法。 我在寻找现有的解决方案,但找不到很多。

So I wrote my own implementation and FlashText was born.

因此,我编写了自己的实现, FlashText诞生了。

Before we get into what is FlashText and how it works, let’s have a look at how it performs for search:

在介绍什么是FlashText及其工作原理之前,让我们看一下它如何执行搜索:

The chart shown above is a comparison of Complied Regex against FlashText for 1 document. As the number of keywords increase, the time taken by Regex grows almost linearly. Yet with FlashText it doesn’t matter.

上图显示的是针对1个文档的Compreged Regex与FlashText的比较。 随着关键字数量的增加,Regex花费的时间几乎呈线性增长。 但是,使用FlashText并不重要。

FlashText将我们的运行时间从5天减少到15分钟! (FlashText reduced our run time from 5 days to 15 minutes!!)

This is FlashText timing for replace:

这是替换的FlashText时间:

Code used for the benchmark shown above is linked here, and results are linked here.

上面显示的用于基准测试的代码在这里链接,结果在这里链接。

那么什么是FlashText? (So what is FlashText?)

FlashText is a Python library that I open sourced on GitHub. It is efficient at both extracting keywords and replacing them.

FlashText是我在GitHub上开源的Python库。 它在提取关键字和替换关键字方面都很有效。

To use FlashText first you have to pass it a list of keywords. This list will be used internally to build a Trie dictionary. Then you pass a string to it and tell if you want to perform replace or search.

要首先使用FlashText,您必须向其传递关键字列表。 此列表将在内部用于构建Trie词典。 然后,将一个字符串传递给它,并告诉您是否要执行替换或搜索。

For replace it will create a new string with replaced keywords. For search it will return a list of keywords found in the string. This will all happen in one pass over the input string.

为了进行replace ,它将创建一个带有替换关键字的新字符串。 为了进行search ,它将返回在字符串中找到的关键字列表。 所有这些都将在输入字符串上一次完成。

Here is what one happy user had to say about the library:

这是一个高兴的用户对图书馆所说的话:

为什么FlashText这么快? (Why is FlashText so fast ?)

Let’s try and understand this part with an example. Say we have a sentence which has 3 words I like Python, and a corpus which has 4 words {Python, Java, J2ee, Ruby}.

让我们尝试通过一个例子来理解这一部分。 假设我们有一个句子,其中有3个单词I like Python ,而一个语料库中有4个单词{Python, Java, J2ee, Ruby}

If we take each word from the corpus, and check if it is present in sentence, it will take 4 tries.

如果我们从语料库中提取每个单词,并检查其是否存在于句子中,则将需要4次尝试。

is 'Python' in sentence? is 'Java' in sentence?...

If the corpus had n words it would have taken n loops. Also each search step is <word> in sentence? will take its own time. This is kind of what happens in Regex match.

如果语料库有n单词,则将进行n循环。 此外,每个搜索步骤is <word> in sen ? 会花费自己的时间。 这是正则表达式匹配中发生的事情。

There is another approach which is reverse of the first one. For each word in the sentence, check if it is present in corpus.

还有另一种方法与第一种方法相反。 对于句子中的每个单词,检查它是否存在于语料库中。

is 'I' in corpus?is 'like' in corpus?is 'python' in corpus?

If the sentence had m words it would have taken m loops. In this case the time it takes is only dependent on the number of words in sentence. And this step, is <word> in corpus? can be made fast using a dictionary lookup.

如果句子中有m单词,则将需要m循环。 在这种情况下,所花费的时间仅取决于句子中单词的数量。 这一步, is <word> in c orpus中的is <word> in c吗? 可以使用字典查找快速完成。

FlashText algorithm is based on the second approach. It is inspired by the Aho-Corasick algorithm and Trie data structure.

FlashText算法基于第二种方法。 它受到Aho-Corasick算法和Trie数据结构的启发。

The way it works is:First a Trie dictionary is created with the corpus. It will look somewhat like this:

它的工作方式是:首先用语料库创建Trie字典。 看起来会像这样:

Start and EOT (End Of Term) represent word boundaries like space, period and new_line. A keyword will only match if it has word boundaries on both sides of it. This will prevent matching apple in pineapple.

Start和EOT(术语末)表示单词边界,例如spaceperiodnew_line 。 关键字仅在词的两边都具有单词边界时才匹配。 这将防止在菠萝中匹配苹果。

Next we will take an input string I like Python and search it character by character.

接下来,我们将输入一个I like Python的输入字符串,并逐字符搜索它。

Step 1: is <start>I<EOT> in dictionary? NoStep 2: is <start>like<EOT> in dictionary? NoStep 3: is <start>Python<EOT> in dictionary? Yes

Since this is a character by character match, we could easily skip <start>like<;EOT> at <start>l because l is not connected to start. This makes skipping missing words really fast.

由于这是由字符匹配的字符,我们可以很容易地跳过<start>lik Ë< ;EOT>在<S t艺术>升,因为L是not连接到启动。 这使得跳过遗漏的单词的速度非常快。

The FlashText algorithm only went over each character of the input string ‘I like Python’. The dictionary could have very well had a million keywords, with no impact on the runtime. This is the true power of FlashText algorithm.

FlashText算法仅遍历输入字符串“ I like Python”的每个字符。 字典中可能有100万个关键字,而对运行时没有影响。 这是FlashText算法的真正功能。

那么什么时候应该使用FlashText? (So when should you use FlashText?)

Simple Answer: When Number of keywords > 500

简单答案:当关键字数量> 500时

Complicated Answer: Regex can search for keywords based special characters like ^,$,*,\d,. which are not supported in FlashText.

复杂的答案: 正则表达式可以搜索基于关键字的特殊字符,例如^,$,*,\d,. FlashText不支持这些功能。

So it’s no good if you want to match partial words like `word\dvec`. But it is excellent for extracting complete words like `word2vec`.

因此,如果要匹配`word\dvec`类的局部词,那就`word\dvec` 。 但是对于提取像`word2vec`这样的完整单词非常`word2vec`

用于查找关键字的FlashText (FlashText for finding keywords)

FlashText用于替换关键字 (FlashText for replacing keywords)

Instead of extracting keywords you can also replace keywords in sentences. We use this as a data cleaning step in our data processing pipeline.

除了提取关键字,您还可以替换句子中的关键字。 我们将此用作数据处理管道中的数据清理步骤。

If you know someone who works with text data, Entity recognition, natural language processing, or Word2vec, please consider sharing this blog with them.

如果您知道有人处理文本数据,实体识别,自然语言处理或Word2vec,请考虑与他们共享此博客。

This library has been really useful for us, and I am sure it would be useful for others too.

这个库对我们真的很有用,我相信它对其他人也很有用。

So long, and thanks for all the claps ?

这么长时间,感谢您的所有鼓掌?

翻译自: https://www.freecodecamp.org/news/regex-was-taking-5-days-flashtext-does-it-in-15-minutes-55f04411025f/

h5正则表达式

h5正则表达式_正则表达式需要5天才能运行。 因此,我构建了一个可以在15分钟内完成操作的工具。...相关推荐

  1. 高通骁龙835无线充电手机_高通的快速充电5可以在15分钟内为手机充电

    高通骁龙835无线充电手机 Quick Charge 5 supports fast charging up to 100W and can completely replenish a dead s ...

  2. java replaceall正则表达式_正则表达式的规则及应用

    第三阶段 JAVA常见对象的学习 正则表达式 (一) 正则表达式概述 (1) 简单概述 就是符合一定规则的字符串. (2) 常见规则 (3) 常见功能 //判断功能 正则表达式是非常强大的,我们通过几 ...

  3. 网页视频15分钟自动暂停_在15分钟内学习网页爬取

    网页视频15分钟自动暂停 什么是网页抓取? (What is Web Scraping?) Web scraping, also known as web data extraction, is th ...

  4. 以中划线开头正则表达式_正则表达式 汉字、数字、字母、横杠

    先推荐一个在线测试/学习正则表达式的网址, www.gskinner.com/RegExr/ 1.一个正则表达式,只含有汉字.数字.字母.下划线不能以下划线开头和结尾: ^(?!_)(?!.*?_$) ...

  5. 浮点数正则表达式_正则表达式的邮箱,手机号,身份证号,密码

    一.校验数字的表达式 数字:^[0-9]*$ n位的数字:^d{n}$ 至少n位的数字:^d{n,}$ m-n位的数字:^d{m,n}$ 零和非零开头的数字:^(0|[1-9][0-9]*)$ 非零开 ...

  6. java jui 正则表达式_正则表达式-Gorilla City-51CTO博客

    正则表达式,简称为regex,是文本模式的描述方法.例如,\d是一个正则表达式,表示一位数字字符,即任何一位0到9的数字. 使用步骤 python中所有正则表达式的函数都在re模块中. ▎python ...

  7. oracle匹配靓号的正则表达式_正则表达式(靓号过滤)

    一般公司在开发一类对的号码时,会预留一些号码给以后升级的会员使用,比如旺旺靓号,QQ号等,采用正则表达式实现较好,通过规则引擎的后台页面做成实时可配置的也是不错的选择. 一. 一般会有如下的正则需求 ...

  8. mysql全角正则表达式_正则表达式匹配【全角字符】

    网上搜了老半天,关于正则表达式啊,大家写的都是乱七八糟.尤其是关于中文的正则,写的时候估计就是临时一用,面对小样本数据没出错,就记录在了博客里. 在面对中文写正则表达式,这里边很重要的一个概念是[字符 ...

  9. java qq正则表达式_正则表达式对qq号码校验

    废话不多说了,直接给大家贴代码了,具体代码如下所示: package 正则表达式; /*对QQ号码进行校验 要求5~15位,不能以0开头,只能是数字*/ import java.util.regex. ...

最新文章

  1. XXL-REGISTRY v1.0.2 发布,分布式服务注册中心
  2. 数据结构实验之栈与队列四:括号匹配
  3. html如何改变浏览器的图标,css 更换浏览器 默认图标
  4. HDU 1964 Pipes
  5. 【crawler笔记】R语言简单动态网页爬虫(rvest包)示例
  6. Java Web面试题及答案整理(2021年最新版,持续更新)
  7. 比较全的OA系统功能模块列表
  8. linux usb有线网卡驱动_Linux系统安装R8169网卡驱动的方法
  9. webrtc视频码率计算
  10. 交叉线和直通线的区别和用途
  11. sqlite3_exec返回SQLITE_MISUSE(21)
  12. 大哥要我实现天干地支的组合
  13. ZYNQ_MPSoC启动
  14. 银河麒麟arm64 qt打包
  15. anki卡片浏览器_Anki怎么用|小白anki安装使用指南+记忆库资源推荐
  16. mybatis的insert语句获取自增id的方法(mySQL)
  17. Android 进阶:网络图片加载 - Glide篇
  18. javaweb简单小项目-投票系统
  19. ISO-12233图卡
  20. 深度学习 Day28——利用Pytorch实现好莱坞明星识别

热门文章

  1. 关于封装 c# 115691143
  2. ListView控件获取选中项的内容 c# 114867417
  3. 让媒体播放控件,播放媒体 0201
  4. 吃货联盟 项目日记 0922
  5. 演练 实现等腰三角形
  6. 数据结构与算法-二叉树的名词概念与相关数据的计算
  7. 博客索引-pyhui-第二版
  8. linux指令:时间与日期
  9. mysql查询语句 查询方式
  10. 2016上半年中国云存储排行榜:阿里云居榜首