Python实现多模匹配——AC自动机

目标:学习AC自动机,多模匹配。

要求:尽可能用纯Python实现,提升代码的扩展性。

一、什么是AC自动机?

AC自动机,Aho-Corasick automaton,该算法在1975年产生于贝尔实验室,是著名的多模匹配算法。要学会AC自动机,我们必 须知道什么是Trie,也就是字典树。Trie树,又称单词查找树或键树,是一种树形结构,是一种哈希树的变种。典型应用是用于统计和排序大量的字符串(但不仅限于字符串),所以经常被搜索引擎系统用于文本词频统计。

——摘自百度百科

二、AC自动机用来做什么?

一个常见的例子就是给出n个单词,再给出一段包含m个字符的文章,让你找出有多少个单词在文章里出现过。要搞懂AC自动机,先得有模式树(字典树)Trie和KMP模式匹配算法的基础知识。AC自动机算法分为3步:构造一棵Trie树,构造失败指针和模式匹配过程。

如果你对KMP算法了解的话,应该知道KMP算法中的next函数(shift函数或者fail函数)是干什么用的。KMP中我们用两个指针i和j分别表示,A[i-j+ 1..i]与B[1..j]完全相等。也就是说,i是不断增加的,随着i的增加j相应地变化,且j满足以A[i]结尾的长度为j的字符串正好匹配B串的前 j个字符,当A[i+1]≠B[j+1],KMP的策略是调整j的位置(减小j值)使得A[i-j+1..i]与B[1..j]保持匹配且新的B[j+1]恰好与A[i+1]匹配,而next函数恰恰记录了这个j应该调整到的位置。同样AC自动机的失败指针具有同样的功能,也就是说当我们的模式串在Trie上进行匹配时,如果与当前节点的关键字不能继续匹配,就应该去当前节点的失败指针所指向的节点继续进行匹配。

三、AC自动机的Python安装

安装过这个包的朋友,相信都遇到过各种坑。

1、pip安装

官网:https://pypi.org/project/pyahocorasick/。源码下载:

  • GitHub: https://github.com/WojciechMula/pyahocorasick/
  • Pypi: https://pypi.python.org/pypi/pyahocorasick/
  • Conda-Forge: https://github.com/conda-forge/pyahocorasick-feedstock/

安装方式:pip install pyahocorasick(python3),但尝试过的朋友会发现,这个包需要C编译器,如果自己的电脑中没有安装C编译器,是安装不成功的。pip install ahocorasick(python2)也无法安装。具体报错代码:

pip install pyahocorasickCollecting pyahocorasickUsing cached https://files.pythonhosted.org/packages/f4/9f/f0d8e8850e12829eea2e778f1c90e3c53a9a799b7f412082a5d21cd19ae1/pyahocorasick-1.4.0.tar.gz
Building wheels for collected packages: pyahocorasickRunning setup.py bdist_wheel for pyahocorasick ... errorComplete output from command /anaconda3/bin/python -u -c "import setuptools, tokenize;__file__='/private/var/folders/jd/6t6rh0991m72k_vxp02p7f440000gn/T/pip-install-_tg58exd/pyahocorasick/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d /private/var/folders/jd/6t6rh0991m72k_vxp02p7f440000gn/T/pip-wheel-rbzdosp6 --python-tag cp37:running bdist_wheelrunning buildrunning build_extbuilding 'ahocorasick' extensioncreating buildcreating build/temp.macosx-10.7-x86_64-3.7gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/anaconda3/include -arch x86_64 -I/anaconda3/include -arch x86_64 -DAHOCORASICK_UNICODE= -I/anaconda3/include/python3.7m -c pyahocorasick.c -o build/temp.macosx-10.7-x86_64-3.7/pyahocorasick.oxcrun: error: invalid active developer path (/Library/Developer/CommandLineTools), missing xcrun at: /Library/Developer/CommandLineTools/usr/bin/xcrunerror: command 'gcc' failed with exit status 1----------------------------------------Failed building wheel for pyahocorasickRunning setup.py clean for pyahocorasick
Failed to build pyahocorasick
Installing collected packages: pyahocorasickRunning setup.py install for pyahocorasick ... errorComplete output from command /anaconda3/bin/python -u -c "import setuptools, tokenize;__file__='/private/var/folders/jd/6t6rh0991m72k_vxp02p7f440000gn/T/pip-install-_tg58exd/pyahocorasick/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /private/var/folders/jd/6t6rh0991m72k_vxp02p7f440000gn/T/pip-record-5oyl9c1l/install-record.txt --single-version-externally-managed --compile:running installrunning buildrunning build_extbuilding 'ahocorasick' extensioncreating buildcreating build/temp.macosx-10.7-x86_64-3.7gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/anaconda3/include -arch x86_64 -I/anaconda3/include -arch x86_64 -DAHOCORASICK_UNICODE= -I/anaconda3/include/python3.7m -c pyahocorasick.c -o build/temp.macosx-10.7-x86_64-3.7/pyahocorasick.oxcrun: error: invalid active developer path (/Library/Developer/CommandLineTools), missing xcrun at: /Library/Developer/CommandLineTools/usr/bin/xcrunerror: command 'gcc' failed with exit status 1----------------------------------------
Command "/anaconda3/bin/python -u -c "import setuptools, tokenize;__file__='/private/var/folders/jd/6t6rh0991m72k_vxp02p7f440000gn/T/pip-install-_tg58exd/pyahocorasick/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /private/var/folders/jd/6t6rh0991m72k_vxp02p7f440000gn/T/pip-record-5oyl9c1l/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /private/var/folders/jd/6t6rh0991m72k_vxp02p7f440000gn/T/pip-install-_tg58exd/pyahocorasick/

如果直接下载Github中的源码,在使用ahocorasick.Automaton()函数会报错。那怎么办?

个人尝试着安装ahocorasick-python,官网:https://pypi.org/project/ahocorasick-python/,GitHub源码:源码。

但是结果发现Mac/Linux系统可以使用,Win10不行?瞬间无语了。demo环境用的是win10。

2、解决方案

网上查找了一些解决方案,主要包括三种:

(1)老老实实地装C编译器;

(2)python使用esmre代替ahocorasick实现ac自动机多模匹配

(3)个人改写ahocorasick——Python下的ahocorasick实现快速的关键字匹配

四、ahocorasick的Python代码

1、Python2代码:

# python2
# coding=utf-8KIND = 16class Node():static = 0def __init__(self):self.fail = Noneself.next = [None] * KINDself.end = Falseself.word = NoneNode.static += 1class AcAutomation():def __init__(self):self.root = Node()self.queue = []def getIndex(self, char):return ord(char)  # - BASEdef insert(self, string):p = self.rootfor char in string:index = self.getIndex(char)if p.next[index] == None:p.next[index] = Node()p = p.next[index]p.end = Truep.word = stringdef build_automation(self):self.root.fail = Noneself.queue.append(self.root)while len(self.queue) != 0:parent = self.queue[0]self.queue.pop(0)for i, child in enumerate(parent.next):if child == None: continueif parent == self.root:child.fail = self.rootelse:failp = parent.failwhile failp != None:if failp.next[i] != None:child.fail = failp.next[i]breakfailp = failp.failif failp == None: child.fail = self.rootself.queue.append(child)def matchOne(self, string):p = self.rootfor char in string:index = self.getIndex(char)while p.next[index] == None and p != self.root: p = p.failif p.next[index] == None:p = self.rootelse:p = p.next[index]if p.end: return True, p.wordreturn False, Noneclass UnicodeAcAutomation():def __init__(self, encoding='utf-8'):self.ac = AcAutomation()self.encoding = encodingdef getAcString(self, string):string = bytearray(string.encode(self.encoding))ac_string = ''for byte in string:ac_string += chr(byte % 16)ac_string += chr(byte / 16)# print ac_stringreturn ac_stringdef insert(self, string):if type(string) != unicode:raise Exception('UnicodeAcAutomation:: insert type not unicode')ac_string = self.getAcString(string)self.ac.insert(ac_string)def build_automation(self):self.ac.build_automation()def matchOne(self, string):if type(string) != unicode:raise Exception('UnicodeAcAutomation:: insert type not unicode')ac_string = self.getAcString(string)retcode, ret = self.ac.matchOne(ac_string)if ret != None:s = ''for i in range(len(ret) / 2):tmp = chr(ord(ret[2 * i]) + ord(ret[2 * i + 1]) * 16)s += tmpret = s.decode('utf-8')return retcode, retdef main():ac = UnicodeAcAutomation()ac.insert(u'丁亚光')ac.insert(u'好吃的')ac.insert(u'好玩的')ac.build_automation()print(ac.matchOne(u'hi,丁亚光在干啥'))print(ac.matchOne(u'ab'))print(ac.matchOne(u'不能吃饭啊'))print(ac.matchOne(u'饭很好吃,有很多好好的吃的,'))print(ac.matchOne(u'有很多好玩的'))if __name__ == '__main__':main()

输出:

(True, u'\u4e01\u4e9a\u5149')
(False, None)
(False, None)
(False, None)
(True, u'\u597d\u73a9\u7684')

可能很多朋友习惯了Python3,这里提供个人修改后的代码(主要是编码格式的修改)

2、Python3

# python3
# coding=utf-8KIND = 16class Node():static = 0def __init__(self):self.fail = Noneself.next = [None] * KINDself.end = Falseself.word = NoneNode.static += 1class AcAutomation():def __init__(self):self.root = Node()self.queue = []def getIndex(self, char):return ord(char)  # - BASEdef insert(self, string):p = self.rootfor char in string:index = self.getIndex(char)if p.next[index] == None:p.next[index] = Node()p = p.next[index]p.end = Truep.word = stringdef build_automation(self):self.root.fail = Noneself.queue.append(self.root)while len(self.queue) != 0:parent = self.queue[0]self.queue.pop(0)for i, child in enumerate(parent.next):if child == None: continueif parent == self.root:child.fail = self.rootelse:failp = parent.failwhile failp != None:if failp.next[i] != None:child.fail = failp.next[i]breakfailp = failp.failif failp == None: child.fail = self.rootself.queue.append(child)def matchOne(self, string):p = self.rootfor char in string:index = self.getIndex(char)while p.next[index] == None and p != self.root: p = p.failif p.next[index] == None:p = self.rootelse:p = p.next[index]if p.end: return True, p.wordreturn False, Noneclass UnicodeAcAutomation():def __init__(self, encoding='utf-8'):self.ac = AcAutomation()self.encoding = encodingdef getAcString(self, string):string = bytearray(string.encode(self.encoding))ac_string = ''for byte in string:ac_string += chr(byte % 16)ac_string += chr(byte // 16)return ac_stringdef insert(self, string):if type(string) != str:raise Exception('StrAcAutomation:: insert type not str')ac_string = self.getAcString(string)self.ac.insert(ac_string)def build_automation(self):self.ac.build_automation()def matchOne(self, string):if type(string) != str:raise Exception('StrAcAutomation:: insert type not str')ac_string = self.getAcString(string)retcode, ret = self.ac.matchOne(ac_string)if ret != None:s = ''for i in range(len(ret) // 2):s += chr(ord(ret[2 * i]) + ord(ret[2 * i + 1]) * 16)ret = s.encode("latin1").decode('utf-8')return retcode, retdef main():ac = UnicodeAcAutomation()ac.insert('丁亚光')ac.insert('好吃的')ac.insert('好玩的')ac.build_automation()print(ac.matchOne('hi,丁亚光在干啥'))print(ac.matchOne('ab'))print(ac.matchOne('不能吃饭啊'))print(ac.matchOne('饭很好吃,有很多好好的吃的,'))print(ac.matchOne('有很多好玩的'))if __name__ == '__main__':

输出:

(True, '丁亚光')
(False, None)
(False, None)
(False, None)
(True, '好玩的')

总结:ahocorasick个人改写的方法还有很多,比如根据ahocorasick-python的源码进行改写。其中ahocorasick-python的核心源码如下。

# coding:utf-8
# write by zhou
# revised by zwclass Node(object):"""节点的抽象"""def __init__(self, str='', is_root=False):self._next_p = {}self.fail = Noneself.is_root = is_rootself.str = strself.parent = Nonedef __iter__(self):return iter(self._next_p.keys())def __getitem__(self, item):return self._next_p[item]def __setitem__(self, key, value):_u = self._next_p.setdefault(key, value)_u.parent = selfdef __repr__(self):return "<Node object '%s' at %s>" % \(self.str, object.__repr__(self)[1:-1].split('at')[-1])def __str__(self):return self.__repr__()class AhoCorasick(object):"""Ac自动机对象"""def __init__(self, *words):self.words_set = set(words)self.words = list(self.words_set)self.words.sort(key=lambda x: len(x))self._root = Node(is_root=True)self._node_meta = {}self._node_all = [(0, self._root)]_a = {}for word in self.words:for w in word:_a.setdefault(w, set())_a[w].add(word)def node_append(keyword):assert len(keyword) > 0_ = self._rootfor _i, k in enumerate(keyword):node = Node(k)if k in _:passelse:_[k] = nodeself._node_all.append((_i+1, _[k]))self._node_meta.setdefault(id(_[k]),set())if _i >= 1:for _j in _a[k]:if keyword[:_i+1].endswith(_j):self._node_meta[id(_[k])].add((_j, len(_j)))_ = _[k]else:if _ != self._root:self._node_meta[id(_)].add((keyword, len(keyword)))for word in self.words:node_append(word)self._node_all.sort(key=lambda x: x[0])self._make()def _make(self):"""构造Ac树:return:"""for _level, node in self._node_all:if node == self._root or _level <= 1:node.fail = self._rootelse:_node = node.parent.failwhile True:if node.str in _node:node.fail = _node[node.str]breakelse:if _node == self._root:node.fail = self._rootbreakelse:_node = _node.faildef search(self, content, with_index=False):result = set()node = self._rootindex = 0for i in content:while 1:if i not in node:if node == self._root:breakelse:node = node.failelse:for keyword, keyword_len in self._node_meta.get(id(node[i]), set()):if not with_index:result.add(keyword)else:result.add((keyword, (index - keyword_len + 1, index + 1)))node = node[i]breakindex += 1return resultif __name__ == '__main__':ac = AhoCorasick("abc", 'abe', 'acdabd', 'bdf', 'df', 'f', 'ac', 'cd', 'cda')print(ac.search('acdabdf', True))

输出:

{('cd', (1, 3)), ('acdabd', (0, 6)), ('df', (5, 7)), ('f', (6, 7)), ('bdf', (4, 7)), ('cda', (1, 4)), ('ac', (0, 2))}

参考文献:

1、AC自动机的python实现

2、70行Python实现AC自动机

3、序列比对(二十六)——精准匹配之KMP算法、Trie树以及AC自动机

4、关于AC自动机的思考

Python实现多模匹配——AC自动机相关推荐

  1. Python——利用AC自动机进行关键词提取

    Python--利用AC自动机进行关键词提取 目标:在之前写的文章[Python实现多模匹配--AC自动机]基础上,安装gcc(C编译器),再装ahocorasick ,并完成从文本中提取关键词的任务 ...

  2. 模板 - AC自动机

    ACM-ICPC模板 目录 求有多少个模式串在文本串里出现过 建fail树dfs求每个模式串在文本串中的出现次数 ac自动机fail树上dfs序建可持久化线段树 AC自动机是一种多模匹配算法 AC自动 ...

  3. python使用esmre代替ahocorasick实现ac自动机[多模匹配]

    Toggle navigation Home saltstack ansible zabbix docker python Golang web开发 运维开发 运维 文艺 python使用esmre代 ...

  4. ac自动机 匹配最长前缀_AC自动机算法

    AC自动机简介: 首先简要介绍一下AC自动机:Aho-Corasick automation,该算法在1975年产生于贝尔实验室,是著名的多模匹配算法之一.一个常见的例子就是给出n个单词,再给出一段包 ...

  5. python实现AC自动机

    ac自动机可以看成带指针的字典树,每个节点的指针指向了当前节点的最大后缀的位置.在建立字典树后,可以层次遍历字典树来构建fail指针,根节点的直接孩子(第一层节点)的fail指针肯定是指向根节点的,之 ...

  6. 关键词匹配(Ac自动机模板题)

    2772: 关键词匹配 Time Limit: 1 Sec  Memory Limit: 128 MB Submit: 10  Solved: 4 [Submit][Status][Web Board ...

  7. 【AC自动机】前缀匹配(ybtoj AC自动机-3)

    正题 ybtoj AC自动机-3 题目大意 给你一个字符串和若干匹配串,问你匹配串的前缀和字符串的最大匹配 解题思路 先把所有匹配串丢进AC自动机,然后拿字符串去跑 每次只在当前位置存下贡献,然后按b ...

  8. python 复现AC自动机

    转载地址:https://www.cnblogs.com/nullzx/news/2017/09/09/7497991.html 详情请参考转载地址,代码如下: package string_alg; ...

  9. AC自动机(python)

    AC自动机作为天朝发扬光大的算法,常用于非法字符.恶意文本匹配,比如把字符串中的"小学生"变成"***"之类,或是识别是不是违规的帖子之类的.AC自动机是基于前 ...

最新文章

  1. XXL-JOB v2.0.2,分布式任务调度平台
  2. Oracle 11g R2 RAC 高可用连接特性
  3. SQL-33 创建一个actor表,包含如下列信息
  4. C++模板专门化与重载
  5. java 持续集成工具_Jenkins-Jenkins(持续集成工具)下载 v2.249.2官方版--pc6下载站
  6. JBox2D手机游戏引擎介绍(附jbox2d官网网址)
  7. 知识图谱:刻画事物关系,沉淀领域知识!
  8. 题解 P3367 【【模板】并查集】
  9. Linux基础命令---间歇性执行程序watch
  10. Linux编写shell脚本的注意事项
  11. 【入门经典】准备工作
  12. 【人工智能】NIPS2019 | 2019NIPS论文 | NeurIPS2019最新更新论文~持续更新| NIPS2019百度云下载
  13. 分布式任务调度平台XXL-JOB
  14. 2022软考[嵌入式系统设计师]大纲
  15. java 水晶按钮_java渲染水晶按钮
  16. atsha204a加密ic01
  17. 利用函数进行ip地址转换
  18. java 程序怎么设置中文_怎么让这个简单JAVA程序读写中文字符
  19. 直接在Visual Studio代码编辑器中加密字符串文件
  20. 计算机网络 IP地址基础知识

热门文章

  1. ADM2483的原理图
  2. OFDM系统架构梳理(1)
  3. 百度智能云 × 狮桥物流 | 主动安全驾驶技术加码,狮桥物流干线运输安全有保障...
  4. conda创建python虚拟环境常用指令和流程
  5. java中负数取余数_数学 - java如何用负数进行模数计算?
  6. 计算机和会计论文题目,财务会计(论文)参考题目.doc
  7. 装了万象网管2004 windowsXP开机时,系统帐户自动登陆
  8. 小学数学深度教学论文
  9. 在html调节元素左右间距,HTML元素间距问题
  10. Win7 + VirtualBox安装Mac OS X雪豹操作系统图文详解