python爬取整个网站的广告敏感词_Python实现敏感词过滤的4种方法

在我们生活中的一些场合经常会有一些不该出现的敏感词，我们通常会使用*去屏蔽它，例如：尼玛 -> **，一些骂人的敏感词和一些政治敏感词都不应该出现在一些公共场合中，这个时候我们就需要一定的手段去屏蔽这些敏感词。下面我来介绍一些简单版本的敏感词屏蔽的方法。

(我已经尽量把脏话做成图片的形式了，要不然文章发不出去)

方法一：replace过滤

replace就是最简单的字符串替换，当一串字符串中有可能会出现的敏感词时，我们直接使用相应的replace方法用*替换出敏感词即可。

缺点：

文本和敏感词少的时候还可以，多的时候效率就比较差了

import datetime

now = datetime.datetime.now()

print(filter_sentence, " | ", now)

如果是多个敏感词可以用列表进行逐一替换

for i in dirty:

speak = speak.replace(i, '*')

print(speak, " | ", now)

方法二：正则表达式过滤

正则表达式算是一个不错的匹配方法了，日常的查询中，机会都会用到正则表达式，包括我们的爬虫，也都是经常会使用到正则表达式的，在这里我们主要是使用“|”来进行匹配，“|”的意思是从多个目标字符串中选择一个进行匹配。写个简单的例子：

import re

def sentence_filter(keywords, text):

return re.sub("|".join(keywords), "***", text)

print(sentence_filter(dirty, speak))

方法三：DFA过滤算法

DFA的算法，即Deterministic Finite Automaton算法，翻译成中文就是确定有穷自动机算法。它的基本思想是基于状态转移来检索敏感词，只需要扫描一次待检测文本，就能对所有敏感词进行检测。(实现见代码注释)

#!/usr/bin/env python

# -*- coding:utf-8 -*-

# @Time：2020/4/15 11:40

# @Software：PyCharm

# article_add: https://www.cnblogs.com/JentZhang/p/12718092.html

__author__ = "JentZhang"

import json

MinMatchType = 1 # 最小匹配规则

MaxMatchType = 2 # 最大匹配规则

class DFAUtils(object):

"""

DFA算法

"""

def __init__(self, word_warehouse):

"""

算法初始化

:param word_warehouse:词库

"""

# 词库

self.root = dict()

# 无意义词库,在检测中需要跳过的(这种无意义的词最后有个专门的地方维护，保存到数据库或者其他存储介质中)

self.skip_root = [' ', '&', '!', '！', '@', '#', '$', '￥', '*', '^', '%', '?', '？', '', "《", '》']

# 初始化词库

for word in word_warehouse:

self.add_word(word)

def add_word(self, word):

"""

添加词库

:param word:

:return:

"""

now_node = self.root

word_count = len(word)

for i in range(word_count):

char_str = word[i]

if char_str in now_node.keys():

# 如果存在该key，直接赋值，用于下一个循环获取

now_node = now_node.get(word[i])

now_node['is_end'] = False

else:

# 不存在则构建一个dict

new_node = dict()

if i == word_count - 1: # 最后一个

new_node['is_end'] = True

else: # 不是最后一个

new_node['is_end'] = False

now_node[char_str] = new_node

now_node = new_node

def check_match_word(self, txt, begin_index, match_type=MinMatchType):

"""

检查文字中是否包含匹配的字符

:param txt:待检测的文本

:param begin_index: 调用getSensitiveWord时输入的参数，获取词语的上边界index

:param match_type:匹配规则 1：最小匹配规则，2：最大匹配规则

:return:如果存在，则返回匹配字符的长度，不存在返回0

"""

flag = False

match_flag_length = 0 # 匹配字符的长度

now_map = self.root

tmp_flag = 0 # 包括特殊字符的敏感词的长度

for i in range(begin_index, len(txt)):

word = txt[i]

# 检测是否是特殊字符"

if word in self.skip_root and len(now_map) < 100:

# len(nowMap)<100 保证已经找到这个词的开头之后出现的特殊字符

tmp_flag += 1

continue

# 获取指定key

now_map = now_map.get(word)

if now_map: # 存在，则判断是否为最后一个

# 找到相应key，匹配标识+1

match_flag_length += 1

tmp_flag += 1

# 如果为最后一个匹配规则，结束循环，返回匹配标识数

if now_map.get("is_end"):

# 结束标志位为true

flag = True

# 最小规则，直接返回,最大规则还需继续查找

if match_type == MinMatchType:

break

else: # 不存在，直接返回

break

if tmp_flag < 2 or not flag: # 长度必须大于等于1，为词

tmp_flag = 0

return tmp_flag

def get_match_word(self, txt, match_type=MinMatchType):

"""

获取匹配到的词语

:param txt:待检测的文本

:param match_type:匹配规则 1：最小匹配规则，2：最大匹配规则

:return:文字中的相匹配词

"""

matched_word_list = list()

for i in range(len(txt)): # 0---11

length = self.check_match_word(txt, i, match_type)

if length > 0:

word = txt[i:i + length]

matched_word_list.append(word)

# i = i + length - 1

return matched_word_list

def is_contain(self, txt, match_type=MinMatchType):

"""

判断文字是否包含敏感字符

:param txt:待检测的文本

:param match_type:匹配规则 1：最小匹配规则，2：最大匹配规则

:return:若包含返回true，否则返回false

"""

flag = False

for i in range(len(txt)):

match_flag = self.check_match_word(txt, i, match_type)

if match_flag > 0:

flag = True

return flag

def replace_match_word(self, txt, replace_char='*', match_type=MinMatchType):

"""

替换匹配字符

:param txt:待检测的文本

:param replace_char:用于替换的字符，匹配的敏感词以字符逐个替换，如"你是大王八"，敏感词"王八"，替换字符*，替换结果"你是大**"

:param match_type:匹配规则 1：最小匹配规则，2：最大匹配规则

:return:替换敏感字字符后的文本

"""

tuple_set = self.get_match_word(txt, match_type)

word_set = [i for i in tuple_set]

result_txt = ""

if len(word_set) > 0: # 如果检测出了敏感词，则返回替换后的文本

for word in word_set:

replace_string = len(word) * replace_char

txt = txt.replace(word, replace_string)

result_txt = txt

else: # 没有检测出敏感词，则返回原文本

result_txt = txt

return result_txt

if __name__ == '__main__':

dfa = DFAUtils(word_warehouse=word_warehouse)

print('词库结构：', json.dumps(dfa.root, ensure_ascii=False))

# 待检测的文本

msg = msg

print('是否包含：', dfa.is_contain(msg))

print('相匹配的词：', dfa.get_match_word(msg))

print('替换包含的词：', dfa.replace_match_word(msg))

方法四：AC自动机

AC自动机需要有前置知识：Trie树(简单介绍：又称前缀树，字典树，是用于快速处理字符串的问题，能做到快速查找到一些字符串上的信息。)

详细参考：

ac自动机,就是在tire树的基础上,增加一个fail指针,如果当前点匹配失败,则将指针转移到fail指针指向的地方,这样就不用回溯,而可以路匹配下去了。

详细匹配机制我在这里不过多赘述，关于AC自动机可以参考一下这篇文章：

https://www.jb51.net/article/128711.htm

python可以利用ahocorasick模块快速实现：

# python3 -m pip install pyahocorasick

import ahocorasick

def build_actree(wordlist):

actree = ahocorasick.Automaton()

for index, word in enumerate(wordlist):

actree.add_word(word, (index, word))

actree.make_automaton()

return actree

if __name__ == '__main__':

actree = build_actree(wordlist=wordlist)

sent_cp = sent

for i in actree.iter(sent):

sent_cp = sent_cp.replace(i[1][1], "**")

print("屏蔽词：",i[1][1])

print("屏蔽结果：",sent_cp)

当然，我们也可以手写一份AC自动机，具体参考：

class TrieNode(object):

__slots__ = ['value', 'next', 'fail', 'emit']

def __init__(self, value):

self.value = value

self.next = dict()

self.fail = None

self.emit = None

class AhoCorasic(object):

__slots__ = ['_root']

def __init__(self, words):

self._root = AhoCorasic._build_trie(words)

@staticmethod

def _build_trie(words):

assert isinstance(words, list) and words

root = TrieNode('root')

for word in words:

node = root

for c in word:

if c not in node.next:

node.next[c] = TrieNode(c)

node = node.next[c]

if not node.emit:

node.emit = {word}

else:

node.emit.add(word)

queue = []

queue.insert(0, (root, None))

while len(queue) > 0:

node_parent = queue.pop()

curr, parent = node_parent[0], node_parent[1]

for sub in curr.next.itervalues():

queue.insert(0, (sub, curr))

if parent is None:

continue

elif parent is root:

curr.fail = root

else:

fail = parent.fail

while fail and curr.value not in fail.next:

fail = fail.fail

if fail:

curr.fail = fail.next[curr.value]

else:

curr.fail = root

return root

def search(self, s):

seq_list = []

node = self._root

for i, c in enumerate(s):

matched = True

while c not in node.next:

if not node.fail:

matched = False

node = self._root

break

node = node.fail

if not matched:

continue

node = node.next[c]

if node.emit:

for _ in node.emit:

from_index = i + 1 - len(_)

match_info = (from_index, _)

seq_list.append(match_info)

node = self._root

return seq_list

if __name__ == '__main__':

aho = AhoCorasic(['foo', 'bar'])

print aho.search('barfoothefoobarman')

以上便是使用Python实现敏感词过滤的四种方法，前面两种方法比较简单，后面两种偏向算法，需要先了解算法具体实现的原理，之后代码就好懂了。(DFA作为比较常用的过滤手段，建议大家掌握一下~)

最后附上敏感词词库：

以上就是Python实现敏感词过滤的4种方法的详细内容，更多关于python 敏感词过滤的资料请关注我们其它相关文章！

时间： 2020-09-09

python爬取整个网站的广告敏感词_Python实现敏感词过滤的4种方法相关推荐

（详细总结）python爬取 163收件箱邮件内容，收件箱列表的几种方法(urllib, requests, selenium)
需求:最近有一个需求,需要将163邮箱收件箱里面的所有邮件的内容全部copy下来,整理到一个word里面,不多也就28页的邮件(不要问我为什么有这需求,不告诉你),自己手动去ctrl+ cv 的话,估 ...
Python | 使用Python爬取Wallhaven网站壁纸并上传百度网盘
更多详情请查看Honker Python | 使用Python爬取Wallhaven网站壁纸并上传百度网盘给大家推荐一款超好用的壁纸下载网站-- wallhaven 第一次知道这个网站的时候,惊为天 ...
使用Python爬取马蜂窝网站的游记和照片
使用Python爬取马蜂窝网站的游记和照片特殊原因需要在马蜂窝上爬取一些游记和照片作为后续分析处理的数据,参考网上一些类似的爬虫文章,自己尝试了一下,这次爬取的是马蜂窝上所有有关苏州的游记(包括游记 ...
Python爬取素材网站的音频文件
这篇文章主要介绍了基于Python爬取素材网站音频文件,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学习价值,需要的朋友可以参考下,另外我建立了一个Python学习圈子群:115 ...
python爬取电影网站存储于数据库_Python零基础爬虫教程（实战案例爬取电影网站资源链接）...
前言好像没法添加链接,文中的链接只能复制到浏览器查看了这篇是我写在csdn的,那里代码格式支持更好,文章链接 https://blog.csdn.net/d497465762/article/de ...
python爬取小说网站资源_利用python的requests和BeautifulSoup库爬取小说网站内容
1. 什么是Requests?html Requests是用Python语言编写的,基于urllib3来改写的,采用Apache2 Licensed 来源协议的HTTP库.python 它比urlli ...
Python爬取小说网站下载小说
1前言这个小程序是用来爬取小说网站的小说的,一般的盗版小说网站都是很好爬取的因为这种网站基本没有反爬虫机制的,所以可以直接爬取该小程序以该网站http://www.126shu.com/15/下 ...
python爬取阿里巴巴网站实现
文章目录 1.杂谈 2.python爬虫的过程 3.代码分享 1.杂谈好久没有更新我的博客了,那么问题来了,我干嘛去了,难道不爱分享了吗?不,我不是.真实情况是,我刚毕业,入职某互联网公司.因为 ...
python爬虫真假网址,python爬取福利网站图片完整代码,懂得人都懂
网址需要自己替换懂的人都懂512*2,主要学习简单的爬虫,别乱用,否则后果自负! [Python] 纯文本查看复制代码 import requests,bs4,re,os,threadingclas ...

python爬取整个网站的广告敏感词_Python实现敏感词过滤的4种方法

python爬取整个网站的广告敏感词_Python实现敏感词过滤的4种方法相关推荐

最新文章

热门文章