敏感词过滤及反垃圾文本的相关知识（欢迎收藏）

先介绍一下敏感词词库

：1.funNLP

敏感词库：
2.chat-censorship
与聊天客户端审查调查相关的数据，此存储库包含关键字黑名单以及其他内容的列表，例如用于触发在中国使用的应用程序中的审查制度的URL或图像（应用包括：微博，微信，Line,skype）

3.网上整理的敏感词库及Java实现的代码

请移步github

敏感词过滤的相关算法：

1.使用敏感词过滤系统。
信息审核工作都是在信息审核平台上进行的，网站的运营审核系统中会预先设定一批关键词库并对词组进行排列组合，这批词库又会根据敏感性进行分类。系统会阻止用户发布敏感词汇，或将用户发出来的含有敏感词的内容直接删除。对于某些敏感性较低的词汇，发出来不会立即删除，需要经过审核人员过目进行二次审核。
AC自动机算法（原理）

#python实现，
# -*- coding:utf-8 -*-import time
time1=time.time()# AC自动机算法
class node(object):def __init__(self):self.next = {}self.fail = Noneself.isWord = Falseself.word = ""class ac_automation(object):def __init__(self):self.root = node()# 添加敏感词函数def addword(self, word):temp_root = self.rootfor char in word:if char not in temp_root.next:temp_root.next[char] = node()temp_root = temp_root.next[char]temp_root.isWord = Truetemp_root.word = word# 失败指针函数def make_fail(self):temp_que = []temp_que.append(self.root)while len(temp_que) != 0:temp = temp_que.pop(0)p = Nonefor key,value in temp.next.item():if temp == self.root:temp.next[key].fail = self.rootelse:p = temp.failwhile p is not None:if key in p.next:temp.next[key].fail = p.failbreakp = p.failif p is None:temp.next[key].fail = self.roottemp_que.append(temp.next[key])# 查找敏感词函数def search(self, content):p = self.rootresult = []currentposition = 0while currentposition < len(content):word = content[currentposition]while word in p.next == False and p != self.root:p = p.failif word in p.next:p = p.next[word]else:p = self.rootif p.isWord:result.append(p.word)p = self.rootcurrentposition += 1return result# 加载敏感词库函数def parse(self, path):with open(path,encoding='gbk') as f:for keyword in f:self.addword(str(keyword).strip())# 敏感词替换函数def words_replace(self, text):""":param ah: AC自动机:param text: 文本:return: 过滤敏感词之后的文本"""result = list(set(self.search(text)))for x in result:m = text.replace(x, '*' * len(x))text = mreturn textif __name__ == '__main__':ah = ac_automation()path='keywords.txt'ah.parse(path)text1=input('输入文字：')# text1="shabi操草草得到大大苏打"text2=ah.words_replace(text1)print(text2)time2 = time.time()print('总共耗时：' + str(time2 - time1) + 's')

DFA算法（原理）

#python实现
# -*- coding:utf-8 -*-
import time
time1=time.time()
# DFA算法
class DFAFilter():def __init__(self):self.keyword_chains = {}self.delimit = '\x00'def add(self, keyword):keyword = keyword.lower()chars = keyword.strip()if not chars:returnlevel = self.keyword_chainsfor i in range(len(chars)):if chars[i] in level:level = level[chars[i]]else:if not isinstance(level, dict):breakfor j in range(i, len(chars)):level[chars[j]] = {}last_level, last_char = level, chars[j]level = level[chars[j]]last_level[last_char] = {self.delimit: 0}breakif i == len(chars) - 1:level[self.delimit] = 0def parse(self, path):with open(path,encoding='gbk') as f:for keyword in f:self.add(str(keyword).strip())def filter(self, message, repl="*"):message = message.lower()ret = []start = 0while start < len(message):level = self.keyword_chainsstep_ins = 0for char in message[start:]:if char in level:step_ins += 1if self.delimit not in level[char]:level = level[char]else:ret.append(repl * step_ins)start += step_ins - 1breakelse:ret.append(message[start])breakelse:ret.append(message[start])start += 1return ''.join(ret)if __name__ == "__main__":gfw = DFAFilter()path="keywords.txt"gfw.parse(path)text=input("请输入文字：")# text="新疆骚乱苹果新品发布会雞八，操你妈逼的大傻逼你个哈哈哈胡爱思"result = gfw.filter(text)# print(text)print(result)time2 = time.time()print('总共耗时：' + str(time2 - time1) + 's')

3.TTMP网友自创算法（原理，code）

建立反垃圾信息（anti-spam）机制：**

我们经常会遇到一些垃圾信息，比如邮箱中收到的各种垃圾邮件、新浪微博的僵尸粉以及论坛中层出不穷的广告贴等等。有人会不停的去寻找网站的漏洞以及规则，使用机器发布这些垃圾广告从而达到营利目的。anti-spam主要是指通过技术手段对数据进行过滤和筛选，将我们认定为不合格的数据清理掉，将系统认为可疑的信息进行提示分类。anti-spam对审核工作也是一个相辅相成的内容。
先看看几个例子：

Facebook反垃圾实践
知乎反作弊垃圾文本识别
文本反垃圾在花椒直播中的应用概述
【NLP文本分类】文本分类算法集锦，从入门到精通

关于敏感词相关的github项目:

1.ToolGood.Words

2.text-antispam

3.textfilter

优质中文NLP资源集合：
包括语言检测、中外手机/电话归属地/运营商查询、名字推断性别、手机号抽取、身份证抽取、邮箱抽取，关于BERT的相关资源等等
https://github.com/fighting41love/funNLP

打开之后就会发现你需要的宝藏！