聊天机器人之文本分词

1. 准备词典和停用词

1.1 准备词典

1.2 准备停用词

stopwords = set([i.strip() for i in open(config.stopwords_path).readlines()])

关于结巴分词的详细代码操作，查看博文：https://blog.csdn.net/weixin_44799217/article/details/115244829

2. 准备按照单个字切分句子的方法

import redef cut_by_word(sentence):# 对中文按照字进行处理，对英文不分为字母sentence = re.sub("\s+", " ", sentence)  # 匹配空白，即 空格，tab键sentence = sentence.strip()result = []temp = ""for word in sentence:if word.lower() in letters:temp += word.lower()else:if temp != "":  # 不是字母result.append(temp)temp = ""if word.strip() in filters:  # 标点符号continueelse:  # 是单个字result.append(word)if temp != "":  # 最后的temp中包含字母result.append(temp)return result

3. 完成分词方法的封装

lib 下创建cut_sentence.py文件，完成分词方法的构建

import logging
import jieba
import jieba.posseg as psg  # 分词并返回词性
import config
import re
import string# 关闭jieba log输出
jieba.setLogLevel(logging.INFO)  # 这行代码使程序运行时的4行红字去掉
# 加载词典
jieba.load_userdict(config.keywords_path)
# 单字分割，英文部分
letters = string.ascii_lowercase
# 单字分割 去除的标点
filters = [",", "-", ".", " "]
# 停用词
stopwords = set([i.strip() for i in open(config.stopwords_path).readlines()])def cut(sentence, by_word=False, use_stopwords=False, with_sg=False):""":param sentence: str句子:param by_word: 是否按照单个字分词:param use_stopwords: 是否使用停用词:param with_sg: 是否返回词性:return: """assert by_word != True or with_sg != True, "根据word切分时候无法返回词性"if by_word:return _cut_by_word(sentence)else:ret = psg.lcut(sentence)if use_stopwords:ret = [(i.word, i.flag) for i in ret if i.word not in stopwords]if not with_sg:ret = [i.word for i in ret]return retdef _cut_by_word(sentence):# 对中文按照字进行处理，对英文不分为字母sentence = re.sub("\s+", " ", sentence)sentence = sentence.strip()result = []temp = ""for word in sentence:if word.lower() in letters:temp += word.lower()else:if temp != "":  # 不是字母result.append(temp)temp = ""if word.strip() in filters:  # 标点符号continueelse:  # 是单个字result.append(word)if temp != "":  # 最后的temp中包含字母result.append(temp)return result

聊天机器人之文本分词相关推荐

聊天机器人之文本聚类分析
目录文本聚类聚类算法 Affinity propagation 算法概述特点 K-means 算法概述特点 Chinese Whispers 算法概述特点选择算法计算过程优化聚类API ...
人工智能标记语言AIML聊天机器人：…
人工智能标记语言AIML聊天机器人:产生.种类.应用.实例.AIML概述.知识库.公司.业界(20k字经典收藏版) 秦陇纪10译编聊天机器人(chatterbot)是一个用来模拟人类对话或聊天的程序 ...
人工智能标记语言AIML聊天机器人：产生、种类、应用、实例、AIML概述、知识库、公司、业界（20k字经典收藏版）...
目录一.聊天机器人(chatbots)的产生.盛行.中文版二.聊天机器人种类及应用场景简介三.聊天机器人相关疑问与常见实例四.人工智能标记语言(AIML)概述(Dr.理查德S.华勒斯Richa ...
使用TensorFlow.js的AI聊天机器人三：改进了文本中的情感检测
目录使用通用语句编码器设置TensorFlow.js代码 GoEmotion数据集通用句子编码器训练AI模型让我们发现情绪终点线下一步是什么? 下载项目代码-9.9 MB TensorFl ...
使用TensorFlow.js的AI聊天机器人一：检测文本中的情绪
目录设置TensorFlow.js代码 GoEmotion数据集言语包训练AI模型检测文本中的情绪终点线下一步是什么? 下载项目代码-9.9 MB TensorFlow + JavaScr ...
JAVASCRIPT实现基于文本的自动智能聊天机器人
原创作者:一粒马豆&冰豆小李首先让我们来看看这样一种语言现象: 研表究明,汉字序顺并不定一影阅响读. Aoccdrnig to a rscheearch at an Elingsh uine ...
聊天机器人落地及进阶实战 | 公开课速记
嘉宾 | 邵浩编辑 | suiling 来源 | AI科技大本营在线公开课近年来,聊天机器人技术及产品得到了快速的发展.聊天机器人作为人工智能技术的杀手级应用,发展得如火如荼,各种智能硬件层出不穷 ...
聊天机器人之语料准备
聊天机器人之语料准备 1. 分词词典最终词典的格式: 词语词性(不要和jieba默认的词性重复) 1.1 词典来源各种输入法的词典 [通过下面链接下载使用] 例如:https: ...
《预训练周刊》第21期：FlipDA：有效且稳健的数据增强小样本学习、开放域低资源适应的生成式聊天机器人...
No.21 智源社区预训练组预训练研究观点资源活动关于周刊超大规模预训练模型是当前人工智能领域研究的热点,为了帮助研究与工程人员了解这一领域的进展和资讯,智源社区整理了第21期&l ...

聊天机器人之文本分词

聊天机器人之文本分词

1. 准备词典和停用词

1.1 准备词典

1.2 准备停用词

2. 准备按照单个字切分句子的方法

3. 完成分词方法的封装

聊天机器人之文本分词相关推荐

最新文章

热门文章