结巴分词----去停用词

目前的课题需要用结巴分词处理一下数据，其中要去掉其中的停用词，以下是代码：

import jieba
import os
import pymysqldef fun(filepath):  # 遍历文件夹中的所有文件，返回文件listarr = []for root, dirs, files in os.walk(filepath):for fn in files:arr.append(root+"\\"+fn)return arr#创建停用词表
def stopwordslist(filepath):stopwords = [line.strip() for line in open(filepath, 'r', encoding='utf-8').readlines()]return stopwords# 对句子去除停用词
def movestopwords(sentence):stopwords = stopwordslist('D:/2181729/stop_words.txt')  # 这里加载停用词的路径santi_words =[x for x in sentence if len(x) >1 and x not in stopwords]return santi_wordsdef segmentor(text):words = jieba.cut(text, cut_all=False)return words
#stopwords = {}.fromkeys(['的', '包括', '等', '是', '《', '》', '（', '）', '.', '、', '。'])stopwords = stopwordslist('D:/2181729/stop_words.txt')filepath = r'D:/2181729/data'
filelist = fun(filepath)  # 获取文件列表
text = ""
count = 0
print(len(filelist))
#f1 = open('D:/2181729/nerfcdata/1.txt', 'a+')
for file in filelist:with open(file, encoding='UTF-8')as f:for line in f:segs = jieba.cut(line, cut_all=False)for seg in segs:if seg not in stopwords:text += segwords = segmentor(text)#print('/'.join(words))count += 1output = '/'.join(words)dir='D:/2181729/nerfcdata/'+f.name[-5:]with open(dir, 'w', encoding='UTF-8') as f1:print(output)f1.write(output)

结巴分词----去停用词相关推荐

python结巴分词去掉停用词、标点符号、虚词_NLP自然语言处理入门-- 文本预处理Pre-processing...
引言自然语言处理NLP(nature language processing),顾名思义,就是使用计算机对语言文字进行处理的相关技术以及应用.在对文本做数据分析时,我们一大半的时间都会花在文本预处理 ...
（3.2）将分词和去停用词后的评论文本基于“环境、卫生、价格、服务”分类...
酒店评论情感分析系统(三)-- 将分词和去停用词后的评论文本基于"环境.卫生.价格.服务"分类思想: 将进行了中文分词和去停用词之后得到的词或短语按序存在一个数组(iniArra ...
文本分析——分词并去停用词返回嵌套列表并保存到本地
文章目录文本分析分词并去停用词返回嵌套列表读取文件并进行分词去停用词操作保存结果到本地从本地读取结果文本分析分词并去停用词返回嵌套列表此代码块用于分词并去停用词(从csv文件转成了txt分 ...
『NLP自然语言处理』中文文本的分词、去标点符号、去停用词、词性标注
利用Python代码实现中文文本的自然语言处理,包括分词.去标点符号.去停用词.词性标注&过滤. 在刚开始的每个模块,介绍它的实现.最后会将整个文本处理过程封装成 TextProcess 类. ...
IKAnalyzer进行中文分词和去停用词
最近学习主题模型pLSA.LDA,就想拿来试试中文.首先就是找文本进行切词.去停用词等预处理,这里我找了开源工具IKAnalyzer2012,下载地址:(:(注意:这里尽量下载最新版本,我这里用的IK ...
分词并去停用词自定义函数：seg_word(sentence)
分词并去停用词自定义函数:seg_word(sentence). import jieba def seg_word(sentence):"""使用jieba对文档分词& ...
Gensim：word2vec（jieba分词，去停用词）
参考https://www.cnblogs.com/pinard/p/7278324.html 计算词向量 gensim计算词向量需要执行三个步骤 model=gensim.models.Word2V ...
Python借助jieba包对中文txt文档去停用词、分词
Python借助jieba包对中文txt文档去停用词.分词` import jieba# 创建停用词list def stopwordslist(filepath):stopwords = [line ...
文本相似度计算 python去停用词_python专业方向 | 文本相似度计算
欢迎关注我们的微信公众号"人工智能LeadAI"(ID:atleadai)步骤 1.分词.去停用词 2.词袋模型向量化文本 3.TF-IDF模型向量化文本 4.LSI模型向量化文本 ...

结巴分词----去停用词

结巴分词----去停用词相关推荐

最新文章

热门文章