贪心NLP——jieba分词、停用词过滤、词的标准化，词袋模型

基于结巴（jieba）的分词。

Jieba是最常用的中文分词工具

import jiebaset_list=jieba.cut('中南财经政法大学在茶山刘',cut_all=False)
print('/'.join(set_list))#jieba里没有茶山刘这个词，把它加进去
jieba.add_word('茶山刘')
set_list=jieba.cut('中南财经政法大学在茶山刘',cut_all=False)
print('/'.join(set_list))

运行结果：

停用词过滤

出现频率特别高的和频率特别低的词对于文本分析帮助不大，一般在预处理阶段会过滤掉。在英文里，经典的停用词为 “The”, "an"....

方法一：自己定义停用词

# 方法1： 自己建立一个停用词词典
stop_words = ["the", "an", "is", "there"]
# 在使用时： 假设 word_list包含了文本里的单词
word_list = ["we", "are", "the", "students"]
filtered_words = [word for word in word_list if word not in stop_words]
print (filtered_words)

运行结果：

['we', 'are', 'students']

方法二：直接利用别人已经构建好的停用词库

# 方法2：直接利用别人已经构建好的停用词库
from nltk.corpus import stopwordsstop_words = stopwords.words("english")
#查看停用词个数
print(len(stop_words))
#打印10个看一看
print(stop_words[:10])
filtered_words = [word for word in word_list if word not in stop_words]
print (filtered_words)

运行结果：

179
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]
['students']

词的标准化

from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()test_strs = ['caresses', 'flies', 'dies', 'mules', 'denied','died', 'agreed', 'owned', 'humbled', 'sized','meeting', 'stating', 'siezing', 'itemization','sensational', 'traditional', 'reference', 'colonizer','plotted']singles = [stemmer.stem(word) for word in test_strs]
print(' '.join(singles))

运行结果：

caress fli die mule deni die agre own humbl size meet state siez item sensat tradit refer colon plot

词袋向量：把文本转换成向量。

只有向量才能作为模型的输入。

按照词语出现的个数

#方法1：词袋模型(按照词语出现的个数)
from sklearn.feature_extraction.text import CountVectorizer
vectorizer=CountVectorizer()
corpus=['He is going from Beijing to Shanghai.','He denied my request, but he actually lied.','Mike lost the phone, and phone was in the car.'
]
X=vectorizer.fit_transform(corpus)print(vectorizer.get_feature_names())
print(X.toarray())

运行结果：

['actually', 'and', 'beijing', 'but', 'car', 'denied', 'from', 'going', 'he', 'in', 'is', 'lied', 'lost', 'mike', 'my', 'phone', 'request', 'shanghai', 'the', 'to', 'was']
[[0 0 1 0 0 0 1 1 1 0 1 0 0 0 0 0 0 1 0 1 0][1 0 0 1 0 1 0 0 2 0 0 1 0 0 1 0 1 0 0 0 0][0 1 0 0 1 0 0 0 0 1 0 0 1 1 0 2 0 0 2 0 1]]

tf-idf方法

from sklearn.feature_extraction.text import TfidfVectorizervectorizer=TfidfVectorizer(smooth_idf=False)
X=vectorizer.fit_transform(corpus)print(vectorizer.get_feature_names())
print(X.toarray())

运行结果：

['actually', 'and', 'beijing', 'but', 'car', 'denied', 'from', 'going', 'he', 'in', 'is', 'lied', 'lost', 'mike', 'my', 'phone', 'request', 'shanghai', 'the', 'to', 'was']
[[0.         0.         0.39379499 0.         0.         0.0.39379499 0.39379499 0.26372909 0.         0.39379499 0.0.         0.         0.         0.         0.         0.393794990.         0.39379499 0.        ][0.35819397 0.         0.         0.35819397 0.         0.358193970.         0.         0.47977335 0.         0.         0.358193970.         0.         0.35819397 0.         0.35819397 0.0.         0.         0.        ][0.         0.26726124 0.         0.         0.26726124 0.0.         0.         0.         0.26726124 0.         0.0.26726124 0.26726124 0.         0.53452248 0.         0.0.53452248 0.         0.26726124]]

贪心NLP——jieba分词、停用词过滤、词的标准化，词袋模型相关推荐

敏感词过滤程序编写敏感词过滤程序
敏感词过滤程序编写敏感词过滤程序前言 Java程序设计语言课程让我们2-3人一个小组,找一个能用集合解决的问题进行介绍.说明,于是我们就找到了这道题目. 一.题目要求编写敏感词过滤程序编写敏感词过 ...
php返回当前字符串把所有敏感词变红,PHP 实现敏感词 / 停止词过滤（附敏感词库）...
敏感词.文字过滤是一个网站必不可少的功能,如何设计一个好的.高效的过滤算法是非常有必要的.在实现敏感词过滤的算法中,我们必须要减少运算,而 DFA 在 DFA 算法中几乎没有什么计算,有的只是状态的转 ...
java中编写敏感词过滤程序_Java敏感词过滤
一下实现对敏感词,禁忌词的过滤. 两个个文件words.properties和KeyWordFilter.java; 1.words.properties文件是个文本文件:内容如下: 敏感词一敏感词 ...
自然语言处理之jieba分词
在处理英文文本时,由于英文文本天生自带分词效果,可以直接通过词之间的空格来分词(但是有些人名.地名等需要考虑作为一个整体,比如New York).而对于中文还有其他类似形式的语言,我们需要根据来特殊处 ...
php敏感字符串过滤_PHP实现的敏感词过滤方法示例
本文实例讲述了PHP实现的敏感词过滤方法.分享给大家供大家参考,具体如下: 1.敏感词过滤方法 /** * @todo 敏感词过滤,返回结果 * @param array $list 定义敏感词一维数 ...
敏感词过滤，PHP实现的Trie树
[转载]敏感词过滤,PHP实现的Trie树原文地址:http://blog.11034.org/2012-07/trie_in_php.html 项目需求,要做敏感词过滤,对于敏感词本身就是一个CR ...
【Lilishop商城】No3-2.模块详细设计，系统设置（系统配置、行政区划、物流公司、滑块验证码图片、敏感词过滤）的详细设计
仅涉及后端,全部目录看顶部专栏,代码.文档.接口路径在: [Lilishop商城]记录一下B2B2C商城系统学习笔记~_清晨敲代码的博客-CSDN博客全篇会结合业务介绍重点设计逻辑,其中重点包括接 ...
php敏感字符串过滤_PHP实现敏感词过滤
正则表达式,又称规则表达式.(英语:Regular Expression,在代码中常简写为regex.regexp或RE),计算机科学的一个概念.正则表达式通常被用来检索.替换那些符合某个模式(规则) ...

贪心NLP——jieba分词、停用词过滤、词的标准化，词袋模型

基于结巴（jieba）的分词。

停用词过滤

词的标准化

词袋向量：把文本转换成向量。

贪心NLP——jieba分词、停用词过滤、词的标准化，词袋模型相关推荐

最新文章

热门文章

贪心NLP——jieba分词、停用词过滤、词的标准化，词袋模型

基于结巴（jieba）的分词。

停用词过滤

词的标准化

词袋向量： 把文本转换成向量 。

贪心NLP——jieba分词、停用词过滤、词的标准化，词袋模型相关推荐

最新文章

热门文章

词袋向量：把文本转换成向量。