中文分词与去除停用词

jieba分词的三种模式

精确模式：把文本精确的切分开，不存在冗余单词。
全模式：把文本中所有可能的词语都扫描出来，有冗余。
搜索引擎模式：在精确模式基础上，对长词再次切分。

jieba库的解析

jieba.cut(s):精确模式，返回一个可迭代的数据类型,生成迭代器。
jieba.cut(s,cut_all=True):全模式，输出文本s中所有可能的单词，生成迭代器。
jieba.cut_for_search(s):搜索引擎模式，适合搜索，生成迭代器。

jieba.lcut(s):精确模式，返回一个列表类型，建议使用。
jieba.lcut(s,cut_all=True):全模式，返回一个列表类型，建议使用。
jieba.lcut_for_search(s):搜索引擎模式，返回一个列表类型，建议使用。

jieba.add_word(w):向分词词典中增加新词w。
jieba.del_word(w):从分词词典中删除词w。

下面来试一下

import jieba
s='今天天气好冷，快出太阳'
jieba.lcut(s)

[‘今天天气’, ‘好’, ‘冷’, ‘，’, ‘快出’, ‘太阳’]

#有冗余
jieba.lcut(s,cut_all=True)

[‘今天’, ‘今天天气’, ‘天天’, ‘天气’, ‘好’, ‘冷’, ‘’, ‘’, ‘快’, ‘出’, ‘太阳’]

#介于以上两者之间
jieba.lcut_for_search(s)

[‘今天’, ‘天天’, ‘天气’, ‘今天天气’, ‘好’, ‘冷’, ‘，’, ‘快出’, ‘太阳’]

使用自定义词典

jieba.lcut('牛气哄哄大法师在种菜')

[‘牛气’, ‘哄哄’, ‘大法师’, ‘在’, ‘种菜’]

手动添加词

#把牛气哄哄定义为一个词
jieba.add_word("牛气哄哄")

jieba.lcut('牛气哄哄大法师在种菜')

[‘牛气哄哄’, ‘大法师’, ‘在’, ‘种菜’]

导入文件分词

jieba.lcut('菠萝菠萝蜜芝麻开门嘛咪嘛咪轰')

[‘菠萝’, ‘菠萝蜜’, ‘芝麻开门’, ‘嘛’, ‘咪’, ‘嘛’, ‘咪’, ‘轰’]

加载文件

jieba.load_userdict('C:\\Users\Dell\Desktop\\aaa.txt')

在文件里把词组列出来

jieba.lcut('菠萝菠萝蜜芝麻开门嘛咪嘛咪轰')

[‘菠萝菠萝蜜’, ‘芝麻开门’, ‘嘛咪嘛咪轰’]

使用搜索细胞词库

http://pinyin.sogou.com/dict/

按照词库分类或者关键词搜索方式，查找并下载所需词库
使用转换工具，将其转换为txt格式——深蓝词库转换、奥创词库转换，在程序中导入相应词库

去除停用词

分词后去停用词（很笨）。
用extract_tags函数去除停用词。

1.分词后去除停用词

基本步骤：

读入停用词表文件
正常分词
在分词结果中取出停用词

newlist = [word for word in list if word not in stopwords]

方法的问题：停用词必须要被分词过程正确拆分出来才行

ss='菠萝菠萝蜜芝麻开门的嘛咪嘛咪轰烦大哒'
jieba.lcut(ss)

[‘菠萝菠萝蜜’, ‘芝麻开门’, ‘的’, ‘嘛咪嘛咪轰’, ‘烦’]

wordlist=jieba.lcut(ss)
newlist=[word for word in wordlist if word not in ['烦','的']]
newlist

[‘菠萝菠萝蜜’, ‘芝麻开门’, ‘嘛咪嘛咪轰’]

导入停用词表

import pandas as pd
stopwords=pd.read_table('F:\\HMM\\stopwords.txt',names=['words'],encoding='utf-8')
stopwords.head()

去除句子中停用词表中的词

newlist = [word for word in jieba.cut(ss) if word not in list(stopwords['words'])]
print(newlist)

[‘菠萝菠萝蜜’, ‘芝麻开门’, ‘嘛咪嘛咪轰’, ‘烦’]

也可以直接获取停用词list，效率更高

open('stopwords.txt').readlines()

2. 用extract_tags函数去除停用词

方法特点：根据TF-IDF算法将特征词提取出来，在提取之前去掉停用词可以人工置顶停用词字典。

jieba.analyse.set_stop_words()

括号里是想要去掉的停用词

import jieba.analyse as ana
ana.set_stop_words('F:\\HMM\\stopwords.txt')
sentence='大数据专业的同学棒棒哒！'
ana.extract_tags(sentence)

[‘棒棒’, ‘同学’, ‘专业’, ‘数据’]

词性标注

posseg.cut():给出附加词性的分词结果
词性标注采用和ICTCLAS兼容的标记法

import jieba.posseg as psg
sentence='大数据专业的同学棒棒哒！'
psg.lcut(sentence)

[pair(‘大’, ‘a’),
pair(‘数据’, ‘n’),
pair(‘专业’, ‘n’),
pair(‘的’, ‘uj’),
pair(‘同学’, ‘n’),
pair(‘棒棒’, ‘n’),
pair(‘哒’, ‘zg’),
pair(’！’, ‘x’)]