中文自然语言预处理总结

中文文本预处理总结

1、文本数据准备

2、全角与半角的转化

3、文本中大写数字转化为小写数字

4、文本中大写字母转化为小写字母

5、文本中的表情符号去除（只保留中英文和数字）

6、去除文本中所有的字符（只保留中文）

7、中文文本分词

8、繁体中文与简体中文转换

9、中文文本停用词过滤

10、将清洗后的数据写入CSV文件

中文文本预处理总结

1、文本数据准备

（1）使用已有的语料库

（2）网络爬虫获取自己的语料库（可以使用beautifulsoup等爬虫工具）


#读取文件列表数据,返回文本数据的内容列表和标签列表
def filelist_contents_labels(filelist):contents=[]labels = []for file in filelist:with open(file, "r", encoding="utf-8") as f:for row in f.read().splitlines():sentence=row.split('\t')contents.append(sentence[-1])if sentence[0]=='other' :labels.append(0)else:labels.append(1)return contents,labels

2、全角与半角的转化

在自然语言处理过程中，全角、半角的的不一致会导致信息抽取不一致，因此需要统一。中文文字永远是全角，只有英文字母、数字键、符号键才有全角半角的概念,一个字母或数字占一个汉字的位置叫全角，占半个汉字的位置叫半角。标点符号在中英文状态下、全半角的状态下是不同的。

有规律（不含空格）：全角字符unicode编码从65281~65374 （十六进制 0xFF01 ~ 0xFF5E）；半角字符unicode编码从33~126 （十六进制 0x21~ 0x7E）

特例：空格比较特殊，全角为 12288（0x3000），半角为 32（0x20）

#全角转半角
def full_to_half(sentence):      #输入为一个句子change_sentence=""for word in sentence:inside_code=ord(word)if inside_code==12288:    #全角空格直接转换inside_code=32elif inside_code>=65281 and inside_code<=65374:  #全角字符（除空格）根据关系转化inside_code-=65248change_sentence+=chr(inside_code)return change_sentence

ord() 函数是 chr() 函数（对于8位的ASCII字符串）或 unichr() 函数（对于Unicode对象）的配对函数，它以一个字符（长度为1的字符串）作为参数，返回对应的 ASCII 数值，或者 Unicode 数值，如果所给的 Unicode 字符超出了你的 Python 定义范围，则会引发一个 TypeError 的异常。

#半角转全角
def hulf_to_full(sentence):      #输入为一个句子change_sentence=""for word in sentence:inside_code=ord(word)if inside_code==32:    #半角空格直接转换inside_code=12288elif inside_code>=32 and inside_code<=126:  #半角字符（除空格）根据关系转化inside_code+=65248change_sentence+=chr(inside_code)return change_sentence

3、文本中大写数字转化为小写数字

#大写数字转换为小写数字
def big2small_num(sentence):numlist = {"一":"1","二":"2","三":"3","四":"4","五":"5","六":"6","七":"7","八":"8","九":"9","零":"0"}for item in numlist:sentence = sentence.replace(item, numlist[item])return sentence

4、文本中大写字母转化为小写字母

#大写字母转为小写字母
def upper2lower(sentence):new_sentence=sentence.lower()return new_sentence

5、文本中的表情符号去除（只保留中英文和数字）

使用正则表达式

#去除文本中的表情字符（只保留中英文和数字）
def clear_character(sentence):pattern1= '\[.*?\]'     pattern2 = re.compile('[^\u4e00-\u9fa5^a-z^A-Z^0-9]')   line1=re.sub(pattern1,'',sentence)line2=re.sub(pattern2,'',line1)   new_sentence=''.join(line2.split()) #去除空白return new_sentence

6、去除文本中所有的字符（只保留中文）

#去除字母数字表情和其它字符
def clear_character(sentence):pattern1='[a-zA-Z0-9]'pattern2 = '\[.*?\]'pattern3 = re.compile(u'[^\s1234567890:：' + '\u4e00-\u9fa5]+')pattern4='[’!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~]+'line1=re.sub(pattern1,'',sentence)   #去除英文字母和数字line2=re.sub(pattern2,'',line1)   #去除表情line3=re.sub(pattern3,'',line2)   #去除其它字符line4=re.sub(pattern4, '', line3) #去掉残留的冒号及其它符号new_sentence=''.join(line4.split()) #去除空白return new_sentence

7、中文文本分词

本文使用的是jieba分词。

8、繁体中文与简体中文转换

from langconv import *
# 转为简体
def Traditional2Simplified(sentence):sentence = Converter('zh-hans').convert(sentence)return sentence
# 转为繁体
def Simplified2Traditional(sentence):sentence = Converter('zh-hant').convert(sentence)return sentence

9、中文文本停用词过滤

#去除停用词，返回去除停用词后的文本列表
def clean_stopwords(contents):contents_list=[]stopwords = {}.fromkeys([line.rstrip() for line in open('data/stopwords.txt', encoding="utf-8")]) #读取停用词表stopwords_list = set(stopwords)for row in contents:      #循环去除停用词words_list = jieba.lcut(row)words = [w for w in words_list if w not in stopwords_list]sentence=''.join(words)   #去除停用词后组成新的句子contents_list.append(sentence)return contents_list

10、将清洗后的数据写入CSV文件

# 将清洗后的文本和标签写入.csv文件中
def after_clean2csv(contents, labels): #输入为文本列表和标签列表columns = ['contents', 'labels']save_file = pd.DataFrame(columns=columns, data=list(zip(contents, labels)))save_file.to_csv('data/clean_data.csv', index=False, encoding="utf-8")

本人博文NLP学习内容目录：

一、NLP基础学习

1、NLP学习路线总结

2、TF-IDF算法介绍及实现

3、NLTK使用方法总结

4、英文自然语言预处理方法总结及实现

5、中文自然语言预处理方法总结及实现

6、NLP常见语言模型总结

7、NLP数据增强方法总结及实现

8、TextRank算法介绍及实现

9、NLP关键词提取方法总结及实现

10、NLP词向量和句向量方法总结及实现

11、NLP句子相似性方法总结及实现

12、NLP中文句法分析

二、NLP项目实战

1、项目实战-英文文本分类-电影评论情感判别

2、项目实战-中文文本分类-商品评论情感判别

3、项目实战-XGBoost与LightGBM文本分类

4、项目实战-TextCNN文本分类实战

5、项目实战-Bert文本分类实战

6、项目实战-NLP中文句子类型判别和分类实战

交流学习资料共享欢迎入群：955817470（群一），801295159（群二）