python 提取sogou中文语料库

sogou中文语料库下载地址是：https://download.csdn.net/download/kinas2u/1277550
下载下来的文件包含了很多子文件夹，每个子文件夹下又包含了很多txt语料文件，我想把他们都整合到一个txt中（./SogouC_mini_20061102/Sample），并且输出的是已经分好词的txt文件

下面是处理程序

# -*- coding: utf-8 -*-
#!/usr/bin/env python
import sys
reload(sys)
sys.setdefaultencoding('utf8')import pandas as pd
import numpy as np
import lightgbm as lgb
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import f1_score
from gensim.models import word2vec
import logging, jieba
import os, ioif os.path.exists('sogou_seg.txt'):os.remove('sogou_seg.txt')stop_words_file = "./SogouC_mini_20061102/stop_words.txt"
stop_words = list()
with io.open(stop_words_file, 'r', encoding="gb18030") as stop_words_file_object: contents = stop_words_file_object.readlines() for line in contents: line = line.strip() stop_words.append(line)d_s = []
data_dir = './SogouC_mini_20061102/Sample'
#data_dir = './train'
for folder in os.listdir(data_dir):d = os.path.join(data_dir, folder)  if not os.path.isdir(d):continued_s.append(d)              data_files = []
for folder_cls in d_s:txt_files = os.listdir(folder_cls)for txt_file in txt_files:data_files.append(os.path.join(folder_cls,txt_file))for data_file in data_files:with io.open(data_file, 'r', encoding='gb18030') as content:for line in content:seg_list = list(jieba.cut(line))out_str = ''for word in seg_list:if word not in stop_words:if word.strip() != "":word = ''.join(word)out_str += wordout_str += ' 'with io.open('sogou_seg.txt', 'a', encoding='utf-8') as output:output.write(unicode(out_str))output.close()

程序中中文语料停用词（stop_words.txt）下载地址为https://download.csdn.net/download/majinlei121/10733352
输出文件为sogou_seg.txt（大约309K），打开样式如下

python 提取sogou中文语料库相关推荐

python取特定年份的数据_python,_怎样用python提取不同股票csv里特定时间段的数据，python - phpStudy...
怎样用python提取不同股票csv里特定时间段的数据我有几千只股票的csv数据,需要算所有股票在特定时间段内的收益率. 但是数据里的日期信息并不统一,有的csv到2011年就没有了之后的信息了,有 ...
用Python提取解析pdf文档中内容
用Python提取解析pdf文档中内容文章目录: 参考: 1.https://blog.csdn.net/tmaczt/article/details/82876018 # Tika库 2.http ...
怎样用Python提取信息呢？分享这3个Python PDF库
很多时候我们都会用Python去取数据文件,这些文件中很多都是PDF格式,有些PDF文件解析的时候只能解析一部分内容出来,大段的文字没有解析出来,那怎么样才能用Python提取这些信息呢? 下面小千就 ...
[知识图谱实战篇] 三.Python提取JSON数据、HTML+D3构建基本可视化布局
前面作者讲解了很多知识图谱原理知识,包括知识图谱相关技术.Neo4j绘制关系图谱等,但仍缺少一个系统全面的实例.为了加深自己对知识图谱构建的认识,为后续创建贵州旅游知识图谱打下基础,作者深入学习了张宏 ...
python 替换array中的值_利用Python提取视频中的字幕（文字识别）
我的CSDN博客id:qq_39783601,昵称是糖潮丽子~辣丽从今天开始我会陆续将数据分析师相关的知识点分享在这里,包括Python.机器学习.数据库等等. 今天来分享一个Python小项目! ...
python提取html正文为txt,python 提取html文本的方法
假设我们需要从各种网页中提取全文,并且要剥离所有HTML标记.通常,默认解决方案是使用BeautifulSoup软件包中的get_text方法,该方法内部使用lxml.这是一个经过充分测试的解决方案, ...
python提取数据段_python提取数据段 python数据分析
如何在python中用slice分段取数据? 执行以下操作:&gt&gt a=range(6)&gt&gt a[0, 1, 2, 3, 4,5]&gt& ...
python字典导入mongodb_Python语言生成内嵌式字典(dict)-案例从python提取内嵌json写入mongodb...
本文主要向大家介绍了Python语言生成内嵌式字典(dict)-案例从python提取内嵌json写入mongodb,通过具体的内容向大家展示,希望对大家学习Python语言有所帮助. 从mongo查 ...
python读json文件中不同的数据类型_怎么使用python提取json文件中的字段
python中为什么用json有什么作用 python的json模块中如何将变量添加到里面 python的json模块第一个是要打开的文件,第二个是打开的操作,为什么会如果你早认清你在别人心中没那么重 ...

python 提取sogou中文语料库

python 提取sogou中文语料库相关推荐

最新文章

热门文章