Python3.7对文本批量进行词频分析

github上找的源码，自己改的，记在这里。

对图中的文档做分词及词频统计，然后将统计生成的excel表格和分词后的text文本存入result文件夹里。
待分词的文本：

最后生成的文档：

文件批量处理函数：
主要用到os模块为新生成的文件命名，实现批量处理

def word_frequency_analysis(path):files = os.listdir(path)  # files为列表，存储的是path里面的所有文件名result_dir = os.path.abspath(os.path.join(path, 'result'))  # 返回result文档的路径csv_all = os.path.abspath(os.path.join(result_dir, 'csv_all.csv'))if not os.path.exists(result_dir):os.mkdir(result_dir)  #若不存在该文件路径，则创建一个for filename in files:if not fnmatch.fnmatch(filename, '*.txt'):continuetxt_path = os.path.join(path, filename)txt_content = open(txt_path, 'r').read()field_name = filename[:-4] + '年'  # eg:返回2014年，2015年header_filed.append(field_name)filename_fulltext = filename[:-4] + '_all.txt'filename_counter = filename[:-4] + '_tj.csv'# filename_key = filename[:-4] + '_hy_tj.csv'txt_to_all = os.path.join(os.path.join(path, 'result'), filename_fulltext)txt_to_counter = os.path.join(os.path.join(path, 'result'), filename_counter)#  txt_to_key = os.path.join(os.path.join(path, 'result'), filename_key)text_cutted = jiebaCutText(txt_content)text_cleared = clearText(text_cutted)text_counted = countwords(text_cleared, txt_to_counter)newfile = open(txt_to_all, 'w')newfile.write(text_cleared)newfile.close()

分词函数：
主要用jieba第三方库进行分词

def jiebaCutText(text):seg_list = jieba.cut(text, cut_all=False)liststr = '/'.join(seg_list)return liststr  # 返回的结果中会带标点符号

去除标点和单音节词：
将符合要求的词记入列表

def clearText(text):mywordlist = []for myword in text.split('/'):if len(myword.strip()) > 1 and contain_zh(myword.strip()):mywordlist.append(myword.strip())return '/'.join(mywordlist)

判断字符是否为汉字：
用到re模块，判断字符是否是汉字

def contain_zh(word):zh = re.compile(u'[\u4200-\u9fa5]+')match = zh.search(word)return match

词频统计函数：
用到字典和collections模块

def countwords(text, counter_file):count_dict = dict()for item in text.split('/'):if item in count_dict:count_dict[item] += 1else:count_dict[item] = 1d_sorted_by_value = OrderedDict(sorted(count_dict.items(), key=lambda x: x[1]))with open(counter_file, 'w',encoding='utf-8-sig') as f:#f.write(codecs.BOM_UTF8)w = csv.writer(f)w.writerows(d_sorted_by_value.items())

完整的代码：

import csv
import fnmatch
import os
import re
from collections import OrderedDict
import jieba#header_filed = []def word_frequency_analysis(path):files = os.listdir(path)  # files为列表，存储的是path里面的所有文件名result_dir = os.path.abspath(os.path.join(path, 'result'))  # 返回result文档的路径csv_all = os.path.abspath(os.path.join(result_dir, 'csv_all.csv'))if not os.path.exists(result_dir):os.mkdir(result_dir)  #若不存在该文件路径，则创建一个for filename in files:if not fnmatch.fnmatch(filename, '*.txt'):continuetxt_path = os.path.join(path, filename)txt_content = open(txt_path, 'r').read()field_name = filename[:-4] + '年'  # eg:返回2014年，2015年header_filed.append(field_name)filename_fulltext = filename[:-4] + '_all.txt'filename_counter = filename[:-4] + '_tj.csv'# filename_key = filename[:-4] + '_hy_tj.csv'txt_to_all = os.path.join(os.path.join(path, 'result'), filename_fulltext)txt_to_counter = os.path.join(os.path.join(path, 'result'), filename_counter)#  txt_to_key = os.path.join(os.path.join(path, 'result'), filename_key)text_cutted = jiebaCutText(txt_content)text_cleared = clearText(text_cutted)text_counted = countwords(text_cleared, txt_to_counter)newfile = open(txt_to_all, 'w')newfile.write(text_cleared)newfile.close()def jiebaCutText(text):seg_list = jieba.cut(text, cut_all=False)liststr = '/'.join(seg_list)return liststr  # 返回的结果中会带标点符号def clearText(text):mywordlist = []for myword in text.split('/'):if len(myword.strip()) > 1 and contain_zh(myword.strip()):mywordlist.append(myword.strip())return '/'.join(mywordlist)def contain_zh(word):zh = re.compile(u'[\u4200-\u9fa5]+')match = zh.search(word)return matchdef countwords(text, counter_file):count_dict = dict()for item in text.split('/'):if item in count_dict:count_dict[item] += 1else:count_dict[item] = 1d_sorted_by_value = OrderedDict(sorted(count_dict.items(), key=lambda x: x[1]))with open(counter_file, 'w',encoding='utf-8-sig', newline = '') as f: #newline参数防止生成的文件有空行#f.write(codecs.BOM_UTF8)w = csv.writer(f)w.writerows(d_sorted_by_value.items())if __name__=='__main__':path = 'E:/Programe/PySeg/jieba-wordcloud-demo-master/基础数据/韶关（分年度）'word_frequency_analysis(path)

Python3.7对文本批量进行词频分析相关推荐

【自然语言处理概述】文本词频分析
[自然语言处理概述]文本词频分析作者简介:在校大学生一枚,华为云享专家,阿里云专家博主,腾云先锋(TDP)成员,云曦智划项目总负责人,全国高等学校计算机教学与产业实践资源建设专家委员会(TIPCC) ...
python文本txt词频统计_python实例：三国演义TXT文本词频分析
0x00 前言找不到要写什么东西了!今天有个潭州大牛讲师说了个文本词频分析我基本上就照抄了一遍中间遇到一些小小的问题自我百度填坑补全了如下 : 效果演示 0x01 准备环境及 ...
大数据分析 | 用 Python 做文本词频分析
老师教给我,要学骆驼,沉得住气的动物.看它从不着急,慢慢地走,慢慢地嚼,总会走到的,总会吃饱的. ---<城南旧事> 目录一.前言 Python 简介 Python 特点二.基本环境配 ...
python文本聚类词云图_有哪些软件可以进行中文词频分析？
在现实生活中,人想做词云,也有了关键词的数据但自己又不会做词云可怎么办,我给大家推荐几款词云制作工具,让你瞬间呈现美观.酷炫的词云可视化.我们先来看看国外的词云制作工具: 1.Wordle Wordl ...
python爬取微博评论并做词频分析_爬取李子柒微博评论并分析
爬取李子柒微博评论并分析微博主要分为网页端.手机端和移动端.微博网页版反爬太厉害,因此选择爬取手机端. 1 需求爬取李子柒微博中视频的评论信息,并做词频分析. 2 方法 2.1 运行环境运行平台 ...
Python入门与词频分析初步
一.python与其他语言的区别 1.python作为一门解释性语言,与java.C等语言相比,第一个特点就是python不用编译,可以像脚本一样直接运行.前几天咱们工作室有同学问我,他的编程界面为什 ...
python单词词频字典_用python实现词频分析+词云
2020.05.13更新:大家点个赞再收藏吧(点赞后观看,养成好习惯)TAT 如你所见.文章标题图是以周杰伦的百度百科词条为分析文档,以周杰伦超话第一的那张图+PPT删除背景底色为词频背景进行 ...
第三方库实现中文词频分析和词语可视化（jieba，wordcloud库）
jieba,wordcloud库实现中文词频分析和词语可视化文章目录前言: 一.实验题目: 二.实验准备: 三.实验内容 1.全部代码: 2.实验结果: 3.难点分析: 结语: 前言: 这篇文章是 ...
用javascript自制ctf词频分析工具
不废话,上代码: <!DOCTYPE html> <html> <head><title></title> </head> &l ...

Python3.7对文本批量进行词频分析

Python3.7对文本批量进行词频分析相关推荐

最新文章

热门文章