Python实现文本词频统计（嵩天老师）

实例10：文本词频统计

引用文本

英文文本：Hamet

https://python123.io/resources/pye/hamlet.txt

中文文本：《三国演义》

https://python123.io/resources/pye/threekingdoms.txt

代码（哈姆雷特）：

#CalHamlet1.py
def getText():txt=open("hamlet.txt","r").read()txt=txt.lower()for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_{|}.~’‘':txt=txt.replace(ch,"")return txt
hamletTxt=getText()
words=hamletTxt.split()
counts={}
for word in words:counts[word]=counts.get(word,0)+1
items=list(counts.items())
items.sort(key=lambda x:x[1],reverse=True)
for i in range(10):word,count=items[i]print("{0:<10}".format(word,count))
#注：文本要和代码放在一个文件夹里

逐行分析：

#CalHamlet1.py
def getText():
   txt=open("hamlet.txt","r").read()
打开要处理的文件，并读取它
   txt=txt.lower()
把读取到的所有大写字母转换为小写
   for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_{|}.~’‘':
       txt=txt.replace(ch," ")
   return txt
把文章中的特殊字符全部转换为空格，把转换后的文章返回
hamletTxt=getText()
调用getText,将返回的txt的值赋给hamletTxt
words=hamletTxt.split()
把hamletTxt转换为列表形式，赋值给words
counts={}
定义一个字典
for word in words:
   counts[word]=counts.get(word,0)+1
有word时返回其值，默认是0，+1能够累计次数；没有word时则返回0。
items=list(counts.items())
将字典类型变成列表类型，键值对则表示在列表中是元组。
items.sort(key=lambda x:x[1],reverse=True)
key是待比较的元素，lambda是匿名函数，参数的第一个x表示列表的第一个元素，在
这里表示列表中的元组，x是任意定义的形参，也可以使用任意的字母代替；x[1]表示以元组的第
二个元素排序；若sort()方法中的参数 reverse=True 表示按降序（也就是从大到小）排序，反之
reverse=False 表示升序排序。
for i in range(10):
   word,count=items[i]
   print("{0:<10}".format(word,count))
循环输出前十个

代码（三国演义）:

# CalThreeKingdomsV1.py
import jiebatxt = open("threekingdoms.txt", "r", encoding="utf-8").read()
excludes = {"将军", "却说", "荆州", "二人", "不可", "不能", "如此"}
words = jieba.lcut(txt)
counts = {}
for word in words:if len(word) == 1:continueelif word == "诸葛亮" or word == "孔明曰":rword == "孔明"elif word == "关公" or word == "云长":rword == "关羽"elif word == "玄德" or word == "玄德曰":rword == "刘备"elif word == "孟德" or word == "丞相":rword == "曹操"else:rword = wordcounts[rword] = counts.get(rword, 0) + 1
for word in excludes:del counts[word]
items = list(counts.items())
items.sort(key=lambda x: x[1], reverse=True)
for i in range(10):word, count = items[i]print("{0:<10}{1:>5}".format(word, count))
#这段代码输出的不是最终结果，代码可进一步优化

举一反三：

-其它名著的人物出场统计
-政府工作报告、科研论文、新闻报道
-可以统计后形成词云