用python分析小说_用Python对哈利波特系列小说进行情感分析

原标题：用Python对哈利波特系列小说进行情感分析

准备数据

现有的数据是一部小说放在一个txt里，我们想按照章节(列表中第一个就是章节1的内容，列表中第二个是章节2的内容)进行分析，这就需要用到正则表达式整理数据。

比如我们先看看 01-Harry Potter and the Sorcerer's Stone.txt" 里的章节情况，我们打开txt

经过检索发现，所有章节存在规律性表达

[Chapter][空格][整数][换行符n][可能含有空格的英文标题][换行符n]

我们先熟悉下正则，使用这个设计一个模板pattern提取章节信息

import re

import nltk

raw_text = open("data/01-Harry Potter and the Sorcerer's Stone.txt").read

pattern = 'Chapter d+n[a-zA-Z ]+n'

re.findall(pattern, raw_text)

['Chapter 1nThe Boy Who Livedn',

'Chapter 2nThe Vanishing Glassn',

'Chapter 3nThe Letters From No Onen',

'Chapter 4nThe Keeper Of The Keysn',

'Chapter 5nDiagon Alleyn',

'Chapter 7nThe Sorting Hatn',

'Chapter 8nThe Potions Mastern',

'Chapter 9nThe Midnight Dueln',

'Chapter 10nHalloweenn',

'Chapter 11nQuidditchn',

'Chapter 12nThe Mirror Of Erisedn',

'Chapter 13nNicholas Flameln',

'Chapter 14nNorbert the Norwegian Ridgebackn',

'Chapter 15nThe Forbidden Forestn',

'Chapter 16nThrough the Trapdoorn',

'Chapter 17nThe Man With Two Facesn']

熟悉上面的正则表达式操作，我们想更精准一些。我准备了一个test文本，与实际小说中章节目录表达相似，只不过文本更短，更利于理解。按照我们的预期，我们数据中只有5个章节，那么列表的长度应该是5。这样操作后的列表中第一个内容就是章节1的内容，列表中第二个内容是章节2的内容。

import re

test = """Chapter 1nThe Boy Who LivednMr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you’d expect to be involved in anything strange or mysterious, because they just didn’t hold with such nonsense.nMr. Dursley was the director of a firm called Grunnings,

Chapter 2nThe Vanishing GlassnFor a second, Mr. Dursley didn’t realize what he had seen — then he jerked his head around to look again. There was a tabby cat standing on the corner of Privet Drive, but there wasn’t a map in sight. What could he have been thinking of? It must have been a trick of the light. Mr. Dursley blinked and stared at the cat.

Chapter 3nThe Letters From No OnenThe traffic moved on and a few minutes later, Mr. Dursley arrived in the Grunnings parking lot, his mind back on drills.nMr. Dursley always sat with his back to the window in his office on the ninth floor. If he hadn’t, he might have found it harder to concentrate on drills that morning.

Chapter 4nThe Keeper Of The KeysnHe didn’t know why, but they made him uneasy. This bunch were whispering excitedly, too, and he couldn’t see a single collecting tin.

Chapter 5nDiagon AlleynIt was a few seconds before Mr. Dursley realized that the man was wearing a violet cloak. """

#获取章节内容列表(列表中第一个内容就是章节1的内容，列表中第二个内容是章节2的内容)

#为防止列表中有空内容，这里加了一个条件判断，保证列表长度与章节数预期一致

chapter_contents = [c for c in re.split('Chapter d+n[a-zA-Z ]+n', test) if c]

chapter_contents

['Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you’d expect to be involved in anything strange or mysterious, because they just didn’t hold with such nonsense.nMr. Dursley was the director of a firm called Grunnings,n ',

'For a second, Mr. Dursley didn’t realize what he had seen — then he jerked his head around to look again. There was a tabby cat standing on the corner of Privet Drive, but there wasn’t a map in sight. What could he have been thinking of? It must have been a trick of the light. Mr. Dursley blinked and stared at the cat.n ',

'The traffic moved on and a few minutes later, Mr. Dursley arrived in the Grunnings parking lot, his mind back on drills.nMr. Dursley always sat with his back to the window in his office on the ninth floor. If he hadn’t, he might have found it harder to concentrate on drills that morning.n ',

'He didn’t know why, but they made him uneasy. This bunch were whispering excitedly, too, and he couldn’t see a single collecting tin. n ',

'It was a few seconds before Mr. Dursley realized that the man was wearing a violet cloak. ']

能得到哈利波特的章节内容列表

也就意味着我们可以做真正的文本分析了

数据分析章节数对比

import os

import re

import matplotlib.pyplot as plt

colors = ['#78C850', '#A8A878','#F08030','#C03028','#6890F0', '#A890F0','#A040A0']

harry_potters = ["Harry Potter and the Sorcerer's Stone.txt",

"Harry Potter and the Chamber of Secrets.txt",

"Harry Potter and the Prisoner of Azkaban.txt",

"Harry Potter and the Goblet of Fire.txt",

"Harry Potter and the Order of the Phoenix.txt",

"Harry Potter and the Half-Blood Prince.txt",

"Harry Potter and the Deathly Hallows.txt"]

#横坐标为小说名

harry_potter_names = [n.replace('Harry Potter and the ', '')[:-4]

for n in harry_potters]

#纵坐标为章节数

chapter_nums = []

for harry_potter in harry_potters:

file = "data/"+harry_potter

raw_text = open(file).read

pattern = 'Chapter d+n[a-zA-Z ]+n'

chapter_contents = [c for c in re.split(pattern, raw_text) if c]

chapter_nums.append(len(chapter_contents))

#设置画布尺寸

plt.figure(figsize=(20, 10))

#图的名字，字体大小，粗体

plt.title('Chapter Number of Harry Potter', fontsize=25, weight='bold')

#绘制带色条形图

plt.bar(harry_potter_names, chapter_nums, color=colors)

#横坐标刻度上的字体大小及倾斜角度

plt.xticks(rotation=25, fontsize=16, weight='bold')

plt.yticks(fontsize=16, weight='bold')

#坐标轴名字

plt.xlabel('Harry Potter Series', fontsize=20, weight='bold')

plt.ylabel('Chapter Number', rotation=25, fontsize=20, weight='bold')

plt.show

从上面可以看出哈利波特系列小说的后四部章节数据较多(这分析没啥大用处，主要是练习)

用词丰富程度

如果说一句100个词的句子，同时词语不带重样的，那么用词的丰富程度为100。

而如果说同样长度的句子，只用到20个词语，那么用词的丰富程度为100/20=5。

import os

import re

import matplotlib.pyplot as plt

from nltk import word_tokenize

from nltk.stem.snowball importSnowballStemmer

plt.style.use('fivethirtyeight')

colors = ['#78C850', '#A8A878','#F08030','#C03028','#6890F0', '#A890F0','#A040A0']

harry_potters = ["Harry Potter and the Sorcerer's Stone.txt",

"Harry Potter and the Chamber of Secrets.txt",

"Harry Potter and the Prisoner of Azkaban.txt",

"Harry Potter and the Goblet of Fire.txt",

"Harry Potter and the Order of the Phoenix.txt",

"Harry Potter and the Half-Blood Prince.txt",

"Harry Potter and the Deathly Hallows.txt"]

#横坐标为小说名

harry_potter_names = [n.replace('Harry Potter and the ', '')[:-4]

for n in harry_potters]

#用词丰富程度

richness_of_words = []

stemmer = SnowballStemmer("english")

for harry_potter in harry_potters:

file = "data/"+harry_potter

raw_text = open(file).read

words = word_tokenize(raw_text)

words = [stemmer.stem(w.lower) for w in words]

wordset = set(words)

richness = len(words)/len(wordset)

richness_of_words.append(richness)

#设置画布尺寸

plt.figure(figsize=(20, 10))

#图的名字，字体大小，粗体

plt.title('The Richness of Word in Harry Potter', fontsize=25, weight='bold')

#绘制带色条形图

plt.bar(harry_potter_names, richness_of_words, color=colors)

#横坐标刻度上的字体大小及倾斜角度

plt.xticks(rotation=25, fontsize=16, weight='bold')

plt.yticks(fontsize=16, weight='bold')

#坐标轴名字

plt.xlabel('Harry Potter Series', fontsize=20, weight='bold')

plt.ylabel('Richness of Words', rotation=25, fontsize=20, weight='bold')

plt.show

情感分析

哈利波特系列小说情绪发展趋势，这里使用VADER,有现成的库vaderSentiment，这里使用其中的polarity_scores函数，可以得到

neg:负面得分

neu：中性得分

pos：积极得分

compound: 综合情感得分

from vaderSentiment.vaderSentiment importSentimentIntensityAnalyzer

analyzer = SentimentIntensityAnalyzer

test = 'i am so sorry'

analyzer.polarity_scores(test)

{'neg': 0.443, 'neu': 0.557, 'pos': 0.0, 'compound': -0.1513}

import os

import re

import matplotlib.pyplot as plt

from nltk.tokenize import sent_tokenize

from vaderSentiment.vaderSentiment importSentimentIntensityAnalyzer

harry_potters = ["Harry Potter and the Sorcerer's Stone.txt",

"Harry Potter and the Chamber of Secrets.txt",

"Harry Potter and the Prisoner of Azkaban.txt",

"Harry Potter and the Goblet of Fire.txt",

"Harry Potter and the Order of the Phoenix.txt",

"Harry Potter and the Half-Blood Prince.txt",

"Harry Potter and the Deathly Hallows.txt"]

#横坐标为章节序列

chapter_indexes = []

#纵坐标为章节情绪得分

compounds = []

analyzer = SentimentIntensityAnalyzer

chapter_index = 1

for harry_potter in harry_potters:

file = "data/"+harry_potter

raw_text = open(file).read

pattern = 'Chapter d+n[a-zA-Z ]+n'

chapters = [c for c in re.split(pattern, raw_text) if c]

#计算每个章节的情感得分

for chapter in chapters:

compound = 0

sentences = sent_tokenize(chapter)

for sentence in sentences:

score = analyzer.polarity_scores(sentence)

compound += score['compound']

compounds.append(compound/len(sentences))

chapter_indexes.append(chapter_index)

chapter_index+=1

#设置画布尺寸

plt.figure(figsize=(20, 10))

#图的名字，字体大小，粗体

plt.title('Average Sentiment of the Harry Potter', fontsize=25, weight='bold')

#绘制折线图

plt.plot(chapter_indexes, compounds, color='#A040A0')

#横坐标刻度上的字体大小及倾斜角度

plt.xticks(rotation=25, fontsize=16, weight='bold')

plt.yticks(fontsize=16, weight='bold')

#坐标轴名字

plt.xlabel('Chapter', fontsize=20, weight='bold')

plt.ylabel('Average Sentiment', rotation=25, fontsize=20, weight='bold')

plt.show

曲线不够平滑，为了熨平曲线波动，自定义了一个函数

import numpy as np

import os

import re

import matplotlib.pyplot as plt

from nltk.tokenize import sent_tokenize

from vaderSentiment.vaderSentiment importSentimentIntensityAnalyzer

#曲线平滑函数

def movingaverage(value_series, window_size):

window = np.ones(int(window_size))/float(window_size)

return np.convolve(value_series, window, 'same')

harry_potters = ["Harry Potter and the Sorcerer's Stone.txt",

"Harry Potter and the Chamber of Secrets.txt",

"Harry Potter and the Prisoner of Azkaban.txt",

"Harry Potter and the Goblet of Fire.txt",

"Harry Potter and the Order of the Phoenix.txt",

"Harry Potter and the Half-Blood Prince.txt",

"Harry Potter and the Deathly Hallows.txt"]

#横坐标为章节序列

chapter_indexes = []

#纵坐标为章节情绪得分

compounds = []

analyzer = SentimentIntensityAnalyzer

chapter_index = 1

for harry_potter in harry_potters:

file = "data/"+harry_potter

raw_text = open(file).read

pattern = 'Chapter d+n[a-zA-Z ]+n'

chapters = [c for c in re.split(pattern, raw_text) if c]

#计算每个章节的情感得分

for chapter in chapters:

compound = 0

sentences = sent_tokenize(chapter)

for sentence in sentences:

score = analyzer.polarity_scores(sentence)

compound += score['compound']

compounds.append(compound/len(sentences))

chapter_indexes.append(chapter_index)

chapter_index+=1

#设置画布尺寸

plt.figure(figsize=(20, 10))

#图的名字，字体大小，粗体

plt.title('Average Sentiment of the Harry Potter',

fontsize=25,

weight='bold')

#绘制折线图

plt.plot(chapter_indexes, compounds,

color='red')

plt.plot(movingaverage(compounds, 10),

color='black',

linestyle=':')

#横坐标刻度上的字体大小及倾斜角度

plt.xticks(rotation=25,

fontsize=16,

weight='bold')

plt.yticks(fontsize=16,

weight='bold')

#坐标轴名字

plt.xlabel('Chapter',

fontsize=20,

weight='bold')

plt.ylabel('Average Sentiment',

rotation=25,

fontsize=20,

weight='bold')

plt.show

全新打卡学习模式

每天30分钟

30天学会Python编程

世界正在奖励坚持学习的人！返回搜狐，查看更多

责任编辑：

用python分析小说_用Python对哈利波特系列小说进行情感分析相关推荐

python 情感分析实例_基于Python的情感分析案例
**情感分析 **又称为倾向性分析和意见挖掘它是对带有情感色彩的主观性文本进行分析.处理.归纳和推理的过程其中情感分析还可以细分为情感极性倾向分析情感程度分析主客观分析等. 情感极性分析的 ...
python 时间序列预测_使用Python进行动手时间序列预测
python 时间序列预测 Time series analysis is the endeavor of extracting meaningful summary and statistical ...
python 概率分布模型_使用python的概率模型进行公司估值
python 概率分布模型 Note from Towards Data Science's editors: While we allow independent authors to publis ...
python爬虫下载小说_用PYTHON爬虫简单爬取网络小说
用PYTHON爬虫简单爬取网络小说. 这里是17K小说网上,随便找了一本小说,名字是<千万大奖>. 里面主要是三个函数: 1.get_download_url() 用于获取该小说的所有章节 ...
python画哪吒_用Python爬取暑期档大火的《哪吒》，20W+评论数据，我们分析一波...
原标题:用Python爬取暑期档大火的<哪吒>,20W+评论数据,我们分析一波说起这个暑期档的动画片,非<哪吒之魔童降世>莫属了! 上映第 1 天:89分钟,中国动画最快破 ...
mysql+web日志分析工具_用Python+MySQL实现2017年web日志分析报告
日志分析在web系统中故障排查.性能分析方面有着非常重要的作用.目前,开源的ELK系统是成熟且功能强大的选择.但是部署及学习成本亦然不低,这里我实现了一个方法上相对简单(但准确度和效率是有保证的)的实 ...
python情感分析步骤_使用python+机器学习方法进行情感分析(详细步骤)
不是有词典匹配的方法了吗?怎么还搞多个机器学习方法. 因为词典方法和机器学习方法各有千秋. 机器学习的方法精确度更高,因为词典匹配会由于语义表达的丰富性而出现很大误差,而机器学习方法不会.而且它可使用 ...
python情感分析语料库_利用Python实现中文情感极性分析
情感极性分析,即情感分类,对带有主观情感色彩的文本进行分析.归纳.情感极性分析主要有两种分类方法:基于情感知识的方法和基于机器学习的方法.基于情感知识的方法通过一些已有的情感词典计算文本的情感极性(正 ...
python情感分析中文_【python机器学习】中文情感分析
3月31日,3月最后的一天接到了腾讯HR终面,看着招聘官网变成已完成还有点小自豪呢python 而后百度搜了搜显示"已完成"是否是稳了,原来不是,好多最后被通知没被录取....we ...

用python分析小说_用Python对哈利波特系列小说进行情感分析

用python分析小说_用Python对哈利波特系列小说进行情感分析相关推荐

最新文章

热门文章