2021春节档电影数据分析

分析需求：针对大年初一（2021年2月12日）同一天上映的7部电影进行分析，分析影响电影票房高低的原因，分析电影票房的相关性因素，并分析总票房前三电影的影评内容。最终得出报告结论并对电影的拍摄提供参考建议。

分析流程图：

爬取豆瓣网你好李焕英的影评

通过对豆瓣网址的结构分析可知，链接到不同电影影评网址的区别是电影代码不同，并且第一页和下一页的区别就是 start=20&limit=20 ，当我们知道了这个结构之后就可以爬取相对应的影评内容了。如图（以你好，李焕英和唐探3为例）：

以爬取豆瓣网你好李焕英的影评内容为例：

#需求-----爬取豆瓣网你好李焕英的影评
# MovieName = '你好，李焕英'
import requests
from lxml import etree
import timebase_url = 'https://movie.douban.com/subject/34841067/comments?start={}&limit=20&status=P&sort=new_score'#构造请求头，防止反爬
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36'}#爬取影评内容
def Spider_Comment():my_comment_list = []for i in range(0,10):url_all = base_url.format(i*20)response = requests.get(url=url_all,headers=headers)#设置时间间隔，网址IP被bantime.sleep(10)#获取爬取结果result_str = response.content.decode('utf-8')result_str = str(result_str)#将结果转换成xpath可以解析的类型html = etree.HTML(result_str)#用xpath进行解析review_contents = html.xpath('//span[@class="short"]/text()')my_comment_list.append(review_contents)return my_comment_listresult_comment = Spider_Comment()
result_comment
result_comment = str(result_comment)
# 保存文件
with open(r'D:\Data_analysis\program_movie\review_li3.text','w',encoding='utf-8') as f:f.write(result_comment)
f.close()

若想爬取观众影评的时间，只需找到对应的标签层级，用xpath相对路径的方式就可以取到。

#爬取影评时间
def Spider_Comment_time():comment_time_list = []for i in range(0,10):url_all = base_url.format(i*20)response = requests.get(url=url_all,headers=headers)time.sleep(10)result_str = response.content.decode('utf-8')result_str = str(result_str)html = etree.HTML(result_str)for j in range(1,20):#找到影评时间的xpath相对路径，并用xpath解析review_time = html.xpath('//*[@id="comments"]/div[{}]/div[2]/h3/span[2]/span[3]/@title'.format(j))comment_time_list.append(review_time)return comment_time_list

但是当我爬取下来的时候发现，爬取下来的影评时间在Python中是以str类型保存下来的，在导入到Excel中时会出现所有的数据都在同一单元格内的情况，因此需要通过“逗号分割”将所有内容分开。

with open(r'D:\Data_analysis\program_movie\review_li5.text','r',encoding='utf-8') as f:file = f.read()#通过逗号将str分开file.split(',')print(file)
with open(r'D:\Data_analysis\program_movie\影评时间.xlsx','w',encoding='utf-8') as f1:f1.write(file)

接下来我们将制作评论内容的词云图
首先在评论内容里有大量的 “了”，“的”，“是”等无用词，因此我们需要对评论内容进行处理。

import pandas as pd
import jieba#'D:\Data_analysis\program_movie\stopwords.txt'为本地的停用词文档
stopwords = [i.strip() for i in open(r'D:\Data_analysis\program_movie\stopwords.txt',encoding='utf-8').readlines()]
#i.strip()：去除字符串首尾的空格
#.readlines():读取一整行，直到遇到换行符（\n）结束
#stopwords   该行在jutyper上，可以展示已导入的stopwords.txt的内容comment=pd.read_csv(r'D:\Data_analysis\program_movie\review_xiaoshuojia3.text',header=None)  #读取数据
# commentcontent=""
for message in comment.values.ravel():content += message  #将数据存储在content中
# contentspace_content=" ".join(jieba.lcut(content))
# space_contentafter_text=''
for message in space_content:if message not in list(stopwords):after_text += message
print(after_text)# 将结果保存在txt中
with open(r'D:\Data_analysis\program_movie\review_李焕英.txt','w+',encoding='utf-8')as f :for i in after_text:f.write(i)
f.close()

处理之后的结果：

此时就可以制作词云图了。

from stylecloud import gen_stylecloud
import jieba
from wordcloud import STOPWORDSdef jieba_cloud(file_name, icon):with open(r'D:\Data_analysis\program_movie\review_李焕英.txt', 'r', encoding='utf-8') as f:    word_list = jieba.cut(f.read())result = " ".join(word_list)    # 分词用  隔开# 制作词云图icon_name = " "if icon == "first_pic":icon_name = ''elif icon == "second_pic":icon_name = "fas fa-taxi"elif icon == "third_pic":icon_name = "fas fa-heart"elif icon == "forth_pic":icon_name = "fas fa-bolt"elif icon == "fifth_pic":icon_name = "fas fa-thumbs-up"picture = str(icon) + '.png'if icon_name is not None and len(icon_name) > 0:gen_stylecloud(text=result,size=1024,  # stylecloud 的大小（长度和宽度）icon_name=icon_name,font_path=r'C:\Windows\Fonts\msyhbd.ttc',max_font_size=200,  max_words=5000,  stopwords=True,  # 布尔值，用于筛除常见禁用词custom_stopwords=STOPWORDS, output_name=picture)else:gen_stylecloud(text=result, font_path=r'C:\Windows\Fonts\msyhbd.ttc', output_name=picture)return picture# 主函数
if __name__ == '__main__':jieba_cloud("douban.txt", "first_pic")jieba_cloud("douban.txt", "second_pic")jieba_cloud("douban.txt", "third_pic")jieba_cloud("douban.txt", "forth_pic")jieba_cloud("douban.txt", "fifth_pic")

你好，李焕英词云图：

唐探3词云图：

刺杀小说家词云图：

之后需要爬取猫眼的实时票房数据：

import os
import time
import requests
import datetime#确定爬取网站
#构造请求头
class Box_Office(object):def __init__(self):self.url = 'https://piaofang.maoyan.com/dashboard-ajax?orderType=0&uuid=173d6dd20a2c8-0559692f1032d2-393e5b09-1fa400-173d6dd20a2c8&riskLevel=71&optimusCode=10'self.headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36","Referer": "https://piaofang.maoyan.com/dashboard/movie"}def Spider_Boxoffice(self):'''主程序，打印最终结果:return:'''while True:# 需在dos命令下运行此文件，才能清屏os.system('cls')result_json = self.get_parse()if not result_json:breakresults = self.parse(result_json)# 获取时间calendar = result_json['calendar']['serverTimestamp']t = calendar.split('.')[0].split('T')t = t[0] + " " + (datetime.datetime.strptime(t[1], "%H:%M:%S") + datetime.timedelta(hours=8)).strftime("%H:%M:%S")print("北京时间:", t)x_line = '-' * 155# 总票房total_box = result_json['movieList']['data']['nationBoxInfo']['nationBoxSplitUnit']['num']# 总票房单位total_box_unit = result_json['movieList']['data']['nationBoxInfo']['nationBoxSplitUnit']['unit']print(f"今日总票房: {total_box} {total_box_unit}", end=f'\n{x_line}\n')print('电影名称'.ljust(14), '综合票房'.ljust(11), '票房占比'.ljust(13), '场均上座率'.ljust(11), '场均人次'.ljust(11),'排片场次'.ljust(12),'排片占比'.ljust(12), '累积总票房'.ljust(11), '上映天数', sep='\t', end=f'\n{x_line}\n')for result in results:print(result['movieName'][:10].ljust(9),  # 电影名称result['boxSplitUnit'][:8].rjust(10),  # 综合票房result['boxRate'][:8].rjust(13),  # 票房占比result['avgSeatView'][:8].rjust(13),  # 场均上座率result['avgShowView'][:8].rjust(13),  # 场均人次result['showCount'][:8].rjust(13),  # '排片场次'result['showCountRate'][:8].rjust(13),  # 排片占比result['sumBoxDesc'][:8].rjust(13),  # 累积总票房result['releaseInfo'][:8].rjust(13),  # 上映信息sep='\t', end='\n\n')
#                 break # 把break注释掉，打印的是所有电影实时票房,否则只打印榜首time.sleep(4)def get_parse(self):'''网页是否成功获取:return:'''try:response = requests.get(self.url, headers=self.headers)if response.status_code == 200:# print("success!")return response.json()except requests.ConnectionError as e:print("ERROR:", e)return Nonedef parse(self, result_json):'''获取数据:return:'''if result_json:movies = result_json['movieList']['data']['list']# movies = [{},{},{}]# 场均上座率, 场均人次, 票房占比, 电影名称,# 上映信息（上映天数）, 排片场次, 排片占比, 综合票房,累积总票房ticks = ['avgSeatView', 'avgShowView', 'boxRate', 'movieName','releaseInfo', 'showCount', 'showCountRate', 'boxSplitUnit', 'sumBoxDesc']for movie in movies:self.box_office = {}for tick in ticks:# 数字和单位分开需要joinif tick == 'boxSplitUnit':movie[tick] = ''.join([str(i) for i in movie[tick].values()])# 多层字典嵌套if tick == 'movieName' or tick == 'releaseInfo':movie[tick] = movie['movieInfo'][tick]if movie[tick] == '':movie[tick] = '此项数据为空'self.box_office[tick] = str(movie[tick])yield self.box_officeif __name__ == '__main__':pf = Box_Office()pf.Spider_Boxoffice()

我们获取了所有需要的数据，接下来就可以运用Tableau、PowerBI、Excel来做可视化呈现了。

完整的数据分析报告链接：
https://blog.csdn.net/WastonWu/article/details/114787887

2021春节档电影数据分析相关推荐

2021春节档7部电影数据分析报告
2021春节档7部电影数据分析报告
互联网日报 | 1月31日星期日 | 海航集团宣布破产重整；小米首发自研隔空充电技术；2021年春节档电影开启预售...
今日看点 ✦ 海航集团:因不能清偿到期债务,债权人申请破产重整 ✦ 快手科技香港IPO定价为每股115港元,位于指导价区间高端 ✦ 小米首发隔空充电技术,可在数米半径内5瓦远距离充电 ✦ 各地出台就地 ...
2022年中国春节档电影观影人次、票房收入及票价走势分析[图]
1.影院数.场次及银幕数伴随着我国经济的腾飞及国民文化素养的进步,我们国家观众的观影意愿.观影习惯.观影水平都在不断提高,电影已经逐渐融入大众生活,成为重要的文化娱乐活动,例如老百姓已经养成了&qu ...
2022春节档电影票房破20亿元
2月2日消息,据灯塔专业版显示,截至2月2日8点05分,2022春节档新片(含预售)总票房已经突破20亿元.其中,<长津湖之水门桥>以8.64亿票房夺得冠军. 除了<长津湖之水门桥& ...
Python爬取2022春节档电影信息
Python爬取2022春节档电影信息前提条件相关介绍实验环境具体步骤确定目标网站分析网站按F12打开浏览器操作台按Ctrl+Shift+C快捷键,用鼠标找到目标元素按Ctrl+F快 ...
使用Python获取春节档电影影评，制作可视化词云图
Python获取春节档电影影评,制作可视化词云图前言准备工作采集数据部分基本思路流程代码实战可视化词云图代码展示效果展示前言春节电影听巳月说都还可以,我不信,我觉得还是要看看看过的 ...
【报告分享】2021年春节档电影报告-艺恩（附下载）
摘要:在"就地过年"的倡议下,和"看电影"成为"春节新民俗"的趋势潮流双重影响下,电影行业复工后迎来的第一个春节档,不仅交出一份满意答卷,更 ...
春节档电影降价了最低30元
2月9日消息,据国家电影局统计,2022年春节档全国城市影院电影票房为60.35亿元,高居影史春节档票房亚军. 2022年春节档的票房成绩已经高于2019年的59.03亿元,超越疫情前水平,但与202 ...
春节档电影评分出炉：韩寒《四海》垫底仅有 5.6 分，他居然第一
春节档的几部电影,你看了没? 经过一天的沉淀发酵,虎年春节档影片豆瓣评分出炉.由张艺谋执导的<狙击手>以7.7分暂列评分第一,韩寒的第4部导演作品<四海>则以5.6分垫底. 而 ...
2021春节档票房超78亿元收官总观影人次达1.6亿
2月18日消息,根据灯塔专业版实时数据,截至2月17日22时,2021年春节档总票房累计超78亿,总观影人次1.6亿,累计场次285.8万,创中国影史春节档累计票房.人次新纪录. <唐人街探案3 ...

2021春节档电影数据分析

2021春节档电影数据分析相关推荐

最新文章

热门文章