猫眼电影评论爬取

【目标】
（1）爬取榜单电影名称以及评分，简单的数据可视化。
（2）爬取《你好，李焕英》的评论，用词云显示

第一步：了解反爬机制：
1.请求过多，ip地址会被封掉24h。
2. User-Agent要频繁更换

第二步：如何避免反爬：
1.使用虚拟ip（网站：https://h.shenlongip.com/index/index.html，注册可领取500ip）。
2. 引入fake-useragent，配合random函数。

第三步：确定URL地址
（1）猫眼榜单URL：

https://maoyan.com/board

（2）你好李焕英页面URL：

https://maoyan.com/films/1299372

第四步：
一：获取榜单电影名称，以及评分

1.分析网页源代码

使用正则表达式

<dd>.*?<a href=.*? title="(.*?)" class="image-link".*?<p class="score"><i class="integer">(.*?)</i><i class="fraction">(.*?)</i>

代码实现：

def parse_html_one(self,one_url):       one_regex = '<dd>.*?<a href=.*? title="(.*?)" class="image-link".*?<p class="score"><i class="integer">(.*?)</i><i class="fraction">(.*?)</i>'one_html = self.get_html(url=one_url)r_list = self.re_func(one_regex,one_html)list01=[]list02 =[]j=1for i in r_list:list01.append(i[0])list02.append(eval(i[1]+i[2]))print("排行第 {} 名：".format(j),i[0],'  '+i[1]+i[2])j=j+1

输出结果：

将排行前四的电影及评分用条形图展示：
代码实现：

plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
plt.xlabel('电影名称')
plt.ylabel('评分')
plt.bar(list01[0:4], list02[0:4])
plt.title('热门电影排行')
plt.savefig('g:spider/猫眼电影/排行分析/谭——排行.png', dpi=300)
plt.show()

条形图展示：

二：爬取《你好，李焕英》的评论，并用词云显示。
1.分析网页源代码：

用正则表达式表示：

<div class="comment-content">(.*?)</div>

将获取的内容保存到txt文件中
读取txt文件，用词云展示

def word_coud(self,s,name_list,text):  #功能：生成词云cut_text = jieba.cut(text)result = " ".join(cut_text)font = r'C:\Windows\Fonts\simfang.ttf'wc = WordCloud(collocations=False, font_path=font, width=800, height=800, margin=2,scale=20,max_words=30,               background_color='white').generate(text.lower())plt.imshow(wc)plt.axis("off")plt.show()wc.to_file('g:spider/猫眼电影/排行分析//谭——{}.png'.format(name_list))

词云展示：

完整代码：

#需要安装re库，requests库，fake_useragent库，wordcloud库，jieba库，matplotlib库
#在cmd中使用pip install 相应的库
import re
import requests
import time
import random
from fake_useragent import UserAgent#代理池
import matplotlib.pyplot as plt
import jieba
from wordcloud import WordCloudclass Catfilmspidr:def __init__(self):self.paihang_url = 'https://maoyan.com/board'self.list_url = 'https://maoyan.com/films/1299372'#初始化passdef get_html(self,url):#功能：请求网页ip = '222.93.74.8:63325'     header = {'User-Agent':UserAgent().random}html = requests.get(url = url,proxies ={'http' : 'http://{}'.format(ip),'https':'https://{}'.format(ip)},headers = header).content.decode('utf-8')return htmldef re_func(self,regex,html):#功能：解析网页pattern = re.compile(regex,re.S)r_list = pattern.findall(html)return r_listdef parse_html_one(self,one_url):       one_regex = '<dd>.*?<a href=.*? title="(.*?)" class="image-link".*?<p class="score"><i class="integer">(.*?)</i><i class="fraction">(.*?)</i>'one_html = self.get_html(url=one_url)r_list = self.re_func(one_regex,one_html)list01=[]list02 =[]j=1for i in r_list:list01.append(i[0])list02.append(eval(i[1]+i[2]))print("排行第 {} 名：".format(j),i[0],'  '+i[1]+i[2])j=j+1#功能：生成条形图plt.rcParams['font.sans-serif'] = ['SimHei']plt.rcParams['axes.unicode_minus'] = Falseplt.xlabel('电影名称')plt.ylabel('评分')plt.bar(list01[0:4], list02[0:4])plt.title('热门电影排行')               plt.savefig('g:spider/猫眼电影/排行分析/谭——排行.png', dpi=300)plt.show()def parse_html_two(self,two_url):#功能：提取网页内容       name_regex = '<h1 class="name">(.*?)</h1>' #电影名称正则表达式       comment_regex ='<div class="comment-content">(.*?)</div>'#评论正则表达式two_html = self.get_html(url=two_url)name_list = self.re_func(name_regex,two_html)#获取电影名称（列表类型）        comment_list = self.re_func(comment_regex,two_html)#获取评论信息（列表类型）file_name = 'g:spider/猫眼电影/排行分析/谭——{}.txt'.format(name_list[0])        f = open(file_name,'w',encoding='utf-8')  #评论保存      print(name_list)for i in comment_list:                       f.write(i)f.write('\n')print('{}抓取成功'.format(name_list[0]))f.close        self.word_parse(file_name,name_list[0])#词频分析def word_parse(self,file_name,name_list): #功能：词频分析，可以参照Python语言程序设计基础P171text = open(file_name, "r",encoding='utf-8').read()counts={}words = jieba.cut(text)        for word in words:if len(word)==1:continueelse:counts[word]=counts.get(word,0)+1items =list(counts.items())items.sort(key=lambda x:x[1],reverse=True)s='  'for i in range(5):word,count = items[i]s=word+','+s    print('{}:{}'.format(word,count))print(s)self.word_coud(s,name_list,text) #生成词云 图片保存      def word_coud(self,s,name_list,text):  #功能：生成词云cut_text = jieba.cut(text)result = " ".join(cut_text)font = r'C:\Windows\Fonts\simfang.ttf'wc = WordCloud(collocations=False, font_path=font, width=800, height=800, margin=2,scale=20,max_words=30,               background_color='white').generate(text.lower())plt.imshow(wc)plt.axis("off")plt.show()wc.to_file('g:spider/猫眼电影/排行分析//谭——{}.png'.format(name_list))def run_spider(self):self.parse_html_one(self.paihang_url)self.parse_html_two(self.list_url)if __name__=='__main__':start_time = time.time()spider =Catfilmspidr()spider.run_spider()end_time =time.time()a=end_time - start_timeprint('执行时间为：{0:.2f}'.format(a))

Python爬取猫眼电影榜单评分，以及评论相关推荐

利用python爬取猫眼电影榜单TOP100
代码如下 import re import requests import json #from multiprocessing import Pool # 多进程#url = 'https://ma ...
【python爬虫自学笔记】（实战）----爬取猫眼电影榜单Top100
目的:爬取猫眼电影榜单TOP100的信息并保存在文档中. 查看网站结构,确定思路: 首先请求网页的地址为maoyan.com/board/4,电影信息的内容包含在一个个dd标签之中,分析dd标签中的内 ...
Python全栈开发-Python爬虫-05 爬取猫眼电影榜单信息
爬取猫眼电影榜单信息(翻页) 一. 获取url及headers 首先进入猫眼电影首页: 猫眼电影之后点击菜单栏的榜单并在下面选择 TOP100榜接着右击检查并刷新界面,在Network中找到4 ...
使用PHP+QueryList 爬取猫眼电影榜单信息
爬虫是我一直以来跃跃欲试的技术,现在的爬虫框架很多,比较流行的是基于python,nodejs,java,C#的的框架,其中又以基于python的爬虫流行最为广泛,还有的已经是一套傻瓜式的软件操作,如 ...
python爬取豆瓣电影榜单
python爬取豆瓣电影榜单 python爬取豆瓣电影榜单并保存到本地excel中,以后就不愁没片看了. 目标确定我们想要抓取的电影的相关内容. 抓取豆瓣top250电影的排名.电影名.评价(总结很 ...
利用requests库和Xpath爬取猫眼电影榜单【Python】
博主的前几篇有关定向网络爬虫的博客,在解析HTML界面时,都是运用了BeautifulSoup和re库进行解析,这篇博客写一下怎么用Xpath语法进行HTML界面解析,从而得到我们想要的结果. 说明 ...
正则爬取猫眼电影榜单信息
预期效果代码实现 import requests from requests.exceptions import RequestException import re import jsondef ...
【Python爬虫】猫眼电影榜单Top100
这是一个入门级的Python爬虫,结构易于理解.本文对编写此爬虫的全过程进行了讲述.希望对大家的Python爬虫学习有所帮助. 一.目标爬取猫眼电影榜单Top100,将数据存入Excel文件中,并利 ...
python爬取电影评分_用Python爬取猫眼上的top100评分电影
代码如下: # 注意encoding = 'utf-8'和ensure_ascii = False,不写的话不能输出汉字 import requests from requests.exception ...

Python爬取猫眼电影榜单评分，以及评论

猫眼电影评论爬取

Python爬取猫眼电影榜单评分，以及评论相关推荐

最新文章

热门文章