python爬虫爬取豆瓣电影评分排行榜前n名的前n页影评

目标网站

https://movie.douban.com/explore#!type=movie&tag=%E8%B1%86%E7%93%A3%E9%AB%98%E5%88%86&sort=rank&page_limit=20&page_start=0
（豆瓣电影——选电影——豆瓣高分——按评价排序）

爬虫基本思路

1.首先发送请求并返回requests（最好模拟谷歌浏览器的头部访问（即下面的headers），并且设置一个每次访问的间隔时间，这样就不容易触发网站的反爬机制（说白了就是模拟人类的访问行为））
2.获得requests对象后使用BeautifulSoup (美丽的汤？？也不知道为啥要起这个名)来解析requests对象，注意这里要用request.text，就取文本，解析后的soup打印出来其实就是整个html的字符串内容，但是类型并不是string，应该是bs4类型，这就是这个美丽的汤的魅力所在，它可以直接在python用类似于ccs选择器那样的方式一层一层的寻找我们要的div内容。
3.搜寻soup对象中我们需要的内容，就是一层一层div找到对应的属性，然后拿取我们需要的内容。（看html或者把之前的soup对象打印出来）
4.打印或保存文件

在分析过网页之后发现传统的从html中拿前三部电影不太方便，对于json更建议从xhr中的preview获取，这样一看就一目了然了。

至于如何获取xhr中preview的内容，可以用如下方式：
首先看header里的url：

res = requests.get(url, headers=headers,timeout=20) （假设这里我们已经获得了request对象）
首先把res转化为json对象：
js = res.json() #这样才能用键值对的方式访问到我们要的名称和url

全部代码：

import requests
from bs4 import BeautifulSoup
import json
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'}
#从xhr中获取链接
url = 'https://movie.douban.com/j/search_subjects?type=movie&tag=%E8%B1%86%E7%93%A3%E9%AB%98%E5%88%86&sort=rank&page_limit=20&page_start=0'
res = requests.get(url, headers=headers,timeout=20)
#print(res.status_code)
js = res.json()  #转化成json才能用键值对访问  response对象不能def topCinema(num):  #获取评分排名前n部电影的名称和链接top_info = js['subjects'][:num]top_cinema = {}for i in range(num):top_cinema[top_info[i]['title']] = top_info[i]['url']return top_cinema
#print(topCinema(4))def getComment(movieUrl,pageNum):  #爬取某个电影的第i页影评start = (pageNum-1) * 20headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'}url = movieUrl + 'comments?'+ 'start=' + str(start) + 'limit=20&status=P&sort=new_score'res = requests.get(url, headers=headers,timeout=20)soup = BeautifulSoup(res.text,'html.parser')comment_list = soup.find_all('span',class_='short')user = soup.find_all('span',class_='comment-info')cinema_comment = {}for i in range(len(user)):cinema_comment[user[i].a.string] = comment_list[i].stringreturn cinema_comment
#print(getComment('https://movie.douban.com/subject/1292052/',1))#爬取top3电影的前两页影评：（爬取多页只需要改一下参数即可）
top3 = topCinema(3)
top3_comment = {}
for name in top3:for i in range(1,3):top3_comment[name] = getComment(top3[name],i)
#print(top3_comment)#存储本地
with open ('./comment/top3_comment.txt','w') as f:f.write(str(top3_comment))print('保存成功')f.close()
with open('./comment/top3_comment.txt','r') as r:print(r.read())r.close()

运行结果