Python爬虫：人人影视追剧脚本

抓包分析
- 搜索页面
- 影视资源页面
- 5.22更新完善
- 获取百度云，电驴等连接
代码实现
- 所需第三方库
- 搜索页面
- 5.22更新解析搜索页链接
- 获取下载页跳转链接
- 获取下载链接
- 5.22更新，增加json文件格式美化转变，显示中文
结果

最近追剧时发现找资源挺繁琐的，网页切换来切换去耗时也多，所以就想写个脚本代替。于是今下午花了点时间，先对人人影视进行了资源爬取。

抓包分析

搜索页面

打开工作台，选择NETWORK，刷新之后，如图：

图中指出的就是搜索接口‘ http ://www.zimuzu.tv/search/index?keyword=西部世界&search_type=resource ’
就两参数：keywd和search_type（搜索类型）

影视资源页面

点进西部世界的资源面，我们接着要获取什么呢？当然是下载链接了。

没错，就是蓝色的资源下载页，再次抓包分析。

没错就是图中箭头所指的tv，这就是接口，

Request URL:http://www.zimuzu.tv/resource/index_json/rid/33701/channel/tv
分析多个页面后，发现只有33701是变化的，它就在西部世界url中（http://www.zimuzu.tv/resource/33701）

看他的返回值，是一个类似json格式的数据，但不规范。如图：

把鼠标移到跳转下载链接上，发现是http ://zmz003.com/v5ta03 ，搜索一下 v5ta03 ，就能找到了。如图：

5.22更新完善

今天使用时，发现报了错，有些资源抓取不到。再次抓包分析后，发现资源分为两类：电视剧、电影。
电影的接口是 movie，类似：

Request URL:http://www.zimuzu.tv/resource/index_json/rid/22376/channel/movie

获取百度云，电驴等连接

接下来就简单了，静态页面，有点经验就OK的。如图：

代码实现

所需第三方库

import requests
from lxml import html
import re
import json

搜索页面

#获取搜索页面资源
def get_html(keywd,url):param={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0',}#cookie相带就带Url=url%keywdhtml=requests.get(Url,params=param).content.decode('utf8')return html

5.22更新解析搜索页链接

def get_movielink(text):tree=html.fromstring(text)ctree = tree.xpath('//div[@class="clearfix search-item"]')link=[]for item in ctree:print(item.xpath('em/text()')[0],item.xpath('div[2]/div/a/strong/text()')[0],':',item.xpath('div[2]/div/a/@href')[0])link.append((item.xpath('div[2]/div/a/@href')[0],item.xpath('em/text()')[0]))return link #元组的列表，元组第一个元素是资源类型（如电影）

获取下载页跳转链接

def get_downloadlink(link):if type_link=='电视剧':from_url='http://www.zimuzu.tv/resource/index_json/rid/%s/channel/tv'%link.split('/')[-1]else:from_url='http://www.zimuzu.tv/resource/index_json/rid/%s/channel/movie'%link.split('/')[-1]param = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0',#‘cookie’可以有'Referer':'http://www.zimuzu.tv%s'%link,}data=requests.get(from_url,params=param).content.decode('utf8')data=''.join(data.split('=')[1:])print(data)# pattern='<h3><a href="(.*?)" target'pattern='<h3><a href(.*?) target'# print(re.findall(pattern,data)[0].replace('\\',''))url=re.findall(pattern,data)[0].replace('\\','').replace('"','').strip()return url #获取的跳转到百度云等下载资源页面链接

获取下载链接

5.22更新，增加json文件格式美化转变，显示中文

def get_download(keywd,url):#电驴在div[id="tab-g1-MP4"]/ul/li/ul/li[2]/a/@href下,磁力是第三个
#百度云在div[id="tab-g1-APP"]/ul/li/div/ul/li[2]/a/@hrefif 'zmz' not in url:      #资源页面还包含一种跳转到种子站的链接，如https://rarbg.is/torrents.php?searchBattlestar%20Galactica%3A%20Blood%20and%20Chrome%20print('非下载页面：',url)passparam={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0',}res=requests.get(url,params=param).content.decode('utf-8')tree=html.fromstring(res)tree1=tree.xpath('//div[@class="tab-content info-content"]//div[@class="tab-content info-content"]')if tree1:downloadList=[]for item in tree1:ed2k=item.xpath('div[2]//ul[@class="down-links"]/li[2]/a/@href')#电驴name=item.xpath('div[1]//div[@class="title"]/span[1]/text()')#namebdy=item.xpath('div[1]//div[@class="title"]/ul/li[2]/a/@href')#百度云for i,j,k in zip(name,bdy,ed2k):downloadList.append(dict(name=i,bdy=j,ed2k=k))with open(keywd+'.json','a+',encoding='utf-8')as f:json.dump(downloadList,f,ensure_ascii=False,indent=2) #这里保存为json文件

结果

大致就是这样的。

5.22新的展示

思路代码大致就这样，没做异常处理（因为我想追的剧没报错，o(￣︶￣)o，可能有错，也没优化代码，你可以改善一下，比如异常处理，多线程多进程并发等

下次再找个时间，做一下自动添加百度云离线下载、或是fdm下载的脚本吧。