Python多线程爬虫获取电影下载链接

一些电影资源网站往往广告太多，不想看广告所以做了这个程序

首先需要先分析网站的搜索链接，这里只用到了“爱下电影网”和“电影天堂”两个网站

爱下电影：http://www.aixia.cc/plus/search.php?searchtype=titlekeyword&q=%E9%80%9F%E5%BA%A6%E4%B8%8E%E6%BF%80%E6%83%85

电影天堂：http://s.dydytt.net/plus/so.php?kwtype=0&searchtype=title&keyword=%CB%D9%B6%C8%D3%EB%BC%A4%C7%E9

可以发现这两个网站搜索结果链接前部分可以固定死，后面肩上keyword(关键词)即可

所以我们的搜索链接可以按照这个规律直接拼接出来

爬虫基本思考：

二三级均为线程

首先对于queue模块，这是一种队列类型，也就是具有先入先出的特点，用这个来存放需要下载的链接

使用：

1.写入

object=queue.Queue()

object.put('what you want')

object.task_done()

2.读出

object.get()

注意如果没有对象在object中会出现堵塞

使用前一定先判断是否为空object.emoty()

其次就是thread模块

我是用的办法是创建自定义类继承与threading.Thread类

关于正则表达式这里就不再提了，我觉得正则表达式比BeautifulSoup和LXML好用一些

为了扩展性，每一个网站都用一个字典存放相关信息，这样以后需要添加其他网

站可以通过添加网站字典完成

设置一个url列表存放所有的网站信息字典

为了方便管理任务列表，我这里把所有任务放在一个key为网站名的字典中

变量展示：

task1=queue.Queue()
task2=queue.Queue()Cannel={'爱下电影':task1,'电影天堂':task2}#队列的字典
downloadurl={'爱下电影':[],'电影天堂':[]}"""
website中字典数据格式:
{'name':'网站名','url':'网站地址半加工','pat':[正则1，正则2],'root':'原本地址''encode':'编码格式',
}
"""
aixiamovie={'name':'爱下电影','url':r'http://www.aixia.cc/plus/search.php?searchtype=titlekeyword&q=','root':r'http://www.aixia.cc','pat':['<h1 class=".*?"><a href="(.*?)" target="_blank">','οnclick="copyUrl(.*?)">'],'encode':'utf-8'}
tiantang={'name':'电影天堂','url':r'http://s.dydytt.net/plus/so.php?kwtype=0&searchtype=title&keyword=','root':r'http://s.dydytt.net','pat':["<td width='.*?'><b><a href='(.*?)'>",'<td style=.*? bgcolor=.*?><a href="(.*?)">'],'encode':'gb2312',}
weblist=[]
weblist.append(aixiamovie)
weblist.append(tiantang)

任务启动类：

class taskstart():def __init__(self,keyword):for item in weblist:#加工搜索地址temp=str(keyword.encode(item['encode']))temp=temp.replace(r'\x','%')temp=temp[2:]item['url']=item['url']+tempbe=findurls(website=weblist)be.start()

在这个类中输入关键词，并根据网站的编码方式加工网站地址为搜索结果地址。并启动寻找详细链接的线程

这个对于多个关键词可以实例化多个该类型对象。

链接获取类：

class findurls(threading.Thread):def __init__(self,website):threading.Thread.__init__(self)#website是一个key为网站名,网址,正则表达式的字典集合成的列表self.website=websiteself.data=''self.id=''self.pat=''self.root=''self.encode=''def connect(self,url,counts=3):try:webpage=requests.get(headers=headers,url=url)webpage.encoding=self.encodeself.data=webpage.textexcept Exception as f:print(f)if counts > 0:print('%s 连接失败，即将重新连接'%url)time.sleep(1)counts-=1self.connect(url=url,counts=counts)else:print("爬取失败")def urlgets(self):if self.data:res=re.findall(self.pat[0],self.data)canshu={'name':self.id,'pat':self.pat[1],'encode':self.encode}if res:#这里可以开启爬虫线程了thread1=spdier(dic=canshu)thread1.start()for item in res:item=self.root+itemCannel[self.id].put(item)#根据网站名投入队列Cannel[self.id].task_done()else:print("没有相关结果")else:print("没有返回数据，爬虫失败")def run(self):for item in self.website:self.id=item['name']self.pat=item['pat']#第一个正则获取详情连接 第二个正则获取下载连接self.encode=item['encode']self.connect(url=item['url'])self.root=item['root']self.urlgets()print("任务分配完成")

该类对象可以根据列表中的信息链接url获取信息，并通过正则表达式提取详情页面的链接并载入队列中，在载入队列前开启爬虫线程，开始对详情页面的提取

可以根据情况实例化多个该类型对象，分配url池

spdier类：

class spdier(threading.Thread):#通用爬虫def __init__(self,dic):#dic是一个字典key为pat，encode，namethreading.Thread.__init__(self)self.id=dic['name']self.pat=dic['pat']self.encode=dic['encode']self.data=[]self.wait=5self.timeout=3def connect(self):try:if not Cannel[self.id].empty():#检测队列是否为空url=Cannel[self.id].get()print("%s has running for %s"%(self.id,url))webpage=requests.get(url=url,headers=headers,timeout=5)webpage.encoding=self.encodeself.data.append(webpage.text)self.timeout=3self.wait=5self.connect()else:print("%s wait for task!"%(self.id))if self.wait>0:self.wait-=1time.sleep(1)self.connect()else:print("%s connect compelet!"%(self.id))except Exception as f:print(f)if self.timeout>0:self.timeout-=1time.sleep(1)self.connect()else:print("连接失败")self.connect()def getres(self):for each in self.data:res=re.findall(self.pat,each)#title=re.findall('<title>(.*?)</title>',each)#获取标题if res:for item in res:downloadurl[self.id].append(item)else:print("没有相关连接")def run(self):self.connect()if self.data:self.getres()print("%s has make the result!"%self.id)save=open(r'f://'+self.id+'.txt','w')for d in downloadurl[self.id]:save.write(d)save.write('\n')save.close()print("%s work compelet!"%self.id)else:print("%s 缺少相关信息"%self.id)

这个类就直接根据相关信息获取下载链接并存在f盘中

贴上整体代码：

import requests
import re
import threading
import queue
import timeheaders={'User-Agent':r'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0;  TheWorld 7)', }
proxies={#代理配置}
task1=queue.Queue()
task2=queue.Queue()Cannel={'爱下电影':task1,'电影天堂':task2}#队列的字典
downloadurl={'爱下电影':[],'电影天堂':[]}"""
website中字典数据格式:
{'name':'网站名','url':'网站地址半加工','pat':[正则1，正则2],'root':'原本地址''encode':'编码格式',
}
"""
aixiamovie={'name':'爱下电影','url':r'http://www.aixia.cc/plus/search.php?searchtype=titlekeyword&q=','root':r'http://www.aixia.cc','pat':['<h1 class=".*?"><a href="(.*?)" target="_blank">','οnclick="copyUrl(.*?)">'],'encode':'utf-8'}
tiantang={'name':'电影天堂','url':r'http://s.dydytt.net/plus/so.php?kwtype=0&searchtype=title&keyword=','root':r'http://s.dydytt.net','pat':["<td width='.*?'><b><a href='(.*?)'>",'<td style=.*? bgcolor=.*?><a href="(.*?)">'],'encode':'gb2312',}
weblist=[]
weblist.append(aixiamovie)
weblist.append(tiantang)class spdier(threading.Thread):#通用爬虫def __init__(self,dic):#dic是一个字典key为pat，encode，namethreading.Thread.__init__(self)self.id=dic['name']self.pat=dic['pat']self.encode=dic['encode']self.data=[]self.wait=5self.timeout=3def connect(self):try:if not Cannel[self.id].empty():#检测队列是否为空url=Cannel[self.id].get()print("%s has running for %s"%(self.id,url))webpage=requests.get(url=url,headers=headers,timeout=5)webpage.encoding=self.encodeself.data.append(webpage.text)self.timeout=3self.wait=5self.connect()else:print("%s wait for task!"%(self.id))if self.wait>0:self.wait-=1time.sleep(1)self.connect()else:print("%s connect compelet!"%(self.id))except Exception as f:print(f)if self.timeout>0:self.timeout-=1time.sleep(1)self.connect()else:print("连接失败")self.connect()def getres(self):for each in self.data:res=re.findall(self.pat,each)#title=re.findall('<title>(.*?)</title>',each)#获取标题if res:for item in res:downloadurl[self.id].append(item)else:print("没有相关连接")def run(self):self.connect()if self.data:self.getres()print("%s has make the result!"%self.id)save=open(r'f://'+self.id+'.txt','w')for d in downloadurl[self.id]:save.write(d)save.write('\n')save.close()print("%s work compelet!"%self.id)else:print("%s 缺少相关信息"%self.id)class findurls(threading.Thread):def __init__(self,website):threading.Thread.__init__(self)#website是一个key为网站名,网址,正则表达式的字典集合成的列表self.website=websiteself.data=''self.id=''self.pat=''self.root=''self.encode=''def connect(self,url,counts=3):try:webpage=requests.get(headers=headers,url=url)webpage.encoding=self.encodeself.data=webpage.textexcept Exception as f:print(f)if counts > 0:print('%s 连接失败，即将重新连接'%url)time.sleep(1)counts-=1self.connect(url=url,counts=counts)else:print("爬取失败")def urlgets(self):if self.data:res=re.findall(self.pat[0],self.data)canshu={'name':self.id,'pat':self.pat[1],'encode':self.encode}if res:#这里可以开启爬虫线程了thread1=spdier(dic=canshu)thread1.start()for item in res:item=self.root+itemCannel[self.id].put(item)#根据网站名投入队列Cannel[self.id].task_done()else:print("没有相关结果")else:print("没有返回数据，爬虫失败")def run(self):for item in self.website:self.id=item['name']self.pat=item['pat']#第一个正则获取详情连接 第二个正则获取下载连接self.encode=item['encode']self.connect(url=item['url'])self.root=item['root']self.urlgets()print("任务分配完成")class taskstart():def __init__(self,keyword):for item in weblist:#加工搜索地址temp=str(keyword.encode(item['encode']))temp=temp.replace(r'\x','%')temp=temp[2:]item['url']=item['url']+tempbe=findurls(website=weblist)be.start()keyword=input("请输入电影名称 ")
main=taskstart(keyword=keyword)

最终获取的链接如图所示

这个程序缺点在于只适合于GET请求类型的网站，在后续中再加入了通用POST请求的方法。链接点击打开链接

Python多线程爬虫获取电影下载链接相关推荐

2021-03-10 Python多线程爬虫快速批量下载图片
Python多线程爬虫快速批量下载图片 1.完成这个需要导入的模块 urllib,random,queue(队列),threading,time,os,json 第三方模块的安装键盘win+R,输入 ...
Python-基于80s的一键获取电影下载链接
分析主要任务: 一键获取 80S手机电影网(https://www.y80s.net/)不同年代的电影下载链接,可以根据用户选择的不同年代比如XX年的XX类型电影来提取前4页(x个)不同电影的下载地址 ...
python下载电影天堂视频_一篇文章教会你利用Python网络爬虫获取电影天堂视频下载链接...
点击上方"IT共享之家",进行关注回复"资料"可获赠Python学习福利 [一.项目背景] 相信大家都有一种头疼的体验,要下载电影特别费劲,对吧?要一部一部的 ...
一篇文章教会你利用Python网络爬虫获取电影天堂视频下载链接
[一.项目背景] 相信大家都有一种头疼的体验,要下载电影特别费劲,对吧?要一部一部的下载,而且不能直观的知道最近电影更新的状态. 今天小编以电影天堂为例,带大家更直观的去看自己喜欢的电影,并且下载下来 ...
python爬虫教程视频下载-利用Python网络爬虫获取电影天堂视频下载链接【详细教程】...
相信大家都有一种头疼的体验,要下载电影特别费劲,对吧?要一部一部的下载,而且不能直观的知道最近电影更新的状态. 今天以电影天堂为例,带大家更直观的去看自己喜欢的电影,并且下载下来. [二.项目准备] ...
利用Python网络爬虫获取电影天堂视频下载链接【详细教程】
相信大家都有一种头疼的体验,要下载电影特别费劲,对吧?要一部一部的下载,而且不能直观的知道最近电影更新的状态. 今天以电影天堂为例,带大家更直观的去看自己喜欢的电影,并且下载下来. [二.项目准备] ...
python下载电影天堂视频教程_一篇文章教会你利用Python网络爬虫获取电影天堂视频下载链接|python基础教程|python入门|python教程...
https://www.xin3721.com/eschool/pythonxin3721/ [一.项目背景] 相信大家都有一种头疼的体验,要下载电影特别费劲,对吧?要一部一部的下载,而且不能直观的知 ...
python实战-HTML形式爬虫-批量爬取电影下载链接
文章目录一.前言二.思路 1.网站返回内容 2.url分页结构 3.子页面访问形式 4.多种下载链接判断三.具体代码的实现四.总结一.前言喜欢看片的小伙伴,肯定想打造属于自己的私人影院 ...
python爬虫——三步爬得电影天堂电影下载链接，30多行代码即可搞定：
python爬虫--三步爬得电影天堂电影下载链接,30多行代码即可搞定: 本次我们选择的爬虫对象是:https://www.dy2018.com/index.html 具体的三个步骤:1.定位到202 ...
python爬电影_使用Python多线程爬虫爬取电影天堂资源
最近花些时间学习了一下Python,并写了一个多线程的爬虫程序来获取电影天堂上资源的迅雷下载地址,代码已经上传到GitHub上了,需要的同学可以自行下载.刚开始学习python希望可以获得宝贵的意见. ...

Python多线程爬虫获取电影下载链接

Python多线程爬虫获取电影下载链接相关推荐

最新文章

热门文章