很久没有更新博客了,这段时间其实也做了不少东西,但总是懒得坐下来整理下学习笔记,今天终于努力说服自己。做了那么多东西到底改写什么呢?自从接触python以来首先接触的就是爬虫,之前也写过许多关于爬虫的博客,但是其中最负盛名的基于scrapy的爬虫框架还没有写过,于是乎就以这为出发点吧。另外,在github上研究过某大神基于scrapy的爬虫(github地址我已经找不到了,不过那个爬虫已经过期了,基本不能用了),这个网站很好,平时我也经常在上边找一些线性代数、概率论的视频来研究学习一番,我呢,在巨人的肩膀上,实现了该网站的视频和视频封面下载功能,如下:

项目准备

1、科学上网

2、python3.7 (如何配置开发环境这里不多赘述)

3、连接mongodb

理论基础

Scrapy是基于Twisted的异步处理框架,默认是10线程同步。其数据流由引擎控制,数据流的过程如下:

1、Engine首先打开一个网站,找到处理该网站的Spider,并向该Spider请求第一个要爬的URL

2、Engine从Spider中获取到第一个要爬取的URL,并通过Scheduler以Request的形式调度。

3、Engine从Scheduler请求下一个要爬取的URL

4、Scheduler返回下一个要爬取的URL给Engine,Engine将URL通过Downloader MiddleWares转发给Downloader下载

5、一旦页面下载完毕,Downloader生成该页面的Response,并将其通过Downloader MiddleWares发送给Engine

6、Engine从下载器中接收到Response,并将其通过Spider Middlewares发送给Spider处理

7、Spider处理Response,并返回爬取到的Item及新的 Request给Engine

8、Engine将Spider返回的Item给Item Pipline ,将新的Requst给Scheduler

9、重复2-8   ,直到Scheduler中没有更多的Request,Engine关闭

项目实践

1、创建项目

#创建项目文件夹
scrapy startproject pornhubBot
#cd 项目路径
#创建spider  注意spider的名字#1不能和项目名重复   #2是网站域名
scrapy genspider pornhub pornhub.com   

2、创建Item

Item是保存爬取数据的容器,它的使用方法和字典类似。创建Item需要继承scrapy.Item类,并且定义类型为scrapy.Field的字段。观察目标网站,我们可以获取到的内容有:

import scrapyclass PornhubbotItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()video_title = scrapy.Field()  #视频标题image_urls = scrapy.Field()  #缩略图下载链接image_paths = scrapy.Field()  #缩略图本地路径video_duration = scrapy.Field()  #视频时长video_views = scrapy.Field()    #视频播放量video_rating = scrapy.Field()   #视频热度排行link_url = scrapy.Field()    #视频在线地址file_urls = scrapy.Field()   #分段视频文件下载链接列表file_paths = scrapy.Field() #分段视频文件本地路径列表

3、创建Spider

最核心的便是Spider类了,在这里要做两件事:定义爬取网站的动作、分析爬取下来的网页。

(1)以初始的URL初始化Request,并设置回调函数。当该Request成功请求并返回时,Response生成并作为参数传给该回调函数

(2)在回到函数内分析返回的网页内容。返回结果有两种形式。一种是解析得到的字典或Item对象,可以直接保存;一种是解析得到的下一个(如下一页)链接,可以利用此链接构造Request并设置新的回调函数,返回Request等待后续调度。

(3)如果返回的是字典或Item对象,我们可以通过Feed Exports等组件将返回结果存入到文件。如果设置了Pipeline的话,我们可以使用Pipeline处理并保存。

(4)如果返回的是Requst,那么Request执行成功得到Response之后,Response会被传递给Request中定义的回调函数,在回调函数中我们可以再次使用选择器(如Selector)来分析新得到的网页内容,并根据分析的数据生成Item

通过以上几步循环,完成整个网站的爬取。

首先,我们在构建初始URL的时候,我们首先分析整个网站的资源分类,发现pornhub将资源根据热度排行、观看总量、评分等方式进行了分类:

"""归纳PornHub资源链接"""
PH_TYPES = ['','recommended','video?o=ht', # hot'video?o=mv', # Most Viewed'video?o=tr', # Top Rate# Examples of certain categories# 'video?c=1',  # Category = Asian# 'video?c=111',  # Category = Japanese
]

Scrapy提供了自己的数据提取方法,即Selector选择器,是基于lxml来构建的,支持XPath、CSS、正则表达式,如:

from scrapy import Selectorselector = Selector(response)
#xpath
title = selector.xpath('//img/a[@class="image1.html"]/text()').extract_first()
#css
title = selector.css('img a[class="image1.html"]::text').extract_first()
# -*- coding: utf-8 -*-
import requests
import logging
from scrapy.spiders import CrawlSpider
from scrapy.selector import Selector
from pornhubBot.items import PornhubbotItem
from pornhubBot.pornhub_type import PH_TYPES
from scrapy.http import Request
import re
import json
import randomclass PornhubSpider(CrawlSpider):name = 'pornhub'   #每个项目唯一的名字allowed_domains = ['www.pornhub.com']   #允许爬取的域名host = 'https://www.pornhub.com'start_urls = list(set(PH_TYPES))   #启动时爬取的url列表logging.getLogger("requests").setLevel(logging.WARNING)  # 将requests的日志级别设成WARNINGlogging.basicConfig(level=logging.DEBUG,format='%(asctime)s %(filename)s[line:%(lineno)d] %(levelname)s %(message)s',datefmt='%a, %d %b %Y %H:%M:%S',filename='cataline.log',filemode='w')# 构建初始URL,并设置回调函数def start_requests(self):for ph_type in self.start_urls:yield Request(url='https://www.pornhub.com/%s' % ph_type,callback=self.parse_ph_key)#迭代Requestdef parse_ph_key(self, response):selector = Selector(response)logging.debug('request url:------>' + response.url)# logging.info(selector)divs = selector.xpath('//div[@class="phimage"]')for div in divs:# logging.debug('divs :------>' + div.extract())#herf = " viewkey= ******",匹配双引号之前的数字viewkey = re.findall('viewkey=(.*?)"', div.extract())# logging.debug(viewkey)#这里返回的是在线视频播放页面,因为我们要从单个视频在线播放页面的源码中寻找我们所要的信息yield Request(url='https://www.pornhub.com/view_video.php?viewkey=%s' % viewkey[0],callback=self.parse_ph_info)#找到 next 按钮 ,并提取 herf 属性<a href="/video?o=ht&page=2" class="orangeButton">url_next = selector.xpath('//a[@class="orangeButton" and text()="Next "]/@href').extract()logging.debug(url_next)if url_next:# if self.test:logging.debug(' next page:---------->' + self.host + url_next[0])yield Request(url=self.host + url_next[0],callback=self.parse_ph_key)# 解析得到Itemdef parse_ph_info(self, response):phItem = PornhubbotItem()selector = Selector(response)# logging.info(selector)#方括号把一列字符或一个范围括在了一起 (或两者). 例如, [abc] 表示 "a, b 或 c 的中任何一个字符#竖线将两个或多个可选项目分隔开来. 如果可选项目中 任何一个 满足条件, 则会形成匹配. 例如, gray|grey 既可以匹配 gray 也可以匹配 grey._ph_info = re.findall('var flashvars_\d+ =(.*?)[,|;]\n', selector.extract())logging.debug('PH信息的JSON:')logging.debug(_ph_info)_ph_info_json = json.loads(_ph_info[0])duration = _ph_info_json.get('video_duration')phItem['video_duration'] = durationtitle = _ph_info_json.get('video_title')phItem['video_title'] = titleimage_urls = _ph_info_json.get('image_url')phItem['image_urls'] = image_urlslink_url = _ph_info_json.get('link_url')phItem['link_url'] = link_urlfile_urls = _ph_info_json.get('quality_480p')phItem['file_urls'] = file_urlsyield phItem

4、创建 Item Pipeline

当Spider解析完Response之后,Item就会传到Item Pipeline,被定义的Item Pipeline组件会被顺次调用,完成一连串的处理过程:

  • 清理html数据
  • 验证爬取数据,检查爬取字段
  • 查看并丢弃重复内容
  • 将爬取结果保存到数据库
import pymongo
from pymongo import IndexModel, ASCENDING
from pornhubBot import items
from scrapy import Request
from scrapy.exceptions import DropItem
from scrapy.pipelines.images import ImagesPipeline
from scrapy.pipelines.files import FilesPipeline#链接mongodb
class PornhubbotMongoDBPipeline(object):def __init__(self):clinet = pymongo.MongoClient("localhost", 27017)db = clinet["PornHub"]self.PhRes = db["PhRes"]#建立数据库的索引,一个索引也可以idx1 = IndexModel([('link_url', ASCENDING)], unique=True)idx2 = IndexModel([('video_title', ASCENDING)], unique=True)self.PhRes.create_indexes([idx1,idx2])# if your existing DB has duplicate records, refer to:# https://stackoverflow.com/questions/35707496/remove-duplicate-in-mongodb/35711737#这是必须实现的方法def process_item(self, item, spider):print('MongoDBItem', item)""" 判断类型 存入MongoDB """if isinstance(item, items.PornhubbotItem):print('PornVideoItem True')try:#'$set‘操作符替换掉指定字段的值,意为更新数据self.PhRes.update_one({'video_title': item['video_title']}, {'$set': dict(item)}, upsert=True)except Exception:passreturn item#链接Feed Exports组件
# https://doc.scrapy.org/en/latest/topics/media-pipeline.html#module-scrapy.pipelines.files
class VideoThumbPipeline(ImagesPipeline):# 自定义缩略图路径(及命名), 注意该路径是 IMAGES_STORE 的相对路径def file_path(self, request, response=None, info=None):file_name = request.url.split('/')[-1]return "%s/thumb.jpg" % file_name  # 返回路径及命名格式# 下载完成后, 将缩略图本地路径(IMAGES_STORE + 相对路径)填入到 item 的 thumb_pathdef item_completed(self, results, item, info):image_paths = [x['path'] for ok, x in results if ok]if not image_paths:raise DropItem('Image Downloaded Failed')item['image_paths'] = image_pathsreturn item# 从item中取出缩略图的url并下载文件def get_media_requests(self, item, info):yield Request(url=item['image_urls'], meta={'item': item})# https://doc.scrapy.org/en/latest/topics/media-pipeline.html#module-scrapy.pipelines.files
class VideoFilesPipeline(FilesPipeline):# 从item中取出分段视频的url列表并下载文件def get_media_requests(self, item, info):yield Request(url=item['file_urls'], meta={'item': item})# 自定义分段视频下载到本地的路径(以及命名), 注意该路径是 FILES_STORE 的相对路径def file_path(self, request, response=None, info=None):url = request.urlfile_name = url.split('/')[-1]return "%s/%s.mp4" % (file_name, file_name) # 返回路径及命名格式#return file_name# 下载完成后, 将分段视频本地路径列表(FILES_STORE + 相对路径)填入到 item 的 file_pathsdef item_completed(self, results, item, info):file_paths = [x['path'] for ok, x in results if ok]if not file_paths:raise DropItem("Item contains no files")item['file_paths'] = file_pathsreturn item

当然,还要在Settings里面定义各Pipeline的调用顺序,数值越小越先被调用。

#Scrapy自带了Feed输出,并且支持多种序列化格式
#生成存储文件中文不乱码
FEED_EXPORT_ENCODING = 'utf-8'
FEED_URI=u'/Users/chenyan/important/python_demo/pornhubBot/pornhub.csv'
FEED_FORMAT='CSV'
#配置文件下载目录FILE_STORE
IMAGES_STORE = u'/Users/chenyan/important/python_demo/pornhubBot/Downloads'
FILES_STORE = u'/Users/chenyan/important/python_demo/pornhubBot/Downloads'
IMAGES_URLS_FIELD = 'image_urls'  # 自定义链接Field
FILES_URLS_FIELD = 'file_urls'   # 自定义链接Field
IMAGES_THUMBS = {'small': (50, 50),'big': (270, 270),
}
#滤出小图片
IMAGES_MIN_HEIGHT = 110
IMAGES_MIN_WIDTH = 110DOWNLOADER_MIDDLEWARES = {"pornhubBot.middlewares.UserAgentMiddleware": 401,"pornhubBot.middlewares.CookiesMiddleware": 402,"pornhubBot.middlewares.ProxyMiddleware": 403,
}
ITEM_PIPELINES = {"pornhubBot.pipelines.PornhubbotMongoDBPipeline": 3,'pornhubBot.pipelines.VideoThumbPipeline': 1,'pornhubBot.pipelines.VideoFilesPipeline': 1,
}

5、创建Middleware

pornhub采用了并不是很严格的反爬策略,一开始没有设置代理时,爬取了几次我的IP就被封禁了,因此我使用了自己维护的代理池。常用的爬虫策略,userAgent,cookies,proxy的设置都可以放在这里。

# -*- coding: utf-8 -*-# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlimport random
import json
import logging
import requestsclass UserAgentMiddleware(object):""" 换User-Agent """def __init__(self, agents):self.agents = agents@classmethoddef from_crawler(cls, crawler):return cls(crawler.settings.getlist('USER_AGENTS'))def process_request(self, request, spider):# print "**************************" + random.choice(self.agents)request.headers.setdefault('User-Agent', random.choice(self.agents))class CookiesMiddleware(object):""" 换Cookie """cookie = {'platform': 'pc','ss': '367701188698225489','bs': '%s','RNLBSERVERID': 'ded6699','FastPopSessionRequestNumber': '1','FPSRN': '1','performance_timing': 'home','RNKEY': '40859743*68067497:1190152786:3363277230:1'}def process_request(self, request, spider):bs = ''for i in range(32):bs += chr(random.randint(97, 122))_cookie = json.dumps(self.cookie) % bsrequest.cookies = json.loads(_cookie)class ProxyMiddleware():#获取随机可用代理的地址为:http://localhost:5555/randomdef __init__(self,proxy_url):self.logger = logging.getLogger(__name__)self.proxy_url = proxy_urldef get_random_proxy(self):try:response = requests.get(self.proxy_url)if response.status_code == 200:proxy = response.textreturn proxyexcept requests.ConnectionError:return False@classmethoddef from_crawler(cls,crawler):settings = crawler.settingsreturn cls(proxy_url = settings.get('PROXY_URL'))def process_request(self,request,spider):#request.meta 是一个Python字典 ,'retry_times' 是scrapy常见的请求参数if request.meta.get('retry_times'):proxy = self.get_random_proxy()if proxy:uri = 'https://{proxy}'.format(proxy = proxy)self.logger.debug('使用代理:'+ proxy)request.meta['proxy'] = uri

我把常用的UserAgent放在了Setting里:

USER_AGENTS = ["Mozilla/5.0 (Linux; U; Android 2.3.6; en-us; Nexus S Build/GRK39F) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1","Avant Browser/1.2.789rel1 (http://www.avantbrowser.com)","Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.0 Safari/532.5","Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/532.9 (KHTML, like Gecko) Chrome/5.0.310.0 Safari/532.9","Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.514.0 Safari/534.7","Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/9.0.601.0 Safari/534.14","Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/10.0.601.0 Safari/534.14","Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.27 (KHTML, like Gecko) Chrome/12.0.712.0 Safari/534.27","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.24 Safari/535.1","Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.120 Safari/535.2","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7","Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre","Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10","Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv:1.9.0.11) Gecko/2009060215 Firefox/3.0.11 (.NET CLR 3.5.30729)","Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6 GTB5","Mozilla/5.0 (Windows; U; Windows NT 5.1; tr; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 ( .NET CLR 3.5.30729; .NET4.0E)","Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1","Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1","Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0","Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0a2) Gecko/20110622 Firefox/6.0a2","Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0.1) Gecko/20100101 Firefox/7.0.1","Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b4pre) Gecko/20100815 Minefield/4.0b4pre","Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0 )","Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90)","Mozilla/5.0 (Windows; U; Windows XP) Gecko MultiZilla/1.6.1.0a","Mozilla/2.02E (Win95; U)","Mozilla/3.01Gold (Win95; I)","Mozilla/4.8 [en] (Windows NT 5.1; U)","Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.4) Gecko Netscape/7.1 (ax)","HTC_Dream Mozilla/5.0 (Linux; U; Android 1.5; en-ca; Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1","Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.2; U; de-DE) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/234.40.1 Safari/534.6 TouchPad/1.0","Mozilla/5.0 (Linux; U; Android 1.5; en-us; sdk Build/CUPCAKE) AppleWebkit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1","Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17","Mozilla/5.0 (Linux; U; Android 2.2; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1","Mozilla/5.0 (Linux; U; Android 1.5; en-us; htc_bahamas Build/CRB17) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1","Mozilla/5.0 (Linux; U; Android 2.1-update1; de-de; HTC Desire 1.19.161.5 Build/ERE27) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17","Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1","Mozilla/5.0 (Linux; U; Android 1.5; de-ch; HTC Hero Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1","Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1","Mozilla/5.0 (Linux; U; Android 2.1; en-us; HTC Legend Build/cupcake) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17","Mozilla/5.0 (Linux; U; Android 1.5; de-de; HTC Magic Build/PLAT-RC33) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1 FirePHP/0.3","Mozilla/5.0 (Linux; U; Android 1.6; en-us; HTC_TATTOO_A3288 Build/DRC79) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1","Mozilla/5.0 (Linux; U; Android 1.0; en-us; dream) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2","Mozilla/5.0 (Linux; U; Android 1.5; en-us; T-Mobile G1 Build/CRB43) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari 525.20.1","Mozilla/5.0 (Linux; U; Android 1.5; en-gb; T-Mobile_G2_Touch Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1","Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17","Mozilla/5.0 (Linux; U; Android 2.2; en-us; Droid Build/FRG22D) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1","Mozilla/5.0 (Linux; U; Android 2.0; en-us; Milestone Build/ SHOLS_U2_01.03.1) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17","Mozilla/5.0 (Linux; U; Android 2.0.1; de-de; Milestone Build/SHOLS_U2_01.14.0) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17","Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2","Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522  (KHTML, like Gecko) Safari/419.3","Mozilla/5.0 (Linux; U; Android 1.1; en-gb; dream) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2","Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17","Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17","Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1","Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1","Mozilla/5.0 (Linux; U; Android 2.2; en-ca; GT-P1000M Build/FROYO) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1","Mozilla/5.0 (Linux; U; Android 3.0.1; fr-fr; A500 Build/HRI66) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13","Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2","Mozilla/5.0 (Linux; U; Android 1.6; es-es; SonyEricssonX10i Build/R1FA016) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1","Mozilla/5.0 (Linux; U; Android 1.6; en-us; SonyEricssonX10i Build/R1AA056) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",]

6、构建Settings

在这里对一些东西做了定义

# -*- coding: utf-8 -*-# Scrapy settings for pornhubBot project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'pornhubBot'SPIDER_MODULES = ['pornhubBot.spiders']
NEWSPIDER_MODULE = 'pornhubBot.spiders'DOWNLOAD_DELAY = 1  # 间隔时间
# LOG_LEVEL = 'INFO'  # 日志级别
CONCURRENT_REQUESTS = 20  # 默认为16
# CONCURRENT_ITEMS = 1
# CONCURRENT_REQUESTS_PER_IP = 1
REDIRECT_ENABLED = False
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'pornhub (+http://www.yourdomain.com)'# Obey robots.txt rules
ROBOTSTXT_OBEY = True#获取随机代理的地址
PROXY_URL = 'http://localhost:5555/random'#Scrapy自带了Feed输出,并且支持多种序列化格式
#生成存储文件中文不乱码
FEED_EXPORT_ENCODING = 'utf-8'
FEED_URI=u'/Users/chenyan/important/python_demo/pornhubBot/pornhub.csv'
FEED_FORMAT='CSV'
#配置文件下载目录FILE_STORE
IMAGES_STORE = u'/Users/chenyan/important/python_demo/pornhubBot/Downloads'
FILES_STORE = u'/Users/chenyan/important/python_demo/pornhubBot/Downloads'
IMAGES_URLS_FIELD = 'image_urls'  # 自定义链接Field
FILES_URLS_FIELD = 'file_urls'   # 自定义链接Field
IMAGES_THUMBS = {'small': (50, 50),'big': (270, 270),
}
#滤出小图片
IMAGES_MIN_HEIGHT = 110
IMAGES_MIN_WIDTH = 110DOWNLOADER_MIDDLEWARES = {"pornhubBot.middlewares.UserAgentMiddleware": 401,"pornhubBot.middlewares.CookiesMiddleware": 402,"pornhubBot.middlewares.ProxyMiddleware": 403,
}
ITEM_PIPELINES = {"pornhubBot.pipelines.PornhubbotMongoDBPipeline": 3,'pornhubBot.pipelines.VideoThumbPipeline': 1,'pornhubBot.pipelines.VideoFilesPipeline': 1,
}
#默认情况下,Scrapy使用 LIFO 队列来存储等待的请求。简单的说,就是 深度优先顺序
#以 广度优先顺序 进行爬取
DEPTH_PRIORITY = 1
SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'USER_AGENTS = ["Mozilla/5.0 (Linux; U; Android 2.3.6; en-us; Nexus S Build/GRK39F) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1","Avant Browser/1.2.789rel1 (http://www.avantbrowser.com)","Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.0 Safari/532.5","Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/532.9 (KHTML, like Gecko) Chrome/5.0.310.0 Safari/532.9","Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.514.0 Safari/534.7","Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/9.0.601.0 Safari/534.14","Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/10.0.601.0 Safari/534.14","Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.27 (KHTML, like Gecko) Chrome/12.0.712.0 Safari/534.27","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.24 Safari/535.1","Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.120 Safari/535.2","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7","Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre","Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10","Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv:1.9.0.11) Gecko/2009060215 Firefox/3.0.11 (.NET CLR 3.5.30729)","Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6 GTB5","Mozilla/5.0 (Windows; U; Windows NT 5.1; tr; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 ( .NET CLR 3.5.30729; .NET4.0E)","Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1","Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1","Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0","Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0a2) Gecko/20110622 Firefox/6.0a2","Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0.1) Gecko/20100101 Firefox/7.0.1","Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b4pre) Gecko/20100815 Minefield/4.0b4pre","Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0 )","Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90)","Mozilla/5.0 (Windows; U; Windows XP) Gecko MultiZilla/1.6.1.0a","Mozilla/2.02E (Win95; U)","Mozilla/3.01Gold (Win95; I)","Mozilla/4.8 [en] (Windows NT 5.1; U)","Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.4) Gecko Netscape/7.1 (ax)","HTC_Dream Mozilla/5.0 (Linux; U; Android 1.5; en-ca; Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1","Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.2; U; de-DE) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/234.40.1 Safari/534.6 TouchPad/1.0","Mozilla/5.0 (Linux; U; Android 1.5; en-us; sdk Build/CUPCAKE) AppleWebkit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1","Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17","Mozilla/5.0 (Linux; U; Android 2.2; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1","Mozilla/5.0 (Linux; U; Android 1.5; en-us; htc_bahamas Build/CRB17) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1","Mozilla/5.0 (Linux; U; Android 2.1-update1; de-de; HTC Desire 1.19.161.5 Build/ERE27) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17","Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1","Mozilla/5.0 (Linux; U; Android 1.5; de-ch; HTC Hero Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1","Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1","Mozilla/5.0 (Linux; U; Android 2.1; en-us; HTC Legend Build/cupcake) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17","Mozilla/5.0 (Linux; U; Android 1.5; de-de; HTC Magic Build/PLAT-RC33) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1 FirePHP/0.3","Mozilla/5.0 (Linux; U; Android 1.6; en-us; HTC_TATTOO_A3288 Build/DRC79) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1","Mozilla/5.0 (Linux; U; Android 1.0; en-us; dream) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2","Mozilla/5.0 (Linux; U; Android 1.5; en-us; T-Mobile G1 Build/CRB43) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari 525.20.1","Mozilla/5.0 (Linux; U; Android 1.5; en-gb; T-Mobile_G2_Touch Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1","Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17","Mozilla/5.0 (Linux; U; Android 2.2; en-us; Droid Build/FRG22D) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1","Mozilla/5.0 (Linux; U; Android 2.0; en-us; Milestone Build/ SHOLS_U2_01.03.1) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17","Mozilla/5.0 (Linux; U; Android 2.0.1; de-de; Milestone Build/SHOLS_U2_01.14.0) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17","Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2","Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522  (KHTML, like Gecko) Safari/419.3","Mozilla/5.0 (Linux; U; Android 1.1; en-gb; dream) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2","Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17","Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17","Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1","Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1","Mozilla/5.0 (Linux; U; Android 2.2; en-ca; GT-P1000M Build/FROYO) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1","Mozilla/5.0 (Linux; U; Android 3.0.1; fr-fr; A500 Build/HRI66) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13","Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2","Mozilla/5.0 (Linux; U; Android 1.6; es-es; SonyEricssonX10i Build/R1FA016) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1","Mozilla/5.0 (Linux; U; Android 1.6; en-us; SonyEricssonX10i Build/R1AA056) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",]

7、定义一个快速开启通道

from __future__ import absolute_import
from scrapy import cmdlinecmdline.execute("scrapy crawl pornhub".split())

运行这个程序就可以了

《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《

以上,就可以离线学高数啦!!!

NO.54——基于scrapy的P站爬虫相关推荐

  1. python 系列 03 - 基于scrapy框架的简单爬虫

    文章目录 1. scrapy介绍 2 新建爬虫项目 3 新建蜘蛛文件 4 运行爬虫 5 爬取内容 5.1分析网页结构 5.2 关于Xpath解析 5.3 接着解析电影数据 5.4 下载缩略图 5.5 ...

  2. 4.基于scrapy的实时电影爬虫开发

    在前面搭建好了前后台的基本框架之后,就可以使用websocket+scrapy来开发和用户交互的实时爬虫系统了.基本的思路为:当用户在前台发送请求之后,通过websocket的方式来进行前后台交互,并 ...

  3. 基于scrapy的B站UP主信息爬取

    文章目录 思路分析 项目目录 代码 结果 思路分析 本次爬取的信息,包括UP主的mid.昵称.性别.头像的链接.个人简介.粉丝数.关注数.播放数.获赞数. 我的思路是,首先,选择一位B站比较火的UP主 ...

  4. Python基于Scrapy网上兼职网爬虫可视化分析设计

    开发环境: PyCharm + Python3.7 + Django + SimpleUI + Echarts + Scrapy + Mysql + Redis 功能介绍:   基于scarpy框架开 ...

  5. 基于Scrapy的交互式漫画爬虫

    class BaseComicSpider(scrapy.Spider): """改写start_requests""" step = 'l ...

  6. 基于Scrapy框架的简单爬虫

    本项目的完整代码放在最后 目录 1.环境的安装 2.在cmd中创建一个scrapy项目 3.cmd中创建spider包下的爬虫主文件 4.对scrapy文件的具体编写 4.1用xpath对爬取的内容进 ...

  7. 基于scrapy的qq音乐爬虫

    不多说,上源码,仅作学习. https://github.com/18844631601/qq_music 百来行代码,有看不懂的下方评论,有错漏之处也希望指出,大家共同学习.

  8. Python分布式爬虫打造搜索引擎完整版-基于Scrapy、Redis、elasticsearch和django打造一个完整的搜索引擎网站

    Python分布式爬虫打造搜索引擎 基于Scrapy.Redis.elasticsearch和django打造一个完整的搜索引擎网站 https://github.com/mtianyan/Artic ...

  9. Python爬虫实战之二 - 基于Scrapy框架抓取Boss直聘的招聘信息

    Python爬虫实战之三 - 基于Scrapy框架抓取Boss直聘的招聘信息 ---------------readme--------------- 简介:本人产品汪一枚,Python自学数月,对于 ...

最新文章

  1. 在拓扑图上做标准ACL和扩展ACL(期末考试)
  2. 笔记本电脑怎么清理灰尘_笔记本电脑维修|笔记本电脑CPU故障
  3. SpringMVC-自定义转换器
  4. 阿里大规模数据中心性能分析
  5. 王道 —— 操作系统的概念(定义)、功能和目标
  6. excel 使用连接符合并单元格内容或者给单元格内容添加信息
  7. Atitit 提升开发效率几大策略 目录 1. 提升效率三原则 2 1.1. 更少的工作 2 1.2. 优化配置减少等待 2 1.3. 提升一次性处理能力 2 2. 方法提升 3 2.1. 分类优
  8. Java的数据类型转换
  9. python开发的网络调试助手_Linux/windows/mac 下的socket网络通信调试助手 UDP/TCP
  10. 盗版win10右下角去水印_轻松去掉Win10桌面右下角的测试模式水印
  11. 相机快门、 光圈有啥区别?
  12. COBOL中的基本语法(转)
  13. 上元之夜,故宫亮起来!
  14. 二维码有效期要注意什么
  15. Docker----Consul集群搭建
  16. PID控制算法实践应用(一):PID算法的离散化
  17. Neighbourhood Consensus Networks(NIPS 2018)特征点检测与匹配论文笔记
  18. [bzoj3252]攻略
  19. IPEmotion的NVH噪声测试模块——坎贝尔图
  20. 1、SSD算法源码解读-如何进行数据增强

热门文章

  1. 中国软件测试有多少人,测试全国有多少人重名,查重名(无需安装软件)
  2. oracle %date 0 10%,oracle database 10.2.0.5.0升级到10.2.0.5.22方法(for windows)
  3. css使背景图片旋转
  4. 计算机文档打不开是什么原因,电脑文件打不开怎么回事
  5. 基于Android studio有声听书系统 java音乐播放器系统
  6. linux系统下解压缩
  7. arcgis的炸开多边形功能
  8. 计算机应用大赛宣传稿,江苏开放大学计算机应用基础中国名城宣传片
  9. TCP原理之:linux网桥
  10. Qt VTK软件开发问题学习记录