很久没有更新博客了，这段时间其实也做了不少东西，但总是懒得坐下来整理下学习笔记，今天终于努力说服自己。做了那么多东西到底改写什么呢？自从接触python以来首先接触的就是爬虫，之前也写过许多关于爬虫的博客，但是其中最负盛名的基于scrapy的爬虫框架还没有写过，于是乎就以这为出发点吧。另外，在github上研究过某大神基于scrapy的爬虫（github地址我已经找不到了，不过那个爬虫已经过期了，基本不能用了），这个网站很好，平时我也经常在上边找一些线性代数、概率论的视频来研究学习一番，我呢，在巨人的肩膀上，实现了该网站的视频和视频封面下载功能，如下：

项目准备

1、科学上网

2、python3.7 (如何配置开发环境这里不多赘述)

3、连接mongodb

理论基础

Scrapy是基于Twisted的异步处理框架，默认是10线程同步。其数据流由引擎控制，数据流的过程如下：

1、Engine首先打开一个网站，找到处理该网站的Spider，并向该Spider请求第一个要爬的URL

2、Engine从Spider中获取到第一个要爬取的URL，并通过Scheduler以Request的形式调度。

3、Engine从Scheduler请求下一个要爬取的URL

4、Scheduler返回下一个要爬取的URL给Engine，Engine将URL通过Downloader MiddleWares转发给Downloader下载

5、一旦页面下载完毕，Downloader生成该页面的Response，并将其通过Downloader MiddleWares发送给Engine

6、Engine从下载器中接收到Response，并将其通过Spider Middlewares发送给Spider处理

7、Spider处理Response，并返回爬取到的Item及新的 Request给Engine

8、Engine将Spider返回的Item给Item Pipline ,将新的Requst给Scheduler

9、重复2-8 ，直到Scheduler中没有更多的Request，Engine关闭

项目实践

1、创建项目

#创建项目文件夹
scrapy startproject pornhubBot
#cd 项目路径
#创建spider  注意spider的名字#1不能和项目名重复   #2是网站域名
scrapy genspider pornhub pornhub.com

2、创建Item

Item是保存爬取数据的容器，它的使用方法和字典类似。创建Item需要继承scrapy.Item类，并且定义类型为scrapy.Field的字段。观察目标网站，我们可以获取到的内容有：

import scrapyclass PornhubbotItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()video_title = scrapy.Field()  #视频标题image_urls = scrapy.Field()  #缩略图下载链接image_paths = scrapy.Field()  #缩略图本地路径video_duration = scrapy.Field()  #视频时长video_views = scrapy.Field()    #视频播放量video_rating = scrapy.Field()   #视频热度排行link_url = scrapy.Field()    #视频在线地址file_urls = scrapy.Field()   #分段视频文件下载链接列表file_paths = scrapy.Field() #分段视频文件本地路径列表

3、创建Spider

最核心的便是Spider类了，在这里要做两件事：定义爬取网站的动作、分析爬取下来的网页。

（1）以初始的URL初始化Request，并设置回调函数。当该Request成功请求并返回时，Response生成并作为参数传给该回调函数

（2）在回到函数内分析返回的网页内容。返回结果有两种形式。一种是解析得到的字典或Item对象，可以直接保存；一种是解析得到的下一个（如下一页）链接，可以利用此链接构造Request并设置新的回调函数，返回Request等待后续调度。

（3）如果返回的是字典或Item对象，我们可以通过Feed Exports等组件将返回结果存入到文件。如果设置了Pipeline的话，我们可以使用Pipeline处理并保存。

（4）如果返回的是Requst，那么Request执行成功得到Response之后，Response会被传递给Request中定义的回调函数，在回调函数中我们可以再次使用选择器（如Selector)来分析新得到的网页内容，并根据分析的数据生成Item

通过以上几步循环，完成整个网站的爬取。

首先，我们在构建初始URL的时候，我们首先分析整个网站的资源分类，发现pornhub将资源根据热度排行、观看总量、评分等方式进行了分类：

"""归纳PornHub资源链接"""
PH_TYPES = ['','recommended','video?o=ht', # hot'video?o=mv', # Most Viewed'video?o=tr', # Top Rate# Examples of certain categories# 'video?c=1',  # Category = Asian# 'video?c=111',  # Category = Japanese
]

Scrapy提供了自己的数据提取方法，即Selector选择器，是基于lxml来构建的，支持XPath、CSS、正则表达式，如：

from scrapy import Selectorselector = Selector(response)
#xpath
title = selector.xpath('//img/a[@class="image1.html"]/text()').extract_first()
#css
title = selector.css('img a[class="image1.html"]::text').extract_first()

# -*- coding: utf-8 -*-
import requests
import logging
from scrapy.spiders import CrawlSpider
from scrapy.selector import Selector
from pornhubBot.items import PornhubbotItem
from pornhubBot.pornhub_type import PH_TYPES
from scrapy.http import Request
import re
import json
import randomclass PornhubSpider(CrawlSpider):name = 'pornhub'   #每个项目唯一的名字allowed_domains = ['www.pornhub.com']   #允许爬取的域名host = 'https://www.pornhub.com'start_urls = list(set(PH_TYPES))   #启动时爬取的url列表logging.getLogger("requests").setLevel(logging.WARNING)  # 将requests的日志级别设成WARNINGlogging.basicConfig(level=logging.DEBUG,format='%(asctime)s %(filename)s[line:%(lineno)d] %(levelname)s %(message)s',datefmt='%a, %d %b %Y %H:%M:%S',filename='cataline.log',filemode='w')# 构建初始URL,并设置回调函数def start_requests(self):for ph_type in self.start_urls:yield Request(url='https://www.pornhub.com/%s' % ph_type,callback=self.parse_ph_key)#迭代Requestdef parse_ph_key(self, response):selector = Selector(response)logging.debug('request url:------>' + response.url)# logging.info(selector)divs = selector.xpath('//div[@class="phimage"]')for div in divs:# logging.debug('divs :------>' + div.extract())#herf = " viewkey= ******",匹配双引号之前的数字viewkey = re.findall('viewkey=(.*?)"', div.extract())# logging.debug(viewkey)#这里返回的是在线视频播放页面，因为我们要从单个视频在线播放页面的源码中寻找我们所要的信息yield Request(url='https://www.pornhub.com/view_video.php?viewkey=%s' % viewkey[0],callback=self.parse_ph_info)#找到 next 按钮 ，并提取 herf 属性<a href="/video?o=ht&page=2" class="orangeButton">url_next = selector.xpath('//a[@class="orangeButton" and text()="Next "]/@href').extract()logging.debug(url_next)if url_next:# if self.test:logging.debug(' next page:---------->' + self.host + url_next[0])yield Request(url=self.host + url_next[0],callback=self.parse_ph_key)# 解析得到Itemdef parse_ph_info(self, response):phItem = PornhubbotItem()selector = Selector(response)# logging.info(selector)#方括号把一列字符或一个范围括在了一起 (或两者). 例如, [abc] 表示 "a, b 或 c 的中任何一个字符#竖线将两个或多个可选项目分隔开来. 如果可选项目中 任何一个 满足条件, 则会形成匹配. 例如, gray|grey 既可以匹配 gray 也可以匹配 grey._ph_info = re.findall('var flashvars_\d+ =(.*?)[,|;]\n', selector.extract())logging.debug('PH信息的JSON:')logging.debug(_ph_info)_ph_info_json = json.loads(_ph_info[0])duration = _ph_info_json.get('video_duration')phItem['video_duration'] = durationtitle = _ph_info_json.get('video_title')phItem['video_title'] = titleimage_urls = _ph_info_json.get('image_url')phItem['image_urls'] = image_urlslink_url = _ph_info_json.get('link_url')phItem['link_url'] = link_urlfile_urls = _ph_info_json.get('quality_480p')phItem['file_urls'] = file_urlsyield phItem

4、创建 Item Pipeline

当Spider解析完Response之后，Item就会传到Item Pipeline，被定义的Item Pipeline组件会被顺次调用，完成一连串的处理过程：

清理html数据
验证爬取数据，检查爬取字段
查看并丢弃重复内容
将爬取结果保存到数据库

import pymongo
from pymongo import IndexModel, ASCENDING
from pornhubBot import items
from scrapy import Request
from scrapy.exceptions import DropItem
from scrapy.pipelines.images import ImagesPipeline
from scrapy.pipelines.files import FilesPipeline#链接mongodb
class PornhubbotMongoDBPipeline(object):def __init__(self):clinet = pymongo.MongoClient("localhost", 27017)db = clinet["PornHub"]self.PhRes = db["PhRes"]#建立数据库的索引，一个索引也可以idx1 = IndexModel([('link_url', ASCENDING)], unique=True)idx2 = IndexModel([('video_title', ASCENDING)], unique=True)self.PhRes.create_indexes([idx1,idx2])# if your existing DB has duplicate records, refer to:# https://stackoverflow.com/questions/35707496/remove-duplicate-in-mongodb/35711737#这是必须实现的方法def process_item(self, item, spider):print('MongoDBItem', item)""" 判断类型 存入MongoDB """if isinstance(item, items.PornhubbotItem):print('PornVideoItem True')try:#'$set‘操作符替换掉指定字段的值，意为更新数据self.PhRes.update_one({'video_title': item['video_title']}, {'$set': dict(item)}, upsert=True)except Exception:passreturn item#链接Feed Exports组件
# https://doc.scrapy.org/en/latest/topics/media-pipeline.html#module-scrapy.pipelines.files
class VideoThumbPipeline(ImagesPipeline):# 自定义缩略图路径(及命名), 注意该路径是 IMAGES_STORE 的相对路径def file_path(self, request, response=None, info=None):file_name = request.url.split('/')[-1]return "%s/thumb.jpg" % file_name  # 返回路径及命名格式# 下载完成后, 将缩略图本地路径(IMAGES_STORE + 相对路径)填入到 item 的 thumb_pathdef item_completed(self, results, item, info):image_paths = [x['path'] for ok, x in results if ok]if not image_paths:raise DropItem('Image Downloaded Failed')item['image_paths'] = image_pathsreturn item# 从item中取出缩略图的url并下载文件def get_media_requests(self, item, info):yield Request(url=item['image_urls'], meta={'item': item})# https://doc.scrapy.org/en/latest/topics/media-pipeline.html#module-scrapy.pipelines.files
class VideoFilesPipeline(FilesPipeline):# 从item中取出分段视频的url列表并下载文件def get_media_requests(self, item, info):yield Request(url=item['file_urls'], meta={'item': item})# 自定义分段视频下载到本地的路径(以及命名), 注意该路径是 FILES_STORE 的相对路径def file_path(self, request, response=None, info=None):url = request.urlfile_name = url.split('/')[-1]return "%s/%s.mp4" % (file_name, file_name) # 返回路径及命名格式#return file_name# 下载完成后, 将分段视频本地路径列表(FILES_STORE + 相对路径)填入到 item 的 file_pathsdef item_completed(self, results, item, info):file_paths = [x['path'] for ok, x in results if ok]if not file_paths:raise DropItem("Item contains no files")item['file_paths'] = file_pathsreturn item

当然，还要在Settings里面定义各Pipeline的调用顺序，数值越小越先被调用。

#Scrapy自带了Feed输出，并且支持多种序列化格式
#生成存储文件中文不乱码
FEED_EXPORT_ENCODING = 'utf-8'
FEED_URI=u'/Users/chenyan/important/python_demo/pornhubBot/pornhub.csv'
FEED_FORMAT='CSV'
#配置文件下载目录FILE_STORE
IMAGES_STORE = u'/Users/chenyan/important/python_demo/pornhubBot/Downloads'
FILES_STORE = u'/Users/chenyan/important/python_demo/pornhubBot/Downloads'
IMAGES_URLS_FIELD = 'image_urls'  # 自定义链接Field
FILES_URLS_FIELD = 'file_urls'   # 自定义链接Field
IMAGES_THUMBS = {'small': (50, 50),'big': (270, 270),
}
#滤出小图片
IMAGES_MIN_HEIGHT = 110
IMAGES_MIN_WIDTH = 110DOWNLOADER_MIDDLEWARES = {"pornhubBot.middlewares.UserAgentMiddleware": 401,"pornhubBot.middlewares.CookiesMiddleware": 402,"pornhubBot.middlewares.ProxyMiddleware": 403,
}
ITEM_PIPELINES = {"pornhubBot.pipelines.PornhubbotMongoDBPipeline": 3,'pornhubBot.pipelines.VideoThumbPipeline': 1,'pornhubBot.pipelines.VideoFilesPipeline': 1,
}

5、创建Middleware

pornhub采用了并不是很严格的反爬策略，一开始没有设置代理时，爬取了几次我的IP就被封禁了，因此我使用了自己维护的代理池。常用的爬虫策略，userAgent，cookies，proxy的设置都可以放在这里。

# -*- coding: utf-8 -*-# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlimport random
import json
import logging
import requestsclass UserAgentMiddleware(object):""" 换User-Agent """def __init__(self, agents):self.agents = agents@classmethoddef from_crawler(cls, crawler):return cls(crawler.settings.getlist('USER_AGENTS'))def process_request(self, request, spider):# print "**************************" + random.choice(self.agents)request.headers.setdefault('User-Agent', random.choice(self.agents))class CookiesMiddleware(object):""" 换Cookie """cookie = {'platform': 'pc','ss': '367701188698225489','bs': '%s','RNLBSERVERID': 'ded6699','FastPopSessionRequestNumber': '1','FPSRN': '1','performance_timing': 'home','RNKEY': '40859743*68067497:1190152786:3363277230:1'}def process_request(self, request, spider):bs = ''for i in range(32):bs += chr(random.randint(97, 122))_cookie = json.dumps(self.cookie) % bsrequest.cookies = json.loads(_cookie)class ProxyMiddleware():#获取随机可用代理的地址为：http://localhost:5555/randomdef __init__(self,proxy_url):self.logger = logging.getLogger(__name__)self.proxy_url = proxy_urldef get_random_proxy(self):try:response = requests.get(self.proxy_url)if response.status_code == 200:proxy = response.textreturn proxyexcept requests.ConnectionError:return False@classmethoddef from_crawler(cls,crawler):settings = crawler.settingsreturn cls(proxy_url = settings.get('PROXY_URL'))def process_request(self,request,spider):#request.meta 是一个Python字典 ,'retry_times' 是scrapy常见的请求参数if request.meta.get('retry_times'):proxy = self.get_random_proxy()if proxy:uri = 'https://{proxy}'.format(proxy = proxy)self.logger.debug('使用代理：'+ proxy)request.meta['proxy'] = uri

我把常用的UserAgent放在了Setting里：

USER_AGENTS = ["Mozilla/5.0 (Linux; U; Android 2.3.6; en-us; Nexus S Build/GRK39F) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1","Avant Browser/1.2.789rel1 (http://www.avantbrowser.com)","Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.0 Safari/532.5","Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/532.9 (KHTML, like Gecko) Chrome/5.0.310.0 Safari/532.9","Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.514.0 Safari/534.7","Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/9.0.601.0 Safari/534.14","Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/10.0.601.0 Safari/534.14","Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.27 (KHTML, like Gecko) Chrome/12.0.712.0 Safari/534.27","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.24 Safari/535.1","Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.120 Safari/535.2","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7","Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre","Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10","Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv:1.9.0.11) Gecko/2009060215 Firefox/3.0.11 (.NET CLR 3.5.30729)","Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6 GTB5","Mozilla/5.0 (Windows; U; Windows NT 5.1; tr; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 ( .NET CLR 3.5.30729; .NET4.0E)","Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1","Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1","Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0","Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0a2) Gecko/20110622 Firefox/6.0a2","Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0.1) Gecko/20100101 Firefox/7.0.1","Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b4pre) Gecko/20100815 Minefield/4.0b4pre","Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0 )","Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90)","Mozilla/5.0 (Windows; U; Windows XP) Gecko MultiZilla/1.6.1.0a","Mozilla/2.02E (Win95; U)","Mozilla/3.01Gold (Win95; I)","Mozilla/4.8 [en] (Windows NT 5.1; U)","Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.4) Gecko Netscape/7.1 (ax)","HTC_Dream Mozilla/5.0 (Linux; U; Android 1.5; en-ca; Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1","Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.2; U; de-DE) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/234.40.1 Safari/534.6 TouchPad/1.0","Mozilla/5.0 (Linux; U; Android 1.5; en-us; sdk Build/CUPCAKE) AppleWebkit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1","Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17","Mozilla/5.0 (Linux; U; Android 2.2; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1","Mozilla/5.0 (Linux; U; Android 1.5; en-us; htc_bahamas Build/CRB17) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1","Mozilla/5.0 (Linux; U; Android 2.1-update1; de-de; HTC Desire 1.19.161.5 Build/ERE27) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17","Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1","Mozilla/5.0 (Linux; U; Android 1.5; de-ch; HTC Hero Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1","Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1","Mozilla/5.0 (Linux; U; Android 2.1; en-us; HTC Legend Build/cupcake) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17","Mozilla/5.0 (Linux; U; Android 1.5; de-de; HTC Magic Build/PLAT-RC33) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1 FirePHP/0.3","Mozilla/5.0 (Linux; U; Android 1.6; en-us; HTC_TATTOO_A3288 Build/DRC79) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1","Mozilla/5.0 (Linux; U; Android 1.0; en-us; dream) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2","Mozilla/5.0 (Linux; U; Android 1.5; en-us; T-Mobile G1 Build/CRB43) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari 525.20.1","Mozilla/5.0 (Linux; U; Android 1.5; en-gb; T-Mobile_G2_Touch Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1","Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17","Mozilla/5.0 (Linux; U; Android 2.2; en-us; Droid Build/FRG22D) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1","Mozilla/5.0 (Linux; U; Android 2.0; en-us; Milestone Build/ SHOLS_U2_01.03.1) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17","Mozilla/5.0 (Linux; U; Android 2.0.1; de-de; Milestone Build/SHOLS_U2_01.14.0) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17","Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2","Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522  (KHTML, like Gecko) Safari/419.3","Mozilla/5.0 (Linux; U; Android 1.1; en-gb; dream) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2","Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17","Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17","Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1","Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1","Mozilla/5.0 (Linux; U; Android 2.2; en-ca; GT-P1000M Build/FROYO) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1","Mozilla/5.0 (Linux; U; Android 3.0.1; fr-fr; A500 Build/HRI66) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13","Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2","Mozilla/5.0 (Linux; U; Android 1.6; es-es; SonyEricssonX10i Build/R1FA016) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1","Mozilla/5.0 (Linux; U; Android 1.6; en-us; SonyEricssonX10i Build/R1AA056) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",]

6、构建Settings

在这里对一些东西做了定义

# -*- coding: utf-8 -*-# Scrapy settings for pornhubBot project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'pornhubBot'SPIDER_MODULES = ['pornhubBot.spiders']
NEWSPIDER_MODULE = 'pornhubBot.spiders'DOWNLOAD_DELAY = 1  # 间隔时间
# LOG_LEVEL = 'INFO'  # 日志级别
CONCURRENT_REQUESTS = 20  # 默认为16
# CONCURRENT_ITEMS = 1
# CONCURRENT_REQUESTS_PER_IP = 1
REDIRECT_ENABLED = False
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'pornhub (+http://www.yourdomain.com)'# Obey robots.txt rules
ROBOTSTXT_OBEY = True#获取随机代理的地址
PROXY_URL = 'http://localhost:5555/random'#Scrapy自带了Feed输出，并且支持多种序列化格式
#生成存储文件中文不乱码
FEED_EXPORT_ENCODING = 'utf-8'
FEED_URI=u'/Users/chenyan/important/python_demo/pornhubBot/pornhub.csv'
FEED_FORMAT='CSV'
#配置文件下载目录FILE_STORE
IMAGES_STORE = u'/Users/chenyan/important/python_demo/pornhubBot/Downloads'
FILES_STORE = u'/Users/chenyan/important/python_demo/pornhubBot/Downloads'
IMAGES_URLS_FIELD = 'image_urls'  # 自定义链接Field
FILES_URLS_FIELD = 'file_urls'   # 自定义链接Field
IMAGES_THUMBS = {'small': (50, 50),'big': (270, 270),
}
#滤出小图片
IMAGES_MIN_HEIGHT = 110
IMAGES_MIN_WIDTH = 110DOWNLOADER_MIDDLEWARES = {"pornhubBot.middlewares.UserAgentMiddleware": 401,"pornhubBot.middlewares.CookiesMiddleware": 402,"pornhubBot.middlewares.ProxyMiddleware": 403,
}
ITEM_PIPELINES = {"pornhubBot.pipelines.PornhubbotMongoDBPipeline": 3,'pornhubBot.pipelines.VideoThumbPipeline': 1,'pornhubBot.pipelines.VideoFilesPipeline': 1,
}
#默认情况下，Scrapy使用 LIFO 队列来存储等待的请求。简单的说，就是 深度优先顺序
#以 广度优先顺序 进行爬取
DEPTH_PRIORITY = 1
SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'USER_AGENTS = ["Mozilla/5.0 (Linux; U; Android 2.3.6; en-us; Nexus S Build/GRK39F) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1","Avant Browser/1.2.789rel1 (http://www.avantbrowser.com)","Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.0 Safari/532.5","Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/532.9 (KHTML, like Gecko) Chrome/5.0.310.0 Safari/532.9","Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.514.0 Safari/534.7","Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/9.0.601.0 Safari/534.14","Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/10.0.601.0 Safari/534.14","Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.27 (KHTML, like Gecko) Chrome/12.0.712.0 Safari/534.27","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.24 Safari/535.1","Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.120 Safari/535.2","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7","Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre","Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10","Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv:1.9.0.11) Gecko/2009060215 Firefox/3.0.11 (.NET CLR 3.5.30729)","Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6 GTB5","Mozilla/5.0 (Windows; U; Windows NT 5.1; tr; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 ( .NET CLR 3.5.30729; .NET4.0E)","Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1","Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1","Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0","Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0a2) Gecko/20110622 Firefox/6.0a2","Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0.1) Gecko/20100101 Firefox/7.0.1","Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b4pre) Gecko/20100815 Minefield/4.0b4pre","Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0 )","Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90)","Mozilla/5.0 (Windows; U; Windows XP) Gecko MultiZilla/1.6.1.0a","Mozilla/2.02E (Win95; U)","Mozilla/3.01Gold (Win95; I)","Mozilla/4.8 [en] (Windows NT 5.1; U)","Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.4) Gecko Netscape/7.1 (ax)","HTC_Dream Mozilla/5.0 (Linux; U; Android 1.5; en-ca; Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1","Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.2; U; de-DE) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/234.40.1 Safari/534.6 TouchPad/1.0","Mozilla/5.0 (Linux; U; Android 1.5; en-us; sdk Build/CUPCAKE) AppleWebkit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1","Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17","Mozilla/5.0 (Linux; U; Android 2.2; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1","Mozilla/5.0 (Linux; U; Android 1.5; en-us; htc_bahamas Build/CRB17) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1","Mozilla/5.0 (Linux; U; Android 2.1-update1; de-de; HTC Desire 1.19.161.5 Build/ERE27) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17","Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1","Mozilla/5.0 (Linux; U; Android 1.5; de-ch; HTC Hero Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1","Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1","Mozilla/5.0 (Linux; U; Android 2.1; en-us; HTC Legend Build/cupcake) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17","Mozilla/5.0 (Linux; U; Android 1.5; de-de; HTC Magic Build/PLAT-RC33) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1 FirePHP/0.3","Mozilla/5.0 (Linux; U; Android 1.6; en-us; HTC_TATTOO_A3288 Build/DRC79) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1","Mozilla/5.0 (Linux; U; Android 1.0; en-us; dream) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2","Mozilla/5.0 (Linux; U; Android 1.5; en-us; T-Mobile G1 Build/CRB43) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari 525.20.1","Mozilla/5.0 (Linux; U; Android 1.5; en-gb; T-Mobile_G2_Touch Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1","Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17","Mozilla/5.0 (Linux; U; Android 2.2; en-us; Droid Build/FRG22D) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1","Mozilla/5.0 (Linux; U; Android 2.0; en-us; Milestone Build/ SHOLS_U2_01.03.1) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17","Mozilla/5.0 (Linux; U; Android 2.0.1; de-de; Milestone Build/SHOLS_U2_01.14.0) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17","Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2","Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522  (KHTML, like Gecko) Safari/419.3","Mozilla/5.0 (Linux; U; Android 1.1; en-gb; dream) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2","Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17","Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17","Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1","Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1","Mozilla/5.0 (Linux; U; Android 2.2; en-ca; GT-P1000M Build/FROYO) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1","Mozilla/5.0 (Linux; U; Android 3.0.1; fr-fr; A500 Build/HRI66) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13","Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2","Mozilla/5.0 (Linux; U; Android 1.6; es-es; SonyEricssonX10i Build/R1FA016) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1","Mozilla/5.0 (Linux; U; Android 1.6; en-us; SonyEricssonX10i Build/R1AA056) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",]

7、定义一个快速开启通道

from __future__ import absolute_import
from scrapy import cmdlinecmdline.execute("scrapy crawl pornhub".split())

运行这个程序就可以了

《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《

以上，就可以离线学高数啦！！！

NO.54——基于scrapy的P站爬虫相关推荐

python 系列 03 - 基于scrapy框架的简单爬虫
文章目录 1. scrapy介绍 2 新建爬虫项目 3 新建蜘蛛文件 4 运行爬虫 5 爬取内容 5.1分析网页结构 5.2 关于Xpath解析 5.3 接着解析电影数据 5.4 下载缩略图 5.5 ...
4.基于scrapy的实时电影爬虫开发
在前面搭建好了前后台的基本框架之后,就可以使用websocket+scrapy来开发和用户交互的实时爬虫系统了.基本的思路为:当用户在前台发送请求之后,通过websocket的方式来进行前后台交互,并 ...
基于scrapy的B站UP主信息爬取
文章目录思路分析项目目录代码结果思路分析本次爬取的信息,包括UP主的mid.昵称.性别.头像的链接.个人简介.粉丝数.关注数.播放数.获赞数. 我的思路是,首先,选择一位B站比较火的UP主 ...
Python基于Scrapy网上兼职网爬虫可视化分析设计
开发环境: PyCharm + Python3.7 + Django + SimpleUI + Echarts + Scrapy + Mysql + Redis 功能介绍: 基于scarpy框架开 ...
基于Scrapy的交互式漫画爬虫
class BaseComicSpider(scrapy.Spider): """改写start_requests""" step = 'l ...
基于Scrapy框架的简单爬虫
本项目的完整代码放在最后目录 1.环境的安装 2.在cmd中创建一个scrapy项目 3.cmd中创建spider包下的爬虫主文件 4.对scrapy文件的具体编写 4.1用xpath对爬取的内容进 ...
基于scrapy的qq音乐爬虫
不多说,上源码,仅作学习. https://github.com/18844631601/qq_music 百来行代码,有看不懂的下方评论,有错漏之处也希望指出,大家共同学习.
Python分布式爬虫打造搜索引擎完整版-基于Scrapy、Redis、elasticsearch和django打造一个完整的搜索引擎网站
Python分布式爬虫打造搜索引擎基于Scrapy.Redis.elasticsearch和django打造一个完整的搜索引擎网站 https://github.com/mtianyan/Artic ...
Python爬虫实战之二 - 基于Scrapy框架抓取Boss直聘的招聘信息
Python爬虫实战之三 - 基于Scrapy框架抓取Boss直聘的招聘信息 ---------------readme--------------- 简介:本人产品汪一枚,Python自学数月,对于 ...

NO.54——基于scrapy的P站爬虫