目录

一  Scrapy框架简介

二  scrapy框架的基本使用

1) 环境的安装

2)基础命令

3)项目组成:

4)创建爬虫文件:

5)scrapy架构组成

6) 五大核心组件

7)srapy工作原理

8)yield

9)scrapy shell

10)pymysql的使用步骤

11)CrawlSpider

12)日志信息和日志等级

13) 乱码

14)scrapy持久化存储

- 基于终端指令:

- 基于管道:

15)图片数据爬取之ImagesPipeline

案例:scrapy图片爬取:

16)中间件(处理动态加载数据)

案例:网易新闻爬取

17)scrapy的post请求

18)代理

三 scrapy框架应用--以创建BaiduSpider项目为例,介绍使用scrapy框架的使用过程

1)创建项目

2)修改items脚本

3)创建spider脚本

5)运行爬虫程序

6) 修改pipelines脚本

7)定制MIiddleware

四 项目实战 爬取中国大学Mooc网站课程信息

五 小结


一  Scrapy框架简介

- 什么是框架?
    - 就是一个集成了很多功能并且具有很强通用性的一个项目模板。

- 如何学习框架?
    - 专门学习框架封装的各种功能的详细用法。

- 什么是scrapy?
    - 爬虫中封装好的一个明星框架。功能:高性能的持久化存储,异步的数据下载,高性能的数据解析,分布式

Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架。 可以应用在包括数据挖掘,信息处理 或存储历史数据等一系列的程序中。

二  scrapy框架的基本使用

1) 环境的安装

- mac or linux:pip install scrapy
    - windows:
        - pip install wheel
        - 下载twisted,下载地址为http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
        - 安装twisted:pip install Twisted‑17.1.0‑cp36‑cp36m‑win_amd64.whl
        - pip install pywin32
        - pip install scrapy
        测试:在终端里录入scrapy指令,没有报错即表示安装成功!

2)基础命令

- 创建一个工程:scrapy startproject   xxx
- cd xxx
- 在spiders子目录中创建一个爬虫文件
    - scrapy genspider spiderName www.xxx.com
- 执行工程:
    - scrapy crawl spiderName

注意:应在spiders文件夹内执行

3)项目组成:

spiders

__init__.py

自定义的爬虫文件.py       ‐‐‐》由我们自己创建,是实现爬虫核心功能的文件

__init__.py

items.py                     ‐‐‐》定义数据结构的地方,是一个继承自scrapy.Item的类

middlewares.py               ‐‐‐》中间件   代理

pipelines.py                ‐‐‐》管道文件,里面只有一个类,用于处理下载数据的后续处理

默认是300优先级,值越小优先级越高(1‐1000)

settings.py                 ‐‐‐》配置文件  比如:是否遵守robots协议,User‐Agent定义等

4)创建爬虫文件:

(1)跳转到spiders文件夹   cd 目录名字/目录名字/spiders

(2)scrapy genspider 爬虫名字 网页的域名

爬虫文件的基本组成:

继承scrapy.Spider类

name = 'baidu'       ‐‐‐》  运行爬虫文件时使用的名字

allowed_domains       ‐‐‐》 爬虫允许的域名,在爬取的时候,如果不是此域名之下的

url,会被过滤掉

start_urls           ‐‐‐》 声明了爬虫的起始地址,可以写多个url,一般是一个

parse(self, response) ‐‐‐》解析数据的回调函数

response.text         ‐‐‐》响应的是字符串

response.body         ‐‐‐》响应的是二进制文件

response.xpath()‐》xpath方法的返回值类型是selector列表

extract()             ‐‐‐》提取的是selector对象的是data

extract_first()       ‐‐‐》提取的是selector列表中的第一个数据

5)scrapy架构组成

(1)引擎                   ‐‐‐》自动运行,无需关注,会自动组织所有的请求对象,分发给下载器

(2)下载器                 ‐‐‐》从引擎处获取到请求对象后,请求数据

(3)spiders  ‐‐‐》Spider类定义了如何爬取某个(或某些)网站。包括了爬取的动作(例 如:是否跟进链接)以及如何从网页的内容中提取结构化数据(爬取item)。 换句话说,Spider就是您定义爬取的动作及 分析某个网页(或者是有些网页)的地方。

(4)调度器                 ‐‐‐》有自己的调度规则,无需关注

(5)管道(Item pipeline)   ‐‐‐》最终处理数据的管道,会预留接口供我们处理数据 当Item在Spider中被收集之后,它将会被传递到Item Pipeline,一些组件会按照一定的顺序执行对Item的处理。 每个item pipeline组件(有时称之为“Item Pipeline”)是实现了简单方法的Python类。他们接收到Item并通过它执行 一些行为,同时也决定此Item是否继续通过pipeline,或是被丢弃而不再进行处理。

以下是item pipeline的一些典型应用:

1. 清理HTML数据

2. 验证爬取的数据(检查item包含某些字段)

3. 查重(并丢弃)

4. 将爬取结果保存到数据库中

6) 五大核心组件

引擎(Scrapy)
        用来处理整个系统的数据流处理, 触发事务(框架核心)
    调度器(Scheduler)
        用来接受引擎发过来的请求, 压入队列中, 并在引擎再次请求的时候返回. 可以想像成一个URL(抓取网页的网址或者说是链接)的优先队列, 由它来决定下一个要抓取的网址是什么, 同时去除重复的网址
    下载器(Downloader)
        用于下载网页内容, 并将网页内容返回给蜘蛛(Scrapy下载器是建立在twisted这个高效的异步模型上的)
    爬虫(Spiders)
        爬虫是主要干活的, 用于从特定的网页中提取自己需要的信息, 即所谓的实体(Item)。用户也可以从中提取出链接,让Scrapy继续抓取下一个页面
    项目管道(Pipeline)
        负责处理爬虫从网页中抽取的实体,主要的功能是持久化实体、验证实体的有效性、清除不需要的信息。当页面被爬虫解析后,将被发送到项目管道,并经过几个特定的次序

7)srapy工作原理

8)yield

1. 带有 yield 的函数不再是一个普通函数,而是一个生成器generator,可用于迭代

2. yield 是一个类似 return 的关键字,迭代一次遇到yield时就返回yield后面(右边)的值。重点是:下一次迭代 时,从上一次迭代遇到的yield后面的代码(下一行)开始执行

3. 简要理解:yield就是 return 返回一个值,并且记住这个返回的位置,下次迭代就从这个位置后(下一行)开始

9)scrapy shell

10)pymysql的使用步骤

1.pip install pymysql

2.pymysql.connect(host,port,user,password,db,charset)

3.conn.cursor()

4.cursor.execute()

11)CrawlSpider

- CrawlSpider:类,Spider的一个子类
    - 全站数据爬取的方式
        - 基于Spider:手动请求
        - 基于CrawlSpider
    - CrawlSpider的使用:
        - 创建一个工程
        - cd XXX
        - 创建爬虫文件(CrawlSpider):
            - scrapy genspider -t crawl xxx www.xxxx.com
            - 链接提取器:
                - 作用:根据指定的规则(allow)进行指定链接的提取
            - 规则解析器:
                - 作用:将链接提取器提取到的链接进行指定规则(callback)的解析

运行原理

12)日志信息和日志等级

(1)日志级别:

CRITICAL:严重错误

ERROR:   一般错误

WARNING: 警告

INFO:     一般信息

DEBUG:   调试信息

默认的日志等级是DEBUG

只要出现了DEBUG或者DEBUG以上等级的日志

那么这些日志将会打印 (2)settings.py文件设置:

默认的级别为DEBUG,会显示上面所有的信息

在配置文件中  settings.py

LOG_FILE  : 将屏幕显示的信息全部记录到文件中,屏幕不再显示,注意文件后缀一定是.log

LOG_LEVEL : 设置日志显示的等级,就是显示哪些,不显示哪些

例如,想要只显示出错误的日志信息

可在setting中加入

#显示指定类型的日志信息
LOG_LEVEL = 'ERROR'

13) 乱码

在settings中加入

FEED_EXPORT_ENCODING='utf-8'

14)scrapy持久化存储

- 基于终端指令:

- 要求:只可以将parse方法的返回值存储到本地的文本文件中
    - 注意:持久化存储对应的文本文件的类型只可以为:'json', 'jsonlines', 'jl', 'csv', 'xml', 'marshal', 'pickle
    - 指令:scrapy crawl xxx -o filePath
    - 好处:简介高效便捷
    - 缺点:局限性比较强(数据只可以存储到指定后缀的文本文件中)

- 基于管道:

- 编码流程:
        - 数据解析
        - 在item类中定义相关的属性
        - 将解析的数据封装存储到item类型的对象
        - 将item类型的对象提交给管道进行持久化存储的操作(在items中 )
        - 在管道类的process_item中要将其接受到的item对象中存储的数据进行持久化存储操作
        - 在配置文件中开启管道
    - 好处:
        - 通用性强。

15)图片数据爬取之ImagesPipeline

- 基于scrapy爬取字符串类型的数据和爬取图片类型的数据区别?
    - 字符串:只需要基于xpath进行解析且提交管道进行持久化存储
    - 图片:xpath解析出图片src的属性值。单独的对图片地址发起请求获取图片二进制类型的数据

- ImagesPipeline:
    - 只需要将img的src的属性值进行解析,提交到管道,管道就会对图片的src进行请求发送获取图片的二进制类型的数据,且还会帮我们进行持久化存储。
- 需求:爬取站长素材中的高清图片
- 使用流程:
    - 数据解析(图片的地址)
    - 将存储图片地址的item提交到制定的管道类
    - 在管道文件中自定制一个基于ImagesPipeLine的一个管道类
        - get_media_request
        - file_path
        - item_completed
    - 在配置文件中:
        - 指定图片存储的目录:IMAGES_STORE = './imgs_bobo'
        - 指定开启的管道:自定制的管道类

案例:scrapy图片爬取:

img.py:

# -*- coding: utf-8 -*-
import scrapy
from imgsPro.items import ImgsproItemclass ImgSpider(scrapy.Spider):name = 'img'# allowed_domains = ['www.xxx.com']start_urls = ['http://sc.chinaz.com/tupian/']def parse(self, response):div_list = response.xpath('//div[@id="container"]/div')for div in div_list:#注意:使用伪属性src = div.xpath('./div/a/img/@src2').extract_first()item = ImgsproItem()item['src'] = srcyield item

itiems:

# -*- coding: utf-8 -*-# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass ImgsproItem(scrapy.Item):# define the fields for your item here like:src = scrapy.Field()# pass

pipelines:

# -*- coding: utf-8 -*-# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html# class ImgsproPipeline(object):
#     def process_item(self, item, spider):
#         return item
from scrapy.pipelines.images import ImagesPipeline
import scrapy
class imgsPileLine(ImagesPipeline):#就是可以根据图片地址进行图片数据的请求def get_media_requests(self, item, info):yield scrapy.Request(item['src'])#指定图片存储的路径def file_path(self, request, response=None, info=None):imgName = request.url.split('/')[-1]return imgNamedef item_completed(self, results, item, info):return item #返回给下一个即将被执行的管道类
# -*- coding: utf-8 -*-# Define here the models for your spider middleware
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/spider-middleware.htmlfrom scrapy import signalsclass ImgsproSpiderMiddleware(object):# Not all methods need to be defined. If a method is not defined,# scrapy acts as if the spider middleware does not modify the# passed objects.@classmethoddef from_crawler(cls, crawler):# This method is used by Scrapy to create your spiders.s = cls()crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)return sdef process_spider_input(self, response, spider):# Called for each response that goes through the spider# middleware and into the spider.# Should return None or raise an exception.return Nonedef process_spider_output(self, response, result, spider):# Called with the results returned from the Spider, after# it has processed the response.# Must return an iterable of Request, dict or Item objects.for i in result:yield idef process_spider_exception(self, response, exception, spider):# Called when a spider or process_spider_input() method# (from other spider middleware) raises an exception.# Should return either None or an iterable of Response, dict# or Item objects.passdef process_start_requests(self, start_requests, spider):# Called with the start requests of the spider, and works# similarly to the process_spider_output() method, except# that it doesn’t have a response associated.# Must return only requests (not items).for r in start_requests:yield rdef spider_opened(self, spider):spider.logger.info('Spider opened: %s' % spider.name)class ImgsproDownloaderMiddleware(object):# Not all methods need to be defined. If a method is not defined,# scrapy acts as if the downloader middleware does not modify the# passed objects.@classmethoddef from_crawler(cls, crawler):# This method is used by Scrapy to create your spiders.s = cls()crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)return sdef process_request(self, request, spider):# Called for each request that goes through the downloader# middleware.# Must either:# - return None: continue processing this request# - or return a Response object# - or return a Request object# - or raise IgnoreRequest: process_exception() methods of#   installed downloader middleware will be calledreturn Nonedef process_response(self, request, response, spider):# Called with the response returned from the downloader.# Must either;# - return a Response object# - return a Request object# - or raise IgnoreRequestreturn responsedef process_exception(self, request, exception, spider):# Called when a download handler or a process_request()# (from other downloader middleware) raises an exception.# Must either:# - return None: continue processing this exception# - return a Response object: stops process_exception() chain# - return a Request object: stops process_exception() chainpassdef spider_opened(self, spider):spider.logger.info('Spider opened: %s' % spider.name)

settins:

# -*- coding: utf-8 -*-# Scrapy settings for imgsPro project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'imgsPro'SPIDER_MODULES = ['imgsPro.spiders']
NEWSPIDER_MODULE = 'imgsPro.spiders'LOG_LEVEL = 'ERROR'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'# Obey robots.txt rules
ROBOTSTXT_OBEY = False# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
#COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'imgsPro.middlewares.ImgsproSpiderMiddleware': 543,
#}# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'imgsPro.middlewares.ImgsproDownloaderMiddleware': 543,
#}# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {'imgsPro.pipelines.imgsPileLine': 300,
}# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'#指定图片存储的目录
IMAGES_STORE = './imgs_bobo'

16)中间件(处理动态加载数据)

- 中间件
    - 下载中间件
        - 位置:引擎和下载器之间
        - 作用:批量拦截到整个工程中所有的请求和响应
        - 拦截请求:
            - UA伪装:process_request
            - 代理IP:process_exception:return request

- 拦截响应:
            - 篡改响应数据,响应对象

案例:网易新闻爬取

items;

# -*- coding: utf-8 -*-# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass WangyiproItem(scrapy.Item):# define the fields for your item here like:title = scrapy.Field()content = scrapy.Field()

wangyi.py

# -*- coding: utf-8 -*-
import scrapy
from selenium import webdriver
from wangyiPro.items import WangyiproItem
class WangyiSpider(scrapy.Spider):name = 'wangyi'# allowed_domains = ['www.cccom']start_urls = ['https://news.163.com/']models_urls = []  #存储五个板块对应详情页的url#解析五大板块对应详情页的url#实例化一个浏览器对象def __init__(self):self.bro = webdriver.Chrome(executable_path='/Users/bobo/Desktop/小猿圈爬虫课程/chromedriver')def parse(self, response):li_list = response.xpath('//*[@id="index2016_wrap"]/div[1]/div[2]/div[2]/div[2]/div[2]/div/ul/li')alist = [3,4,6,7,8]for index in alist:model_url = li_list[index].xpath('./a/@href').extract_first()self.models_urls.append(model_url)#依次对每一个板块对应的页面进行请求for url in self.models_urls:#对每一个板块的url进行请求发送yield scrapy.Request(url,callback=self.parse_model)#每一个板块对应的新闻标题相关的内容都是动态加载def parse_model(self,response): #解析每一个板块页面中对应新闻的标题和新闻详情页的url# response.xpath()div_list = response.xpath('/html/body/div/div[3]/div[4]/div[1]/div/div/ul/li/div/div')for div in div_list:title = div.xpath('./div/div[1]/h3/a/text()').extract_first()new_detail_url = div.xpath('./div/div[1]/h3/a/@href').extract_first()item = WangyiproItem()item['title'] = title#对新闻详情页的url发起请求yield scrapy.Request(url=new_detail_url,callback=self.parse_detail,meta={'item':item})def parse_detail(self,response):#解析新闻内容content = response.xpath('//*[@id="endText"]//text()').extract()content = ''.join(content)item = response.meta['item']item['content'] = contentyield itemdef closed(self,spider):self.bro.quit()

settings:

# -*- coding: utf-8 -*-# Scrapy settings for wangyiPro project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'wangyiPro'SPIDER_MODULES = ['wangyiPro.spiders']
NEWSPIDER_MODULE = 'wangyiPro.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'wangyiPro (+http://www.yourdomain.com)'
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'# Obey robots.txt rules
ROBOTSTXT_OBEY = False# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
#COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'wangyiPro.middlewares.WangyiproSpiderMiddleware': 543,
#}# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {'wangyiPro.middlewares.WangyiproDownloaderMiddleware': 543,
}# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {'wangyiPro.pipelines.WangyiproPipeline': 300,
}
LOG_LEVEL = 'ERROR'
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

pipeline:

# -*- coding: utf-8 -*-# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.htmlclass WangyiproPipeline(object):def process_item(self, item, spider):print(item)return item

middleware:

# -*- coding: utf-8 -*-# Define here the models for your spider middleware
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/spider-middleware.htmlfrom scrapy import signalsfrom scrapy.http import HtmlResponse
from time import sleep
class WangyiproDownloaderMiddleware(object):# Not all methods need to be defined. If a method is not defined,# scrapy acts as if the downloader middleware does not modify the# passed objects.def process_request(self, request, spider):# Called for each request that goes through the downloader# middleware.# Must either:# - return None: continue processing this request# - or return a Response object# - or return a Request object# - or raise IgnoreRequest: process_exception() methods of#   installed downloader middleware will be calledreturn None#该方法拦截五大板块对应的响应对象,进行篡改def process_response(self, request, response, spider):#spider爬虫对象bro = spider.bro#获取了在爬虫类中定义的浏览器对象#挑选出指定的响应对象进行篡改#通过url指定request#通过request指定responseif request.url in spider.models_urls:bro.get(request.url) #五个板块对应的url进行请求sleep(3)page_text = bro.page_source  #包含了动态加载的新闻数据#response #五大板块对应的响应对象#针对定位到的这些response进行篡改#实例化一个新的响应对象(符合需求:包含动态加载出的新闻数据),替代原来旧的响应对象#如何获取动态加载出的新闻数据?#基于selenium便捷的获取动态加载数据new_response = HtmlResponse(url=request.url,body=page_text,encoding='utf-8',request=request)return new_responseelse:#response #其他请求对应的响应对象return responsedef process_exception(self, request, exception, spider):# Called when a download handler or a process_request()# (from other downloader middleware) raises an exception.# Must either:# - return None: continue processing this exception# - return a Response object: stops process_exception() chain# - return a Request object: stops process_exception() chainpass

17)scrapy的post请求

(1)重写start_requests方法:

def start_requests(self)  (2) start_requests的返回值:

scrapy.FormRequest(url=url, headers=headers, callback=self.parse_item, formdata=data)             url: 要发送的post地址

headers:可以定制头信息

callback: 回调函数                formdata: post所携带的数据,这是一个字典

18)代理

(1)到settings.py中,打开一个选项

DOWNLOADER_MIDDLEWARES = {

'postproject.middlewares.Proxy': 543,

}

(2)到middlewares.py中写代码

def process_request(self, request, spider):

request.meta['proxy'] = 'https://113.68.202.10:9999'

return None

三 scrapy框架应用--以创建BaiduSpider项目为例,介绍使用scrapy框架的使用过程

1)创建项目

语法:scrapy startproject 项目名称 【存放爬虫项目的路径】

若不指定项目路径,则在命令执行路径下生成

例如在pycharm中创建

则会在该文件目录下生成

2)修改items脚本

scrapy库提供Item对象来实现将爬取到的数据转换成结构化数据的功能,实现方法是定义Item类(继承scrapy.Item类),并定义类中的数据类型为scrapy.Filed字段

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass BaiduspiderItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()passclass NewsItem(scrapy.Item):index=scrapy.Field()  #定义排名,标题,链接,搜素结果个数title=scrapy.Field()link=scrapy.Field()newsNum=scrapy.Field()

3)创建spider脚本

语法:scrapy genspider [template]   <name>  <domain>

template:表示创建模板的类型,缺省则使用默认模板

name:表示创建的spider脚本名称,擦混关键后会在spider目录下生成一个以namme命名的py文件

在BaiduSpider项目中创建一个爬取百度首页热榜新闻的spider脚本,命名为news,域名为www .baidu.com

#xpath返回的是列表,但是列表元素一定是Selector类型的对象
#extract可以将Selector对象中data参数存储的字符串提取出来

request方法常用的参数及其说明

在BadiuSpider项目中,shixiannews爬虫的过程如下:

1)在parse()方法中提取百度首页热榜新闻的排名,标题和链接,并写入NewsItem,

类的对象中,然后调用Scrapy。Request()方法请求新的链接,并使用callback参数指定回调方法为parse_newsnum,使用meta参数在两个解析方法之间传递NewsItem类的对象。

2)定义parse_newsnum()方法,解析请求新的链接返回的响应,提取每台哦新闻的搜索接过书,并写入newsItem类中

news.py:

import scrapy                                    #导入scrapy模块
#导入items模块中的NewsItem类
from BaiduSpider.items import NewsItem
from copy import deepcopy                       #导入deepcopy模块
class NewsSpider(scrapy.Spider):                #定义NewsSpider类name = 'news'                              #初始化name#初始化allowed_domainsallowed_domains = ['www.baidu.com']start_urls = ['http://www.baidu.com/']  #初始化start_urlsdef parse(self, response):                    #定义parse方法#查找ul节点下的所有li节点news_list = response.xpath('//ul[@class="s-hotsearch-content"]/li')for news in news_list:                   #遍历li节点item = NewsItem()                   #初始化对象item#查找排名节点,获取文本,并赋值给item中的indexitem['index'] = news.xpath('a/span[1]/text()').extract_first()#查找标题节点,获取文本,并赋值给item中的titleitem['title'] = news.xpath('a/span[2]/text()').extract_first()#查找链接节点,获取文本,并赋值给item中的linkitem['link']= news.xpath('a/@href').extract_first()#发送请求,并指定回调方法为parse_newsnumyield scrapy.Request(item['link'],callback=self.parse_newsnum,meta={'item': deepcopy(item)}#使用meta参数在两个解析方法之间传递item时,避免item数据发生错误,需使用深拷贝,而且需要导入deepcopy模块)def parse_newsnum(self, response):  # 定义parse_newsnum方法item = response.meta['item']  # 传递item# 查找搜索结果个数节点,并获取文本,赋值给news_numitem['newsNum'] = response.xpath('//span[@class="nums_text"]/text()').extract_first()yield item  # 返回item

提示:

scrapy.Requset()方法中使用meta参数在两个解析方法之间传递Item时,避免Item数据发生错误

需使用深拷贝,如meta={'item':deepcopy(item)},而且需要导入deepcopy模块

注意:

导入items模块中的newsitem类时,需要设置路径。方法为:右击项目名,在弹出的快捷菜单中选择mark directory as ->sources root即可

拓展:

scrapy还提供FormRequest()方法发送请求并使用表单提交数据如POST请求,常用的参数有url,callback,method,formdata,meta,dont_filter等,其中formdata为字典类型,表示表单提交的数据,dont)filte为bool类型,如果需要多次提交表单,切url一样,那么必须将其设置为True,以防止被当成重复请求被过滤

4)修改settings脚本

settings脚本提供键值映射的全局命名空间,可以使用代码提取其中的配置值,默认的settings文件中共有25项设置

可添加俩个

一个是只显示错误的日志信息

#显示指定类型的日志信息
LOG_LEVEL = 'ERROR'

这个是爬取的数据对于中文不乱码
FEED_EXPORT_ENCODING='utf-8'

settings:

# Scrapy settings for BaiduSpider project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'BaiduSpider'SPIDER_MODULES = ['BaiduSpider.spiders']
NEWSPIDER_MODULE = 'BaiduSpider.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36'                               #设置USER_AGENT# Obey robots.txt rules
ROBOTSTXT_OBEY = False                 #设置不遵守robots协议# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
#COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'BaiduSpider.middlewares.BaiduspiderSpiderMiddleware': 543,
#}# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {#启动RandomUserAgentMiddleware,并设置调用顺序'BaiduSpider.middlewares.RandomUserAgentMiddleware': 350,
}# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {#启动TextPipeline,设置调用顺序'BaiduSpider.pipelines.TextPipeline': 300,#启动MongoPipeline,设置调用顺序'BaiduSpider.pipelines.MongoPipeline': 400,
}
MONGO_URI = 'localhost'              #设置数据库的连接地址
MONGO_DB = 'baidu'                       #设置数据库名# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'FEED_EXPORT_ENCODING = 'utf-8'

5)运行爬虫程序

以上BaiduSpider项目已经基本完成可以运行

语法  :scrapy crawl 项目名称

注意在 BadiduSpider目录下

也可以通过命令将爬取到的内容保存到文件中,例如输入  “scrapy crawl news -o news.json”即可将Item的内容保存到json文件中

提示:

使用-o输出json文件时,会默认使用unicode编码,当内容为中文时,输出的json文件不便于查看,此时。可以在settings。py文件中修改默认的编码方式,即增加设置FEED_EXPORT_ENCODING='utf-8'

6) 修改pipelines脚本

如果想进行更复杂的处理,如筛选一些有用的数据或者将数据保存到数据库中,可以在pipelines脚本中通过定义Item Pipeline来实现

定义Item Pipeline只需要定义一个类并实现process_item()方法即可,process_item方法有两个参数:一个是item参数,Spider生成的Item每次都会作为参数传递过来;另外一个是spider参数,他说Spider的实例(如果创建多个Spider可以通过spider.name来区分)。该方法必须返回包含数据的字典或者Item对象或者抛出DropItem异常并且删除该Item

例如在该项目中,定义一个类,实现将index为3的Item删除

并且顶一个MOngoPipeline类,实现将处理后的item存储至MongoDB数据库中

pipelines.py:

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html# useful for handling different item types with a single interface
from itemadapter import ItemAdapterclass BaiduspiderPipeline:def process_item(self, item, spider):return item#导入DropItem模块
from scrapy.exceptions import DropItem
class TextPipeline:                     #定义TextPipeline类
#定义process_item方法def process_item(self, item, spider):if item['index']=='3':          #如果item中“index”为“3”raise DropItem()             #删除itemelse:                                #如果item中“index”不为“3”return item                 #返回itemimport csv
class CsvPipeline:def __init__(self):# csv文件的位置,无需事先创建store_file = 'news.csv'# 打开(创建)文件self.file = open(store_file, 'w', newline='')# csv写法self.writer = csv.writer(self.file)#, dialect="excel"def process_item(self, item, spider):# 判断字段值不为空再写入文件if item['title']:#写入csv文件self.writer.writerow([item['index'], item['title'], item['link']])return itemdef close_spider(self, spider):# 关闭爬虫时顺便将文件保存退出self.file.close()import pymongo                           #导入pymongo模块
class MongoPipeline:                    #定义MongoPipeline类#定义__init__方法def __init__(self, mongo_uri, mongo_db):self.mongo_uri = mongo_uri   #初始化类中的mongo_uriself.mongo_db = mongo_db       #初始化类中的mongo_db@classmethod                            #使用classmethod标识def from_crawler(cls, crawler): #定义from_crawler方法#获取settings.py文件中的数据库的连接地址和数据库名return cls(mongo_uri = crawler.settings.get('MONGO_URI'),mongo_db = crawler.settings.get('MONGO_DB'))def open_spider(self, spider):   #定义open_spider方法#连接MongoDB数据库self.client = pymongo.MongoClient(self.mongo_uri)#创建数据库self.db = self.client[self.mongo_db]def close_spider(self, spider):   #定义close_spider方法self.client.close()                #关闭数据库连接#定义process_item方法def process_item(self, item, spider):data = {'index': item['index'],'title': item['title'],'link': item['link'],'newsNum': item['newsNum'],}                                  #初始化datatable = self.db['news']      #新建集合table.insert_one(data)         #向数据库中插入数据return item                       #返回item

定义了TextPipeline和MongoPipeline类后,还需要再settings文件的ITEM_PIPELINES中启动这两个Pipeline并设置掉调用顺序,调用数据库的连接地址解二数据库名,修改后的代码:

其中ITEM_PIPELINES的键值是一个数字,表示调用优先级,数字越小优先级越高

在BaiduSpider目录下,输入scrapy crawl news命令并运行,即可将内容保存至数据库中

7)定制MIiddleware

在该项目中,定制一个Downloader Middleware实现随即设置请求头的User-Agent,即在middlewares.py中定义RandomUserAgentMiddleware类:

middleware:

# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlfrom scrapy import signals# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter'''class BaiduspiderSpiderMiddleware:# Not all methods need to be defined. If a method is not defined,# scrapy acts as if the spider middleware does not modify the# passed objects.@classmethoddef from_crawler(cls, crawler):# This method is used by Scrapy to create your spiders.s = cls()crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)return sdef process_spider_input(self, response, spider):# Called for each response that goes through the spider# middleware and into the spider.# Should return None or raise an exception.return Nonedef process_spider_output(self, response, result, spider):# Called with the results returned from the Spider, after# it has processed the response.# Must return an iterable of Request, or item objects.for i in result:yield idef process_spider_exception(self, response, exception, spider):# Called when a spider or process_spider_input() method# (from other spider middleware) raises an exception.# Should return either None or an iterable of Request or item objects.passdef process_start_requests(self, start_requests, spider):# Called with the start requests of the spider, and works# similarly to the process_spider_output() method, except# that it doesn’t have a response associated.# Must return only requests (not items).for r in start_requests:yield rdef spider_opened(self, spider):spider.logger.info('Spider opened: %s' % spider.name)'''class BaiduspiderSpiderMiddleware:@classmethoddef from_crawler(cls, crawler):s = cls()crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)return sdef process_spider_input(self, response, spider):return Nonedef process_spider_output(self, response, result, spider):for i in result:yield idef process_spider_exception(self, response, exception, spider):passdef process_start_requests(self, start_requests, spider):for r in start_requests:yield rdef spider_opened(self, spider):spider.logger.info('Spider opened: %s' % spider.name)class BaiduspiderDownloaderMiddleware:@classmethoddef from_crawler(cls, crawler):s = cls()crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)return sdef process_request(self, request, spider):return Nonedef process_response(self, request, response, spider):return responsedef process_exception(self, request, exception, spider):passdef spider_opened(self, spider):spider.logger.info('Spider opened: %s' % spider.name)import random                                   #设置random模块
#定义RandomUserAgentMiddleware类
class RandomUserAgentMiddleware:def __init__(self):                         #定义__init__方法self.user_agent_list = ['Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.18363','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36','Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)','Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)','Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)','Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)','Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6','Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1','Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0','Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5','Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11','Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20']                                       #定义user_agent_list#定义process_request方法def process_request(self, request, spider):#在user_agent_list随机选择useragent = random.choice(self.user_agent_list)#设置请求头中的User-Agentrequest.headers.setdefault('User-Agent', useragent)return None                        #返回Noneclass RandomProxyMiddleware():def __init__(self):self.proxy_list = ['http://121.232.148.167:9000','http://39.105.28.28:8118','http://113.195.18.133:9999']def process_request(self, request, spider):proxy = random.choice(self.proxy_list)request.meta.setdefault('proxy', proxy)print(request.meta['proxy'])

定制完成后,需要在settings中DOWNLOADER_MIDOLEWARES中启动Downloader Middleware并设置调用顺序,同时取消USER_AGENT设置

默认提供并开启的Spider Middleware

已经足够满足多数需求,一般不需要手动修改

四 项目实战 爬取中国大学Mooc网站课程信息

实战内容:使用scrapy框架爬取中国大学MOOC网站搜素的课程信息(如python)包括课程名称,开设学习,课程类型,参与人数,课程概述,授课目标,预备知识,并将参与人数大于10000的课程存储到MongoDB中

1)scrapy startproject MOOCSpider        #创建项目

2)items:

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass MoocspiderItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()passclass CourseItem(scrapy.Item):courseName = scrapy.Field()              #定义课程名称university = scrapy.Field()             #定义开课学校category = scrapy.Field()                   #定义课程类型enrollCount = scrapy.Field()                #定义参与人数overview = scrapy.Field()                   #定义课程概述objective = scrapy.Field()                  #定义授课目标preliminaries = scrapy.Field()          #定义预备知识class SchoolItem(scrapy.Item):university = scrapy.Field()  #定义排名courseName = scrapy.Field()  #定义标题enrollCount = scrapy.Field()teacher = scrapy.Field()

3)创建一个spider脚本

cd MOOCSpider

scrapy genspider course www.icourse163.org

scrapy genspider school www.icourse163.org

4)修改course.py。由于启动程序时发送的是POST请求,所以删除默认的start_urls

属性,重写start_requests()方法,其中使用scrapy.FormRequest()方法发送POST请求,表单提交数据,指定回调方法为parse

在parse0方法中提取课程名称、开课学校、课程类型和参与人数,同时获取学校缩写和课程ID,拼接成新的URL(如https://www.icourse163.org/course/HENANNU-1003544138,其中HENANNU为学校缩写,1003544138 为课程ID),并使用scrapy.Request0方法请求新的URL,指定回调方法为parse_ section()以提取课程概述、授课目标和预备知识。由于获取的内容需要请求不同的网页,所以在使用scrapy.Request()方法时需要使用meta参数传递item,并用使用深拷贝。
course.py:

import scrapy                                    #导入scrapy模块
#导入items模块中的CourseItem类
from MOOCSpider.items import CourseItem
import json                                     #导入json模块
from copy import deepcopy                       #导入deepcopy模块
#定义CourseSpider类
class CourseSpider(scrapy.Spider):name = 'course'                                #初始化name#初始化allowed_domainsallowed_domains = ['www.icourse163.org']# 由于启动程序时发送的是post请求,所以删除默认的start_urls属性,重写start_requests()方法,其中# 使用scrapy.FormRequest()方法发送POSt请求,表单提交数据,指定回调方法为parse# start_urls = ['http://www.icourse163.org/']# 重写start_requests方法#重写start_requests方法def start_requests(self):#定义urlurl = 'https://www.icourse163.org/web/j/' \'mocSearchBean.searchCourse.rpc?csrfKey=' \'6d38afce3bd84a39b368f9175f995f2b'for i in range(7):                      #循环7次#定义data_strdata_dict = {'keyword': 'python','pageIndex': str(i+1),'highlight': 'true','orderBy': 0,'stats': 30,'pageSize': 20}data_str = json.dumps(data_dict)data = {'mocCourseQueryVo': data_str}                                      #定义data#发送POST请求,指定回调方法为parseyield scrapy.FormRequest(method='POST',url=url,formdata=data,callback=self.parse,dont_filter=True)def parse(self, response):                 #定义parse方法data = response.body.decode('utf-8')   #响应解码#获取课程列表course_list = json.loads(data)['result']['list']item = CourseItem()                       #初始化对象itemfor course in course_list:                #遍历#获取mocCourseCardDto键值CourseCard=course['mocCourseCard']['mocCourseCardDto']#提取课程名称,并写入Itemitem['courseName'] = CourseCard['name']#提取开课学校,并写入Itemitem['university']=CourseCard['schoolPanel']['name']if CourseCard['mocTagDtos']:#如果mocTagDtos键在字典中#提取课程类型,并写入Itemitem['category']=CourseCard['mocTagDtos'][0]['name']else:                        #如果mocTagDtos键不在字典中item['category'] = 'NULL'#课程类型赋值为NULL#提取参与人数,并写入Itemitem['enrollCount']=CourseCard['termPanel']['enrollCount']#提取学校缩写shortName = CourseCard['schoolPanel']['shortName']#提取课程IDcourse_id = course['courseId']#拼接URLurl = 'https://www.icourse163.org/course/' + \shortName + '-' + str(course_id)#指定回调方法为parse_section方法yield scrapy.Request(url,meta={'item':deepcopy(item)},callback=self.parse_section)def parse_section(self, response):   #定义parse_section方法item = response.meta['item']       #传递item#初始化item的“overview”为NULLitem['overview'] = 'NULL'#初始化item的“objective”为NULLitem['objective'] = 'NULL'#初始化item的“preliminaries”为NULLitem['preliminaries'] = 'NULL'#获取节点列表course_section = response.xpath('//div[@id="content-section"]')[0]for i in range(3, 10, 2):          #循环,间隔为2#定义节点路径,提取节点文本path_str = 'div[' + str(i) + ']/span[2]/text()'text = course_section.xpath(path_str).extract()#定义节点路径path = 'div[' + str(i + 1) + ']/div//p//text()'if '课程概述' in text:        #如果节点文本包含“课程概述”#提取课程概述列表overview = course_section.xpath(path).extract()overview = ''.join(overview) #连接列表中元素item['overview'] = overview  #写入itemelif '授课目标' in text:       #如果节点文本包含“授课目标”#提取授课目标列表objective = course_section.xpath(path).extract()objective = ''.join(objective)  #连接列表中元素item['objective'] = objective        #写入itemelif '预备知识' in text:       #如果节点文本包含“预备知识”#提取预备知识列表preliminaries=course_section.xpath(path).extract()#连接列表中元素preliminaries = ''.join(preliminaries)#写入itemitem['preliminaries'] = preliminariesyield item                       #返回item

school.py

import scrapy
import re
from MOOCSpider.items import SchoolItem#导入items模块中的NewsItem类
import json
class SchoolSpider(scrapy.Spider):name = 'school'allowed_domains = ['www.icourse163.org']start_urls = ['https://www.icourse163.org/university/PKU#/c']'''def parse(self, response):university_list = response.xpath('//div[@class="u-usitys f-cb"]/a')#for university in university_list:university = university_list[0]university_url = 'https://www.icourse163.org' + university.xpath('@href').extract_first()yield scrapy.Request(university_url, callback=self.parse_schoolID)'''def parse(self, response):text = re.search('window.schoolId = "(.*?)"', response.text, re.S)school_Id = text.group(1)url = 'https://www.icourse163.org/web/j/courseBean.getCourseListBySchoolId.rpc?csrfKey=6d38afce3bd84a39b368f9175f995f2b'for num in range(6):data = {'schoolId': school_Id,'p': str(num+1),'psize': '20','type': '1','courseStatus': '30'}yield scrapy.FormRequest(method='POST',url=url,formdata=data,callback=self.parse_course,dont_filter=True)def parse_course(self, response):data = response.body.decode('utf-8')course_list = json.loads(data)['result']['list']item = SchoolItem()for course in course_list:item['university'] = course['schoolName']item['courseName'] = course['name']item['enrollCount'] = course['enrollCount']item['teacher'] = course['teacherName']yield item#print(university, courseName, enrollCount, teacher)

5)修改pipelines.py,定义TextPipeline类,筛选出参数人数大于1000的课程,定义MongoPipeline类,将数据存储到MongoDB数据库中

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html# useful for handling different item types with a single interface
from itemadapter import ItemAdapterclass MoocspiderPipeline:def process_item(self, item, spider):return itemfrom scrapy.exceptions import DropItem  #导入DropItem模块
class TextPipeline():                       #定义TextPipeline类def process_item(self, item, spider):#定义process_item方法#如果item中的“enrollCount”大于于10000if item['enrollCount'] > 10000:return item                     #返回itemelse:#如果item中的“enrollCount”小于等于10000raise DropItem('Missing item')#删除item
import pymongo                              #导入pymongo模块
class MongoPipeline():                      #定义MongoPipeline类def __init__(self, mongo_uri, mongo_db):#定义__init__方法self.mongo_uri = mongo_uri       #初始化类中的mongo_uriself.mongo_db = mongo_db           #初始化类中的mongo_db@classmethod                                #使用classmethod标识def from_crawler(cls, crawler):     #定义from_crawler方法#获取settings.py文件中数据库的URI和数据库名称return cls(mongo_uri = crawler.settings.get('MONGO_URI'),mongo_db = crawler.settings.get('MONGO_DB'))def open_spider(self, spider):        #定义open_spider方法#连接MongoDB数据库self.client = pymongo.MongoClient(self.mongo_uri)#创建数据库self.db = self.client[self.mongo_db]def close_spider(self, spider):       #定义close_spider方法self.client.close()                    #关闭数据库连接def process_item(self, item, spider):#定义process_item方法data={'课程名称': item['courseName'],'开课学校': item['university'],'课程类型': item['category'],'参与人数': item['enrollCount'],'课程概述': item['overview'],'授课目标': item['objective'],'预备知识': item['preliminaries'],}                                      #初始化datatable = self.db['course']            #新建集合table.insert_one(data)             #向数据库中插入数据return item                           #返回item

6)修改middlewares,定义一个类实现随即设置请求头的Useragent

# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlfrom scrapy import signals# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapterclass MoocspiderSpiderMiddleware:# Not all methods need to be defined. If a method is not defined,# scrapy acts as if the spider middleware does not modify the# passed objects.@classmethoddef from_crawler(cls, crawler):# This method is used by Scrapy to create your spiders.s = cls()crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)return sdef process_spider_input(self, response, spider):# Called for each response that goes through the spider# middleware and into the spider.# Should return None or raise an exception.return Nonedef process_spider_output(self, response, result, spider):# Called with the results returned from the Spider, after# it has processed the response.# Must return an iterable of Request, or item objects.for i in result:yield idef process_spider_exception(self, response, exception, spider):# Called when a spider or process_spider_input() method# (from other spider middleware) raises an exception.# Should return either None or an iterable of Request or item objects.passdef process_start_requests(self, start_requests, spider):# Called with the start requests of the spider, and works# similarly to the process_spider_output() method, except# that it doesn’t have a response associated.# Must return only requests (not items).for r in start_requests:yield rdef spider_opened(self, spider):spider.logger.info('Spider opened: %s' % spider.name)class MoocspiderDownloaderMiddleware:# Not all methods need to be defined. If a method is not defined,# scrapy acts as if the downloader middleware does not modify the# passed objects.@classmethoddef from_crawler(cls, crawler):# This method is used by Scrapy to create your spiders.s = cls()crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)return sdef process_request(self, request, spider):# Called for each request that goes through the downloader# middleware.# Must either:# - return None: continue processing this request# - or return a Response object# - or return a Request object# - or raise IgnoreRequest: process_exception() methods of#   installed downloader middleware will be calledreturn Nonedef process_response(self, request, response, spider):# Called with the response returned from the downloader.# Must either;# - return a Response object# - return a Request object# - or raise IgnoreRequestreturn responsedef process_exception(self, request, exception, spider):# Called when a download handler or a process_request()# (from other downloader middleware) raises an exception.# Must either:# - return None: continue processing this exception# - return a Response object: stops process_exception() chain# - return a Request object: stops process_exception() chainpassdef spider_opened(self, spider):spider.logger.info('Spider opened: %s' % spider.name)import random                              #导入random
#定义RandomUserAgentMiddleware类
class RandomUserAgentMiddleware:def __init__(self):                     #定义__init__方法self.user_agent_list = ['Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.18363','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36','Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)','Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)','Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)','Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)','Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6','Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1','Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0','Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5','Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11','Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20']                                   #定义user_agent_list#定义process_request方法def process_request(self, request, spider):#在user_agent_list中随机选择useragent = random.choice(self.user_agent_list)#设置请求头中的User-Agentrequest.headers.setdefault('User-Agent', useragent)return None                       #返回None

7)修改settings。设置ROBOTSTXT_OBEY,DOWNLOADER_MIDDLEWARES,ITEM_PIPELINES定义连接MongoDB数据需要的地址和数据库名

# Scrapy settings for MOOCSpider project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'MOOCSpider'SPIDER_MODULES = ['MOOCSpider.spiders']
NEWSPIDER_MODULE = 'MOOCSpider.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36'# Obey robots.txt rules
ROBOTSTXT_OBEY = False# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en','cookie': 'EDUWEBDEVICE=ad366d00d2a448df9e8b89e8ddb2abb8; hb_MA-A976-948FFA05E931_source=www.baidu.com; __yadk_uid=BJEprvWuVabEiTPe8yD4cxTxKeAOpusu; NTESSTUDYSI=6d38afce3bd84a39b368f9175f995f2b; Hm_lvt_77dc9a9d49448cf5e629e5bebaa5500b=1601255430,1601272424,1601272688,1601273453; WM_NI=edPVgwr6D7b1I0MgK58PF%2FAm%2FIyhZPldCt5b8sM%2FhscIGdXgkmsyDgzHAmRiUa7FH5TC8pZjD4KIBeRgKqNGbQSw0HaOZchEIuwNDn4YwcBaF2UrBM7WArc6W1IvlSUJZ2M%3D; WM_NIKE=9ca17ae2e6ffcda170e2e6ee89ee69838da495d33de98e8fb3c15b829b8f85f552a69684aaf95caef5fdadd22af0fea7c3b92aa7bca0b7e67db2ea8cd8e13b8bf08388ca3ffc908ad0c467ed97b789d95cb0bc8d95b86afcad83d0eb79a1978985db6da9b3bd9ac76dba988f8ed16397bff9a7cb3f989df891d96288ec85aac16f92b98592cd4da28f9d98b344a3919684eb4f8babb9afc766f887b984c16b86ee9b93c147f5898f93e23e95ef8797ef59979696d3d037e2a3; WM_TID=gSj%2BsvyvzttFRAEVVQI7MbZOMjPj6zKS; Hm_lpvt_77dc9a9d49448cf5e629e5bebaa5500b=1601276871',
}# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'MOOCSpider.middlewares.MoocspiderSpiderMiddleware': 543,
#}# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {'MOOCSpider.middlewares.RandomUserAgentMiddleware': 350,
}# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {'MOOCSpider.pipelines.TextPipeline': 400,'MOOCSpider.pipelines.MongoPipeline': 500,
}
MONGO_URI = 'localhost'
MONGO_DB = 'MOOC'# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

8)运行spider脚本,查看数据内容

scrapy crawl course

五 小结

(1) Scrapy框架由Engine、Scheduler、Downloader、Spider、 Item Pipeline、DownloaderMiddleware和Spider Middleware构成。

(2)使用Scrapy 框架的一般流程为:首先,创建新项目:其次,修改items 脚本,定义Item中数据的结构;然后,创建spider 脚本,解析响应,提取数据和新的urL:接着,修改sttings.py脚本,设置Scrapy组件和定义全局变量:最后,运行爬虫程序。

六 对数据进行分析的案例

飞桨AI Studio - 人工智能学习与实训社区 (baidu.com)

网络爬虫学习(三)-scrapy框架相关推荐

  1. python爬虫学习笔记-scrapy框架(1)

    简介 什么是框架? 所谓的框,其实说白了就是一个[项目的半成品],该项目的半成品需要被集成了各种功能且具有较强的通用性. Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架,非常出名, ...

  2. 爬虫学习笔记-scrapy框架介绍

    优势 批量爬取数据 高效率 架构图 各模块的功能 1,Scrapy Engine(引擎):Scrapy框架的核心部分.负责在Spider和ItemPipeline.Downloader.Schedul ...

  3. python爬虫学习笔记-scrapy框架之start_url

    在使用命令行创建scrapy项目后,会发现在spider.py文件内会生成这样的代码: name = 'quotes' allowed_domains = ['quotes.toscrape.com' ...

  4. python网络爬虫系列教程——Scrapy框架应用全解

    全栈工程师开发手册 (作者:栾鹏) python教程全解 安装 在cmd中输入 Scrapy的安装依赖wheel.twiste.lxml包.所以先通过pip install wheel安装wheel库 ...

  5. python网络爬虫技术 江吉彬下载 pdf_精通Python网络爬虫:核心技术、框架与项目实战 附源码 中文pdf完整版[108MB]...

    精通Python网络爬虫这是一本实战性的网络爬虫秘笈,不仅讲解了如何编写爬虫,而且还讲解了流行的网络爬虫的使用. 全书分为4个部分:第壹部分对网络爬虫做了概要性的介绍,主要介绍了网络爬虫的常识和所涉及 ...

  6. Python 网络爬虫笔记9 -- Scrapy爬虫框架

    Python 网络爬虫笔记9 – Scrapy爬虫框架 Python 网络爬虫系列笔记是笔者在学习嵩天老师的<Python网络爬虫与信息提取>课程及笔者实践网络爬虫的笔记. 课程链接:Py ...

  7. 学习推荐《精通Python网络爬虫:核心技术、框架与项目实战》中文PDF+源代码

    随着大数据时代的到来,我们经常需要在海量数据的互联网环境中搜集一些特定的数据并对其进行分析,我们可以使用网络爬虫对这些特定的数据进行爬取,并对一些无关的数据进行过滤,将目标数据筛选出来.对特定的数据进 ...

  8. python爬虫学习笔记-网络爬虫的三种数据解析方式

    爬虫的分类 1.通用爬虫:通用爬虫是搜索引擎(Baidu.Google.Yahoo等)"抓取系统"的重要组成部分.主要目的是将互联网上的网页下载到本地,形成一个互联网内容的镜像备份 ...

  9. Python 网络爬虫笔记10 -- Scrapy 使用入门

    Python 网络爬虫笔记10 – Scrapy 使用入门 Python 网络爬虫系列笔记是笔者在学习嵩天老师的<Python网络爬虫与信息提取>课程及笔者实践网络爬虫的笔记. 课程链接: ...

最新文章

  1. 肝了三天,万字长文教你玩转 tcpdump,从此抓包不用愁
  2. WebRTC手记之初探
  3. Swift - 14 - 字符串的基础操作
  4. 亚马逊S3文件存储的可视化
  5. vertica 数据库 linux,配置访问列式数据库vertica的php环境
  6. VUE如何关闭Eslint的方法
  7. U盘安装Ubuntu 14.04
  8. 【重识云原生】第三章云存储第一节——分布式云存储总述
  9. 无线网卡、以太网驱动消失,“没网络”并且重新下载驱动仍然出现感叹号(windows仍在设置此设备的类配置,代码56)解决方法,绝对绝对有效
  10. 知识图谱研究最新综述论文: 表示学习、知识获取与应用
  11. MDCC 2014移动开发人员大会參会实录
  12. 【Python】京东自动下单抢购脚本——双十一购物小技巧
  13. js自动缩放页面自适应屏幕分辨率
  14. CSP 201809-3 元素选择器
  15. 国产无线耳机什么牌子好?国产真无线蓝牙耳机排行
  16. Vue国际化处理 vue-i18n 以及项目自动切换中英文
  17. 深度报道 第1个从太空发回的LoRa信号(含视频)
  18. 百事起诉可口可乐广告不当
  19. 双拼、kotlin、依赖倒置
  20. 排队打水(排序不等式)

热门文章

  1. java修车_用JAVA描述一个车与修车厂两个事物
  2. 微信小程序--关闭小程序
  3. 中国最有意境的33句诗文
  4. Android平台的手机记账应用开发全程实录
  5. 顶尖程序员的五种思维模式,值得学习
  6. Python从券商客户端获取持仓数据(自动截图+图像识别)
  7. 当View为GONE状态时获取View的宽高
  8. 武汉富士康奖励优秀员工住房最大126平米-富士康-住房-奖励
  9. mysql 替换函数replace()实现mysql替换指定字段中的字符串
  10. Emeditor——支持超大文件的文本编辑器