首先执行如下命令创建一个scrapy项目

# scrapy startproject projectName

主要有如下几个核心文件:

items.py:  在项目的根目录

middlewares.py: 在项目的根目录

pipelines.py: 在项目的根目录

projectName.py: 在spiders目录

settings.py: 在项目的根目录

我的实例是爬取和讯人物信息,一个实例地址为:http://renwu.hexun.com/figure_2789.shtml

项目目标是:根据几个初始的url,爬取到html源码,并从源码中提取出同样的url,进行迭代爬取

本人项目的名称为:crawlHexunRenwu

第一,item.py文件的内容如下:

import scrapyclass CrawlhexunrenwuItem(scrapy.Item):# define the fields for your item here like:filename = scrapy.Field()html_content = scrapy.Field()

该文件的作用是:定义需要从html中抽取出来的字段名称

第二,spiders目录下的projectName.py文件内容,我的名字是叫crawl.py

import scrapy
import codecs
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from crawlHexunRenwu.items import CrawlhexunrenwuItem
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractorclass CrawlHexun(CrawlSpider):name = 'crawl_hexun'allowed_domains = ['renwu.hexun.com']start_urls = [#定义爬取的初始url"http://renwu.hexun.com/figure_2606.shtml","http://renwu.hexun.com/figure_6782.shtml","http://renwu.hexun.com/figure_4679.shtml","http://renwu.hexun.com/figure_1001.shtml"]rules = [#定义从html中提取的url规则,用于迭代爬取Rule(SgmlLinkExtractor(allow=(),restrict_xpaths=('//a[contains(@href,"figure_")]')),callback='parse_item',follow=True)]def parse_start_url(self, response):#处理上面定义的start_urls中的url,这个其实跟下面的parse_item内容基本一样item = CrawlhexunrenwuItem()item['html_content'] = response.bodyitem['filename'] = response.url.split("/")[-1]return item#将最终抽取出来的数据写到item中并返回def parse_item(self, response):#处理除了初始化url之外的迭代出来的urlitem = CrawlhexunrenwuItem()item['html_content'] = response.bodyitem['filename'] = response.url.split("/")[-1]return item#将最终抽取出来的数据写到item中并返回

该文件的作用,是定义初始url,从response中提取字段,提取迭代url,返回提取出来的数据

第三,pipilines.py

import codecs
from random import randrangeclass CrawlhexunrenwuPipeline(object):def process_item(self, item, spider):content = item['html_content']filename = item['filename']with open("/search/hexunrenwu/" + filename, 'wb') as f:f.write(content)return item

第二步中parse_start_url和parse_item方法返回的item,这里可以获取到,可以在这里将爬取到的内容写到文件中

第四, middlewares.py文件内容如下

# -*- coding: utf-8 -*-# Define here the models for your spider middleware
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/spider-middleware.htmlfrom scrapy import signals
import random
from scrapy.contrib.downloadermiddleware.useragent import UserAgentMiddlewareclass CrawlhexunrenwuSpiderMiddleware(object):# Not all methods need to be defined. If a method is not defined,# scrapy acts as if the spider middleware does not modify the# passed objects.@classmethoddef from_crawler(cls, crawler):# This method is used by Scrapy to create your spiders.s = cls()crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)return sdef process_spider_input(response, spider):# Called for each response that goes through the spider# middleware and into the spider.# Should return None or raise an exception.return Nonedef process_spider_output(response, result, spider):# Called with the results returned from the Spider, after# it has processed the response.# Must return an iterable of Request, dict or Item objects.for i in result:yield idef process_spider_exception(response, exception, spider):# Called when a spider or process_spider_input() method# (from other spider middleware) raises an exception.# Should return either None or an iterable of Response, dict# or Item objects.passdef process_start_requests(start_requests, spider):# Called with the start requests of the spider, and works# similarly to the process_spider_output() method, except# that it doesn’t have a response associated.# Must return only requests (not items).for r in start_requests:yield rdef spider_opened(self, spider):spider.logger.info('Spider opened: %s' % spider.name)class ProxyMiddleWare(object):lst_https_proxy = ['https://xxxxx1:9090','https://xxxxx2:9090','https://xxxxx3:9090','https://xxxxx4:9090','https://xxxxx5:9090','https://xxxxx6:9090']lst_http_proxy = ['http://ttttt1:8080','http://ttttt2:8080','http://ttttt3:8080','http://ttttt4:8080','http://ttttt5:8080', ]def random_select_proxy(self):len_all = len(ProxyMiddleWare.lst_https_proxy) + len(ProxyMiddleWare.lst_http_proxy)idx = int(random.random() * len_all)if idx < len(ProxyMiddleWare.lst_https_proxy):return ProxyMiddleWare.lst_https_proxy[idx]else:return ProxyMiddleWare.lst_http_proxy[idx - len(ProxyMiddleWare.lst_https_proxy)]def random_select_https_proxy(self):idx = int(random.random() * len(ProxyMiddleWare.lst_https_proxy))return ProxyMiddleWare.lst_https_proxy[idx]#覆写该方法,可对request设置请求头,这里是配置代理,因为爬取如果直接用本机爬,容易被封,所以需要用代理机器,当然你如果不用代理就不用定义该类def process_request(self, request, spider):if request.url.find('https') == 0:request.meta['proxy'] = self.random_select_https_proxy()else:request.meta['proxy'] = self.random_select_proxy()#class RotateUserAgentMiddleware(UserAgentMiddleware):
class RotateUserAgentMiddleware(object):def __init__(self, user_agent=''):self.user_agent = user_agent
#覆写该方法来配置User-Agentdef process_request(self, request, spider):ua = random.choice(self.user_agent_list)if ua:request.headers.setdefault('User-Agent', ua)user_agent_list = ["Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 ""(KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1","Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 ""(KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 ""(KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6","Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 ""(KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6","Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 ""(KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1","Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 ""(KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5","Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 ""(KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 ""(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3","Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 ""(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3","Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 ""(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3","Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 ""(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 ""(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3","Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 ""(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 ""(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3","Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 ""(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3","Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 ""(KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3","Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 ""(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24","Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 ""(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"]

该文件的核心有两个类ProxyMiddleWare和RotateUserAgentMiddleware

ProxyMiddleWare: 该类是给request请求设置代理

RotateUserAgentMiddleware: 该类是给request请求配置User-Agent

所以该文件的核心就是对request进行预处理

但是要用上面的两个类,必须进行配置

第五, settings.py文件

# -*- coding: utf-8 -*-# Scrapy settings for crawlHexunRenwu project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     http://doc.scrapy.org/en/latest/topics/settings.html
#     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'crawlHexunRenwu'SPIDER_MODULES = ['crawlHexunRenwu.spiders']
NEWSPIDER_MODULE = 'crawlHexunRenwu.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'crawlHexunRenwu (+http://www.yourdomain.com)'# Obey robots.txt rules
# 是否遵循robots.txt协议
ROBOTSTXT_OBEY = False# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
#防止了网站使用了cookieds识别爬虫
COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'crawlHexunRenwu.middlewares.CrawlhexunrenwuSpiderMiddleware': 543,
#}# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#配置中间件
DOWNLOADER_MIDDLEWARES = {'crawlHexunRenwu.middlewares.ProxyMiddleWare': 100,'crawlHexunRenwu.middlewares.RotateUserAgentMiddleware': 101
}# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
#激活自定义的pipeline组件
ITEM_PIPELINES = {'crawlHexunRenwu.pipelines.CrawlhexunrenwuPipeline': 300,
}#设置下载超时时间
DOWNLOAD_TIMEOUT = 15
# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
#默认的重复请求检测过滤,可以自己实现RPFDupeFilter的子类,覆写他的request_fingerprint方法
DUPEFILTER_CLASS = 'scrapy.dupefilters.RFPDupeFilter'#scrapy默认使用LIFO队列存储请求,即以深度优先方式进行抓取。通过以上设置,以广度优先方式进行抓取
DEPTH_PRIORITY = 1
SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'
# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

大功告成咯

接下来就启动爬虫吧:

# scrapy  crawl   crawl

Scrapy框架的用法实例相关推荐

  1. php laravel框架开发实例,Laravel框架集合用法实例浅析

    本文实例讲述了Laravel框架集合用法.分享给大家供大家参考,具体如下: 前言 集合通过 Illuminate\Support\Collection进行实例,Laravel的内核大部分的参数传递都用 ...

  2. python框架sanic_Sanic框架路由用法实例分析

    本文实例讲述了Sanic框架路由用法.分享给大家供大家参考,具体如下: 前面一篇<Sanic框架安装与简单入门>简单介绍了Sanic框架的安装与基本用法,这里进一步学习Sanic框架的路由 ...

  3. 爬虫进阶 之 Scrapy 框架 1(实例: 爬取ITcast 的教师信息)

    Scrapy 什么是Scrapy 简介 Scrapy 架构 使用Scrapy 爬取 ITcast 什么是Scrapy 简介 Scrapy是适用于Python的一个快速.高层次的屏幕抓取和web抓取框架 ...

  4. python pipeline框架_Python爬虫从入门到放弃(十六)之 Scrapy框架中Item Pipeline用法...

    原博文 2017-07-17 16:39 − 当Item 在Spider中被收集之后,就会被传递到Item Pipeline中进行处理 每个item pipeline组件是实现了简单的方法的pytho ...

  5. [爬虫-python] scrapy框架入门实例-百度贴吧

    这里写目录标题 前言 0. 本章内容大概流程 1. 安装Scrapy 2. 工程建立 3. 实现过程 3.1在items.py中定义自己要抓取的数据: 3.2 然后在spiders目录下编辑myspi ...

  6. python爬取链家网实例——scrapy框架爬取-链家网的租房信息

    说明: 本文适合scrapy框架的入门学习. 一.认识scrapy框架 开发python爬虫有很多种方式,从程序的复杂程度的角度来说,可以分为:爬虫项目和爬虫文件. scrapy更适合做爬虫项目,ur ...

  7. scrapy框架爬取建设行业数据实例(思路整理)

    最近挤了点时间,写了个爬虫,可能以后工作中能用得上.关于scrapy框架的一些基础知识这里就不再赘述,这里主要记录下开发思路. 关于项目背景: http://jst.sc.gov.cn/xxgx/En ...

  8. python 爬虫框架_Python网络爬虫-scrapy框架的使用

    1. Scrapy 1.1 Scrapy框架的安装 Scrapy是一个十分强大的爬虫框架,依赖的库比较多,至少需要依赖的库有Twisted .lxml和pyOpenSSL.在不同的平台环境下,它所依赖 ...

  9. python爬虫入门(六) Scrapy框架之原理介绍

    Scrapy框架 Scrapy简介 Scrapy是用纯Python实现一个为了爬取网站数据.提取结构性数据而编写的应用框架,用途非常广泛. 框架的力量,用户只需要定制开发几个模块就可以轻松的实现一个爬 ...

最新文章

  1. 新型量子计算机首个基本元件问世,扩展性更强运算速度更快
  2. 2016-08-29
  3. android电话拨号器
  4. jvm:运行时数据区
  5. 安卓隐藏摄像_侧置摄像与隐藏前摄,拒绝单调乏味,这两款国产5G手机独具匠心...
  6. GraphQL入门之工程搭建
  7. Microsoft photosynth(图片三维展示)
  8. 优秀小程序demo 源码
  9. 1 数列分块入门_怎样用通俗易懂的语言让小学 OIer 理解并能初步运用线段树?...
  10. 从WINDOWS日志判断哪块硬盘好坏!!
  11. 什么命令能把Linux搞死机,Linux常见死机原因
  12. oracle pl sql case,oracle plsql case when_end case小记
  13. matlab中 nntwarn off,network的subsindex的定义问题
  14. html5show()函数怎么写,实例:用JavaScript来操作字符串(一些字符串函数)_基础知识...
  15. pscad调用matlab的模块,PSCAD模块库功能教程(包含与matlab接口).pdf
  16. 计算机的cpu故障,计算机cpu常见故障
  17. 新品上架免费推广,新品上架前的准备
  18. 【数据库基础】什么是A、C、 I 、D?
  19. 专访CAPA梁振宇:信息无障碍是互联网产品的必选项
  20. 数据库笔记--常见sql操作

热门文章

  1. vue指令02---自动获取焦点(全局自定义指令Vue.directive())和全局过滤器Vue.filter() 的学习...
  2. BZOJ 4602: [Sdoi2016]齿轮 dfs
  3. php 目录及文件操作
  4. 进程环境之命令行参数
  5. java 中的static关键字和final关键字
  6. 策略模式思想及示例代码(Strategy)
  7. JMeter性能测试中如何使用“用户参数”实现参数化
  8. Python 数据类型 布尔类型
  9. 前端工程师的迷茫:不知道我这种前端是不是被淘汰了?
  10. 你现在可以使用的10个JavaScript代码段