1.创建爬虫项目

scrapy startproject mySpider

会生成以下目录

scrapy.cfg ：项目的配置文件
mySpider/items.py ：设置抓取数据的存储格式,字段.
mySpider/pipelines.py ：管道文件,用来连接数据库和保存文件
mySpider/settings.py ：配置文件,比如cookie,header,以及各种组件
mySpider/spider/ ：爬虫文件,写解析html的规则

2.创建爬虫文件

scrapy genspider <爬虫名字> <允许爬取的域名>
scrapy genspider web "web.com"  #web对应py文件,"web.com"对应里面的name属性

会在spider文件夹下生成爬虫文件

3. 编写各种文件

3.1 web.py文件

1.scrapy.Spider爬虫类中必须有名为parse的解析
2.启动爬虫的时候注意启动的位置，是在项目路径下启动
3.parse()函数中使用yield返回数据，注意：解析函数中的yield能够传递的对象只能是：BaseItem, Request, dict, None
4.response.xpath
response.xpath方法的返回结果是一个类似list的类型，其中包含的是selector对象，操作和列表一样，但是有一些额外的方法
额外方法extract()：返回一个包含有字符串的列表
额外方法extract_first()：返回列表中的第一个字符串，列表为空没有返回None
5.response响应对象的常用属性
response.url：当前响应的url地址
response.request.url：当前响应对应的请求的url地址
response.headers：响应头
response.requests.headers：当前响应的请求头
response.body：响应体，也就是html代码，byte类型
response.status：响应状态码

1）spider类写法

import scrapy
from mySpider.items import webItem  # 导入item规定好的字段class webSpider(scrapy.Spider):  # 必须继承scrapy.Spidername = "web"  # 指定爬虫的名字,这几个变量其实是在__init__()方法里面的,相当于self.nameallowed_domains = ["web.cn"]  # 限定扒取的域名,其他域名不会扒取start_urls = ('http://www.web.cn/',)  # 第一个访问的网页def parse(self, response):  # 解析网页,一般用`xpath`来获得相应for each in response.xpath("//div[@class='aaa']"):  #获得指定div## 爬取页面的数据item = webItem() # 将数据封装到我们定义好的字段对象name = each.xpath("h3/text()").extract()title = each.xpath("h4/text()").extract()item['name'] = name[0]  # 和定义的字段对应item['title'] = title[0]yield item # 用生成器将获取的数据交给pipelines## 增加新的连接curpage = re.search('(\d+)',response.url).group(1) # 获取当前页码  page = int(curpage) + 10 # 根据规律加page=??url = re.sub('\d+', str(page), response.url)  #后续的要抓取的链接yield scrapy.Request(url, callback = self.parse) # 发送新的url请求加入待爬队列，并调用回调函数 self.parse

3.2 items.py

设置页面处理后的数据格式

import scrapyclass webItem(scrapy.Item):name = scrapy.Field()level = scrapy.Field()

3.3 pipelines.py

利用管道来保存数据
# 爬虫文件中提取数据的方法每yield一次item，就会运行一次
# 该方法为固定名称函数

import jsonclass webJsonPipeline(object):def __init__(self):self.file = open('web.json', 'wb')def process_item(self, item, spider):content = json.dumps(dict(item), ensure_ascii=False) + "\n"self.file.write(content)return itemdef close_spider(self, spider):self.file.close()

3.4 setting.py

在settings.py配置启用管道

ITEM_PIPELINES = {'myspider.pipelines.ItcastPipeline': 400
}

配置项中键为使用的管道类，管道类使用.进行分割，第一个为项目目录，第二个为文件，第三个为定义的管道类。
配置项中值为管道的使用顺序，设置的数值约小越优先执行，该值一般设置为1000以内。

4.启动爬虫

scrapy crawl web

其他：

2.CrawlSpider类写法

scrapy genspider -t crawl web web.com  # 生成的命令也变了

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from mySpider.items import webItemclass webSpider(CrawlSpider):  # 这里继承的CrawlSpidername = "web"allowed_domains = ["hr.web.com"]start_urls = ["http://hr.web.com/position.php?&start=0#a"]page_lx = LinkExtractor(allow=("start=\d+"))  # 符合规则的链接会被提取rules = [Rule(page_lx, callback = "parseContent", follow = True)]  # 如果提取了重复的url则去重,取第一个def parseContent(self, response):for each in response.xpath('//*[@class="even"]'):name = each.xpath('./td[1]/a/text()').extract()[0]detailLink = each.xpath('./td[1]/a/@href').extract()[0]positionInfo = each.xpath('./td[2]/text()').extract()[0]peopleNumber = each.xpath('./td[3]/text()').extract()[0]workLocation = each.xpath('./td[4]/text()').extract()[0]publishTime = each.xpath('./td[5]/text()').extract()[0]#print name, detailLink, catalog,recruitNumber,workLocation,publishTimeitem = webItem()item['name']=name.encode('utf-8')item['detailLink']=detailLink.encode('utf-8')item['positionInfo']=positionInfo.encode('utf-8')item['peopleNumber']=peopleNumber.encode('utf-8')item['workLocation']=workLocation.encode('utf-8')item['publishTime']=publishTime.encode('utf-8')yield item

3.模拟登陆

import scrapyclass LoginSpider(scrapy.Spider):name = 'example.com'start_urls = ['http://www.example.com/users/login.php']def parse(self, response):return scrapy.FormRequest.from_response(response,formdata={'username': 'john', 'password': 'secret'},callback=self.after_login)def after_login(self, response):# check login succeed before going onif "authentication failed" in response.body:self.log("Login failed", level=log.ERROR)return

http://scrapy-chs.readthedocs.io/zh_CN/1.0/topics/settings.html#topics-settings-ref

参考文献:
http://scrapy-chs.readthedocs.io/zh_CN/1.0/index.html

scrapy 入门案例相关推荐

Scrapy入门探索盗墓笔记
Scrapy入门探索盗墓笔记声明:本文只作学习研究,禁止用于非法用途,否则后果自负,如有侵权,请告知删除,谢谢! 引言本文出自微信公众号[Python三剑客] 作者:阿K 阅读时长:5min 留言 ...
Python:Scrapy的安装和入门案例
Scrapy的安装介绍 Scrapy框架官方网址:http://doc.scrapy.org/en/latest Scrapy中文维护站点:http://scrapy-chs.readthedocs. ...
python数据分析案例2-1：Python练习-Python爬虫框架Scrapy入门与实践
本文建立在学习完大壮老师视频Python最火爬虫框架Scrapy入门与实践,自己一步一步操作后做一个记录(建议跟我一样的新手都一步一步进行操作). 主要介绍: 1.scrapy框架简介.数据在框架内如 ...
最流行的python爬虫框架_Python最火爬虫框架Scrapy入门与实践
Scrapy框架简介Scrapy 是用 Python 实现的一个为了爬取网站数据.提取结构性数据而编写的应用框架. Scrapy 常应用在包括数据挖掘,信息处理或存储历史数据等一系列的程序中. 通常我 ...
【学习笔记】爬虫框架Scrapy入门
一. Scrapy简介.架构.数据流和项目结构二. Scrapy入门 1. Scrapy架构由哪些部分组成? 1.1 Scrapy简介 Scrapy是:由Python语言开发的一个快速.高层次的屏幕 ...
scrapy简单案例：好听轻音乐网
目标:爬取好听轻音乐网热播排行榜的歌曲名称和艺术家信息 1.创建爬虫项目在需要创建爬虫项目的文件夹下打开命令窗口,输入 scrapy startproject mspider(mspider是项目名 ...
爬虫框架Scrapy入门——爬取acg12某页面
1.安装 1.1自行安装python3环境 1.2ide使用pycharm 1.3安装scrapy框架 2.入门案例 2.1新建项目工程 2.2配置settings文件 2.3新建爬虫app 新建ap ...
2021年大数据Flink（八）：Flink入门案例
目录 Flink入门案例前置说明 API 编程模型准备工程 pom文件 log4j.properties Flink初体验需求编码步骤代码实现 Flink入门案例前置说明 API API ...
Vue安装配置以及入门案例
Vue Vue简介 Vue (读音 /vjuː/,类似于 view) 是一套用于构建用户界面的渐进式框架.与其它大型框架不同的是,Vue 被设计为可以自底向上逐层应用.Vue 的核心库只关注视图层,不 ...

scrapy 入门案例