scrapy多个page爬取, post请求, 通过爬到的URL继续发请求爬页面

scrapy多个page的爬取

import scrapy
from bossPro.items import BossproItemclass BossSpider(scrapy.Spider):name = 'boss'# allowed_domains = ['www.xxx.com']start_urls = ['https://www.zhipin.com/job_detail/?query=python%E7%88%AC%E8%99%AB&scity=101010100&industry=&position=']url = 'https://www.zhipin.com/c101010100/?query=python爬虫&page=%d&ka=page-2'page = 1# 解析+管道持久化存储def parse(self, response):li_list = response.xpath('//div[@class="job-list"]/ul/li')for li in li_list:job_name = li.xpath('.//div[@class="info-primary"]/h3/a/div/text()').extract_first()salary = li.xpath('.//div[@class="info-primary"]/h3/a/span/text()').extract_first()company = li.xpath('.//div[@class="company-text"]/h3/a/text()').extract_first()# 实例化一个item对象item = BossproItem()# 将解析到的数据全部封装到item对象中item['job_name'] = job_nameitem['salary'] = salaryitem['company'] = company# 将item提交给管道yield itemif self.page <= 3:print('if 执行!!!')self.page += 1new_url = format(self.url % self.page)print(new_url)# 手动请求发送yield scrapy.Request(url=new_url, callback=self.parse)

scrapy post请求

import scrapy
from scrapy1.items import Scrapy1Itemclass MyspiderSpider(scrapy.Spider):name = 'qiubai'# allowed_domains = ['www.baidu.com']start_urls = ['https://fanyi.baidu.com/sug']data = {'kw': 'cat'}def start_requests(self):for url in self.start_urls:yield scrapy.FormRequest(url=url, formdata=self.data, callback=self.parse)def parse(self, response):item = Scrapy1Item()item['title'] = 'cat'item['content'] = response.textyield item

scrapy通过爬到的URL继续发请求爬页面

import scrapy
from scrapy1.items import Scrapy1Itemclass MyspiderSpider(scrapy.Spider):name = 'qiubai'# allowed_domains = ['www.baidu.com']start_urls = ['https://www.4567tv.tv/frim/index1.html']def get_detail(self, response):item = response.meta['item']detail = response.xpath('/html/body/div[1]/div/div/div/div[2]/p[5]/span[2]/text()').extract_first()item['content'] = detailyield itemdef parse(self, response):div_list = response.xpath('//li[@class="col-md-6 col-sm-4 col-xs-3"]')# print(div_list)for li in div_list:item = Scrapy1Item()name = li.xpath('./div/a/@title').extract_first()href = 'https://www.4567tv.tv' + li.xpath('./div/a/@href').extract_first()item['title'] = nameyield scrapy.Request(url=href, callback=self.get_detail, meta={'item':item})

转载于:https://www.cnblogs.com/NachoLau/p/10472664.html

scrapy多个page爬取, post请求, 通过爬到的URL继续发请求爬页面相关推荐

爬取百度贴吧内某吧指定页数的html页面
爬取百度贴吧内某吧指定页数的html页面首先创建文件夹tieba,代码使用面向对象的思想进行封装,测试爬取的页数为前10页, 代码如下: import requests import time im ...
python爬取微博热搜并存入表格_python爬虫进阶之爬取微博热搜存入Mysql
在编程中,我们如果想要把数据转入数据库中,首先会选择 MySQL数据库.因为MySQL数据库体积小.速度快.总体拥有成本低.开放源代码,其有着广泛的应用,例如我们使用python爬虫微博热搜,就可以使 ...
python怎么爬取知乎回答并制作词云_使用python爬取流浪地球影评并制作词云，看看别人都说了些说什么...
流浪地球影评爬取大过年的,蹭个热度,看完电影后爬一下影评并作出词云. 本次影评取自豆瓣: https://movie.douban.com/subject/26266893/ 抓包首先是拿到访问的 ...
爬取千库网ppt_初学Python-只需4步，爬取网站图片（附py文件）
很多人学习Python很重要的一个原因是,可以很简单的把一个网站的数据爬下来. 尤其是做我们这一行,产品经理,电商行业. 领导:弄一个买卖游戏周边商品的交易APP出来. 我:行,那我们卖什么呀? 领导 ...
python爬取国内代理ip_【python】国内高匿代理爬取,并验证代理ip有效性
运行环境:python 3.7.3 所需库: 1. requests 2. lxml 3. time 4. multiprocessing 5. sys 目的:构建自己的代理ip池,针对封ip型反爬虫 ...
Python 爬虫之爬取古代的诗歌，并保存本地（这里以爬取李白的所有诗歌为例）（以备作为AI写诗的训练数据）
Python 爬虫之爬取古代的诗歌,并保存本地(这里以爬取李白的所有诗歌为例)(以备作为AI写诗的训练数据) 目录
python爬取南京市房价_基于python的链家小区房价爬取——仅需60行代码
简介首先打开相关网页(北京链家小区信息). 注意本博客的代码适用于爬取某个城市的小区二手房房价信息. 如果需要爬取其他信息,可修改代码,链家的数据获取的基本逻辑都差不多. 效果展示因为只需要60行 ...
爬取的网页翻页是js的(构造post请求，ajax 异步刷新的, 只抓ajax调用的接口就行)，然后保存固定格式
import requests import json from lxml import etree import time''' 注意,河北省博物馆这个网站.从第二页开始是这样匹配的replys = ...
python爬取贴吧所有帖子-Python爬虫实例（一）爬取百度贴吧帖子中的图片
程序功能说明:爬取百度贴吧帖子中的图片,用户输入贴吧名称和要爬取的起始和终止页数即可进行爬取. 思路分析: 一.指定贴吧url的获取例如我们进入秦时明月吧,提取并分析其有效url如下 ?后面为查询字 ...

scrapy多个page爬取, post请求, 通过爬到的URL继续发请求爬页面

scrapy多个page的爬取

scrapy post请求

scrapy通过爬到的URL继续发请求爬页面

scrapy多个page爬取, post请求, 通过爬到的URL继续发请求爬页面相关推荐

最新文章

热门文章