七月算法课程《python爬虫》第五课: scrapy spider的几种爬取方式

本节课介绍了scrapy的爬虫框架，重点说了scrapy组件spider。

spider的几种爬取方式：

1.爬取1页内容
2.按照给定列表拼出链接爬取多页
3.找到‘下一页’标签进行爬取
4.进入链接，按照链接进行爬取
下面分别给出了示例

1.爬取1页内容

#by 寒小阳(hanxiaoyang.ml@gmail.com)import scrapyclass JulyeduSpider(scrapy.Spider):name = "julyedu"start_urls = ['https://www.julyedu.com/category/index',]def parse(self, response):for julyedu_class in response.xpath('//div[@class="course_info_box"]'):print julyedu_class.xpath('a/h4/text()').extract_first()print julyedu_class.xpath('a/p[@class="course-info-tip"][1]/text()').extract_first()print julyedu_class.xpath('a/p[@class="course-info-tip"][2]/text()').extract_first()print response.urljoin(julyedu_class.xpath('a/img[1]/@src').extract_first())print "\n"yield {'title':julyedu_class.xpath('a/h4/text()').extract_first(),'desc': julyedu_class.xpath('a/p[@class="course-info-tip"][1]/text()').extract_first(),'time': julyedu_class.xpath('a/p[@class="course-info-tip"][2]/text()').extract_first(),'img_url': response.urljoin(julyedu_class.xpath('a/img[1]/@src').extract_first())}

2.按照给定列表拼出链接爬取多页

#by 寒小阳(hanxiaoyang.ml@gmail.com)import scrapyclass CnBlogSpider(scrapy.Spider):name = "cnblogs"allowed_domains = ["cnblogs.com"]start_urls = ['http://www.cnblogs.com/pick/#p%s' % p for p in xrange(1, 11)]def parse(self, response):for article in response.xpath('//div[@class="post_item"]'):print article.xpath('div[@class="post_item_body"]/h3/a/text()').extract_first().strip()print response.urljoin(article.xpath('div[@class="post_item_body"]/h3/a/@href').extract_first()).strip()print article.xpath('div[@class="post_item_body"]/p/text()').extract_first().strip()print article.xpath('div[@class="post_item_body"]/div[@class="post_item_foot"]/a/text()').extract_first().strip()print response.urljoin(article.xpath('div[@class="post_item_body"]/div/a/@href').extract_first()).strip()print article.xpath('div[@class="post_item_body"]/div[@class="post_item_foot"]/span[@class="article_comment"]/a/text()').extract_first().strip()print article.xpath('div[@class="post_item_body"]/div[@class="post_item_foot"]/span[@class="article_view"]/a/text()').extract_first().strip()print ""yield {'title': article.xpath('div[@class="post_item_body"]/h3/a/text()').extract_first().strip(),'link': response.urljoin(article.xpath('div[@class="post_item_body"]/h3/a/@href').extract_first()).strip(),'summary': article.xpath('div[@class="post_item_body"]/p/text()').extract_first().strip(),'author': article.xpath('div[@class="post_item_body"]/div[@class="post_item_foot"]/a/text()').extract_first().strip(),'author_link': response.urljoin(article.xpath('div[@class="post_item_body"]/div/a/@href').extract_first()).strip(),'comment': article.xpath('div[@class="post_item_body"]/div[@class="post_item_foot"]/span[@class="article_comment"]/a/text()').extract_first().strip(),'view': article.xpath('div[@class="post_item_body"]/div[@class="post_item_foot"]/span[@class="article_view"]/a/text()').extract_first().strip(),}

3.找到‘下一页’标签进行爬取

import scrapyclass QuotesSpider(scrapy.Spider):name = "quotes"start_urls = ['http://quotes.toscrape.com/tag/humor/',]def parse(self, response):for quote in response.xpath('//div[@class="quote"]'):yield {'text': quote.xpath('span[@class="text"]/text()').extract_first(),'author': quote.xpath('span/small[@class="author"]/text()').extract_first(),}next_page = response.xpath('//li[@class="next"]/@herf').extract_first()if next_page is not None:next_page = response.urljoin(next_page)yield scrapy.Request(next_page, callback=self.parse)

4.进入链接，按照链接进行爬取

#by 寒小阳(hanxiaoyang.ml@gmail.com)import scrapyclass QQNewsSpider(scrapy.Spider):name = 'qqnews'start_urls = ['http://news.qq.com/society_index.shtml']def parse(self, response):for href in response.xpath('//*[@id="news"]/div/div/div/div/em/a/@href'):full_url = response.urljoin(href.extract())yield scrapy.Request(full_url, callback=self.parse_question)def parse_question(self, response):print response.xpath('//div[@class="qq_article"]/div/h1/text()').extract_first()print response.xpath('//span[@class="a_time"]/text()').extract_first()print response.xpath('//span[@class="a_catalog"]/a/text()').extract_first()print "\n".join(response.xpath('//div[@id="Cnt-Main-Article-QQ"]/p[@class="text"]/text()').extract())print ""yield {'title': response.xpath('//div[@class="qq_article"]/div/h1/text()').extract_first(),'content': "\n".join(response.xpath('//div[@id="Cnt-Main-Article-QQ"]/p[@class="text"]/text()').extract()),'time': response.xpath('//span[@class="a_time"]/text()').extract_first(),'cate': response.xpath('//span[@class="a_catalog"]/a/text()').extract_first(),}

七月算法课程《python爬虫》第五课: scrapy spider的几种爬取方式相关推荐

浅谈Python爬虫（五）【网易云热评爬取实例】
浅谈Python爬虫(五) 目的:爬取网易云歌单所有歌曲的信息及热评 Python环境:3.7 编译器:PyCharm2019.1.3专业版存储格式:JSON 1.分析网页进入网易云音乐首页,点击 ...
[python爬虫之路day19:] scrapy框架初入门day1——爬取百思不得姐段子
好久没学习爬虫了,今天再来记录一篇我的初入门scrapy. 首先scrapy是针对大型数据的爬取,简单便捷,但是需要操作多个文件以下介绍: 写一个爬虫,需要做很多的事情.比如: 发送网络请求, 数据解 ...
Python爬虫新手入门教学（十八）：爬取yy全站小视频
前言本文的文字及图片来源于网络,仅供学习.交流使用,不具有任何商业用途,如有问题请及时联系我们以作处理. Python爬虫.数据分析.网站开发等案例教程视频免费在线观看 https://space. ...
Python爬虫新手入门教学（十六）：爬取好看视频小视频
前言本文的文字及图片来源于网络,仅供学习.交流使用,不具有任何商业用途,如有问题请及时联系我们以作处理. Python爬虫.数据分析.网站开发等案例教程视频免费在线观看 https://space. ...
Python爬虫新手入门教学（二十）：爬取A站m3u8视频格式视频
前言本文的文字及图片来源于网络,仅供学习.交流使用,不具有任何商业用途,如有问题请及时联系我们以作处理. 前文内容 Python爬虫新手入门教学(一):爬取豆瓣电影排行信息 Python爬虫新手入门 ...
Python爬虫 | 对广州市政府数据统一开放平台数据的爬取
Python爬虫 | 对广州市政府数据统一开放平台数据的爬取简单爬虫网页分析爬虫代码简单爬虫本次爬虫演示的是对广州市政府数据统一开放平台数据的爬取网页分析我们先到url=' http ...
Python爬虫系列之多多买菜小程序数据爬取
Python爬虫系列之多多买菜小程序数据爬取小程序爬虫接单.app爬虫接单.网页爬虫接单.接口定制.网站开发.小程序开发> 点击这里联系我们 < 微信请扫描下方二维码代码仅供学习交流, ...
Python爬虫系列之MeiTuan网页美食版块商家数据爬取
Python爬虫系列之MeiTuan网页美食版块商家数据爬取小程序爬虫接单.app爬虫接单.网页爬虫接单.接口定制.网站开发.小程序开发> 点击这里联系我们 < 微信请扫描下方二维码代 ...
python爬虫实践——零基础快速入门（四）爬取小猪租房信息
上篇文章我们讲到python爬虫实践--零基础快速入门(三)爬取豆瓣电影接下来我们爬取小猪短租租房信息.进入主页后选择深圳地区的位置.地址如下: http://sz.xiaozhu.com/ 一,标 ...

七月算法课程《python爬虫》第五课: scrapy spider的几种爬取方式

spider的几种爬取方式：

1.爬取1页内容

2.按照给定列表拼出链接爬取多页

3.找到‘下一页’标签进行爬取

4.进入链接，按照链接进行爬取

七月算法课程《python爬虫》第五课: scrapy spider的几种爬取方式相关推荐

最新文章

热门文章