python爬虫学习-scrapy爬取链家房源信息并存储（翻页）

爬取链家租房频道的房源信息，含翻页，含房间详情页的内容爬取。

items.py

import scrapyclass ScrapytestItem(scrapy.Item):# define the fields for your item here like:title = scrapy.Field()#房源名称price = scrapy.Field()#价格url = scrapy.Field()#详情页地址introduce_item = scrapy.Field()#房源描述

pipelines.py

import jsonclass ScrapytestPipeline(object):#打开文件def open_spider(self,spider):self.file = open('58_chuzu.txt','w',encoding='utf-8')print('文件被打开了')#写入文件def process_item(self, item, spider):line = '{}\n'.format(json.dumps(dict(item),ensure_ascii=False))self.file.write(line)return item#关闭文件def close_spider(self,spider):self.file.close()print('文件被关闭了')

spider

import scrapy
from ..items import ScrapytestItem
from scrapy.http import Requestclass SpiderCity58Spider(scrapy.Spider):name = 'spider_city_58'#必不可少的爬虫名字allowed_domains = ['lianjia.com']start_urls = ['https://bj.lianjia.com/zufang/']def parse(self, response):#提取页面上的信息info_list = response.xpath('//*[@id="content"]/div[1]/div[1]/div')for i in info_list:item = ScrapytestItem()item['title'] = i.xpath('normalize-space(./div/p[1]/a/text())').extract()item['price'] = i.xpath('./div/span/em/text()').extract()url = i.xpath('./div/p[1]/a/@href').extract_first()#相对地址补全为绝对地址item['url'] = response.urljoin(url)#获取详情页的URLif item['url']:#判断URL是否为空yield Request(item['url'],callback = self.detail_parse,meta = {'item':item},#只接受字典类型的赋值，将item传递给detali_parse()priority = 10,dont_filter = True)#获取翻页URLfor page in range(2,5):url = 'https://bj.lianjia.com/zufang/pg{}/'.format(str(page))#提取翻页链接test_request = Request(url,callback = self.parse)yield test_request#获取详情页的信息def detail_parse(self,response):item = response.meta['item']item['introduce_item'] = response.xpath('//*[@id="desc"]/ul/li/p[1]/text()').extract()return item

本来打算爬58的，但是58的反爬策略我还无法破解，所以换成了链家。

目前爬了4页，证明可行。后续添加时间控制等代码即可

python爬虫学习-scrapy爬取链家房源信息并存储（翻页）相关推荐

基于python多线程和Scrapy爬取链家网房价成交信息
文章目录知识背景 Scrapy- spider 爬虫框架 SQLite数据库 python多线程爬取流程详解爬取房价信息封装数据库类,方便多线程操作数据库插入操作构建爬虫爬取数据基于百度 ...
python 爬虫实践（爬取链家成交房源信息和价格）
简单介绍 pi: 简单介绍下,我们需要用到的技术,python 版本是用的pyhon3,系统环境是linux,开发工具是vscode:工具包:request 爬取页面数据,然后redis 实现数据缓存 ...
python爬虫——使用bs4爬取链家网的房源信息
1. 先看效果 2. 进入链家网,这里我选择的是海口市点击跳转到链家网 3. 先看网页的结构,这些房子的信息都在li标签,而li标签再ul标签,所以怎么做大家都懂 4. 代码如下,url的链接大家可以 ...
python爬虫scrapy爬取新闻标题及链接_python爬虫框架scrapy爬取梅花网资讯信息
原标题:python爬虫框架scrapy爬取梅花网资讯信息一.介绍本例子用scrapy-splash爬取梅花网(http://www.meihua.info/a/list/today)的资讯信息, ...
python爬虫requests源码链家_python爬虫——爬取链家房价信息（未完待续）
爬取链家房价信息(未完待续) items.py # -*- coding: utf-8 -*- # Define here the models for your scraped items # # ...
python关于二手房的课程论文_基于python爬取链家二手房信息代码示例
基本环境配置 python 3.6 pycharm requests parsel time 相关模块pip安装即可确定目标网页数据哦豁,这个价格..................看到都觉得脑阔 ...
[Python 爬虫] 使用 Scrapy 爬取新浪微博用户信息（四） —— 应对反爬技术（选取 User-Agent、添加 IP代理池以及Cookies池）
上一篇:[Python 爬虫] 使用 Scrapy 爬取新浪微博用户信息(三) -- 数据的持久化--使用MongoDB存储爬取的数据最近项目有些忙,很多需求紧急上线,所以一直没能完善< 使用 ...
[Python 爬虫] 使用 Scrapy 爬取新浪微博用户信息（二） —— 编写一个基本的 Spider 爬取微博用户信息
上一篇:[Python 爬虫] 使用 Scrapy 爬取新浪微博用户信息(一) -- 新建爬虫项目在上一篇我们新建了一个 sina_scrapy 的项目,这一节我们开始正式编写爬虫的代码. 选择目标 ...
Python爬虫学习之爬取淘宝搜索图片
Python爬虫学习之爬取淘宝搜索图片准备工作因为淘宝的反爬机制导致Scrapy不能使用,所以我这里是使用selenium来获取网页信息,并且通过lxml框架来提取信息. selenium.lxm ...
Python爬虫学习笔记 -- 爬取糗事百科
Python爬虫学习笔记 -- 爬取糗事百科代码存放地址: https://github.com/xyls2011/python/tree/master/qiushibaike 爬取网址:https ...

python爬虫学习-scrapy爬取链家房源信息并存储（翻页）

python爬虫学习-scrapy爬取链家房源信息并存储（翻页）相关推荐

最新文章

热门文章