背景知识要求

Scrapy爬虫框架。
Scrapy是一个为了爬取网站数据，提取结构性数据而编写的应用框架。可以应用在包括数据挖掘，信息处理或存储历史数据等一系列的程序中。
具体学习请参考：https://scrapy-chs.readthedocs.io/zh_CN/latest/index.html#

Python语法基础。
具体学习请参考：https://www.runoob.com/python3/python3-tutorial.html

摘要

获取链家北京石景山区苹果园地区的3000条成交记录，为后续数据清洗和机器学习做准备。
使用Scrapy的基本方法，没有应用的高级方法，希望学习Scrapy库的请绕行。

正文

创建Scrapy工程

编写items代码

房屋成交的时间、价格、户型、面积等数据定义。

class HomelinkItem(Item):# define the fields for your item here like:deal_time = Field()                  #成交时间deal_totalPrice = Field()            #成交价格deal_unitPrice = Field()             #成交单价household_style = Field()            #房屋户型gross_area = Field()                 #建筑面积usable_area = Field()                #使用面积house_orientation = Field()          #房屋朝向floor_number = Field()               #所在楼层build_year = Field()                 #建筑年代year_of_property = Field()           #产权年限with_elevator = Field()              #配备电梯house_usage = Field()                #房屋用途is_two_five = Field()                #满二满五

编写Spider代码

获取房源成交数据程序：: start_request获取成交页面的第一个链接地址。; parse获取房源成交页面总数量，遍历全部页面。; parse_sale遍历一个页面中的全部房源链接。; parse_content解析一个房源链接的数据。

class LianjiaSpider(scrapy.Spider):name = 'lianjia'allowed_domains = ['bj.lianjia.com']start_urls = ['http://bj.lianjia.com/chengjiao/']regions = {'pingguoyuan1': '苹果园'}def start_requests(self):for region in list(self.regions.keys()):url = "https://bj.lianjia.com/chengjiao/" + region + "/"yield Request(url=url, callback=self.parse, meta={'region': region}) #用来获取页码def parse(self, response):region = response.meta['region']selector = etree.HTML(response.text)sel = selector.xpath("//div[@class='page-box house-lst-page-box']/@page-data")[0]  # 返回的是字符串字典sel = json.loads(sel)  # 转化为字典total_pages = sel.get("totalPage")for i in range(int(total_pages)):url_page = "https://bj.lianjia.com/chengjiao/{}/pg{}/".format(region, str(i + 1))yield Request(url=url_page, callback=self.parse_sale)def parse_sale(self, response):selector = etree.HTML(response.text)house_urls = selector.xpath("//div[@class='content']//div[@class='title']//a/@href")  # 返回列表for house_url in house_urls:yield Request(url=house_url, callback=self.parse_content)def parse_content(self, response):item = HomelinkItem()# 成交时间item["deal_time"] = ''.join(response.xpath("//section//p[@class='record_detail']/text()").re(r"\d{4}[-]\d{2}[-]\d{2}"))# 成交总价item["deal_totalPrice"] = response.xpath("//section//span/i/text()").extract_first()# 成交单价item["deal_unitPrice"] = response.xpath("//section//div[@class='price']/b/text()").extract_first()# 其他成交信息deal_info = response.xpath("//section//ul/li/text()")   # response.xpath返回选择器对象，selector.xpath有区别item["household_style"] = deal_info.extract()[0].strip()     # 房屋户型item["gross_area"] = deal_info.extract()[2].strip()          # 建筑面积item["usable_area"] = deal_info.extract()[4].strip()         # 使用面积item["house_orientation"] = deal_info.extract()[6].strip()   # 房屋朝向item["build_year"] = deal_info.extract()[7].strip()          # 所在楼层item["floor_number"] = deal_info.extract()[1].strip()        # 建筑年代item["year_of_property"] = deal_info.extract()[12].strip()   # 产权年限item["with_elevator"] = deal_info.extract()[13].strip()      # 配备电梯item["house_usage"] = deal_info.extract()[17].strip()        # 房屋用途item["is_two_five"] = deal_info.extract()[18].strip()        # 满二满五yield item

运行程序

run.py 编写 cmdline.execute(“scrapy crawl lianjia -o linajia.csv”.split())，运行。

获取数据示例：

build_year,deal_time,deal_totalPrice,deal_unitPrice,floor_number,gross_area,house_orientation,house_usage,household_style,is_two_five,usable_area,with_elevator,year_of_property
1999,2019-03-022012-11-25,269,40801,高楼层(共7层),65.93㎡,南 北,普通住宅,1室1厅1厨1卫,满五年,58.15㎡,无,70年
1994,2019-03-02,359,41876,顶层(共16层),85.73㎡,东 南 北,普通住宅,3室1厅1厨1卫,满两年,暂无数据,有,70年
1997,2019-03-02,296,50651,中楼层(共16层),58.44㎡,东 南,普通住宅,2室1厅1厨1卫,暂无数据,暂无数据,有,70年

一共获取3000条房源成交数据

结论

使用Scrapy获取链家的房源成交数据，本文应用的Scrapy的基本程序方法。
本文获取数据的目的是为后续数据清洗和机器学习使用，顾不在Scrapy高级用法上做深入研究。

参考

https://www.cnblogs.com/cnkai/p/7404972.html

链家房源数据爬取(Scrapy）相关推荐

爬虫实战：链家租房数据爬取，实习僧网站数据爬取
前面已经进行了爬虫基础部分的学习,于是自己也尝试爬了一些网站数据,用的策略都是比较简单,可能有些因素没有考虑到,但是也爬取到了一定的数据,下面介绍两个爬过的案例. 爬虫实战链家网站爬取实习僧网站爬 ...
北京二手房链家网数据爬取
直接放代码 # -*- coding: utf-8 -*- # @Time : 2022/12/23 20:46import re # 正则表达式 import json from numpy imp ...
爬虫实例：链家网房源数据爬取
初接触python爬虫,跟着视频学习一些很基础的内容,小小尝试了一下,如有错误感谢指正. 库和方法介绍: (1)requests requests是python的工具包,用于发出请求,,是用来获取网站 ...
PyQt5+Python+Excel链家二手房信息爬取、可视化以及数据存取
成果图: 第一步运行代码searsh.py,效果如下第二步选择你所需要爬取数据的城市,如湖北-武汉然后搜索,结果如下如果你想爬取更多信息,只需要点击下一页即可第三步,保存数据.可以将所显示的所 ...
干货！链家二手房数据抓取及内容解析要点
"本文对链家官网网页进行内容分析,可以作为一般HTTP类应用协议进行协议分析的参考,同时,对链家官网的结构了解后,可以对二手房相关信息进行爬取,并且获取被隐藏的近期成交信息." 另 ...
Python爬虫入门教程石家庄链家租房数据抓取
1. 写在前面这篇博客爬取了链家网的租房信息,爬取到的数据在后面的博客中可以作为一些数据分析的素材. 我们需要爬取的网址为:https://sjz.lianjia.com/zufang/ 2. 分析 ...
python爬虫requests源码链家_python爬虫——爬取链家房价信息（未完待续）
爬取链家房价信息(未完待续) items.py # -*- coding: utf-8 -*- # Define here the models for your scraped items # # ...
python房子代码_基于python的链家小区房价爬取——仅需60行代码！
简介首先打开相关网页(北京链家小区信息). 注意本博客的代码适用于爬取某个城市的小区二手房房价信息. 如果需要爬取其他信息,可修改代码,链家的数据获取的基本逻辑都差不多. 效果展示因为只需要60行 ...
python爬虫requests源码链家_python爬虫爬取链家二手房信息
#coding=utf-8 import requests from fake_useragent import UserAgent from bs4 import BeautifulSoup imp ...
python爬取南京市房价_基于python的链家小区房价爬取——仅需60行代码
简介首先打开相关网页(北京链家小区信息). 注意本博客的代码适用于爬取某个城市的小区二手房房价信息. 如果需要爬取其他信息,可修改代码,链家的数据获取的基本逻辑都差不多. 效果展示因为只需要60行 ...

链家房源数据爬取(Scrapy）