继续回顾下scrapy,来抓下链家二手房信息,主要涉及scrapy的部分有:

  • CrawlSpider

  • Rule

  • LinkExtractor

  • Mysql 数据存储

  • 房产图片的下载

  • 简单看下链家的二手房信息网站

  • 详情页

  • 翻页(最多只能抓取100页)

  • 思路:

  • 先抓取房屋详情链接

  • 进入详情页抓取关键字段

  • 找到图片链接

  • 信息存入数据库

  • 下载图片保存本地

  • 翻页

  • 写来写去,链家没什么反爬,加上个headers 就OK了

  • 直接上代码

  • spider文件

# -*- coding: utf-8 -*-
from scrapy.spiders import CrawlSpider,Rule
from scrapy.linkextractors import LinkExtractor
from lianjia.items import LianjiaItemclass LjCrwalerSpider(CrawlSpider):name = 'lj_crawler'start_urls = ['https://qd.lianjia.com/ershoufang/']#设置抓取规则rulerules = {#房产详情链接Rule(LinkExtractor(restrict_xpaths="//ul[@class='sellListContent']/li/div[@class='info clear']/div[@class='title']/a"), follow=True, callback="process_item"),#翻页链接Rule(LinkExtractor(restrict_xpaths="//div[@class='pagination_group_a']/a"), follow=True),}def process_item(self, response):item = LianjiaItem()#提取关键字段信息item['title'] = response.css('title::text').extract_first()item['price'] = response.css('div.overview div.content > div.price > span.total::text').extract_first()item['unit_price'] = response.css('div.overview div.content > div.price span.unitPriceValue::text').extract_first()item['community_name'] = response.css('div.overview div.content > div.aroundInfo > div.communityName > a::text').extract_first()item['region'] = response.css('div.areaName span.info a::text').extract()item['linkman'] = response.xpath('//div[@class="brokerInfoText fr"]/div[@class="brokerName"]/a/text()').extract_first()item['linktel'] = response.xpath('//div[@class="brokerInfoText fr"]/div[@class="phone"]/text()').extract()item['type'] = response.css('#introduction div.base ul > li:first-child::text').extract_first()item['construction_area'] = response.css('#introduction div.base ul > li:nth-child(3)::text').extract_first()item['actual_area'] = response.css('#introduction div.base ul > li:nth-child(5)::text').extract_first()item['orientation'] = response.css('#introduction div.base ul > li:nth-child(7)::text').extract_first()item['decoration'] = response.css('#introduction div.base ul > li:nth-child(9)::text').extract_first()item['floor'] = response.css('#introduction div.base ul > li:nth-child(2)::text').extract_first()item['elevator'] = response.css('#introduction div.base ul > li:nth-child(12)::text').extract_first()item['property'] = response.css('#introduction div.base ul > li:nth-child(13)::text').extract_first()item['house_years'] = response.css('#introduction div.transaction li:nth-child(5) span:nth-child(2)::text').extract_first()item['mortgage'] = response.css('#introduction div.transaction li:nth-child(7) span:nth-child(2)::text').extract_first().strip()item['purposes'] = response.css('#introduction div.transaction ul > li:nth-child(4) span:nth-child(2)::text').extract_first()item['release_date'] = response.css('#introduction div.transaction ul > li:first-child span:nth-child(2)::text').extract_first()item['image_urls'] = response.css('div.content-wrapper img::attr(src)').extract()item['from_url'] = response.urlyield item
  • item.py文件
# -*- coding: utf-8 -*-# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass LianjiaItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()#标题title = scrapy.Field()#价格price = scrapy.Field()#单价unit_price = scrapy.Field()# 小区名字community_name = scrapy.Field()# 地区region = scrapy.Field()# 联系人linkman = scrapy.Field()# 联系电话linktel = scrapy.Field()#户型type = scrapy.Field()#建筑面积construction_area = scrapy.Field()#实际面积actual_area = scrapy.Field()#房屋朝向orientation = scrapy.Field()#装修情况decoration = scrapy.Field()#所在楼层floor = scrapy.Field()#电梯elevator = scrapy.Field()#产权年限property = scrapy.Field()#房屋年限house_years = scrapy.Field()#有无抵押mortgage = scrapy.Field()#房屋用途purposes = scrapy.Field()#挂牌时间release_date = scrapy.Field()#房屋照片image_urls = scrapy.Field()#房产链接from_url = scrapy.Field()
  • pipelines.py文件
# -*- coding: utf-8 -*-# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import hashlib
import pymysql
from scrapy.exceptions import DropItem
from urllib.request import urlretrieve
from scrapy.utils.python import to_bytes
import osclass LianjiaPipeline(object):def __init__(self, settings):self.host = settings.get('HOST')self.port = settings.get('PORT')self.user = settings.get('USER')self.passwd = settings.get('PASSWD')self.db = settings.get('DB')self.charset = settings.get('CHARSET')self.table = settings.get('TABLE')self.settings = settings@classmethoddef from_crawler(cls, crawler):return cls(crawler.settings)def open_spider(self, spider):self.conn = pymysql.connect(host=self.host, port=self.port, user=self.user, passwd=self.passwd, db=self.db, charset=self.charset)self.db = self.conn.cursor()def close_spider(self, spider):self.db.close()self.conn.close()def save_data(self, item):'''数据存储:param item::return:'''keys = ', '.join(item.keys())values = ', '.join(['%s'] * len(item.keys()))insert_sql = "insert into `{}`({})values({})".format(self.table, keys, values)try:self.db.execute(insert_sql, tuple(item.values()))self.conn.commit()except Exception as e:print(e.args)self.conn.rollback()def select_data(self, item):'''判重:param item::return:'''value = item.get('from_url')select_sql = "select * from `{}` where from_url='{}';".format(self.table, value)try:self.db.execute(select_sql)res = self.db.fetchall()if res:return Trueelse:return Falseexcept Exception as e:print(e.args)return Falsedef process_item(self, item, spider):item['linktel'] = '-'.join(item['linktel'])item['region'] = '/'.join(item['region'])item['image_urls'] = ','.join(item['image_urls'])if not self.select_data(item):self.save_data(item)return itemclass ImageDownloadPipeline(object):def __init__(self, settings):self.imagepath = settings.get('IMAGES_STORE')@classmethoddef from_crawler(cls, crawler):return cls(crawler.settings)def process_item(self, item, spider):'''图片下载:param item::param spider::return:'''for image in item['image_urls'].split(','):#图片命名image_guid = hashlib.sha1(to_bytes(image)).hexdigest()image_name = '%s.jpg' % (image_guid)house_id = item['from_url'].split('/')[-1].replace('.html','')file_path = '%s/%s'%(self.imagepath, house_id)if not os.path.exists(file_path):os.makedirs(file_path)image_path = '%s/%s/%s'%(self.imagepath, house_id, image_name)if not os.path.exists(image_path):urlretrieve(image, image_path)else:raise DropItem('It exists!')
  • settings.py 文件
# -*- coding: utf-8 -*-# Scrapy settings for lianjia project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'lianjia'SPIDER_MODULES = ['lianjia.spiders']
NEWSPIDER_MODULE = 'lianjia.spiders'HOST = '127.0.0.1'
PORT = 3306
USER = 'root'
PASSWD = '123456'
DB = 'mycrawler'
CHARSET = 'UTF8'
TABLE = 'lianjia'IMAGES_STORE = 'C:/Users/wang/Desktop/lianjia/lianjia/images'# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'lianjia (+http://www.yourdomain.com)'# Obey robots.txt rules
ROBOTSTXT_OBEY = False# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
#COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8','Accept-Encoding': 'gzip, deflate, br','Accept-Language': 'zh-CN,zh;q=0.9','Cache-Control': 'max-age=0','Connection': 'keep-alive','Upgrade-Insecure-Requests': '1','User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36'
}# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'lianjia.middlewares.LianjiaSpiderMiddleware': 543,
#}# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'lianjia.middlewares.LianjiaDownloaderMiddleware': 543,
#}# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {'lianjia.pipelines.LianjiaPipeline': 300,'lianjia.pipelines.ImageDownloadPipeline': 400,}# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
  • run.py文件
# -*- coding: utf-8 -*-
from scrapy import cmdlinecmdline.execute("scrapy crawl lj_crawler".split())
  • 结果

#总结:

  • 图片的下载其实想用scrapy自带的imagespipeline来着,但是不知道应该怎么实现分目录存储图片,IMAGE_STORE貌似只能设置一个固定的路径,应该怎么实现动态的按照房产id生成文件夹呢?有大神的话,求指点一下,应该如何改写里面的方法
  • 抓取的是青岛的房产信息,全国的话, 可以再对start_urls进行处理

Python-Scrapy-抓取链家二手房信息相关推荐

  1. python爬虫代码房-python爬虫爬取链家二手房信息

    #coding=utf-8 import requests from fake_useragent import UserAgent from bs4 import BeautifulSoup imp ...

  2. python找房源_python抓取链家房源信息(二)

    试着用scrapy将之前写的抓取链家网信息的重新写了写 然后先是用了第一页的网页作为测试,调试代码,然后发现总是抓取的时候遇见了 类似于这样的问题,并且抓取不到信息 2017-03-28 17:52: ...

  3. python关于二手房的课程论文_基于python爬取链家二手房信息代码示例

    基本环境配置 python 3.6 pycharm requests parsel time 相关模块pip安装即可 确定目标网页数据 哦豁,这个价格..................看到都觉得脑阔 ...

  4. 掌财社:python怎么爬取链家二手房的数据?爬虫实战!

    我们知道爬虫的比较常见的应用都是应用在数据分析上,爬虫作为数据分析的前驱,它负责数据的收集.今天我们以python爬取链家二手房数据为例来进行一个python爬虫实战.(内附python爬虫源代码) ...

  5. python爬虫requests源码链家_python爬虫爬取链家二手房信息

    #coding=utf-8 import requests from fake_useragent import UserAgent from bs4 import BeautifulSoup imp ...

  6. python二手房使用教程_python爬虫爬取链家二手房信息

    #coding=utf-8 import requests from fake_useragent import UserAgent from bs4 import BeautifulSoup imp ...

  7. python爬虫--爬取链家租房信息

    python 爬虫-链家租房信息 爬虫,其实就是爬取web页面上的信息. 链家租房信息页面如下: https://gz.lianjia.com/zufang/ ## python库 Python库 1 ...

  8. python爬虫requests源码链家_python数据分析实例:python抓取链家二手房源数据和分析...

    python语言,在数据分析方面,涵盖了"数据获取→数据处理→数据分析→数据可视化"这个流程中每个环节,不可不谓之为利器. 本篇文章我们就利用python语言在数据分析方面的应用, ...

  9. python爬取链家二手房信息

    爬取过程分析: 1.链家网url:https://bj.lianjia.com/ershoufang/pg2/ 显示的二手房信息每个url递增一 2.访问时要加头信息,否则无法访问 3.用beauti ...

  10. scrapy 爬取链家二手房数据

    学习使用 只爬取展示的展示的3000条数据 spider: # -*- coding: utf-8 -*- from urllib import parse import scrapy from sc ...

最新文章

  1. python 100例(10)
  2. 华工智能车队--这里有一群热爱智能车的人
  3. ROS学习(八):ROS URDF-transmission
  4. 阅读之spring+Dubbo
  5. The Water Bowls POJ - 3185(开关问题+暴力)
  6. 如何在 C# 中用 SharpZipLib 进行 ZIP 压缩与解压(转)
  7. 新年快乐轮播特效html,基于owl-carousel的卡片水平轮播展示特效
  8. 小写数字转大写_不得不说的Excel小技巧:设置单元格格式将小写金额转换成大写...
  9. [Python] L1-055 谁是赢家-PAT团体程序设计天梯赛GPLT
  10. Python学习---深浅拷贝的学习
  11. 【图像融合】基于matlab IHS变换与自适应区域特征遥感图像融合【含Matlab源码 1636期】
  12. 矩阵的初等变换与线性方程组
  13. 使用tf2的saved_model进行推理
  14. MTK6762 Helio P22 安卓核心板模块应用
  15. 基于开源软件构建高性能集群NAS系统
  16. Java对接微信公众号模板消息推送(架包WxJava)
  17. python一天学费多少_自学python一天的小项目实战
  18. Worktile中的实时消息推送服务实现
  19. 【华为机试真题 Python实现】查找众数及中位数
  20. 蓝蓝算法10-数字颠倒

热门文章

  1. STM32实战总结:HAL之触摸屏
  2. nginx日志中$request_body 十六进制字符(\x22\x9B\x5C\x09\x08...)完美解决方案
  3. 一个简单的显示阴历的日历。
  4. 计算机课程微教学设计,计算机基础知识教学设计
  5. bzoj2101:[USACO2010 DEC]TREASURE CHEST 藏宝箱
  6. 《王二丫的甜品店》用户隐私政策
  7. 我的世界java1.15.2光影_我的世界Java版带光影
  8. Unity Bounds 边界框(包围盒)
  9. AUTOCAD——倒角命令、圆角命令
  10. wordpress 安装主题显示要配置FTP的解决办法