文章目录

  • Scrapy实现,
    • 确定需求
    • 进入分析
      • 分析url
      • 分析页面结构
    • 代码
      • spiders(爬虫)
      • items
      • pipelines
      • middlewares
      • settings
      • start
  • Scrapy-Redis
    • 运行爬虫
    • 代码
      • spiders
      • items
      • middlewares
      • settings
      • start

注意:想copy分布式代码直接到最后
环境: Python3.6
工具: Pycharm
网站: 房天下
地址 https://www.fang.com/SoufunFamily.htm
最终效果目的: 本地持久化:json文件新房数据30Mb+,json文件二手房数据30Mb+,数据库(redis)数据(40W+), 全部已实现
json文件:
新房json

二手房json

redis

由于scrapy-redis(分布式)是在scrapy上更改几行代码就可以实现的,所以重点放在实现scarpy上

Scrapy实现,

目录结构:

确定需求

  • 新房
  • 二手房

进入分析

分析url

  • 获取所有的城市的url链接

    • https://www.fang.com/SoufunFamily.htm
  • 获取(构建url)所有城市的新房的url链接
    • 例:北京:https://bj.fang.com/
    • 北京新房:https://bj.newhouse.fang.com/house/s/ # 发现是在bj与fang之间添加了newhouse,和后面添加了/house/s
  • 获取(构建url)所有城市的二手房的url链接
    • 北京:https://bj.fang.com/
    • 北京二手房:https://bj.esf.fang.com/ # 发现是在bg与fang之间添加了esf,注意后面没有/house/s

分析页面结构


后面见代码吧,太细有点啰嗦
经过测试:控制好爬取的速度就没有什么反爬了,如果速度过快会重定向到验证码界面,要不就云打码机器学习识别验证码(中间件拦截请求url),要不就上代理,最好还是控制爬取速度,没必要一天爬完

代码

spiders(爬虫)

# -*- coding: utf-8 -*-
import scrapy
import re
from fang.items import NewHouseItem , EsfHouseItemclass SfwSpider(scrapy.Spider):name = 'sfw'allowed_domains = ['fang.com']start_urls = ['https://www.fang.com/SoufunFamily.htm']# 对配置文件进行自定制,custom_settings的优先级高于settingscustom_settings = {'DOWNLOAD_DELAY': 1,  # 延时最低为2s'AUTOTHROTTLE_ENABLED': True,  # 启动[自动限速]'AUTOTHROTTLE_DEBUG': True,  # 开启[自动限速]的debug'AUTOTHROTTLE_MAX_DELAY': 10,  # 设置最大下载延时'DOWNLOAD_TIMEOUT': 15,'CONCURRENT_REQUESTS_PER_DOMAIN': 8  # 限制对该网站的并发请求数}# 获取所有城市的省份名称二手房url新房urldef parse(self, response):trs = response.xpath('//div[@id="c02"]//tr')province = Nonefor tr in trs:# 第一个是省# 第二个是城市tds = tr.xpath('.//td[not(@class)]')# 获取省文本,可能有会空的情况province_text = tds[0].xpath('.//text()').extract_first()# 将空白字符替换为空province_text = re.sub(r'\s','',province_text)# 如果有值说明有省份,如果没有值那么就用上一次的省份(为空说明属于上一次的省)if province_text:province = province_text# 不爬取海外城市房源if province =='其它':continuecity_list = tds[1].xpath('.//a')for city in city_list:city_name = city.xpath('./text()').extract_first()city_link = city.xpath('./@href').extract_first()# 构建新房urlcity_link_new = city_link.replace('fang.com','newhouse.fang.com/house/s')# 构建二手房urlcity_link_esf = city_link.replace('fang.com','esf.fang.com')# 发起请求callback是执行回调函数,meta用来请求传参,将本次的信息传递给回调函数中yield scrapy.Request(url=city_link_new,callback=self.parse_newhouse,meta={'info':[province,city_name]})yield scrapy.Request(url=city_link_esf, callback=self.parse_esfhouse,meta={'info': [province, city_name]})# 解析新房页面def parse_newhouse(self,response):province,city_name = response.meta['info']li_list = response.xpath('//div[@id="newhouse_loupai_list"]//li[not(@style)]')for li in li_list:try:house_name = li.xpath('.//div[@class="nlcd_name"]/a/text()').extract_first().strip()except AttributeError:house_name = ''rooms_area_list = li.xpath('.//div[contains(@class,"house_type")]//text()').extract()# 将空白字符用正则去掉后进行拼接操作,结果为 1居/2居/3居-35~179平米.....# 因为后面要大量使用这个map函数,所以可以封装一个函数,优化点rooms_area = ''.join(list(map(lambda x:re.sub(r'\s','',x),rooms_area_list)))# 如果不是居室情况就改成[]if '居' not in rooms_area:rooms_area=[]else:# 格式变得更好看rooms_area = rooms_area.replace(r'-','/总面积:')address = li.xpath('.//div[@class="address"]/a/@title').extract_first()try:district = li.xpath('.//div[@class="address"]/a//text()').extract()# 里面是字符串形式的列表列表中是行政区 [怀来] [门头沟] XXXdistrict =list(map(lambda x: re.sub(r'\s', '', x), district))[1][1:-1]except IndexError:district = ''sale = li.xpath('.//div[@class="fangyuan"]/span/text()').extract_first()price = li.xpath('.//div[@class="nhouse_price"]//text()').extract()price = ''.join(list(map(lambda x: re.sub(r'\s', '', x), price)))# response.urljoin是将缺失的url拼接完整# //feicuigongyuan.fang.com/ 自动拼接成https://feicuigongyuan.fang.com/ 如果完整就不会做操作house_link_url = response.urljoin(li.xpath('.//div[@class="nlcd_name"]/a/@href').extract_first())phone = li.xpath('.//div[@class="tel"]/p/text()').extract_first()item = NewHouseItem(province=province,city_name=city_name,house_name=house_name,price=price,rooms_area=rooms_area,address=address,district=district,sale=sale,house_link_url=house_link_url,phone=phone)yield item# 获取下一页的url# 爬取到最后5页的时候next就会变成上一页的url,可优化!next_url = response.urljoin(response.xpath('.//div[@class="page"]//a[@class="next"]/@href').extract_first())# 分页爬取yield scrapy.Request(url=next_url,callback=self.parse_newhouse,meta={'info': [province,city_name]})# 解析二手房页面def parse_esfhouse(self,response):# print(response.url)province,city_name = response.meta['info']dl_list = response.xpath('//div[@class="shop_list shop_list_4"]/dl[not(@dataflag="bgcomare")]')for dl in dl_list:house_name = dl.xpath('.//p[@class="add_shop"]/a/@title').extract_first()address = dl.xpath('.//p[@class="add_shop"]/span/text()').extract_first()try:price = dl.xpath('.//dd[@class="price_right"]/span[1]//text()').extract()price = price[1] + price[2]except IndexError:price = ''# price = price[1]+price[2]try:unit = dl.xpath('.//dd[@class="price_right"]/span[2]/text()').extract_first().strip()except AttributeError:unit = ''house_link_url = response.urljoin(dl.xpath('.//h4[@class="clearfix"]/a/@href').extract_first())infos = dl.xpath('.//p[@class="tel_shop"]/text()').extract()try:infos = list(map(lambda x:re.sub(r'\s','',x),infos))# 去除掉不和规矩的少数数据if '厅' not in infos[0] or len(infos) !=7:continuefor info in infos:if '厅' in info:rooms = infoelif '层' in  info:floor = infoelif '向' in info:orientation = infoelif '㎡' in info:area = infoelif '建' in info:year = infoitem = EsfHouseItem(province=province,city_name=city_name,house_name=house_name,address=address,price=price,unit=unit,rooms=rooms,floor=floor,area=area,year=year,orientation=orientation,house_link_url=house_link_url)yield itemexcept (IndexError,UnboundLocalError) :continue# 分页爬取next_url = response.urljoin(response.xpath('.//div[@class="page_al"]/p[1]/a/@href').extract_first())# print(next_url)yield scrapy.Request(url=next_url,callback=self.parse_esfhouse,meta={'info':[province,city_name]})

items

# -*- coding: utf-8 -*-# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass NewHouseItem(scrapy.Item):# 省份province = scrapy.Field()# 城市名字city_name = scrapy.Field()# 小区名字house_name = scrapy.Field()# 价格price = scrapy.Field()# 居室和面积情况rooms_area = scrapy.Field()# 地址address = scrapy.Field()# 行政区district = scrapy.Field()# 是否在售sale = scrapy.Field()# 电话phone = scrapy.Field()# 房天下详情页面urlhouse_link_url = scrapy.Field()class EsfHouseItem(scrapy.Item):# 省份province = scrapy.Field()# 城市名字city_name = scrapy.Field()# 小区名字house_name = scrapy.Field()# 地址address = scrapy.Field()# 总价格price = scrapy.Field()# 单价unit = scrapy.Field()# 居室rooms = scrapy.Field()# 层floor = scrapy.Field()# 面积area = scrapy.Field()# 年代year = scrapy.Field()# 朝向orientation = scrapy.Field()# 房天下详情页面urlhouse_link_url = scrapy.Field()

pipelines

# -*- coding: utf-8 -*-# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import json
from fang.items import NewHouseItemclass FangPipeline(object):def open_spider(self,spider):self.new_f = open('new_house.json','w',encoding='utf-8')self.esf_f = open('esf_house.json','w',encoding='utf-8')def process_item(self, item, spider):# 如果item是属于NewHouseItem实例化的那么就写入new_house.josn# 否则写入esf_house.json# isinstance 判断某个实例是否是指定类的实例  如果是返回True# 实例名.__class__.__name__ 获取类实例的名称if isinstance(item,NewHouseItem):self.new_f.write(json.dumps(dict(item),ensure_ascii=False)+'\n')else:self.esf_f.write(json.dumps(dict(item),ensure_ascii=False)+'\n')return itemdef close_spider(self,spider):self.esf_f.close()self.new_f.close()

middlewares

# -*- coding: utf-8 -*-# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
from fake_useragent import UserAgent
from scrapy import signals
from twisted.internet import defer
from twisted.internet.error import TimeoutError, DNSLookupError, \ConnectionRefusedError, ConnectionDone, ConnectError, \ConnectionLost, TCPTimedOutError
from scrapy.http import HtmlResponse
from twisted.web.client import ResponseFailed
from scrapy.core.downloader.handlers.http11 import TunnelError# 伪装中间件,可以在这里添加代理
class UserangentDemoDownloaderMiddleware(object):def process_request(self, request, spider):request.headers['User-Agent'] = UserAgent().randomreturn Nonedef process_response(self, request, response, spider):return response# 异常中间件
class ProcessAllExceptionMiddleware(object):ALL_EXCEPTIONS = (defer.TimeoutError, TimeoutError, DNSLookupError,ConnectionRefusedError, ConnectionDone, ConnectError,ConnectionLost, TCPTimedOutError, ResponseFailed,IOError, TunnelError)def process_response(self, request, response, spider):# 捕获状态码为40x/50x的responseif str(response.status).startswith('4') or str(response.status).startswith('5'):# # 随意封装,直接返回response,spider代码中根据url==''来处理response 可以替换成正确的url并返回print(response.status)print(response.url)pass# 后序处理,这里只做拦截,因为异常状态不多# 其他状态码不处理return response

settings

# -*- coding: utf-8 -*-# Scrapy settings for fang project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'fang'SPIDER_MODULES = ['fang.spiders']
NEWSPIDER_MODULE = 'fang.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent
# USER_AGENT ='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.30 Safari/537.36 Edg/84.0.522.11'# Obey robots.txt rules
ROBOTSTXT_OBEY = False
LOG_LEVEL = 'ERROR'# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 1
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
#COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Accept-Language': 'en',
}# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
# SPIDER_MIDDLEWARES = {
#    'fang.middlewares.UserangentDemoDownloaderMiddleware': 543,
# }# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.htmlDOWNLOADER_MIDDLEWARES = {'fang.middlewares.UserangentDemoDownloaderMiddleware': 100,'fang.middlewares.ProcessAllExceptionMiddleware':80
}# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {'fang.pipelines.FangPipeline': 300,
}# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

start

# -*- coding: utf-8 -*-
from scrapy import cmdline
# 直接run这个文件就不用命令行一直输入命令了
cmdline.execute('scrapy crawl sfw'.split())

Scrapy-Redis

博主这里默认你已经可以远程连接上redis数据库
(你也可以开几个虚拟机尝试是否可执行,但是注意就算可以执行但是带宽用的实际上还是一台机器,爬取的效果是有的,但是爬取的效率并不能显著提升)

要将一个Scrapy项目变成一个Scrapy-redis项目只需要修改一下三点就可以了:

  1. 将爬虫的类(spiders)从scrapy.Spider变成scrapy_redis.spiders.RedisSpier;或者是从scrapy.CrwalSpider变成scrapy_redis.spiders.RedisCrwalSpider。
  2. 将爬虫中的start_urls换成redis_key=‘XXX’。这个redis_key是为了以后在redis中控制爬虫启动的。爬虫的第一个url,就是在redis中通过这个发送出去的。
  3. 在配置文件中增加如下配置:
# 设置redis为item pipeline
ITEM_PIPELINES = {# 'fang.pipelines.FangPipeline': 300,'scrapy_redis.pipelines.RedisPipeline': 300
}
# Scrapy-Redis相关配置
# 增加了一个去重容器类的配置,作用使用Redis的set集合来存储请求的指纹数据,从而实现请求去重的持久化(也就是用的scrapy_redis封装好的过滤器,毕竟调度器都是scrapy_redis共享的了)
DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter'
# 使用scrapy-redis组件自己的调度器
SCHEDULER = 'scrapy_redis.scheduler.Scheduler'
# 配置调度器是否要持久化,也就是说当爬虫结束了,要不要清空Redis中请求队列和去重指纹的set。如果是True,就表示要持久化存储,就不清空数据,否则清空数据。(就是可以实现增量式已经爬过的数据下次在爬的时候就不会爬了只会爬新的数据)
# 在redis中保持scrapy-redis用到的队列,不会清理redis中的队列,
SCHEDULER_PERSIST = True
# 设置连接redis信息
REDIS_HOST = 'XXXXXX' # ip,这里指定的ip为存储的redis的ip
REDIS_POST = XXXX # 端口
REDIS_PARAMS={'password':'XXXX'} # 密码,如果redis没有密码可以不加这行

运行爬虫

  1. 在爬虫服务器上,进入爬虫文件所在的路径,然后输入命令:scrapy runspider [爬虫文件名]
  2. 在Redis服务器上,推入一个开始的url连接:redis-cli> lpush [redis_key] start_url 开始爬取

代码

spiders

# -*- coding: utf-8 -*-
import scrapy
import refrom scrapy_redis.spiders import RedisSpiderfrom fang.items import NewHouseItem,EsfHouseItemclass SfwSpider(RedisSpider):name = 'sfw'allowed_domains = ['fang.com']# start_urls = ['https://www.fang.com/SoufunFamily.htm']redis_key = 'sfw:start_url'# 对配置文件进行自定制,custom_settings的优先级高于settingscustom_settings = {'DOWNLOAD_DELAY': 1,  # 延时最低为2s'AUTOTHROTTLE_ENABLED': True,  # 启动[自动限速]'AUTOTHROTTLE_DEBUG': True,  # 开启[自动限速]的debug'AUTOTHROTTLE_MAX_DELAY': 10,  # 设置最大下载延时'DOWNLOAD_TIMEOUT': 15,'CONCURRENT_REQUESTS_PER_DOMAIN': 8  # 限制对该网站的并发请求数}# 获取所有城市的省份名称二手房url新房urldef parse(self, response):trs = response.xpath('//div[@id="c02"]//tr')province = Nonefor tr in trs:# 第一个是省# 第二个是城市tds = tr.xpath('.//td[not(@class)]')# 获取省文本,可能有会空的情况province_text = tds[0].xpath('.//text()').extract_first()# 将空白字符替换为空province_text = re.sub(r'\s','',province_text)# 如果有值说明有省份,如果没有值那么就用上一次的省份(为空说明属于上一次的省)if province_text:province = province_text# 不爬取海外城市房源if province =='其它':continuecity_list = tds[1].xpath('.//a')for city in city_list:city_name = city.xpath('./text()').extract_first()city_link = city.xpath('./@href').extract_first()# 构建新房urlcity_link_new = city_link.replace('fang.com','newhouse.fang.com/house/s')# 构建二手房urlcity_link_esf = city_link.replace('fang.com','esf.fang.com')# 发起请求callback是执行回调函数,meta用来请求传参,将本次的信息传递给回调函数中yield scrapy.Request(url=city_link_new,callback=self.parse_newhouse,meta={'info':[province,city_name]})yield scrapy.Request(url=city_link_esf, callback=self.parse_esfhouse,meta={'info': [province, city_name]})# 解析新房页面def parse_newhouse(self,response):province,city_name = response.meta['info']li_list = response.xpath('//div[@id="newhouse_loupai_list"]//li[not(@style)]')for li in li_list:try:house_name = li.xpath('.//div[@class="nlcd_name"]/a/text()').extract_first().strip()except AttributeError:house_name = ''rooms_area_list = li.xpath('.//div[contains(@class,"house_type")]//text()').extract()# 将空白字符用正则去掉后进行拼接操作,结果为 1居/2居/3居-35~179平米.....# 因为后面要大量使用这个map函数,所以可以封装一个函数,优化点rooms_area = ''.join(list(map(lambda x:re.sub(r'\s','',x),rooms_area_list)))# 如果不是居室情况就改成[]if '居' not in rooms_area:rooms_area=[]else:# 格式变得更好看rooms_area = rooms_area.replace(r'-','/总面积:')address = li.xpath('.//div[@class="address"]/a/@title').extract_first()try:district = li.xpath('.//div[@class="address"]/a//text()').extract()# 里面是字符串形式的列表列表中是行政区 [怀来] [门头沟] XXXdistrict =list(map(lambda x: re.sub(r'\s', '', x), district))[1][1:-1]except IndexError:district = ''sale = li.xpath('.//div[@class="fangyuan"]/span/text()').extract_first()price = li.xpath('.//div[@class="nhouse_price"]//text()').extract()price = ''.join(list(map(lambda x: re.sub(r'\s', '', x), price)))# response.urljoin是将缺失的url拼接完整# //feicuigongyuan.fang.com/ 自动拼接成https://feicuigongyuan.fang.com/ 如果完整就不会做操作house_link_url = response.urljoin(li.xpath('.//div[@class="nlcd_name"]/a/@href').extract_first())phone = li.xpath('.//div[@class="tel"]/p/text()').extract_first()item = NewHouseItem(province=province,city_name=city_name,house_name=house_name,price=price,rooms_area=rooms_area,address=address,district=district,sale=sale,house_link_url=house_link_url,phone=phone)yield item# 获取下一页的url# 爬取到最后5页的时候next就会变成上一页的url,可优化!next_url = response.urljoin(response.xpath('.//div[@class="page"]//a[@class="next"]/@href').extract_first())# 分页爬取yield scrapy.Request(url=next_url,callback=self.parse_newhouse,meta={'info': [province,city_name]})# 解析二手房页面def parse_esfhouse(self,response):# print(response.url)province,city_name = response.meta['info']dl_list = response.xpath('//div[@class="shop_list shop_list_4"]/dl[not(@dataflag="bgcomare")]')for dl in dl_list:house_name = dl.xpath('.//p[@class="add_shop"]/a/@title').extract_first()address = dl.xpath('.//p[@class="add_shop"]/span/text()').extract_first()try:price = dl.xpath('.//dd[@class="price_right"]/span[1]//text()').extract()price = price[1] + price[2]except IndexError:price = ''# price = price[1]+price[2]try:unit = dl.xpath('.//dd[@class="price_right"]/span[2]/text()').extract_first().strip()except AttributeError:unit = ''house_link_url = response.urljoin(dl.xpath('.//h4[@class="clearfix"]/a/@href').extract_first())infos = dl.xpath('.//p[@class="tel_shop"]/text()').extract()try:infos = list(map(lambda x:re.sub(r'\s','',x),infos))# 去除掉不和规矩的少数数据if '厅' not in infos[0] or len(infos) !=7:continuefor info in infos:if '厅' in info:rooms = infoelif '层' in  info:floor = infoelif '向' in info:orientation = infoelif '㎡' in info:area = infoelif '建' in info:year = infoitem = EsfHouseItem(province=province,city_name=city_name,house_name=house_name,address=address,price=price,unit=unit,rooms=rooms,floor=floor,area=area,year=year,orientation=orientation,house_link_url=house_link_url)yield itemexcept (IndexError,UnboundLocalError) :continue# 分页爬取next_url = response.urljoin(response.xpath('.//div[@class="page_al"]/p[1]/a/@href').extract_first())# print(next_url)yield scrapy.Request(url=next_url,callback=self.parse_esfhouse,meta={'info':[province,city_name]})

items

# -*- coding: utf-8 -*-# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass NewHouseItem(scrapy.Item):# 省份province = scrapy.Field()# 城市名字city_name = scrapy.Field()# 小区名字house_name = scrapy.Field()# 价格price = scrapy.Field()# 居室和面积情况rooms_area = scrapy.Field()# 地址address = scrapy.Field()# 行政区district = scrapy.Field()# 是否在售sale = scrapy.Field()# 电话phone = scrapy.Field()# 房天下详情页面urlhouse_link_url = scrapy.Field()class EsfHouseItem(scrapy.Item):# 省份province = scrapy.Field()# 城市名字city_name = scrapy.Field()# 小区名字house_name = scrapy.Field()# 地址address = scrapy.Field()# 总价格price = scrapy.Field()# 单价unit = scrapy.Field()# 居室rooms = scrapy.Field()# 层floor = scrapy.Field()# 面积area = scrapy.Field()# 年代year = scrapy.Field()# 朝向orientation = scrapy.Field()# 房天下详情页面urlhouse_link_url = scrapy.Field()

middlewares

# -*- coding: utf-8 -*-# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
from fake_useragent import UserAgent
from scrapy import signals
from twisted.internet import defer
from twisted.internet.error import TimeoutError, DNSLookupError, \ConnectionRefusedError, ConnectionDone, ConnectError, \ConnectionLost, TCPTimedOutError
from scrapy.http import HtmlResponse
from twisted.web.client import ResponseFailed
from scrapy.core.downloader.handlers.http11 import TunnelError# 伪装中间件,可以在这里添加代理
class UserangentDemoDownloaderMiddleware(object):def process_request(self, request, spider):request.headers['User-Agent'] = UserAgent().randomreturn Nonedef process_response(self, request, response, spider):return response# 异常中间件
class ProcessAllExceptionMiddleware(object):ALL_EXCEPTIONS = (defer.TimeoutError, TimeoutError, DNSLookupError,ConnectionRefusedError, ConnectionDone, ConnectError,ConnectionLost, TCPTimedOutError, ResponseFailed,IOError, TunnelError)def process_response(self, request, response, spider):# 捕获状态码为40x/50x的responseif str(response.status).startswith('4') or str(response.status).startswith('5'):# # 随意封装,直接返回response,spider代码中根据url==''来处理response 可以替换成正确的url并返回print(response.status)print(response.url)pass# 后序处理,这里只做拦截,因为异常状态不多# 其他状态码不处理return response

settings

# -*- coding: utf-8 -*-# Scrapy settings for fang project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'fang'SPIDER_MODULES = ['fang.spiders']
NEWSPIDER_MODULE = 'fang.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent
# USER_AGENT ='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.30 Safari/537.36 Edg/84.0.522.11'# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# LOG_LEVEL = 'ERROR'# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 1
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
#COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Accept-Language': 'en',
}# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
# SPIDER_MIDDLEWARES = {
#    'fang.middlewares.UserangentDemoDownloaderMiddleware': 543,
# }# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.htmlDOWNLOADER_MIDDLEWARES = {'fang.middlewares.UserangentDemoDownloaderMiddleware': 100,'fang.middlewares.ProcessAllExceptionMiddleware':80,
}# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# 设置redis为item pipeline
ITEM_PIPELINES = {# 'fang.pipelines.FangPipeline': 300,'scrapy_redis.pipelines.RedisPipeline': 300
}# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'# Scrapy-Redis相关配置
# 增加了一个去重容器类的配置,作用使用Redis的set集合来存储请求的指纹数据,从而实现请求去重的持久化(也就是用的scrapy_redis封装好的过滤器,毕竟调度器都是scrapy_redis共享的了)
DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter'
# 使用scrapy-redis组件自己的调度器
SCHEDULER = 'scrapy_redis.scheduler.Scheduler'
# 配置调度器是否要持久化,也就是说当爬虫结束了,要不要清空Redis中请求队列和去重指纹的set。如果是True,就表示要持久化存储,就不清空数据,否则清空数据。(就是可以实现增量式已经爬过的数据下次在爬的时候就不会爬了只会爬新的数据)
# 在redis中保持scrapy-redis用到的队列,不会清理redis中的队列,
SCHEDULER_PERSIST = True
# 设置连接redis信息
REDIS_HOST = 'XXXX'
REDIS_POST = XXXX
REDIS_PARAMS={'password':'XXXX'}

start

# -*- coding: utf-8 -*-
from scrapy import cmdline
cmdline.execute('scrapy crawl sfw'.split())

感觉对你有帮助的话不妨去GitHub上给个star吧(里面还有一些其他关于爬虫的项目,一直有在更新)https://github.com/programday/crawler,感谢
或者点个赞( •̀ ω •́ )✧,欢迎评论区讨论问题

python分布式(scrapy-redis)实现对房天下全国二手房与新房的信息爬取(偏小白,有源码有分析)相关推荐

  1. 基于python进行信息爬取,进行基金组合透视分析

    本工具的主要目地是,利用python抓取晨星网.天天基金网上的数据,进行组合重仓股分析.组合后行业持仓分析.基金具体参数信息分析. 一.晨星的坑 晨星网也有组合透视功能,但是存在三项一足 1.只能分析 ...

  2. Scrapy项目 - 数据简析 - 实现腾讯网站社会招聘信息爬取的爬虫设计

    一.数据分析截图 本例实验,使用Weka 3.7对腾讯招聘官网中网页上所罗列的招聘信息,如:其中的职位名称.链接.职位类别.人数.地点和发布时间等信息进行数据分析,详见如下图: 图1-1 Weka 3 ...

  3. 房天下网站二手房爬虫、数据清洗及可视化(python)

    房天下网站二手房爬虫.数据清洗及可视化(python) 爬虫代码 ###爬取完的数据存入MangoDB中,需自行下载MangoDB import requests, json, threading f ...

  4. 房天下全国658个城市新房,二手房爬取

    房天下北京二手房分布式抓取: import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders imp ...

  5. 【2020-10-27】 scrapy爬虫之猎聘招聘信息爬取

    声明:本文只作学习研究,禁止用于非法用途,否则后果自负,如有侵权,请告知删除,谢谢! scrapy爬虫之猎聘招聘信息爬取 1.项目场景 目标网址:https://www.liepin.com/zhao ...

  6. 基于scrapy下的租房信息爬取与数据展示工具的设计与实现

    环境:python 3.6.0 Anaconda custom 64bit 4.3.0 Pycharm x64 专业版 2018.1.2 Web strom x64 专业版 2018.1.3 scra ...

  7. 基于python的汽车销售_基于Python的汽车信息爬取与分析

    二.<基于Python的汽车数据爬取与分析> 1 课题内容和要求 1.1问题的提出 1)用Python 语言自行编写爬虫框架或使用Scrapy 框架,爬取汽车之家或易车网的车辆相关数据,按 ...

  8. Python爬取58同城广州房源+可视化分析

    感谢关注天善智能,走好数据之路↑↑↑ 欢迎关注天善智能,我们是专注于商业智能BI,人工智能AI,大数据分析与挖掘领域的垂直社区,学习,问答.求职一站式搞定! 对商业智能BI.大数据分析挖掘.机器学习, ...

  9. python爬虫 京东,苏宁,小米众筹网站信息爬取

    可代写python爬虫,收费可协商,用途需提前说明. 下面爬虫爬到的数据有100天左右,100家众筹的完整数据,需要的或者有写爬虫需求的同学可发邮件至starinsunriseabovesea@ali ...

最新文章

  1. POJ1276Cash Machine
  2. 创新工场有哪些失败项目?不要只看着成功
  3. SpringBoot实现微信点餐
  4. hashcode的作用_看似简单的hashCode和equals面试题,竟然有这么多坑!
  5. 使数组中奇数位于偶数前面
  6. linux c 域名转ip函数 gethostbyname 返回结构体 hostent 简介
  7. poj 1061 (扩展欧几里德算法)
  8. 【Verilog语法】PC-relatve branch 以及 Delay Slot 的含义
  9. 重新想象 Windows 8 Store Apps (4) - 控件之提示控件: ProgressRing; 范围控件: ProgressBar, Slider...
  10. zigbee上位机通过vs2019的mfc实现
  11. 【Hadoop】伪分布式安装---MapReduce程序运行到YARN上,编写MapReduce程序---HDFS yarn
  12. MongoDB compass 连接不上远程服务器的解决方法
  13. QT5实现摄像头预览与扑捉图像
  14. Windows 10 开发日记(二)-- 手势顺序调研
  15. spring-第N篇整合SSM,即Mybatis+Spring+Spring MVC
  16. php 通过ip查询地区,php怎样根据ip地址查地区
  17. 如何把播放器转换成HTML5,Chrome把普通flash播放器转变成HTML5播放器插件:HTML5ify...
  18. 计算机二进制转十进制怎么yun,二进制 十进制 十六进制
  19. 如何让语音芯片与功放芯片之间更好地配合让音效更好
  20. 服务器上添加信任网站,如何将网站添加到信任站点

热门文章

  1. Ubuntu 图片查看器 nomacs 安装
  2. [笔试] 2009年腾讯校园招聘会武汉长沙笔试试题
  3. iOS中使用Fastlane实现自动化打包和发布
  4. local_umask=022是什么意思?
  5. J2ME jar文件编译为BlackBerry cod文件
  6. vue3使用大屏dataV详细教程
  7. 以太(Aether)假说的破灭
  8. 架构师如是说(一)——敏感式开发
  9. 星盘软件测试自学,腾讯星座频道_占星知识大讲堂
  10. 帝国CMS后台更新出现“Table '*.phome_ecms_news_data_' doesn't exist”