scrapy爬虫框架

1.scrapy安装

  • Windows 系统安装

pip install scrapy
  • Linux 系统安装

yum -y install scrapy
vim .bashrc
alias scrapy="home/user1/python3/bin/scrapy"
source .bashrc
#分布执行上述命令

2.创建scrapy 项目爬取链家租房

  • 打开创建项目的目录,点击shfit 键,鼠标右击,打开在此处打开命名窗口

scrapy startprojects 项目名称

  • 在pycharm 中打开刚刚创建的项目

 cd spiders
scrapy genspider lianjiazufang https://sh.lianjia.com/zufang/

  • 运行scrapy项目

方法一,terminal中直接运行

scrapy crawl lianjiazufang

方法二,创建main.py ,运行main脚本即可

from scrapy import cmdline
cmdline.execute("scrapy crawl lianjiazufang".split())
  • 链接爬虫项目具体内容

整体框架

lianjiazufang.py

import scrapy
from lianjia.items import LianjiaItem
import reclass LianjiazufangSpider(scrapy.Spider):name = 'lianjiazufang'allowed_domains = ['sh.lianjia.com']start_urls = ['https://sh.lianjia.com/zufang/pg1/']def parse(self, response):name_item_list = response.xpath('//div[@class="content__list--item"]')print(response.request.headers)for name_item in name_item_list:info={}#租房标题info["content_title"] = name_item.xpath('.//div/p/a/text()').extract_first().strip()#租房urlinfo["content_url"] = "https://sh.lianjia.com"+name_item.xpath('.//div/p/a/@href').extract_first()#使用yield 来发送异步请求#使用的是scrapy.Request 发送请求的#回调函数,只写方法的名称,不要调用方法info["content_price"] = name_item.xpath('.//span[@class="content__list--item-price"]/em/text()').extract_first()content_area_serach = re.compile(r'\d+㎡')content_area = name_item.xpath('.//p[@class="content__list--item--des"]/text()').extract()info["content_area"] = content_area_serach.findall(str(content_area))[0]info["content_addr"] = ''.join(name_item.xpath('.//p[@class="content__list--item--des"]/a/text()').extract())yield scrapy.Request(url=info["content_url"],callback=self.handle_pic_parse,meta=info)for p in range(2,10):next_url = 'https://sh.lianjia.com/zufang/pg%d/'%(p)yield scrapy.Request(url=next_url,callback=self.parse)def handle_pic_parse(self,response):# print(response.request.meta)pic_url_list = response.xpath("//ul[@class='piclist']/li//img/@src").extract()for pic_item in pic_url_list:# print(pic_item)lianjia_info = LianjiaItem()lianjia_info['content_title']=response.request.meta['content_title']lianjia_info['content_url'] = response.request.meta['content_url']lianjia_info['content_price'] =response.request.meta['content_price']lianjia_info['content_area'] =response.request.meta['content_area']lianjia_info['content_addr']=response.request.meta['content_addr']lianjia_info['content_pic'] =pic_item#yield 到 piplines ,需要在setting 中把piplines 打开yield lianjia_info

items.py

import scrapyclass LianjiaItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()#获取标题content_title = scrapy.Field()#获取详情页urlcontent_url = scrapy.Field()#获取房源图片content_pic = scrapy.Field()#获取房源地址content_addr = scrapy.Field()#获取房源价格content_price = scrapy.Field()#获取房源面积content_area = scrapy.Field()pass

main.py

from scrapy import cmdline
cmdline.execute("scrapy crawl lianjiazufang".split())

pipelines

import pymongoclass LianjiaPipeline:def __init__(self):myclient = pymongo.MongoClient("mongodb://localhost:27017")mydb = myclient['db_lianjia']self.mycollection = mydb['collection_lianjia']def process_item(self, item, spider):# print('这是我们获取的数据:',item)data = dict(item)self.mycollection.insert_one(data)return item

middlewares.py

import base64from scrapy import signals
import random# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapterclass My_useragent(object):def process_request(self,request,spider):#响应头user_agent_list = ["Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60","Opera/8.0 (Windows NT 5.1; U; en)","Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50","Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50",# Firefox"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0","Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10",# Safari"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2",# chrome"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36","Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11","Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16",# 360"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36","Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko",# 淘宝浏览器"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11",# 猎豹浏览器"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER","Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)","Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)",# QQ浏览器"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)","Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",# sogou浏览器"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0","Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0)",# maxthon浏览器"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.4.3.4000 Chrome/30.0.1599.101 Safari/537.36",# UC浏览器"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36"]agent = random.choice(user_agent_list)request.headers['User-Agent']=agentclass My_proxy(object):def process_request(self,request,spider):request.meta['proxy']= 'ip:port'proxy_name_pass ='用户名:密码'.encode('utf-8')encode_pass_name = base64.b64encode(proxy_name_pass)request.headers['Proxy-Authorization'] = 'Basic '+encode_pass_name

settings

# Scrapy settings for lianjia project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'lianjia'SPIDER_MODULES = ['lianjia.spiders']
NEWSPIDER_MODULE = 'lianjia.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'lianjia (+http://www.yourdomain.com)'# Obey robots.txt rules
ROBOTSTXT_OBEY = True# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
#COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {#    'lianjia.middlewares.LianjiaSpiderMiddleware': 543,
#}# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {# 'lianjia.middlewares.LianjiaDownloaderMiddleware': 543,'lianjia.middlewares.My_useragent': 300,
}# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {#    'scrapy.extensions.telnet.TelnetConsole': None,
#}# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {'lianjia.pipelines.LianjiaPipeline': 300,
}# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

scrapy 爬虫框架及链家租房信息爬取示例相关推荐

  1. 爬虫实战:链家租房数据爬取,实习僧网站数据爬取

    前面已经进行了爬虫基础部分的学习,于是自己也尝试爬了一些网站数据,用的策略都是比较简单,可能有些因素没有考虑到,但是也爬取到了一定的数据,下面介绍两个爬过的案例. 爬虫实战 链家网站爬取 实习僧网站爬 ...

  2. 【Python爬虫项目】链家房屋信息抓取(超详细适合新手练习附源码)

    爬取链家房屋信息 爬取信息具体如下: 1.标题 2.位置 3.房屋介绍 4.房屋总价 5.房屋单价 一.检查网页源码 搜索标题中的关键字发现目标信息可以在源码中找到,所以我们请求该url网址就可以拿到 ...

  3. scrapy某家租房信息爬取

    目的: 使用scrapy框架进行租房信息(杭州地区)爬取,包括小区名称,位置,价格,面积,房间朝向.户型等,然后把爬取的信息保存到本地csv表格中. 分析: 某家的网站爬取不难,看一下页面,需要爬取的 ...

  4. 【python】链家小区信息爬取

    首先在主页面中爬取详情页面的url 主页面示例: 详情页面示例: 在详情页面中我们可以爬取到小区的名称.房价.建筑年代.建筑类型.物业费用等各类信息 详细代码如下: import pandas as ...

  5. PyQt5+Python+Excel链家二手房信息爬取、可视化以及数据存取

    成果图: 第一步运行代码searsh.py,效果如下 第二步选择你所需要爬取数据的城市,如湖北-武汉 然后搜索,结果如下 如果你想爬取更多信息,只需要点击下一页即可 第三步,保存数据.可以将所显示的所 ...

  6. python爬虫--爬取链家租房信息

    python 爬虫-链家租房信息 爬虫,其实就是爬取web页面上的信息. 链家租房信息页面如下: https://gz.lianjia.com/zufang/ ## python库 Python库 1 ...

  7. 爬取南京链家租房信息

    爬取南京链家租房信息 代码如下 代码片. import requests from lxml import etree if name == "main": #设置一个通用URL模 ...

  8. python爬取链家租房信息_Python爬取链家网上海市租房信息

    使用Python进行上海市租房信息爬取,通过requests + Beautifulsoup对网页内容进行抓取和数据提取. import requests from bs4 import Beauti ...

  9. 基于scrapy下的租房信息爬取与数据展示工具的设计与实现

    环境:python 3.6.0 Anaconda custom 64bit 4.3.0 Pycharm x64 专业版 2018.1.2 Web strom x64 专业版 2018.1.3 scra ...

  10. 爬虫系列之链家的信息爬取及数据分析

    关于链家的数据爬取和分析 已经实现 1.房屋数据爬取并下载 2.房屋按区域分析 3.房屋按经纪人分析 4.前十经纪人 5.经纪人最有可能的位置分析 6.实现以地区划分房屋 目前存在的问题: 1.多线程 ...

最新文章

  1. Habitica 4.85.5 发布,习惯游戏养成应用
  2. iOS ViewController的生命周期
  3. XML数据的分页显示
  4. Java09-day09【ArrayList(概述、构造方法、常用方法、遍历)、简易学生管理系统】
  5. 计算从ios照片库中选取的图片文件大小
  6. 在python中使用grpc和protobuf
  7. 朴素贝叶斯与贝叶斯网络
  8. ubuntu 安装java_Hadoop3.1.3安装教程_单机/伪分布式配置
  9. 【MySQL学习】Unknown column 'PASSWORD'|Access denied for user 'root'@'localhost'
  10. #和##在define中的作用
  11. 說說俺的FTP Server服務器
  12. android模拟器安装
  13. 录入年、月、日,判断日期的合法性
  14. 18年下半年读书清单一览
  15. 农村信用社计算机岗位考什么条件,说说农村信用社考试考什么
  16. x^n mod 1003(快速求解法)
  17. 乖 == 孝顺 ?
  18. C语言基础向——二级总结
  19. GAN 论文浅读心得体会-未完
  20. s8站长交易论坛:我与我的威客经历

热门文章

  1. java编写平行四边形的代码_CSS 实现平行四边形的示例代码
  2. C++ 04 翁恺>声明(Declarations) VS. 定义(Definitions)
  3. java 笔画排序_Java汉字排序(3)按笔划排序
  4. BigDecimal实现加减乘除
  5. python_判断是否回文
  6. [词性] 十五、介词 5 [ by ] [ during ] [ for ] [ from ]
  7. Android 阻止AlertDialog dismiss
  8. 机器学习(十五)回归算法之线性回归
  9. 【笔记】YOLOv3训练自己的数据集(2)——训练和测试训练效果
  10. 微信小程序样式Flex Box精通课程-Flex容器的属性-justify-content内容对齐(左中右)