环境

​​1.首先在linux环境下载python3和scrapy

下载及解压压缩包 Python-3.x.x.tgz,3.x.x 为你下载的对应版本号# tar -zxvf Python-3.6.1.tgz
# cd Python-3.6.1
# ./configure
# make && make install
# python3 -V
Python 3.6.1
添加环境变量 详情查看https://www.runoob.com/python3/python3-install.html安装 Scrapypip3 install scrapy
新建项目
scrapy startproject lj_sz

2.下面大致介绍这些目录和文件的作用:

scrapy.cfg:项目的总配置文件,通常无须修改。
lj_sz:项目的 Python 模块,程序将从此处导入 Python 代码。
lj_sz/items.py:用于定义项目用到的 Item 类。Item 类就是一个 DTO(数据传输对象),通常就是定义 N 个属性,该类需要由开发者来定义。
lj_sz/pipelines.py:项目的管道文件,它负责处理爬取到的信息。该文件需要由开发者编写。
lj_sz/settings.py:项目的配置文件,在该文件中进行项目相关配置。
lj_sz/spiders:在该目录下存放项目所需的蜘蛛,蜘蛛负责抓取项目感兴趣的信息。

3.定义 items.py 类,该类仅仅用于定义项目需要爬取的 N 个属性:

import scrapyclass LjSzItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()title = scrapy.Field()total_price = scrapy.Field()unit_price = scrapy.Field()trade_time = scrapy.Field()region = scrapy.Field()location = scrapy.Field()url = scrapy.Field()

4.编写 Spider 类

Scrapy 为创建 Spider 提供了 scrapy genspider 命令,该命令的语法格式如下

scrapy genspider [options] <name> <domain>

在命令行窗口中进入 ZhipinSpider 目录下,然后执行如下命令即可创建一个 Spider:

scrapy genspider sz "sz.lianjia.com"

运行上面命令,即可在 lj_sz 项目的 lj_sz/spider 目录下找到一个 sz.py 文件 ,编辑sz.py

import scrapy
from lj_sz.items import LjSzItem
from bs4 import BeautifulSoup
import sys
import json
import reclass SzSpider(scrapy.Spider):name = 'sz'allowed_domains = ['sz.lianjia.com']start_urls = ['https://sz.lianjia.com/chengjiao/luohuqu/pg1','https://sz.lianjia.com/chengjiao/futianqu/pg1','https://sz.lianjia.com/chengjiao/nanshanqu/pg1','https://sz.lianjia.com/chengjiao/yantianqu/pg1','https://sz.lianjia.com/chengjiao/baoanqu/pg1','https://sz.lianjia.com/chengjiao/longgangqu/pg1','https://sz.lianjia.com/chengjiao/longhuaqu/pg1','https://sz.lianjia.com/chengjiao/guangmingqu/pg1','https://sz.lianjia.com/chengjiao/pingshanqu/pg1','https://sz.lianjia.com/chengjiao/dapengxinqu/pg1',]def parse(self, response):item =  LjSzItem()# print(response.request.url)for li in  response.xpath('/html/body/div[5]/div[1]/ul//li'):item['region'] = [self.getRegion(response)]url =  li.xpath('./div/div[@class="title"]/a/@href').extract_first()print(url)if url:# 请求详情页yield scrapy.Request(url,callback=self.detail_parse,meta={"item": item})# # 下一页递归爬new_links = response.xpath('//div[contains(@page-data, "totalPage")]/@page-data').extract()totalPage = json.loads(new_links[0])['totalPage']nowPage   = json.loads(new_links[0])['curPage']# print('页数情况',totalPage)print('当前------------------------------页',nowPage)print()if nowPage < totalPage :now_url = response.request.urlurlList = now_url.split('/pg')#    next_url = 'https://sz.lianjia.com/chengjiao/dapengxinqu/pg' + str(nowPage+1) + '/'next_url = urlList[0] + '/pg'  + str(nowPage+1) + '/'yield scrapy.Request(next_url,meta={'dont_redirect': True,'handle_httpstatus_list': [301,302],'item':item}, callback=self.parse)def getRegion(self, response):regionList = ['luohuqu','futianqu','nanshanqu','yantianqu','baoanqu','longgangqu','longhuaqu','guangmingqu','pingshanqu','dapengxinqu']regionMap = {'luohuqu' : '罗湖区','futianqu' : '福田区','nanshanqu': '南山区','yantianqu': '盐田区','baoanqu' : '宝安区','longgangqu': '龙岗区','longhuaqu': '龙华区','guangmingqu': '光明区','pingshanqu': '坪山区','dapengxinqu': '大鹏新区',}for region in regionList:if region in response.request.url:return regionMap[region]return None# 爬取详情页数据def detail_parse(self, response):# 接收上级已爬取的数据item = response.meta['item']   #一级内页数据提取 originHtml = response.xpath("/html/body/script[11]/text()").extract()[0]originHtml = str(originHtml)location   = re.findall(r"resblockPosition:'(.*)'", originHtml)item['location'] = locationitem['total_price']  = response.xpath('/html/body/section[1]/div[2]/div[2]/div[1]/span/i/text()').extract()item['title'] = response.xpath('/html/body/div[4]/div/text()').extract()item['unit_price']  = response.xpath('/html/body/section[1]/div[2]/div[2]/div[1]/b/text()').extract()item['trade_time']  = response.xpath('/html/body/div[4]/div/span/text()').extract()item['url'] = [response.request.url]# 二级内页地址爬取# yield scrapy.Request(item['url'] + "&123", meta={'item': item}, callback=self.detail_parse2)# 有下级页面爬取 注释掉数据返回yield item

5. 修改pipeline类

这个类是对爬取的文件最后的处理,一般为负责将所爬取的数据写入文件或数据库中.

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
import  pandas as pd
import pymysqlclass LjSzPipeline:def process_item(self, item, spider):scrapyData = []# 链接数据库conn=pymysql.connect('127.0.0.1','***','******')# 选择数据库conn.select_db('ectouch')cur=conn.cursor()sql="insert into lj_sz_scrapy (city_desc,region, title, trade_time, total_price,total_unit,unit_price,unit_unit,location,url) values(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)";self.getData(item,scrapyData)print(scrapyData)try:# 执行sql语句insert=cur.executemany(sql,scrapyData)print ('批量插入返回受影响的行数:',insert)# 提交到数据库执行conn.commit()except:# 如果发生错误则回滚conn.rollback()print ('错误')conn.close()# print("title:",item['title'])# print("url:",item['url'])# print("total_price:",item['total_price'])# print("unit_price:",item['unit_price'])# print("trade_time:",item['trade_time'])# print("region:",item['region'])# print("location:",item['location'])print('============='*10)return def getData(self, item, scrapyData):df = pd.DataFrame({"city_desc": '深圳', "region": item['region'], "title": item["title"],"trade_time":item["trade_time"],"total_price":item["total_price"],"total_unit":'万',"unit_price":item['unit_price'],"unit_unit":'元/平',"location":item['location'],"url":item['url']})def reshape(r):scrapyData.append(tuple(r))df.apply(reshape,axis=1)return # mysql表结构
# CREATE TABLE `lj_sz_scrapy` (
# `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
# `city_desc` varchar(155) COLLATE utf8mb4_unicode_ci NOT NULL DEFAULT '' COMMENT '城市',
# `region` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL DEFAULT '' COMMENT '区',
# `title` varchar(155) COLLATE utf8mb4_unicode_ci NOT NULL DEFAULT '' COMMENT '标题',
# `trade_time` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL DEFAULT '' COMMENT '交易时间',
# `total_price` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL DEFAULT '0.00' COMMENT '总价',
# `total_unit` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL DEFAULT '' COMMENT '总价单位',
# `unit_price` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL DEFAULT '0.00' COMMENT '单价',
# `unit_unit` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL DEFAULT '' COMMENT '单价单位',
# `location` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL DEFAULT '' COMMENT 'location',
# `url` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL DEFAULT '' COMMENT 'url',
# `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
# `updated_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
# `x` decimal(10,5) NOT NULL DEFAULT '0.00000' COMMENT 'x',
# `y` decimal(10,5) NOT NULL DEFAULT '0.00000' COMMENT 'y',
# PRIMARY KEY (`id`),
# KEY `title` (`title`) USING BTREE
# ) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci; 

6.修改settings类

ROBOTSTXT_OBEY = False// 改为falseDEFAULT_REQUEST_HEADERS = {"User-Agent" : "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:61.0) Gecko/20100101 Firefox/61.0",'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Accept-Language': 'en',
}DOWNLOAD_DELAY = 0.5// 人性化点别太快
DOWNLOADER_MIDDLEWARES = {'lj_sz.middlewares.LjSzDownloaderMiddleware': 543,
}
ITEM_PIPELINES = {'lj_sz.pipelines.LjSzPipeline': 300,
}
HTTPERROR_ALLOWED_CODES = [301]

7.更改代理防止ip被封

购买快代理 https://www.kuaidaili.com/usercenter/tps/

class LjSzDownloaderMiddleware:# Not all methods need to be defined. If a method is not defined,# scrapy acts as if the downloader middleware does not modify the# passed objects.@classmethoddef from_crawler(cls, crawler):# This method is used by Scrapy to create your spiders.s = cls()crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)return s# #设置代理 快代理购买隧道代理proxy = 'tps156.kdlapi.com:15818'user_password = 'XXXXXXXXX:XXXX'b64_user_password = base64.b64encode(user_password.encode('utf-8'))proxyAuth = 'Basic' + b64_user_password.decode('utf-8')def process_request(self, request, spider):# Called for each request that goes through the downloader# middleware.request.headers['User-Agent'] = random.choice(self.user_agents)# Must either:# - return None: continue processing this request# - or return a Response object# - or return a Request object# - or raise IgnoreRequest: process_exception() methods of#   installed downloader middleware will be called#消除关闭证书验证的警告urllib3.disable_warnings();request.meta['proxy'] = self.proxyrequest.headers['Proxy-Authorization'] = self.proxyAuth# return request# return Nonedef process_response(self, request, response, spider):# Called with the response returned from the downloader.# Must either;# - return a Response object# - return a Request object# - or raise IgnoreRequestreturn responsedef process_exception(self, request, exception, spider):# Called when a download handler or a process_request()# (from other downloader middleware) raises an exception.# Must either:# - return None: continue processing this exception# - return a Response object: stops process_exception() chain# - return a Request object: stops process_exception() chainpassdef spider_opened(self, spider):spider.logger.info('Spider opened: %s' % spider.name)e

8.执行

scrapy crawl sz

9.结果

学习来源https://blog.csdn.net/qq_41837900/article/details/96489994?utm_medium=distribute.pc_relevant.none-task-blog-BlogCommendFromBaidu-2.control&depth_1-utm_source=distribute.pc_relevant.none-task-blog-BlogCommendFromBaidu-2.control

https://blog.csdn.net/weixin_30776273/article/details/96193833?utm_medium=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-1.control&depth_1-utm_source=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-1.control

动态ip获取https://www.kuaidaili.com/usercenter/overview/

快代理 隧道代理 https://www.kuaidaili.com/doc/api/

github源码 git@github.com:drt9527/Python3-Scrapy.git

Python3 + Scrapy爬链家深圳成交房信息相关推荐

  1. Python爬虫三:抓取链家已成交二手房信息(58W数据)

    环境:Windows7+python3.6+Pycharm2017 目标:抓取链家北京地区已成交二手房信息(无需登录),如下图,户型.朝向.成交时间价格等,保存到csv.最后一共抓取约58W数据,程序 ...

  2. 26-爬取链家二手房成交的房产信息【简单】

    目的:爬取链家二手房成交的信息,包括:['cjxiaoqu','cjdanjia','cjhuxing','cjmianji','cjshijian','cjlouceng','cjchaoxiang ...

  3. Python爬虫爬取链家网上的房源信息练习

    一 原链接:用Python爬虫爬取链家网上的房源信息_shayebuhui_a的博客-CSDN博客_python爬取链家 打开链家网页:https://sh.lianjia.com/zufang/  ...

  4. python爬房源信息_用python爬取链家网的二手房信息

    题外话:这几天用python做题,算是有头有尾地完成了.这两天会抽空把我的思路和方法,还有代码贴出来,供python的初学者参考.我python的实战经历不多,所以代码也是简单易懂的那种.当然过程中还 ...

  5. python爬房源信息_Python爬链家网租房信息

    爬去链家网的租房信息然后存储到数据库中. #-*- coding:utf-8 -*- import requests import re import random import MySQLdb fr ...

  6. 仿链家地图找房_全网稀缺,完整链家地图找房的实现(一)

    前一段时间我应公司的需求开发了类似链家地图找房的功能,然而我发现现在市面上,对于链家地图找房功能的完整实现相关的文章还是比较稀缺的,亦或是功能还不够完善,出于这个方面,我觉得把自己对于链家地图找房功能 ...

  7. 仿链家地图找房的简单实现 1

    本篇目录: 使用入门 简单使用流程 链家地图找房效果 区域点位气泡 数据结构 实现 addOverlay方法 区域边界 获取区域点位经纬度 获取区域边界 小结 最近由于项目需要,开始调研如何使用百度地 ...

  8. 爬链家数据(武汉光谷房价)

    爬链家数据 #-*- coding:utf-8-*- import urllib import urllib.request import re from bs4 import BeautifulSo ...

  9. python 爬取链家数据_用python爬取链家网的二手房信息

    题外话:这几天用python做题,算是有头有尾地完成了.这两天会抽空把我的思路和方法,还有代码贴出来,供python的初学者参考.我python的实战经历不多,所以代码也是简单易懂的那种.当然过程中还 ...

  10. 五十七、爬取链家网北京二手房信息,并进行线性回归建模

    @Author : By Runsen @Date:2020/5/31 作者介绍:Runsen目前大三下学期,专业化学工程与工艺,大学沉迷日语,Python, Java和一系列数据分析软件.导致翘课严 ...

最新文章

  1. 适用于Linux 2的Windows子系统上的CUDA
  2. SVN、Git设置提交时忽略的文件
  3. c++ string 头文件_“延期不延学” 第25期 | C++篇 | C/C++常用函数
  4. Intel VT学习笔记(六)—— VM-Exit Handler
  5. istio安装命令整理
  6. J2ME Nokia 模拟器 安装运行
  7. Asp.Net Core Web Api图片上传及MongoDB存储实例教程(一)
  8. Wizard of Orz CodeForces - 1467A
  9. Linux服务器运行环境搭建(二)——Redis数据库安装
  10. 新春聊一下:技术架构与架构师角色的诸多思考
  11. 手动创建git忽略push清单,node_module以及自身
  12. 【kafka】kafka 新增节点 报错 InconsistentBrokerIdException Configured broker.id doesn‘t match
  13. eclipse打不开,报错 java was started with exit code=13
  14. python爬虫进程和线程的区别_Python爬虫 | 多线程、多进程、协程
  15. 【直男福音】7款破解版APP,助你早日脱单
  16. Android系统开机时间优化
  17. 前端学习之路-聚美优品注册页面的实现
  18. android倒影效果,Android实现图片的倒影效果案例分析
  19. office教程:如何给excel表格重命名工作表
  20. C# WebApi 获取今日头条新闻代码

热门文章

  1. 互联网是如何工作的?
  2. 已知一点经纬度及与另一点距离和航向,求另一点经纬度
  3. python右对齐输出乘法表_python输出九九乘法表
  4. CD4040二进制计数器实验电路的效果图演示_基础硬件电路图讲解
  5. Joplin实现样式更改
  6. linux运算器小程序报告,小程序运算
  7. IT狂人第一至四季/全集The IT Crowd迅雷下载
  8. 基于BP神经网络的车牌识别问题研究附Matlab代码
  9. js将字符串倒叙的方法
  10. Android最新API获取北斗卫星定位信息(全网最新)