主要思路:

1.通过get方法向服务器提交head文件和cookie信息(通过在chrome网页上面登录之后获取,避免了通过账号密码模拟登陆的繁琐过程),实现模拟登陆的效果
2.访问网页,通过万能的正则匹配到所需要的信息

具体算法有3步骤:

1.从租赁房源的第一页至第100页get网页信息,每页对应的url为:url_1='http://bj.5i5j.com/rent/n%d',获取每个网页里面的房源编号;
2.通过每个房源的房源编号进入该房源界面,爬取该房源的'价格','户型','面积','朝向','楼层','小区名称'信息
3.将爬取的房源信息存储在5i5j_house_info.xlsx下

待完善:

1.由于访问每个网页会花费很长时间的io开销,后期会将房源编号放在队列中,通过队列锁+多线程提高爬虫速度

2.我爱我家没有反爬虫机制,如遇到反爬虫可以采用代理ip,多浏览器,多账号等进行爬虫

结果如下:

具体代码如下:

#get_525j_house_info
import requests
import json
import time
import urllib.request
from win32.win32crypt import CryptUnprotectData
from urllib import parse
import re
url_1='http://bj.5i5j.com/rent/n%d'
#头文件信息,可用于模拟登陆
httphead='''
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
Accept-Encoding:gzip, deflate
Accept-Language:zh-CN,zh;q=0.8
Cache-Control:max-age=0
Connection:keep-alive
Cookie:suid=8715039984; BIGipServer=3647539722.20480.0000; PHPSESSID=a2a4l07dfeejokb2k2u1hgt7c6; yfx_c_g_u_id_10000001=_ck17101612002011032445773465255; renthistorys=%5B%7B%22id%22%3A%22166475545%22%2C%22imgurl%22%3A%22house%5C%2F3768%5C%2F37688980%5C%2Fshinei%5C%2Ffadhfope9b890c26.jpg%22%2C%22housetitle%22%3A%22%5Cu7802%5Cu8f6e%5Cu5382%5Cu5bbf%5Cu820d+2%5Cu5ba41%5Cu53851%5Cu536b%22%2C%22parentareaname%22%3A%22%5Cu6e05%5Cu6cb3%22%2C%22buildarea%22%3A%2263.85%22%2C%22hallhouse%22%3A%222%5Cu5ba41%5Cu5385%22%2C%22districtname%22%3A%22%5Cu6d77%5Cu6dc0%22%2C%22conmmunityname%22%3A%22%5Cu7802%5Cu8f6e%5Cu5382%5Cu5bbf%5Cu820d%22%2C%22price%22%3A%222200%22%2C%22onePrice%22%3A344558%7D%2C%7B%22id%22%3A%22166968941%22%2C%22imgurl%22%3A%22house%5C%2F3741%5C%2F37414743%5C%2Fshinei%5C%2Fnhecpeho0a3e23b4.jpg%22%2C%22housetitle%22%3A%22%5Cu8d22%5Cu5927%5Cu5bb6%5Cu5c5e%5Cu9662+2%5Cu5ba41%5Cu53851%5Cu536b%22%2C%22parentareaname%22%3A%22%5Cu4e0a%5Cu5730%22%2C%22buildarea%22%3A%2210%22%2C%22hallhouse%22%3A%222%5Cu5ba41%5Cu5385%22%2C%22districtname%22%3A%22%5Cu6d77%5Cu6dc0%22%2C%22conmmunityname%22%3A%22%5Cu8d22%5Cu5927%5Cu5bb6%5Cu5c5e%5Cu9662%22%2C%22price%22%3A%222000%22%2C%22onePrice%22%3A2000000%7D%2C%7B%22id%22%3A%22171185475%22%2C%22imgurl%22%3Anull%2C%22housetitle%22%3A%22%5Cu6c38%5Cu65fa%5Cu5bb6%5Cu56ed+2%5Cu5ba41%5Cu53851%5Cu536b%22%2C%22parentareaname%22%3A%22%5Cu4e0a%5Cu5730%22%2C%22buildarea%22%3A%2230%22%2C%22hallhouse%22%3A%222%5Cu5ba41%5Cu5385%22%2C%22districtname%22%3A%22%5Cu6d77%5Cu6dc0%22%2C%22conmmunityname%22%3A%22%5Cu6c38%5Cu65fa%5Cu5bb6%5Cu56ed%22%2C%22price%22%3A%221600%22%2C%22onePrice%22%3A533333%7D%2C%7B%22id%22%3A%22166983380%22%2C%22imgurl%22%3A%22house%5C%2F3772%5C%2F37726408%5C%2Fshinei%5C%2Foahhomjoe708d412.jpg%22%2C%22housetitle%22%3A%22%5Cu767e%5Cu65fa%5Cu5bb6%5Cu82d1+4%5Cu5ba41%5Cu53852%5Cu536b%22%2C%22parentareaname%22%3A%22%5Cu4e0a%5Cu5730%22%2C%22buildarea%22%3A%22140%22%2C%22hallhouse%22%3A%224%5Cu5ba41%5Cu5385%22%2C%22districtname%22%3A%22%5Cu6d77%5Cu6dc0%22%2C%22conmmunityname%22%3A%22%5Cu767e%5Cu65fa%5Cu5bb6%5Cu82d1%22%2C%22price%22%3A%222000%22%2C%22onePrice%22%3A142857%7D%5D; searchHistorys=%5B%7B%22name%22%3A%22%5Cu6d77%5Cu6dc0%22%2C%22spell%22%3A%22haidian%22%2C%22level%22%3A3%2C%22id%22%3A%225%22%7D%2C%7B%22name%22%3A%22nanwu%22%2C%22spell%22%3A%22%22%2C%22level%22%3A1%2C%22id%22%3A%22%22%7D%2C%7B%22name%22%3A%22%5Cu4e0a%5Cu5730%22%2C%22spell%22%3A%22shangdi%22%2C%22level%22%3A4%2C%22id%22%3A%2236854%22%7D%5D; yfx_f_l_v_t_10000001=f_t_1508126420098__r_t_1509969433602__v_t_1509969433602__r_c_1; __utmt=1; __utmt_t2=1; _va_ref=%5B%22%E5%93%81%E4%B8%93%E6%A0%87%E9%A2%98%22%2C%22%E5%93%81%E4%B8%93%E6%A0%87%E9%A2%98%22%2C1509969435%2C%22http%3A%2F%2Fbzclk.baidu.com%2Fadrc.php%3Ft%3D06KL00c00f7SfKC0mn7m0KFRQ00NH6Kp00000F9_U7b000000TGQTM.THYdpHNJcQMuVeLPSPyS0A3qmh7GuZR0T1dhuyN9P1mkn10snjubuywW0ZRqPWD1wHbznHcdnb7KfW0sn16zwW04PWbsn1wAwHRsrHn0mHdL5iuVmv-b5Hnsn1TvnWcvn1fhTZFEuA-b5HDv0ARqpZwYTjCEQvFJQWNGPyC8mvqVQ1qdIAdxTvqdThP-5yF9pywdFMNYUNqVuywGIyYqTZKlTiudIAdxIANzUHY-uHR-rHn-rjD-uHf-mW6-rHn-uHm-mH0-rjT-uHb-mHc-rH6hIgwVgvPEUMw-UMfqFyRdFHb1FH6kFyRYFyc3FHb1FyRvFyDsFH6LFyR4FyDzFHb3FMNYUNqWmydsmy-MUWY-uHR-rHn-rjD-uHf-mW6-rHn-uHm-mH0-rjT-uHb-mHc-rH6hUAVdUHYzPsKWThnqnHDvn1T%26tpl%3Dtpl_10144_15654_11145%26l%3D1500277912%26attach%3Dlocation%3D%26linkName%3D%25E6%25A0%2587%25E9%25A2%2598%26linkText%3D%25E6%2588%2591%25E7%2588%25B1%25E6%2588%2591%25E5%25AE%25B6%25EF%25BC%258C%25E5%2585%25A8%25E5%25BF%2583%25E5%2585%25A8%25E6%2584%258F%25E6%2589%25BE%25E6%2588%25BF%25EF%25BC%258C%25E7%259C%259F%25E5%25BF%2583%25E5%25AE%259E%25E6%2584%258F%26xp%3Did(%2522m501af8ab%2522)%252FDIV%255B1%255D%252FDIV%255B1%255D%252FDIV%255B1%255D%252FDIV%255B1%255D%252FH2%255B1%255D%252FA%255B1%255D%26linkType%3D%26checksum%3D211%26ie%3Dutf-8%26f%3D3%26tn%3Dbaidu%26wd%3D5i5j%20%E5%AE%98%E6%96%B9%E7%BD%91%E7%AB%99%26oq%3D525j%26rqlang%3Dcn%26inputT%3D8167%26rsp%3D0%22%5D; __utma=1.68281274.1508126420.1508130935.1509969435.3; __utmb=1.5.10.1509969435; __utmc=1; __utmz=1.1509969435.3.3.utmcsr=baidu|utmccn=%E5%93%81%E4%B8%93%E6%A0%87%E9%A2%98|utmcmd=ppzq|utmctr=%E5%93%81%E4%B8%93%E6%A0%87%E9%A2%98|utmcct=%E5%93%81%E4%B8%93%E6%A0%87%E9%A2%98; __utma=228451417.694811314.1508126420.1508130935.1509969435.3; __utmb=228451417.5.10.1509969435; __utmc=228451417; __utmz=228451417.1509969435.3.3.utmcsr=baidu|utmccn=%E5%93%81%E4%B8%93%E6%A0%87%E9%A2%98|utmcmd=ppzq|utmctr=%E5%93%81%E4%B8%93%E6%A0%87%E9%A2%98|utmcct=%E5%93%81%E4%B8%93%E6%A0%87%E9%A2%98; _va_id=9086dafceef2e930.1508126421.3.1509969793.1509969435.; _va_ses=*; Hm_lvt_0bccd3f0d70c2d02eb727b5add099013=1508126420,1508130935,1509969434; Hm_lpvt_0bccd3f0d70c2d02eb727b5add099013=1509969793; Hm_lvt_fbfca6a323fa396dde12616e37bc1df9=1508126420,1508130935,1509969434; Hm_lpvt_fbfca6a323fa396dde12616e37bc1df9=1509969793; Hm_lvt_b3ad53a84ea4279d8124cc28d3c3220f=1508126420,1508130935,1509969434; Hm_lpvt_b3ad53a84ea4279d8124cc28d3c3220f=1509969793; _pzfxuvpc=1508126420423%7C1062778321885559810%7C33%7C1509969793161%7C3%7C7558331484128782074%7C4424313154138490995; _pzfxsvpc=4424313154138490995%7C1509969434144%7C5%7Chttp%3A%2F%2Fbzclk.baidu.com%2Fadrc.php%3Ft%3D06KL00c00f7SfKC0mn7m0KFRQ00NH6Kp00000F9_U7b000000TGQTM.THYdpHNJcQMuVeLPSPyS0A3qmh7GuZR0T1dhuyN9P1mkn10snjubuywW0ZRqPWD1wHbznHcdnb7KfW0sn16zwW04PWbsn1wAwHRsrHn0mHdL5iuVmv-b5Hnsn1TvnWcvn1fhTZFEuA-b5HDv0ARqpZwYTjCEQvFJQWNGPyC8mvqVQ1qdIAdxTvqdThP-5yF9pywdFMNYUNqVuywGIyYqTZKlTiudIAdxIANzUHY-uHR-rHn-rjD-uHf-mW6-rHn-uHm-mH0-rjT-uHb-mHc-rH6hIgwVgvPEUMw-UMfqFyRdFHb1FH6kFyRYFyc3FHb1FyRvFyDsFH6LFyR4FyDzFHb3FMNYUNqWmydsmy-MUWY-uHR-rHn-rjD-uHf-mW6-rHn-uHm-mH0-rjT-uHb-mHc-rH6hUAVdUHYzPsKWThnqnHDvn1T%26tpl%3Dtpl_10144_15654_11145%26l%3D1500277912%26attach%3Dlocation%253D%2526linkName%253D%2525E6%2525A0%252587%2525E9%2525A2%252598%2526linkText%253D%2525E6%252588%252591%2525E7%252588%2525B1%2525E6%252588%252591%2525E5%2525AE%2525B6%2525EF%2525BC%25258C%2525E5%252585%2525A8%2525E5%2525BF%252583%2525E5%252585%2525A8%2525E6%252584%25258F%2525E6%252589%2525BE%2525E6%252588%2525BF%2525EF%2525BC%25258C%2525E7%25259C%25259F%2525E5%2525BF%252583%2525E5%2525AE%25259E%2525E6%252584%25258F%2526xp%253Did(%252522m501af8ab%252522)%25252FDIV%25255B1%25255D%25252FDIV%25255B1%25255D%25252FDIV%25255B1%25255D%25252FDIV%25255B1%25255D%25252FH2%25255B1%25255D%25252FA%25255B1%25255D%2526linkType%253D%2526checksum%253D211%26ie%3Dutf-8%26f%3D3%26tn%3Dbaidu%26wd%3D5i5j%2520%25E5%25AE%2598%25E6%2596%25B9%25E7%25BD%2591%25E7%25AB%2599%26oq%3D525j%26rqlang%3Dcn%26inputT%3D8167%26rsp%3D0; Hm_lvt_407473d433e871de861cf818aa1405a1=1508126427,1508130941,1509969440; Hm_lpvt_407473d433e871de861cf818aa1405a1=1509969798; domain=bj
Host:bj.5i5j.com
Referer:http://bj.5i5j.com/rent
Upgrade-Insecure-Requests:1
User-Agent:Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36
'''
#获取head和cookies
def get_head_cookies(httphead):head={}for i in httphead.strip().split('\n'):if re.match('Referer',i):head['Referer']=i[len('Referer'):]continueline=i.strip().split(':')head[line[0]]=line[1]cookie=head['Cookie']cookies={}for i in cookie.strip().split(';'):line=i.strip().split('=')cookies[line[0]]=line[1]return head,cookies
head,cookies=get_head_cookies(httphead)
#获取房源id
house_id='<a href="/rent/([\d]{9})"'
def get_house_ids(url):return set(re.findall(house_id,requests.get(url=url,headers=head,cookies=cookies).content.decode('utf-8')))house_ids=set()
page_url_error=[]
for i in range(1,10):try:house_ids.update(get_house_ids(url_1%(i)))except:page_url_error.append(i)#对于每个房源进行爬取信息
house_inf='''<ul class="house-info">.+?"font-price">([\d|\.]+?)</span> 元/月.+?<b>户型:</b>(.+?)&.+?<b>面积:</b>(.+?)</li>.+?<b>朝向:</b>(.+?)</li>.+?<b>楼层:</b>(.+?)</li>.+?<b>小区:</b>(.+?)\s+.+?</li>.+?</ul>'''
def get_house_inf(house_id):return re.findall(house_inf,requests.get(url='http://bj.5i5j.com/rent/'+house_id,headers=head,cookies=cookies).content.decode('utf-8'),re.S|re.M)[0]from openpyxl import Workbook
file='F:\\临时工作\\1023\\5i5j_house_info.xlsx'
wb_bj=Workbook()
ws_bj=wb_bj.worksheets[0]
ws_bj.title='房源信息表'
#获取信息类似于('6800', '2室1厅1卫', '50.45平米', '南', '中部/13层', '崇文门西大街')
line_1=['价格','户型','面积','朝向','楼层','小区名称']
ws_bj.append(line_1)
house_id_error=[]
for house_id in house_ids:try:ws_bj.append(get_house_inf(house_id))except:house_id_error.append(house_id)
wb_bj.save(file)
#输出出错对应的网页url和户型id
print(page_url_error,house_id_error)


												

利用python爬取我爱我家租赁房源信息相关推荐

  1. python爬取boss直聘招聘信息_年底啦,利用Python爬取Boss直聘的招聘信息,为明年跳槽做准备...

    原标题:年底啦,利用Python爬取Boss直聘的招聘信息,为明年跳槽做准备 前言 为什么都说程序员的薪资高,但是刚开始入职的你,薪资并不是自己想象中的那样,首先是你的工作经验不足,其次就是需要不断的 ...

  2. Python爬虫:爬取我爱我家网二手房源信息

    # xpath爬取 # 爬取小区名称.户型.地区.售价.总价 1.导入模块 import requests import csv from lxml import etree 2.创建类 # 创建我爱 ...

  3. 利用 Python 爬取了 37483 条上海二手房信息,我得出的结论是?

    点击上方蓝色小字,关注"涛哥聊Python" 重磅干货,第一时间送达 作者 | 林小呆 编辑 | JackTian 来源 | 杰哥的IT之旅(ID:Jake_Internet) 转 ...

  4. 利用python爬取IP地址归属地等信息!

    被使用的例子: IP:202.204.80.112 更多关于IP138的信息:https://blog.csdn.net/weixin_42859280/article/details/8375251 ...

  5. 爬取我爱我家网站二手房信息

    items文件定义爬取数据: apartment = scrapy.Field() total_price = scrapy.Field() agent = scrapy.Field() image_ ...

  6. Python爬取链家网上的房源信息

    import re # 正则表达式,进行文字匹配 from bs4 import BeautifulSoup # 网页解析,获取数据 import urllib.request, urllib.err ...

  7. python 怎么爬桌软件数据_如何利用Python爬取并分析红岭创投的数据?

    第一步:爬取数据 通过 selenium + Firefox 的无头模式将需要的数据爬取出来,代码实现不在赘述,详细步骤可查看我的上一篇图文(如何利用Python爬取网易新闻), 由于 seleniu ...

  8. python 翻译库本地库_利用python爬取并翻译GEO数据库

    原标题:利用python爬取并翻译GEO数据库 GEO数据库是NCBI创建并维护的基因表达数据库,始于2000年,收录了世界各国研究机构提交的高通量基因表达数据,现芯片集数据量高达12万以上.想要从这 ...

  9. 利用python爬取东方财富网股吧评论并进行情感分析(一)

    利用python爬取东方财富网股吧评论(一) python-东方财富网贴吧文本数据爬取 分享一下写论文时爬数据用到的代码,有什么问题或者改善的建议的话小伙伴们一起评论区讨论.涉及内容在前人的研究基础之 ...

最新文章

  1. js 改变change方法_JS 之设计模式
  2. python爬虫能赚钱吗-在校大学生用python当爬虫一个月能赚3000吗?
  3. Elasticsearch优化
  4. python通过pip安装包,提示 pip 不是内部或外部命令
  5. Docker版本(三)
  6. 神探tcpdump第三招
  7. 飞畅科技-工业交换机防雷知识总结
  8. 25@JSP_day09
  9. 探秘ReSharper 8新功能——XAML编辑
  10. java 大数 list_Java后台通过Collections获取list集合中最大数,最小数代码
  11. tensorflow:图(Graph)的核心数据结构与通用函数(Utility function)
  12. ASP.NET技巧:使Div内内容可编辑
  13. spring mvc 的上传图片是怎么实现的?
  14. Unity资源加载以及释放
  15. 《女士品茶》读书笔记
  16. C题:无线充电电动小车(本科)--2018年TI杯大学生电子设计竞赛
  17. android dtb文件位置_确定msm8937+android7.1采用的dtb文件
  18. 大二Git-Branching学习
  19. python随机分组的思路_「Python」每日一练:学生学习小组分组程序
  20. electron开发问题记录

热门文章

  1. 好评如潮!《典籍里的中国》为什么火?
  2. android动画的实战篇------乐动力旋转动画
  3. 解决移动端点击屏幕变蓝的问题
  4. iPadOS 13.4+键盘+鼠标—让你的旧iPad重获新生
  5. 小米9pro textView不设置颜色不显示_小米万象息屏2.0内测开启,新增小组件功能和新息屏...
  6. FX3U源码V10.0
  7. 显示Hibernate的HQL语句参数值
  8. uni-app实现Android分享到微信朋友圈和微信好友,附DEMO和源码
  9. git提交代码更换提交人
  10. 从扎克伯格的一喜一忧看图片社交的现在和未来