结果图镇楼。无图无真相。。。。嘿嘿

参考了一篇链家石家庄的文章，但是那篇已经没法用了规则变了，我又重新写了一份。

https://blog.csdn.net/hihell/article/details/84029492

一、效果图

二、代码

import re
from fake_useragent import UserAgent
from lxml import etree
import asyncio
import aiohttp
import pandas as pd# 定义一个类 定义使用的变量  定义get方法通过连接池进行网络请求
class LianjiaSpider(object):def __init__(self):self.ua = UserAgent()  # 获取userAgent类self.head = {"User-Agent": self.ua.random}self._data = list()  # 初始化listasync def get_page_count(self):result = await self.get("https://bj.lianjia.com/zufang/pg1")page_html = etree.HTML(result)  # 解析网页pageCount = page_html.xpath(".//div[@class='content__pg']/@data-totalpage")pageCount = list(map(int, pageCount))return pageCount[0]async def get(self, url):  # 异步方法  当方法执行挂起线程执行完毕返回当前执行async with aiohttp.ClientSession() as session:  # 线程连接池try:async with session.get(url, headers=self.head, timeout=3) as resp:if resp.status == 200:result = await resp.text()return resultexcept Exception as e:print(e.args)async def parse_html(self):count = await self.get_page_count()for page in range(1, count):url = "https://bj.lianjia.com/zufang/pg{}/".format(page)print("正在爬取{}".format(url))html = await self.get(url)  # 获取网页内容html = etree.HTML(html)  # 解析网页await self.parse_page(html)  # 匹配我们想要的数据print("正在存储数据....")print(len(self._data))######################### 数据写入data = pd.DataFrame(self._data)data.to_csv("链家网租房数据.csv", encoding='utf_8_sig')  # 写入文件######################### 数据写入def run(self):loop = asyncio.get_event_loop()  # 获取到循环tasks = [asyncio.ensure_future(self.parse_html())]  # 创建任务loop.run_until_complete(asyncio.wait(tasks))async def parse_page(self, html):rst = html.xpath(".//div[@class='content__list--item']")  # //代表在任意路径下查找节点为div，class为的所有元素print(rst)  # ==> [<Element li at 0x133d9e0>, <Element li at 0x133d9b8>, <Element li at 0x133d990>]  找for div in rst:imgurl = div.xpath(".//a[@class='content__list--item--aside']/img/@src")title = div.xpath(".//a[@class='content__list--item--aside']/img/@alt")floor = div.xpath(".//span[@class='hide']/text()")price = div.xpath(".//span[@class='content__list--item-price']/em/text()")type = div.xpath(".//p[@class='content__list--item--des']/text()")if len(floor) > 0:  # 有的没有写楼层会报错加一层判断currentFloor = floor[1].replace("\n", "").replace(" ", "")else:currentFloor = ''strinfo = []    #用于存储多少平方米 朝向 几室几厅strinfo.clear()for str in type:info = str.replace(" ", "").replace("\n", "").replace("-", "")if info != '':strinfo.append(info)print(info)size = strinfo[0].replace(" ", "").replace("\n", "")  # 30㎡direction = strinfo[1].replace(" ", "").replace("\n", "")  # 南structure = strinfo[2].replace(" ", "").replace("\n", "")  # 5室1厅2卫structure = re.findall(r'\d+', structure)print(structure)print("imgurl:" + imgurl[0])  # 图片地址print("title:" + title[0])  # 标题print("price:" + price[0])  # 价钱print("currentFloor:" + currentFloor)  # 楼层print(structure)  # 分割几室几厅几卫if len(structure) == 3:one_data = {"图片地址": imgurl[0],"标题": title[0],"价格": price[0],"楼层": currentFloor,"大小": size,"朝向": direction,"室": structure[0],"厅": structure[1],"卫": structure[2]}elif len(structure) == 2:one_data = {"图片地址": imgurl[0],"标题": title[0],"价格": price[0],"楼层": currentFloor,"大小": size,"朝向": direction,"室": structure[0],"厅": 0,"卫": structure[1]}self._data.append(one_data)  # 添加数据if __name__ == '__main__':l = LianjiaSpider()l.run()

三、总结

搜索for循环

替换字符串

len长度函数

etree 根据class 解析，

\d正则表达提取数字

fake_useragent 模拟head的使用

协程的使用

list中str转为int map的使用 list(map(int,strList)) list清除的方法 clear

if elseif 使用

python 爬取链家北京租房信息相关推荐

Python爬取链家北京租房信息
刚学习了python,中途遇到很多问题,查了很多资料,最关键的就是要善于调试,div信息一定不要找错,下面就是我爬取租房信息的代码和运行结果: 链家的房租网站两个导入的包 1.requests 用来 ...
Python爬取链家北京租房信息！北京租房都租不起啊！
一.效果图二.代码 import re from fake_useragent import UserAgent from lxml import etree import asyncio impo ...
Python爬取链家成都二手房源信息
作者 | 旧时晚风拂晓城编辑 | JackTian 来源 | 杰哥的IT之旅(ID:Jake_Internet) 转载请联系授权(微信ID:Hc220066) 公众号后台回复:「成都二手房数据」,获 ...
Python爬取链家地产二手房信息
""" 1.爬取链家地产二手房信息要求:使用代理进行爬取:59.58.151.34:3879 步骤:1.找URL第一页:https://yichang.lianjia. ...
Python爬取链家成都二手房源信息，异步爬虫实战项目！
本文先熟悉并发与并行.阻塞与非阻塞.同步与异步.多线程.多线程.协程的基本概念.再实现asyncio + aiohttp爬取链家成都二手房源信息的异步爬虫,爬取效率与多线程版进行简单测试和比较. 1. ...
爬取链家北京租房数据并做简单分析
在一个来北京不久的学生眼中,北京是一个神秘又充满魅力的大城市.它无比美好,但又无时无刻不再觊觎这你薄弱的钱包. 租房是很多人都离不开的硬性需求,这里就对从链家爬取的北京地区房屋出租数据进行一个简单分析 ...
Python爬取链家成都小区信息
事先声明,本人爬虫初学者,实习时需要用到房价数据,故上阵爬虫,水平有限,若有高见,还请多多指教. 准备工具:Chrome浏览器.Python3.7.IPython notebook 爬虫流程 1.进入 ...
python爬取南京市房价_Python的scrapy之爬取链家网房价信息并保存到本地
因为有在北京租房的打算,于是上网浏览了一下链家网站的房价,想将他们爬取下来,并保存到本地. 先看链家网的源码..房价信息都保存在 ul 下的li 里面爬虫结构: 其中封装了一个数据库处理模 ...
python+selenium爬取链家网房源信息并保存至csv
python+selenium爬取链家网房源信息并保存至csv 抓取的信息有:房源', '详细信息', '价格','楼层', '有无电梯 import csv from selenium import ...
python爬取链家房价消息_Python的scrapy之爬取链家网房价信息并保存到本地
因为有在北京租房的打算,于是上网浏览了一下链家网站的房价,想将他们爬取下来,并保存到本地. 先看链家网的源码..房价信息都保存在 ul 下的li 里面爬虫结构: 其中封装了一个数据库处理模 ...

python 爬取链家北京租房信息

一、效果图

二、代码

三、总结

python 爬取链家北京租房信息相关推荐

最新文章

热门文章