Python链家租房信息爬虫

爬取链家某地区(杭州,南京等)租房信息爬虫。链家只开放了前100页供查看，每页30条，因此实际上只爬取了前3000条信息。

对于项目需求需要分析某地区某段时间内发布的租房信息，爬取对应的名称name，地区dist，面积square，价格price，备注detail，用pandas.DataFrame.to_excel()保存为 excel 文件。并发下载使用futures.ThreadPoolExecutor。

针对时间分析，需要得到具体的页面url信息，在详情页内找到对应的时间，并用time = re.sub('[\u4e00-\u9fa5]*', '', time)去掉中文获取时间戳。时间戳的分析使用pd.to_datetime()。

Github地址:Joovo/lianjia_spider

代码：

import requests
import pandas as pd
from lxml import etree
import re
from concurrent import futuresroot_url = 'https://hz.lianjia.com'
s = requests.session()
header = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'
}
csv = pd.DataFrame(columns=['名称', '地区', '面积', '价格', '备注'])def crawl(page):page = str(page)page_url = root_url + '/zufang/pg' + page + '/#contentList'r = s.request('GET', headers=header, url=page_url)r.encoding = r.apparent_encodingtree = etree.HTML(r.text)for num in range(1, 31):num = str(num)name_xpath = '//*[@id="content"]/div[1]/div[1]/div[' + num + ']/div/p[1]/a//text()'detail_xpath = '//*[@id="content"]/div[1]/div[1]/div[' + num + ']/div/p[2]//text()'detail_url_xpath = '//*[@id="content"]/div[1]/div[1]/div[' + num + ']/div/p[1]/a/@href'price_xpath = '//*[@id="content"]/div[1]/div[1]/div[' + num + ']/div/span/em//text()'# //*[@id="content"]/div[1]/div[1]/div[30]/div/p[1]/adetail_url_xpath = tree.xpath(detail_url_xpath)detail_r = s.get(headers=header, url=root_url + detail_url_xpath[0])# if time illegalif not check_time(detail_r):continue# namename = tree.xpath(name_xpath)name = name[0].strip('\n').strip('\t').strip()# detaildetail = tree.xpath(detail_xpath)detail = [_.strip('\n').strip('\t').strip() for _ in detail]detail = ''.join(detail)# distif detail.count('/') == 4:dist = detail.split('/')[0]else:dist = ''# squaresquare = re.search(r'\d+㎡', detail).group()square = square[:-1]# priceprice = tree.xpath(price_xpath)price = price[0]new_line = pd.DataFrame([[name, dist, square, price, detail]], \columns=['名称', '地区', '面积', '价格', '备注'])global csvcsv = pd.concat([csv, new_line])print(page)def check_time(r):r.encoding = r.apparent_encodingtree = etree.HTML(r.text)time_xpath = '/html/body/div[3]/div[1]/div[3]/div[1]/text()'time = tree.xpath(time_xpath)time = [_.strip('\n').strip('\t').strip() for _ in time]time = ''.join(time)time = re.sub('[\u4e00-\u9fa5]*', '', time)date = pd.to_datetime(time)if pd.to_datetime('2018-07-01') <= date <= pd.to_datetime('2019-08-01'):return Trueelse:return Falseif __name__ == '__main__':with futures.ThreadPoolExecutor(max_workers=10) as e:e.map(crawl, range(1, 101))csv.to_excel('./ans.xls', index=False)

Python链家租房信息爬虫相关推荐

python爬虫--爬取链家租房信息
python 爬虫-链家租房信息爬虫,其实就是爬取web页面上的信息. 链家租房信息页面如下: https://gz.lianjia.com/zufang/ ## python库 Python库 1 ...
爬取南京链家租房信息
爬取南京链家租房信息代码如下代码片. import requests from lxml import etree if name == "main": #设置一个通用URL模 ...
python爬取链家租房信息_Python爬取链家网上海市租房信息
使用Python进行上海市租房信息爬取,通过requests + Beautifulsoup对网页内容进行抓取和数据提取. import requests from bs4 import Beauti ...
PYTHON链家租房数据分析：岭回归、LASSO、随机森林、XGBOOST、KERAS神经网络、KMEANS聚类、地理可视化...
全文下载链接:http://tecdat.cn/?p=29480 作者:Xingsheng Yang 1 利用 python 爬取链家网公开的租房数据: 2 对租房信息进行分析,主要对房租相关特征进行 ...
50 行代码爬取链家租房信息
最近自己开始学习数据分析的技术,但数据分析最重要的就是数据.没有数据怎么办?那就自己爬一些数据.大家一定要记得爬虫只是获取数据的一种手段,但如果不用一系列科学的方式去分析这些数据,那么爬去下来的数据是 ...
爬取广州链家租房信息，并用tableau进行数据分析
在外工作,难免需要租房子,而链家是现在比较火的一个租房网站,本文章主要是爬取链家在广州的租房信息,并且进行简单的数据分析. 数据采集如图所示,我们需要的信息主要是存放在这个标签里面,我们把信息采集下 ...
python链家新房信息获取练习
使用python对链家新房相关数据进行爬取,并进行持久化存储. 文章目录前言一.页面分析二.代码编写 1.数据库表的建立 2.代码编写结果前言保持练习以下是本篇文章正文内容,下面案例可供 ...
链家网开源java_异步协程爬取链家租房信息
异步协程抓取链家数据+pandas写入csv import asyncio import aiohttp import pandas from bs4 import BeautifulSoup fro ...
python爬取链家租房信息_python爬取链家租房之获取房屋的链接和页面的详细信息...
因为期末考试的缘故,本打算一个星期结束的爬虫,拖了很久,不过,也有好处:之前写的时候总是被反爬,这几天复习之余写了些反爬取的py code 下面发出来和大家探讨做了些反爬取的手段随机获取一个hea ...
链家租房信息案例数据分析
载入数据: import pandas as pd lj_data = pd.read_csv('../data/excel_data/LJdata.csv') lj_data 把列名替换成英文: # ...

Python链家租房信息爬虫

Python链家租房信息爬虫相关推荐

最新文章

热门文章