爬虫系列总目录

本章节介绍爬虫中使用的基础库用于选择，过滤页面信息。包括requests，bs4，xpath，正则表达式re，json 等内容，能够实现对简单页面的获取。
第二章爬虫基础库-requests/bs4
第二章正则表达式
第二章简单网页的爬取与Xpath、Json使用
第二章页面爬取应用-缺失数据处理、图片下载、懒加载
第二章爬取案例-链家租房数据获取

一、需求描述

获取某一城市下的租房信息，房源名称，位置，面积，朝向，户型，租金等信息，保存到Excel中，并且保存详情页中的图片。

分析：租房信息中无相关接口返回的数据，通过Xpath 获取到页面信息。
控制爬取速度，防止过多请求造成的服务器压力过大，以及验证码等相关反爬虫机制。

二、代码结构

def main(num, path, temp_url):"""主函数"""# 获取要爬取的网页，网页特点决定，按页来遍历target_url = get_url(num, temp_url)# 结果变量all_info = {'房源名称': [], '位置': [], '面积': [], '朝向': [], '户型': [], '租金': [], 'next_url': []}for i, url in enumerate(target_url):# 获取响应response = get_response(url)# 处理响应到变量中all_info = get_info(response.text, all_info)# 保存图片数据get_img_and_download(imge_urls, house_name, path)# 保存数据到Excel中save_data(path, all_info)if __name__ == '__main__':temp_url = 'https://bj.lianjia.com/zufang/pg{}'path = './lianjia'num = input('爬取的⻚数：')if num.isdigit():num = int(num)print(main(num, path, temp_url))else:print('num输⼊的不是纯数字')

三、代码总结

3.1 根据charset 自动设置编码

response.encoding = response.apparent_encoding

3.2 数据获取

3.2.1数据的一次性获取后处理

数据特点使用 / 分割，使用Xpath(“string()”) 获取当前标签下的所有文本内容。

def get_info(response, all_info):"""获取响应解析，保存数据到all_info中"""etree_html = etree.HTML(response, etree.HTMLParser())titles = etree_html.xpath("//div[@class='content__list--item--main']/p[1]/a/text()")all_info['房源名称'].extend(list(map(lambda x: x.strip(), titles)))sub_title = etree_html.xpath("//div[@class='content__list--item']/div")for info in sub_title:sub_results = info.xpath("string(./p[2])").split("/")all_info["位置"].append(sub_results[0].strip())all_info["面积"].append(sub_results[1].strip())all_info["朝向"].append(sub_results[2].strip())all_info["户型"].append(sub_results[3].strip())all_info["租金"].append(info.xpath("string(./span)").strip())return all_info

3.2.2 图片懒加载的获取

获取的src 数据为data-src， src 是拿不到真实的图片地址。

        download_urls = img_html.xpath("//div[@class='content__article__slide__item']/img/@data-src")

3.2.3 获取路径中包含/，保存时被当成子文件夹处理

四、示例代码

优化方向：
1、能控制 get_response 方法请求不要太快，保存上次请求时间，若时间过短则sleep。
2、将缺失的图片的页面输出到结果中，而不是直接print
3、爬取速度较慢，增加请求速度同时不被识别。

import requests
from lxml import etree
import time
import random
import os
import pandasdef get_info(response, all_info):"""获取响应解析，保存数据到all_info中"""etree_html = etree.HTML(response, etree.HTMLParser())titles = etree_html.xpath("//div[@class='content__list--item--main']/p[1]/a/text()")all_info['房源名称'].extend(list(map(lambda x: x.strip(), titles)))sub_title = etree_html.xpath("//div[@class='content__list--item']/div")for info in sub_title:sub_results = info.xpath("string(./p[2])").split("/")all_info["位置"].append(sub_results[0].strip())all_info["面积"].append(sub_results[1].strip())all_info["朝向"].append(sub_results[2].strip())all_info["户型"].append(sub_results[3].strip())all_info["租金"].append(info.xpath("string(./span)").strip())all_info['next_url'].extend(etree_html.xpath("//div[@class='content__list--item']/a/@href"))return all_infodef get_url(num, temp_url):"""拼接url,并且返回"""urls = [temp_url.format(i) for i in range(1, num + 1)]return urlsdef get_response(url, headers=None):"""获取url对应响应"""if headers is None:_headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36'}response = requests.get(url, headers=_headers)response.encoding = response.apparent_encodingreturn responsedef get_img_and_download(urls, house_name, path, headers=None):no_img = 'https://image1.ljcdn.com/rent-front-image/2a778015dbbab30fcde0a978314ce007.png.780x439.jpg'for i, img_url in enumerate(urls):img_response = get_response(img_url)img_html = etree.HTML(img_response.text, etree.HTMLParser())# image_name = img_html.xpath("string(//div[@class='content clear w1150']/p)").strip()image_name = house_name[i]image_dir = os.path.join(path, image_name.replace("/", "-"))download_urls = img_html.xpath("//div[@class='content__article__slide__item']/img/@data-src")if len(download_urls) == 1 and download_urls[0] == no_img:print('{}没有图片, html: {}'.format(image_name, img_url))continueif not os.path.exists(image_dir):os.makedirs(image_dir)for j, download_url in enumerate(download_urls):with open(os.path.join(image_dir, '{}_{}.jpg'.format(image_name,j)), 'wb') as f:f.write(get_response(download_url, headers=headers).content)def save_data(path, all_info):df_data = pandas.DataFrame(all_info)df_data.to_excel(os.path.join(path, "result.xlsx"))def main(num, path, temp_url):"""主函数"""target_url = get_url(num, temp_url)# 结果变量all_info = {'房源名称': [], '位置': [], '面积': [], '朝向': [], '户型': [], '租金': [], 'next_url': []}for i, url in enumerate(target_url):response = get_response(url)all_info = get_info(response.text, all_info)time.sleep(random.random() * 4)imge_urls = map(lambda x: "{}com{}".format(temp_url[: temp_url.index("com")], x), all_info["next_url"])house_name = all_info['房源名称']get_img_and_download(imge_urls, house_name, path)save_data(path, all_info)if __name__ == '__main__':temp_url = 'http://bj.lianjia.com/zufang/pg{}'path = './lianjia/image'# num = input('爬取的⻚数：')num = "2"if num.isdigit():num = int(num)print(main(num, path, temp_url))else:print('num输⼊的不是纯数字')

第二章爬取案例-链家租房数据获取 2021-09-16相关推荐

爬取南京链家租房信息
爬取南京链家租房信息代码如下代码片. import requests from lxml import etree if name == "main": #设置一个通用URL模 ...
爬取广州链家租房信息，并用tableau进行数据分析
在外工作,难免需要租房子,而链家是现在比较火的一个租房网站,本文章主要是爬取链家在广州的租房信息,并且进行简单的数据分析. 数据采集如图所示,我们需要的信息主要是存放在这个标签里面,我们把信息采集下 ...
Python爬虫框架Scrapy入门（三）爬虫实战：爬取长沙链家二手房
Item Pipeline介绍 Item对象是一个简单的容器,用于收集抓取到的数据,其提供了类似于字典(dictionary-like)的API,并具有用于声明可用字段的简单语法. Scrapy的It ...
用Python爬取2020链家杭州二手房数据
起源于数据挖掘课程设计的需求,参考着17年这位老兄写的代码:https://blog.csdn.net/sinat_36772813/article/details/73497956?utm_medi ...
爬取‘广州链家新房’数据并以csv形式保存。
--本次的目标是爬取'广州链家新房'前十页的信息,具体需要爬取的信息为'楼房名字.地址.价格以及是否在售的情况',具体的代码如下. import requests,time import pandas ...
爬虫：一种打破3000套限制爬取所有链家二手房源的方法
本人在爬取二手房的时候,发现链家网站的每个链接(https://sz.lianjia.com/ershoufang/pg100/)最多只能有100页,每页30套房源,那么就是3000套.很多网友也遇到 ...
爬取北京链家二手房数据
利用python爬取了北京链家主页的二手房数据,爬取时间为2020年1月8日.由于链家只显示了100页.每页30条,因此只能爬取3000条数据. 后续将爬取各区的小区名,对每个小区的在售二手房数据进行 ...
简单python脚本爬取杭州链家二手房房价信息
爬取链家房价信息主要使用以下库: requests BeautifulSoup 相关问题当爬取次数太多的时候,可能会遇到被封的情况或者验证码, 大佬们可以自行解决,这个脚本里面我就不掺和了,毕竟我 ...
爬取北京链家二手房（requests和selenium）
从网页源码中可以看出这是静态网页,可以直接从源代码里拿数据,先用requests,在用BeautifulSoup解析,最后通过查找获取数据 import requests from bs4 impor ...
爬虫学习 ----- 第二章爬取静态网站 ---------- 03 . re 模块学习 ---- re屠戮电影天堂
目录: 1. [案例]re屠戮电影天堂 1.目的: 1. 定位到2021新片精品 1. 出现错误??? 2. 网页乱码??? 2.从2021新片精品中提取到子页面的链接地址 3.请求子页面的链接地址, ...

第二章爬取案例-链家租房数据获取 2021-09-16

爬虫系列总目录

一、需求描述

二、代码结构

三、代码总结

3.1 根据charset 自动设置编码

3.2 数据获取

3.2.1数据的一次性获取后处理

3.2.2 图片懒加载的获取

3.2.3 获取路径中包含/，保存时被当成子文件夹处理

四、示例代码

第二章爬取案例-链家租房数据获取 2021-09-16相关推荐

最新文章

热门文章

第二章 爬取案例-链家租房数据获取 2021-09-16

爬虫系列总目录

一、 需求描述

二、代码结构

三、代码总结

3.1 根据charset 自动设置编码

3.2 数据获取

3.2.1数据的一次性获取后处理

3.2.2 图片懒加载的获取

3.2.3 获取路径中包含/， 保存时被当成子文件夹处理

四、 示例代码

第二章 爬取案例-链家租房数据获取 2021-09-16相关推荐

最新文章

热门文章

第二章爬取案例-链家租房数据获取 2021-09-16

一、需求描述

3.2.3 获取路径中包含/，保存时被当成子文件夹处理

四、示例代码

第二章爬取案例-链家租房数据获取 2021-09-16相关推荐