Python 网络爬虫领域两个最新的比较火的工具莫过于 httpx 和 parsel 了。

httpx 号称下一代的新一代的网络请求库,不仅支持 requests 库的所有操作,还能发送异步请求,为编写异步爬虫提供了便利。parsel 最初集成在著名 Python 爬虫框架 Scrapy 中,后独立出来成立一个单独的模块,支持 XPath 选择器, CSS 选择器和正则表达式等多种解析提取方式, 据说相比于 BeautifulSoup,parsel 的解析效率更高。

今天我们就以爬取链家网上的二手房在售房产信息为例,来测评下 httpx 和 parsel 这两个库。为了节约时间,我们以爬取上海市浦东新区 500 万元 -800 万元以上的房产为例。

requests + BeautifulSoup 组合

首先上场的是 Requests + BeautifulSoup 组合,这也是大多数人刚学习 Python 爬虫时使用的组合。本例中爬虫的入口 url 是https://sh.lianjia.com/ershoufang/pudong/a3p5/, 先发送请求获取最大页数,然后循环发送请求解析单个页面提取我们所要的信息(比如小区名,楼层,朝向,总价,单价等信息),最后导出csv文件。如果你正在阅读本文,相信你对Python 爬虫已经有了一定了解,所以我们不会详细解释每一行代码。

整个项目代码如下所示:

Python 网络爬虫领域两个最新的比较火的工具莫过于 httpx 和 parsel 了。httpx 号称下一代的新一代的网络请求库,不仅支持 requests 库的所有操作,还能发送异步请求,为编写异步爬虫提供了便利。parsel 最初集成在著名 Python 爬虫框架 Scrapy 中,后独立出来成立一个单独的模块,支持 XPath 选择器, CSS 选择器和正则表达式等多种解析提取方式, 据说相比于 BeautifulSoup,parsel 的解析效率更高。
今天我们就以爬取链家网上的二手房在售房产信息为例,来测评下 httpx 和 parsel 这两个库。为了节约时间,我们以爬取上海市浦东新区 500 万元 -800 万元以上的房产为例。requests + BeautifulSoup 组合首先上场的是 Requests + BeautifulSoup 组合,这也是大多数人刚学习 Python 爬虫时使用的组合。本例中爬虫的入口 url 是https://sh.lianjia.com/ershoufang/pudong/a3p5/, 先发送请求获取最大页数,然后循环发送请求解析单个页面提取我们所要的信息(比如小区名,楼层,朝向,总价,单价等信息),最后导出csv文件。如果你正在阅读本文,相信你对Python 爬虫已经有了一定了解,所以我们不会详细解释每一行代码。
整个项目代码如下所示:
# homelink_requests.py
# Author: 大江狗from fake_useragent import UserAgentimport requestsfrom bs4 import BeautifulSoupimport csvimport reimport timeclass HomeLinkSpider(object):def __init__(self):self.ua = UserAgent()self.headers = {"User-Agent": self.ua.random}self.data = list()self.path = "浦东_三房_500_800万.csv"self.url = "https://sh.lianjia.com/ershoufang/pudong/a3p5/"def get_max_page(self):response = requests.get(self.url, headers=self.headers)if response.status_code == 200:soup = BeautifulSoup(response.text, 'html.parser')a = soup.select('div[class="page-box house-lst-page-box"]')#使用eval是字符串转化为字典格式max_page = eval(a[0].attrs["page-data"])["totalPage"] return max_pageelse:print("请求失败 status:{}".format(response.status_code))return Nonedef parse_page(self):max_page = self.get_max_page()for i in range(1, max_page + 1):url = 'https://sh.lianjia.com/ershoufang/pudong/pg{}a3p5/'.format(i)response = requests.get(url, headers=self.headers)soup = BeautifulSoup(response.text, 'html.parser')ul = soup.find_all("ul", class_="sellListContent")li_list = ul[0].select("li")for li in li_list:detail = dict()detail['title'] = li.select('div[class="title"]')[0].get_text()#  2室1厅 | 74.14平米 | 南 | 精装 | 高楼层(共6层) | 1999年建 | 板楼house_info = li.select('div[class="houseInfo"]')[0].get_text()house_info_list = house_info.split(" | ")detail['bedroom'] = house_info_list[0]detail['area'] = house_info_list[1]detail['direction'] = house_info_list[2]floor_pattern = re.compile(r'\d{1,2}')# 从字符串任意位置匹配match1 = re.search(floor_pattern, house_info_list[4])  if match1:detail['floor'] = match1.group()else:detail['floor'] = "未知"# 匹配年份year_pattern = re.compile(r'\d{4}')match2 = re.search(year_pattern, house_info_list[5])if match2:detail['year'] = match2.group()else:detail['year'] = "未知"# 文兰小区 - 塘桥, 提取小区名和哈快position_info = li.select('div[class="positionInfo"]')[0].get_text().split(' - ')detail['house'] = position_info[0]detail['location'] = position_info[1]# 650万,匹配650price_pattern = re.compile(r'\d+')total_price = li.select('div[class="totalPrice"]')[0].get_text()detail['total_price'] = re.search(price_pattern, total_price).group()# 单价64182元/平米, 匹配64182unit_price = li.select('div[class="unitPrice"]')[0].get_text()detail['unit_price'] = re.search(price_pattern, unit_price).group()self.data.append(detail)def write_csv_file(self):head = ["标题", "小区", "房厅", "面积", "朝向", "楼层", "年份","位置", "总价(万)", "单价(元/平方米)"]keys = ["title", "house", "bedroom", "area", "direction","floor", "year", "location","total_price", "unit_price"]try:with open(self.path, 'w', newline='', encoding='utf_8_sig') as csv_file:writer = csv.writer(csv_file, dialect='excel')if head is not None:writer.writerow(head)for item in self.data:row_data = []for k in keys:row_data.append(item[k])# print(row_data)writer.writerow(row_data)print("Write a CSV file to path %s Successful." % self.path)except Exception as e:print("Fail to write CSV to path: %s, Case: %s" % (self.path, e))if __name__ == '__main__':start = time.time()home_link_spider = HomeLinkSpider()home_link_spider.parse_page()home_link_spider.write_csv_file()end = time.time()print("耗时:{}秒".format(end-start))
注意:我们使用了 fake_useragent, requests和BeautifulSoup,这些都需要通过 pip 事先安装好才能用。
现在我们来看下爬取结果,耗时约 18.5 秒,总共爬取 580 条数据。
图片requests + parsel 组合
这次我们同样采用 requests 获取目标网页内容,使用 parsel 库(事先需通过 pip 安装)来解析。Parsel 库的用法和 BeautifulSoup 相似,都是先创建实例,然后使用各种选择器提取 DOM 元素和数据,但语法上稍有不同。Beautiful 有自己的语法规则,而 Parsel 库支持标准的 css 选择器和 xpath 选择器, 通过 get 方法或 getall 方法获取文本或属性值,使用起来更方便。# BeautifulSoup的用法from bs4 import BeautifulSoupsoup = BeautifulSoup(response.text, 'html.parser')ul = soup.find_all("ul", class_="sellListContent")[0]# Parsel的用法, 使用Selector类from parsel import Selectorselector = Selector(response.text)ul = selector.css('ul.sellListContent')[0]# Parsel获取文本值或属性值案例selector.css('div.title span::text').get()selector.css('ul li a::attr(href)').get()>>> for li in selector.css('ul > li'):...     print(li.xpath('.//@href').get())
注:老版的 parsel 库使用extract()或extract_first()方法获取文本或属性值,在新版中已被get()和getall()方法替代。
全部代码如下所示:# homelink_parsel.py# Author: 大江狗from fake_useragent import UserAgentimport requestsimport csvimport reimport timefrom parsel import Selectorclass HomeLinkSpider(object):def __init__(self):self.ua = UserAgent()self.headers = {"User-Agent": self.ua.random}self.data = list()self.path = "浦东_三房_500_800万.csv"self.url = "https://sh.lianjia.com/ershoufang/pudong/a3p5/"def get_max_page(self):response = requests.get(self.url, headers=self.headers)if response.status_code == 200:# 创建Selector类实例selector = Selector(response.text)# 采用css选择器获取最大页码div Boxla = selector.css('div[class="page-box house-lst-page-box"]')# 使用eval将page-data的json字符串转化为字典格式max_page = eval(a[0].xpath('//@page-data').get())["totalPage"]print("最大页码数:{}".format(max_page))return max_pageelse:print("请求失败 status:{}".format(response.status_code))return Nonedef parse_page(self):max_page = self.get_max_page()for i in range(1, max_page + 1):url = 'https://sh.lianjia.com/ershoufang/pudong/pg{}a3p5/'.format(i)response = requests.get(url, headers=self.headers)selector = Selector(response.text)ul = selector.css('ul.sellListContent')[0]li_list = ul.css('li')for li in li_list:detail = dict()detail['title'] = li.css('div.title a::text').get()#  2室1厅 | 74.14平米 | 南 | 精装 | 高楼层(共6层) | 1999年建 | 板楼house_info = li.css('div.houseInfo::text').get()house_info_list = house_info.split(" | ")detail['bedroom'] = house_info_list[0]detail['area'] = house_info_list[1]detail['direction'] = house_info_list[2]floor_pattern = re.compile(r'\d{1,2}')match1 = re.search(floor_pattern, house_info_list[4])  # 从字符串任意位置匹配if match1:detail['floor'] = match1.group()else:detail['floor'] = "未知"# 匹配年份year_pattern = re.compile(r'\d{4}')match2 = re.search(year_pattern, house_info_list[5])if match2:detail['year'] = match2.group()else:detail['year'] = "未知"# 文兰小区 - 塘桥    提取小区名和哈快position_info = li.css('div.positionInfo a::text').getall()detail['house'] = position_info[0]detail['location'] = position_info[1]# 650万,匹配650price_pattern = re.compile(r'\d+')total_price = li.css('div.totalPrice span::text').get()detail['total_price'] = re.search(price_pattern, total_price).group()# 单价64182元/平米, 匹配64182unit_price = li.css('div.unitPrice span::text').get()detail['unit_price'] = re.search(price_pattern, unit_price).group()self.data.append(detail)def write_csv_file(self):head = ["标题", "小区", "房厅", "面积", "朝向", "楼层", "年份", "位置", "总价(万)", "单价(元/平方米)"]keys = ["title", "house", "bedroom", "area", "direction", "floor", "year", "location","total_price", "unit_price"]try:with open(self.path, 'w', newline='', encoding='utf_8_sig') as csv_file:writer = csv.writer(csv_file, dialect='excel')if head is not None:writer.writerow(head)for item in self.data:row_data = []for k in keys:row_data.append(item[k])# print(row_data)writer.writerow(row_data)print("Write a CSV file to path %s Successful." % self.path)except Exception as e:print("Fail to write CSV to path: %s, Case: %s" % (self.path, e))if __name__ == '__main__':start = time.time()home_link_spider = HomeLinkSpider()home_link_spider.parse_page()home_link_spider.write_csv_file()end = time.time()print("耗时:{}秒".format(end-start))
现在我们来看下爬取结果,爬取 580 条数据耗时约 16.5 秒,节省了 2 秒时间。可见 parsel 比 BeautifulSoup 解析效率是要高的,爬取任务少时差别不大,任务多的话差别可能会大些。图片httpx 同步 + parsel 组合
我们现在来更进一步,使用 httpx 替代 requests 库。httpx 发送同步请求的方式和 requests 库基本一样,所以我们只需要修改上例中两行代码,把 requests 替换成 httpx 即可, 其余代码一模一样。
from fake_useragent import UserAgentimport csvimport reimport timefrom parsel import Selectorimport httpxclass HomeLinkSpider(object):def __init__(self):self.ua = UserAgent()self.headers = {"User-Agent": self.ua.random}self.data = list()self.path = "浦东_三房_500_800万.csv"self.url = "https://sh.lianjia.com/ershoufang/pudong/a3p5/"def get_max_page(self):# 修改这里把requests换成httpxresponse = httpx.get(self.url, headers=self.headers)if response.status_code == 200:# 创建Selector类实例selector = Selector(response.text)# 采用css选择器获取最大页码div Boxla = selector.css('div[class="page-box house-lst-page-box"]')# 使用eval将page-data的json字符串转化为字典格式max_page = eval(a[0].xpath('//@page-data').get())["totalPage"]print("最大页码数:{}".format(max_page))return max_pageelse:print("请求失败 status:{}".format(response.status_code))return Nonedef parse_page(self):max_page = self.get_max_page()for i in range(1, max_page + 1):url = 'https://sh.lianjia.com/ershoufang/pudong/pg{}a3p5/'.format(i)# 修改这里把requests换成httpxresponse = httpx.get(url, headers=self.headers)selector = Selector(response.text)ul = selector.css('ul.sellListContent')[0]li_list = ul.css('li')for li in li_list:detail = dict()detail['title'] = li.css('div.title a::text').get()#  2室1厅 | 74.14平米 | 南 | 精装 | 高楼层(共6层) | 1999年建 | 板楼house_info = li.css('div.houseInfo::text').get()house_info_list = house_info.split(" | ")detail['bedroom'] = house_info_list[0]detail['area'] = house_info_list[1]detail['direction'] = house_info_list[2]floor_pattern = re.compile(r'\d{1,2}')match1 = re.search(floor_pattern, house_info_list[4])  # 从字符串任意位置匹配if match1:detail['floor'] = match1.group()else:detail['floor'] = "未知"# 匹配年份year_pattern = re.compile(r'\d{4}')match2 = re.search(year_pattern, house_info_list[5])if match2:detail['year'] = match2.group()else:detail['year'] = "未知"# 文兰小区 - 塘桥    提取小区名和哈快position_info = li.css('div.positionInfo a::text').getall()detail['house'] = position_info[0]detail['location'] = position_info[1]# 650万,匹配650price_pattern = re.compile(r'\d+')total_price = li.css('div.totalPrice span::text').get()detail['total_price'] = re.search(price_pattern, total_price).group()# 单价64182元/平米, 匹配64182unit_price = li.css('div.unitPrice span::text').get()detail['unit_price'] = re.search(price_pattern, unit_price).group()self.data.append(detail)def write_csv_file(self):head = ["标题", "小区", "房厅", "面积", "朝向", "楼层", "年份", "位置", "总价(万)", "单价(元/平方米)"]keys = ["title", "house", "bedroom", "area", "direction", "floor", "year", "location","total_price", "unit_price"]try:with open(self.path, 'w', newline='', encoding='utf_8_sig') as csv_file:writer = csv.writer(csv_file, dialect='excel')if head is not None:writer.writerow(head)for item in self.data:row_data = []for k in keys:row_data.append(item[k])# print(row_data)writer.writerow(row_data)print("Write a CSV file to path %s Successful." % self.path)except Exception as e:print("Fail to write CSV to path: %s, Case: %s" % (self.path, e))if __name__ == '__main__':start = time.time()home_link_spider = HomeLinkSpider()home_link_spider.parse_page()home_link_spider.write_csv_file()end = time.time()print("耗时:{}秒".format(end-start))整个爬取过程耗时 16.1 秒,可见使用 httpx 发送同步请求时效率和 requests 基本无差别。
图片注意:Windows 上使用 pip 安装 httpx 可能会出现报错,要求安装 Visual Studio C++, 这个下载安装好就没事了。
接下来,我们就要开始王炸了,使用 httpx 和 asyncio 编写一个异步爬虫看看从链家网上爬取 580 条数据到底需要多长时间。
httpx 异步+ parsel 组合
Httpx 厉害的地方就是能发送异步请求。整个异步爬虫实现原理时,先发送同步请求获取最大页码,把每个单页的爬取和数据解析变为一个 asyncio 协程任务(使用async定义),最后使用 loop 执行。
大部分代码与同步爬虫相同,主要变动地方有两个:# 异步 - 使用协程函数解析单页面,需传入单页面url地址async def parse_single_page(self, url):# 使用httpx发送异步请求获取单页数据async with httpx.AsyncClient() as client:response = await client.get(url, headers=self.headers)selector = Selector(response.text)# 其余地方一样def parse_page(self):max_page = self.get_max_page()loop = asyncio.get_event_loop()# Python 3.6之前用ayncio.ensure_future或loop.create_task方法创建单个协程任务# Python 3.7以后可以用户asyncio.create_task方法创建单个协程任务tasks = []for i in range(1, max_page + 1):url = 'https://sh.lianjia.com/ershoufang/pudong/pg{}a3p5/'.format(i)tasks.append(self.parse_single_page(url))# 还可以使用asyncio.gather(*tasks)命令将多个协程任务加入到事件循环loop.run_until_complete(asyncio.wait(tasks))loop.close()整个项目代码如下所示:from fake_useragent import UserAgentimport csvimport reimport timefrom parsel import Selectorimport httpximport asyncioclass HomeLinkSpider(object):def __init__(self):self.ua = UserAgent()self.headers = {"User-Agent": self.ua.random}self.data = list()self.path = "浦东_三房_500_800万.csv"self.url = "https://sh.lianjia.com/ershoufang/pudong/a3p5/"def get_max_page(self):response = httpx.get(self.url, headers=self.headers)if response.status_code == 200:# 创建Selector类实例selector = Selector(response.text)# 采用css选择器获取最大页码div Boxla = selector.css('div[class="page-box house-lst-page-box"]')# 使用eval将page-data的json字符串转化为字典格式max_page = eval(a[0].xpath('//@page-data').get())["totalPage"]print("最大页码数:{}".format(max_page))return max_pageelse:print("请求失败 status:{}".format(response.status_code))return None# 异步 - 使用协程函数解析单页面,需传入单页面url地址async def parse_single_page(self, url):async with httpx.AsyncClient() as client:response = await client.get(url, headers=self.headers)selector = Selector(response.text)ul = selector.css('ul.sellListContent')[0]li_list = ul.css('li')for li in li_list:detail = dict()detail['title'] = li.css('div.title a::text').get()#  2室1厅 | 74.14平米 | 南 | 精装 | 高楼层(共6层) | 1999年建 | 板楼house_info = li.css('div.houseInfo::text').get()house_info_list = house_info.split(" | ")detail['bedroom'] = house_info_list[0]detail['area'] = house_info_list[1]detail['direction'] = house_info_list[2]floor_pattern = re.compile(r'\d{1,2}')match1 = re.search(floor_pattern, house_info_list[4])  # 从字符串任意位置匹配if match1:detail['floor'] = match1.group()else:detail['floor'] = "未知"# 匹配年份year_pattern = re.compile(r'\d{4}')match2 = re.search(year_pattern, house_info_list[5])if match2:detail['year'] = match2.group()else:detail['year'] = "未知"# 文兰小区 - 塘桥    提取小区名和哈快position_info = li.css('div.positionInfo a::text').getall()detail['house'] = position_info[0]detail['location'] = position_info[1]# 650万,匹配650price_pattern = re.compile(r'\d+')total_price = li.css('div.totalPrice span::text').get()detail['total_price'] = re.search(price_pattern, total_price).group()# 单价64182元/平米, 匹配64182unit_price = li.css('div.unitPrice span::text').get()detail['unit_price'] = re.search(price_pattern, unit_price).group()self.data.append(detail)def parse_page(self):max_page = self.get_max_page()loop = asyncio.get_event_loop()# Python 3.6之前用ayncio.ensure_future或loop.create_task方法创建单个协程任务# Python 3.7以后可以用户asyncio.create_task方法创建单个协程任务tasks = []for i in range(1, max_page + 1):url = 'https://sh.lianjia.com/ershoufang/pudong/pg{}a3p5/'.format(i)tasks.append(self.parse_single_page(url))# 还可以使用asyncio.gather(*tasks)命令将多个协程任务加入到事件循环loop.run_until_complete(asyncio.wait(tasks))loop.close()def write_csv_file(self):head = ["标题", "小区", "房厅", "面积", "朝向", "楼层","年份", "位置", "总价(万)", "单价(元/平方米)"]keys = ["title", "house", "bedroom", "area", "direction","floor", "year", "location","total_price", "unit_price"]try:with open(self.path, 'w', newline='', encoding='utf_8_sig') as csv_file:writer = csv.writer(csv_file, dialect='excel')if head is not None:writer.writerow(head)for item in self.data:row_data = []for k in keys:row_data.append(item[k])writer.writerow(row_data)print("Write a CSV file to path %s Successful." % self.path)except Exception as e:print("Fail to write CSV to path: %s, Case: %s" % (self.path, e))if __name__ == '__main__':start = time.time()home_link_spider = HomeLinkSpider()home_link_spider.parse_page()home_link_spider.write_csv_file()end = time.time()print("耗时:{}秒".format(end-start))
现在到了见证奇迹的时刻了。从链家网上爬取了 580 条数据,使用 httpx 编写的异步爬虫仅仅花了 2.5 秒!!
图片对比与总结
爬取同样的内容,采用不同工具组合耗时是不一样的。httpx 异步+parsel 组合毫无疑问是最大的赢家, requests 和 BeautifulSoup 确实可以功成身退啦。
requests + BeautifulSoup: 18.5 秒
requests + parsel: 16.5 秒
httpx 同步 + parsel: 16.1 秒
httpx 异步 + parsel: 2.5 秒

注意:我们使用了 fake_useragent, requests和BeautifulSoup,这些都需要通过 pip 事先安装好才能用。

现在我们来看下爬取结果,耗时约 18.5 秒,总共爬取 580 条数据。

requests + parsel 组合

这次我们同样采用 requests 获取目标网页内容,使用 parsel 库(事先需通过 pip 安装)来解析。Parsel 库的用法和 BeautifulSoup 相似,都是先创建实例,然后使用各种选择器提取 DOM 元素和数据,但语法上稍有不同。Beautiful 有自己的语法规则,而 Parsel 库支持标准的 css 选择器和 xpath 选择器, 通过 get 方法或 getall 方法获取文本或属性值,使用起来更方便。

# BeautifulSoup的用法from bs4 import BeautifulSoupsoup = BeautifulSoup(response.text, 'html.parser')ul = soup.find_all("ul", class_="sellListContent")[0]# Parsel的用法, 使用Selector类from parsel import Selectorselector = Selector(response.text)ul = selector.css('ul.sellListContent')[0]# Parsel获取文本值或属性值案例selector.css('div.title span::text').get()selector.css('ul li a::attr(href)').get()>>> for li in selector.css('ul > li'):...     print(li.xpath('.//@href').get())
注:老版的 parsel 库使用extract()或extract_first()方法获取文本或属性值,在新版中已被get()和getall()方法替代。
全部代码如下所示:# homelink_parsel.py# Author: 大江狗from fake_useragent import UserAgentimport requestsimport csvimport reimport timefrom parsel import Selectorclass HomeLinkSpider(object):def __init__(self):self.ua = UserAgent()self.headers = {"User-Agent": self.ua.random}self.data = list()self.path = "浦东_三房_500_800万.csv"self.url = "https://sh.lianjia.com/ershoufang/pudong/a3p5/"def get_max_page(self):response = requests.get(self.url, headers=self.headers)if response.status_code == 200:# 创建Selector类实例selector = Selector(response.text)# 采用css选择器获取最大页码div Boxla = selector.css('div[class="page-box house-lst-page-box"]')# 使用eval将page-data的json字符串转化为字典格式max_page = eval(a[0].xpath('//@page-data').get())["totalPage"]print("最大页码数:{}".format(max_page))return max_pageelse:print("请求失败 status:{}".format(response.status_code))return Nonedef parse_page(self):max_page = self.get_max_page()for i in range(1, max_page + 1):url = 'https://sh.lianjia.com/ershoufang/pudong/pg{}a3p5/'.format(i)response = requests.get(url, headers=self.headers)selector = Selector(response.text)ul = selector.css('ul.sellListContent')[0]li_list = ul.css('li')for li in li_list:detail = dict()detail['title'] = li.css('div.title a::text').get()#  2室1厅 | 74.14平米 | 南 | 精装 | 高楼层(共6层) | 1999年建 | 板楼house_info = li.css('div.houseInfo::text').get()house_info_list = house_info.split(" | ")detail['bedroom'] = house_info_list[0]detail['area'] = house_info_list[1]detail['direction'] = house_info_list[2]floor_pattern = re.compile(r'\d{1,2}')match1 = re.search(floor_pattern, house_info_list[4])  # 从字符串任意位置匹配if match1:detail['floor'] = match1.group()else:detail['floor'] = "未知"# 匹配年份year_pattern = re.compile(r'\d{4}')match2 = re.search(year_pattern, house_info_list[5])if match2:detail['year'] = match2.group()else:detail['year'] = "未知"# 文兰小区 - 塘桥    提取小区名和哈快position_info = li.css('div.positionInfo a::text').getall()detail['house'] = position_info[0]detail['location'] = position_info[1]# 650万,匹配650price_pattern = re.compile(r'\d+')total_price = li.css('div.totalPrice span::text').get()detail['total_price'] = re.search(price_pattern, total_price).group()# 单价64182元/平米, 匹配64182unit_price = li.css('div.unitPrice span::text').get()detail['unit_price'] = re.search(price_pattern, unit_price).group()self.data.append(detail)def write_csv_file(self):head = ["标题", "小区", "房厅", "面积", "朝向", "楼层", "年份", "位置", "总价(万)", "单价(元/平方米)"]keys = ["title", "house", "bedroom", "area", "direction", "floor", "year", "location","total_price", "unit_price"]try:with open(self.path, 'w', newline='', encoding='utf_8_sig') as csv_file:writer = csv.writer(csv_file, dialect='excel')if head is not None:writer.writerow(head)for item in self.data:row_data = []for k in keys:row_data.append(item[k])# print(row_data)writer.writerow(row_data)print("Write a CSV file to path %s Successful." % self.path)except Exception as e:print("Fail to write CSV to path: %s, Case: %s" % (self.path, e))if __name__ == '__main__':start = time.time()home_link_spider = HomeLinkSpider()home_link_spider.parse_page()home_link_spider.write_csv_file()end = time.time()print("耗时:{}秒".format(end-start))

现在我们来看下爬取结果,爬取 580 条数据耗时约 16.5 秒,节省了 2 秒时间。可见 parsel 比 BeautifulSoup 解析效率是要高的,爬取任务少时差别不大,任务多的话差别可能会大些。

httpx 同步 + parsel 组合

我们现在来更进一步,使用 httpx 替代 requests 库。httpx 发送同步请求的方式和 requests 库基本一样,所以我们只需要修改上例中两行代码,把 requests 替换成 httpx 即可, 其余代码一模一样。

整个爬取过程耗时 16.1 秒,可见使用 httpx 发送同步请求时效率和 requests 基本无差别。

from fake_useragent import UserAgentimport csvimport reimport timefrom parsel import Selectorimport httpxclass HomeLinkSpider(object):def __init__(self):self.ua = UserAgent()self.headers = {"User-Agent": self.ua.random}self.data = list()self.path = "浦东_三房_500_800万.csv"self.url = "https://sh.lianjia.com/ershoufang/pudong/a3p5/"def get_max_page(self):# 修改这里把requests换成httpxresponse = httpx.get(self.url, headers=self.headers)if response.status_code == 200:# 创建Selector类实例selector = Selector(response.text)# 采用css选择器获取最大页码div Boxla = selector.css('div[class="page-box house-lst-page-box"]')# 使用eval将page-data的json字符串转化为字典格式max_page = eval(a[0].xpath('//@page-data').get())["totalPage"]print("最大页码数:{}".format(max_page))return max_pageelse:print("请求失败 status:{}".format(response.status_code))return Nonedef parse_page(self):max_page = self.get_max_page()for i in range(1, max_page + 1):url = 'https://sh.lianjia.com/ershoufang/pudong/pg{}a3p5/'.format(i)# 修改这里把requests换成httpxresponse = httpx.get(url, headers=self.headers)selector = Selector(response.text)ul = selector.css('ul.sellListContent')[0]li_list = ul.css('li')for li in li_list:detail = dict()detail['title'] = li.css('div.title a::text').get()#  2室1厅 | 74.14平米 | 南 | 精装 | 高楼层(共6层) | 1999年建 | 板楼house_info = li.css('div.houseInfo::text').get()house_info_list = house_info.split(" | ")detail['bedroom'] = house_info_list[0]detail['area'] = house_info_list[1]detail['direction'] = house_info_list[2]floor_pattern = re.compile(r'\d{1,2}')match1 = re.search(floor_pattern, house_info_list[4])  # 从字符串任意位置匹配if match1:detail['floor'] = match1.group()else:detail['floor'] = "未知"# 匹配年份year_pattern = re.compile(r'\d{4}')match2 = re.search(year_pattern, house_info_list[5])if match2:detail['year'] = match2.group()else:detail['year'] = "未知"# 文兰小区 - 塘桥    提取小区名和哈快position_info = li.css('div.positionInfo a::text').getall()detail['house'] = position_info[0]detail['location'] = position_info[1]# 650万,匹配650price_pattern = re.compile(r'\d+')total_price = li.css('div.totalPrice span::text').get()detail['total_price'] = re.search(price_pattern, total_price).group()# 单价64182元/平米, 匹配64182unit_price = li.css('div.unitPrice span::text').get()detail['unit_price'] = re.search(price_pattern, unit_price).group()self.data.append(detail)def write_csv_file(self):head = ["标题", "小区", "房厅", "面积", "朝向", "楼层", "年份", "位置", "总价(万)", "单价(元/平方米)"]keys = ["title", "house", "bedroom", "area", "direction", "floor", "year", "location","total_price", "unit_price"]try:with open(self.path, 'w', newline='', encoding='utf_8_sig') as csv_file:writer = csv.writer(csv_file, dialect='excel')if head is not None:writer.writerow(head)for item in self.data:row_data = []for k in keys:row_data.append(item[k])# print(row_data)writer.writerow(row_data)print("Write a CSV file to path %s Successful." % self.path)except Exception as e:print("Fail to write CSV to path: %s, Case: %s" % (self.path, e))if __name__ == '__main__':start = time.time()home_link_spider = HomeLinkSpider()home_link_spider.parse_page()home_link_spider.write_csv_file()end = time.time()print("耗时:{}秒".format(end-start))

注意:Windows 上使用 pip 安装 httpx 可能会出现报错,要求安装 Visual Studio C++, 这个下载安装好就没事了。

接下来,我们就要开始王炸了,使用 httpx 和 asyncio 编写一个异步爬虫看看从链家网上爬取 580 条数据到底需要多长时间。

httpx 异步+ parsel 组合

Httpx 厉害的地方就是能发送异步请求。整个异步爬虫实现原理时,先发送同步请求获取最大页码,把每个单页的爬取和数据解析变为一个 asyncio 协程任务(使用async定义),最后使用 loop 执行。

大部分代码与同步爬虫相同,主要变动地方有两个:

 # 异步 - 使用协程函数解析单页面,需传入单页面url地址async def parse_single_page(self, url):# 使用httpx发送异步请求获取单页数据async with httpx.AsyncClient() as client:response = await client.get(url, headers=self.headers)selector = Selector(response.text)# 其余地方一样def parse_page(self):max_page = self.get_max_page()loop = asyncio.get_event_loop()# Python 3.6之前用ayncio.ensure_future或loop.create_task方法创建单个协程任务# Python 3.7以后可以用户asyncio.create_task方法创建单个协程任务tasks = []for i in range(1, max_page + 1):url = 'https://sh.lianjia.com/ershoufang/pudong/pg{}a3p5/'.format(i)tasks.append(self.parse_single_page(url))# 还可以使用asyncio.gather(*tasks)命令将多个协程任务加入到事件循环loop.run_until_complete(asyncio.wait(tasks))loop.close()

整个项目代码如下所示:

 from fake_useragent import UserAgentimport csvimport reimport timefrom parsel import Selectorimport httpximport asyncioclass HomeLinkSpider(object):def __init__(self):self.ua = UserAgent()self.headers = {"User-Agent": self.ua.random}self.data = list()self.path = "浦东_三房_500_800万.csv"self.url = "https://sh.lianjia.com/ershoufang/pudong/a3p5/"def get_max_page(self):response = httpx.get(self.url, headers=self.headers)if response.status_code == 200:# 创建Selector类实例selector = Selector(response.text)# 采用css选择器获取最大页码div Boxla = selector.css('div[class="page-box house-lst-page-box"]')# 使用eval将page-data的json字符串转化为字典格式max_page = eval(a[0].xpath('//@page-data').get())["totalPage"]print("最大页码数:{}".format(max_page))return max_pageelse:print("请求失败 status:{}".format(response.status_code))return None# 异步 - 使用协程函数解析单页面,需传入单页面url地址async def parse_single_page(self, url):async with httpx.AsyncClient() as client:response = await client.get(url, headers=self.headers)selector = Selector(response.text)ul = selector.css('ul.sellListContent')[0]li_list = ul.css('li')for li in li_list:detail = dict()detail['title'] = li.css('div.title a::text').get()#  2室1厅 | 74.14平米 | 南 | 精装 | 高楼层(共6层) | 1999年建 | 板楼house_info = li.css('div.houseInfo::text').get()house_info_list = house_info.split(" | ")detail['bedroom'] = house_info_list[0]detail['area'] = house_info_list[1]detail['direction'] = house_info_list[2]floor_pattern = re.compile(r'\d{1,2}')match1 = re.search(floor_pattern, house_info_list[4])  # 从字符串任意位置匹配if match1:detail['floor'] = match1.group()else:detail['floor'] = "未知"# 匹配年份year_pattern = re.compile(r'\d{4}')match2 = re.search(year_pattern, house_info_list[5])if match2:detail['year'] = match2.group()else:detail['year'] = "未知"# 文兰小区 - 塘桥    提取小区名和哈快position_info = li.css('div.positionInfo a::text').getall()detail['house'] = position_info[0]detail['location'] = position_info[1]# 650万,匹配650price_pattern = re.compile(r'\d+')total_price = li.css('div.totalPrice span::text').get()detail['total_price'] = re.search(price_pattern, total_price).group()# 单价64182元/平米, 匹配64182unit_price = li.css('div.unitPrice span::text').get()detail['unit_price'] = re.search(price_pattern, unit_price).group()self.data.append(detail)def parse_page(self):max_page = self.get_max_page()loop = asyncio.get_event_loop()# Python 3.6之前用ayncio.ensure_future或loop.create_task方法创建单个协程任务# Python 3.7以后可以用户asyncio.create_task方法创建单个协程任务tasks = []for i in range(1, max_page + 1):url = 'https://sh.lianjia.com/ershoufang/pudong/pg{}a3p5/'.format(i)tasks.append(self.parse_single_page(url))# 还可以使用asyncio.gather(*tasks)命令将多个协程任务加入到事件循环loop.run_until_complete(asyncio.wait(tasks))loop.close()def write_csv_file(self):head = ["标题", "小区", "房厅", "面积", "朝向", "楼层","年份", "位置", "总价(万)", "单价(元/平方米)"]keys = ["title", "house", "bedroom", "area", "direction","floor", "year", "location","total_price", "unit_price"]try:with open(self.path, 'w', newline='', encoding='utf_8_sig') as csv_file:writer = csv.writer(csv_file, dialect='excel')if head is not None:writer.writerow(head)for item in self.data:row_data = []for k in keys:row_data.append(item[k])writer.writerow(row_data)print("Write a CSV file to path %s Successful." % self.path)except Exception as e:print("Fail to write CSV to path: %s, Case: %s" % (self.path, e))if __name__ == '__main__':start = time.time()home_link_spider = HomeLinkSpider()home_link_spider.parse_page()home_link_spider.write_csv_file()end = time.time()print("耗时:{}秒".format(end-start))

现在到了见证奇迹的时刻了。从链家网上爬取了 580 条数据,使用 httpx 编写的异步爬虫仅仅花了 2.5 秒!!

对比与总结

爬取同样的内容,采用不同工具组合耗时是不一样的。httpx 异步+parsel 组合毫无疑问是最大的赢家, requests 和 BeautifulSoup 确实可以功成身退啦。

  • requests + BeautifulSoup: 18.5 秒

  • requests + parsel: 16.5 秒

  • httpx 同步 + parsel: 16.1 秒

  • httpx 异步 + parsel: 2.5 秒

Python 网络爬虫工具:httpx 和 parsel(对比测评)相关推荐

  1. python网络爬虫工具库集合

    经常逛 GitHub 的同学可能会听说过大名鼎鼎的 awesome 仓库,没错,就是这个:https://github.com/sindresorhus/awesome. 这个库可谓是一个极大的宝藏, ...

  2. 156个Python网络爬虫资源,GitHub上awesome系列之Python爬虫工具

    项目地址:lorien/awesome-web-scraping GitHub上awesome系列之Python的爬虫工具. 本列表包含Python网页抓取和数据处理相关的库. 网络相关 通用 url ...

  3. Python 网络爬虫 001 (科普) 网络爬虫简介

    Python 网络爬虫 001 (科普) 网络爬虫简介 1. 网络爬虫是干什么的 我举几个生活中的例子: 例子一: 我平时会将 学到的知识 和 积累的经验 写成博客发送到CSDN博客网站上,那么对于我 ...

  4. 【读书笔记】Python网络爬虫从入门到实践(第2版)-唐松,爬虫基础体系巩固和常见场景练习

    [概述] 书名:Python网络爬虫从入门到实践(第2版) 作者:唐松 日期:2021年08月01日 读书用时:1568页,100小时,59个笔记 [读书笔记] ◆ 1.2 网络爬虫是否合法 爬虫协议 ...

  5. python爬虫微信朋友圈怎么发文字_如何利用Python网络爬虫抓取微信朋友圈的动态(上)...

    今天小编给大家分享一下如何利用Python网络爬虫抓取微信朋友圈的动态信息,实际上如果单独的去爬取朋友圈的话,难度会非常大,因为微信没有提供向网易云音乐这样的API接口,所以很容易找不到门.不过不要慌 ...

  6. 解析python网络爬虫pdf 黑马程序员_正版 解析Python网络爬虫 核心技术 Scrapy框架 分布式爬虫 黑马程序员 Python应用编程丛书 中国铁道出版社...

    商品参数 书名:Python应用编程丛书:解析Python网络爬虫:核心技术.Scrapy框架.分布式爬虫 定价:52.00元 作者:[中国]黑马程序员 出版社:中国铁道出版社 出版日期:2018-0 ...

  7. python抓取微信朋友圈动态_2018最全如何利用Python网络爬虫抓取微信朋友圈的动态...

    今天小编给大家分享一下如何利用Python网络爬虫抓取微信朋友圈的动态信息,实际上如果单独的去爬取朋友圈的话,难度会非常大,因为微信没有提供向网易云音乐这样的API接口,所以很容易找不到门.不过不要慌 ...

  8. python网络爬虫可以干什么,python网络爬虫有什么用

    python爬虫能做什么 世界上80%的爬虫是基于Python开发的,学好爬虫技能,可为后续的大数据分析.挖掘.机器学习等提供重要的数据源.什么是爬虫? (推荐学习:Python视频教程)网络爬虫(又 ...

  9. 爬虫python能做什么 知乎,python网络爬虫能做什么

    python爬虫能做什么 世界上80%的爬虫是基于Python开发的,学好爬虫技能,可为后续的大数据分析.挖掘.机器学习等提供重要的数据源.什么是爬虫? (推荐学习:Python视频教程)网络爬虫(又 ...

最新文章

  1. String创建方式及其区别(快速了解)
  2. MySQL数据库右连接查询right join ... on
  3. 1411区间内的真素数2
  4. 5页面调用原生相机_React Native与原生通信全梳理(iOS端)
  5. 请听一个故事------gt;百度员工离职总结:如何做个好员工
  6. selenium模拟登陆豆瓣网
  7. JavaScript回调函数的高手指南
  8. 一个工作了两三年程序员的学习计划
  9. 4月份西部数码.wang域名注册量报告:增速严重缩水
  10. 计算机导论考试考什么,计算机导论考试
  11. kali2020出现中文乱码解决
  12. Java实现阿里云短信验证码发送
  13. java 实现打印机_JAVA实现连接本地打印机并打印文件的实现代码
  14. luogu P3369(Splay)
  15. 数据库学习笔记【MySQL】
  16. 四、线段_同级别分解
  17. 世事洞明皆学问-拉链拉头的拆分安装
  18. centos挂载盘到根下_centos挂载磁盘及扩展根目录
  19. 兔子繁殖问题(C语言)
  20. 论文解读:OMNI-DIMENSIONAL DYNAMIC CONVOLUTION

热门文章

  1. 解读:通过挖掘概念间共享信息,实现股票趋势预测的图模型框架
  2. 四种基本力(相互作用)
  3. 计算机中的CNC键代表什么,国内数控(CNC)机床操作面板按键及功能组合键作用(全套)对照...
  4. 终于有人把数据挖掘讲明白了
  5. GitModel数学建模 —— 动手学数理统计
  6. 网页中Flash播放器里的视频获取的方法
  7. discuz后台主导航栏菜单中添加新的菜单项的方法
  8. 合宙 ESP32C3 使用micropython 驱动配套0.96寸 TFT ST7735 屏幕显示色块和文字
  9. IPartDoc Interface 学习笔记
  10. CPU(Central Processing Unit,中央处理器)