Python 网络爬虫工具:httpx 和 parsel(对比测评)
Python 网络爬虫领域两个最新的比较火的工具莫过于 httpx 和 parsel 了。
httpx 号称下一代的新一代的网络请求库,不仅支持 requests 库的所有操作,还能发送异步请求,为编写异步爬虫提供了便利。parsel 最初集成在著名 Python 爬虫框架 Scrapy 中,后独立出来成立一个单独的模块,支持 XPath 选择器, CSS 选择器和正则表达式等多种解析提取方式, 据说相比于 BeautifulSoup,parsel 的解析效率更高。
今天我们就以爬取链家网上的二手房在售房产信息为例,来测评下 httpx 和 parsel 这两个库。为了节约时间,我们以爬取上海市浦东新区 500 万元 -800 万元以上的房产为例。
requests + BeautifulSoup 组合
首先上场的是 Requests + BeautifulSoup 组合,这也是大多数人刚学习 Python 爬虫时使用的组合。本例中爬虫的入口 url 是https://sh.lianjia.com/ershoufang/pudong/a3p5/
, 先发送请求获取最大页数,然后循环发送请求解析单个页面提取我们所要的信息(比如小区名,楼层,朝向,总价,单价等信息),最后导出csv文件。如果你正在阅读本文,相信你对Python 爬虫已经有了一定了解,所以我们不会详细解释每一行代码。
整个项目代码如下所示:
Python 网络爬虫领域两个最新的比较火的工具莫过于 httpx 和 parsel 了。httpx 号称下一代的新一代的网络请求库,不仅支持 requests 库的所有操作,还能发送异步请求,为编写异步爬虫提供了便利。parsel 最初集成在著名 Python 爬虫框架 Scrapy 中,后独立出来成立一个单独的模块,支持 XPath 选择器, CSS 选择器和正则表达式等多种解析提取方式, 据说相比于 BeautifulSoup,parsel 的解析效率更高。
今天我们就以爬取链家网上的二手房在售房产信息为例,来测评下 httpx 和 parsel 这两个库。为了节约时间,我们以爬取上海市浦东新区 500 万元 -800 万元以上的房产为例。requests + BeautifulSoup 组合首先上场的是 Requests + BeautifulSoup 组合,这也是大多数人刚学习 Python 爬虫时使用的组合。本例中爬虫的入口 url 是https://sh.lianjia.com/ershoufang/pudong/a3p5/, 先发送请求获取最大页数,然后循环发送请求解析单个页面提取我们所要的信息(比如小区名,楼层,朝向,总价,单价等信息),最后导出csv文件。如果你正在阅读本文,相信你对Python 爬虫已经有了一定了解,所以我们不会详细解释每一行代码。
整个项目代码如下所示:
# homelink_requests.py
# Author: 大江狗from fake_useragent import UserAgentimport requestsfrom bs4 import BeautifulSoupimport csvimport reimport timeclass HomeLinkSpider(object):def __init__(self):self.ua = UserAgent()self.headers = {"User-Agent": self.ua.random}self.data = list()self.path = "浦东_三房_500_800万.csv"self.url = "https://sh.lianjia.com/ershoufang/pudong/a3p5/"def get_max_page(self):response = requests.get(self.url, headers=self.headers)if response.status_code == 200:soup = BeautifulSoup(response.text, 'html.parser')a = soup.select('div[class="page-box house-lst-page-box"]')#使用eval是字符串转化为字典格式max_page = eval(a[0].attrs["page-data"])["totalPage"] return max_pageelse:print("请求失败 status:{}".format(response.status_code))return Nonedef parse_page(self):max_page = self.get_max_page()for i in range(1, max_page + 1):url = 'https://sh.lianjia.com/ershoufang/pudong/pg{}a3p5/'.format(i)response = requests.get(url, headers=self.headers)soup = BeautifulSoup(response.text, 'html.parser')ul = soup.find_all("ul", class_="sellListContent")li_list = ul[0].select("li")for li in li_list:detail = dict()detail['title'] = li.select('div[class="title"]')[0].get_text()# 2室1厅 | 74.14平米 | 南 | 精装 | 高楼层(共6层) | 1999年建 | 板楼house_info = li.select('div[class="houseInfo"]')[0].get_text()house_info_list = house_info.split(" | ")detail['bedroom'] = house_info_list[0]detail['area'] = house_info_list[1]detail['direction'] = house_info_list[2]floor_pattern = re.compile(r'\d{1,2}')# 从字符串任意位置匹配match1 = re.search(floor_pattern, house_info_list[4]) if match1:detail['floor'] = match1.group()else:detail['floor'] = "未知"# 匹配年份year_pattern = re.compile(r'\d{4}')match2 = re.search(year_pattern, house_info_list[5])if match2:detail['year'] = match2.group()else:detail['year'] = "未知"# 文兰小区 - 塘桥, 提取小区名和哈快position_info = li.select('div[class="positionInfo"]')[0].get_text().split(' - ')detail['house'] = position_info[0]detail['location'] = position_info[1]# 650万,匹配650price_pattern = re.compile(r'\d+')total_price = li.select('div[class="totalPrice"]')[0].get_text()detail['total_price'] = re.search(price_pattern, total_price).group()# 单价64182元/平米, 匹配64182unit_price = li.select('div[class="unitPrice"]')[0].get_text()detail['unit_price'] = re.search(price_pattern, unit_price).group()self.data.append(detail)def write_csv_file(self):head = ["标题", "小区", "房厅", "面积", "朝向", "楼层", "年份","位置", "总价(万)", "单价(元/平方米)"]keys = ["title", "house", "bedroom", "area", "direction","floor", "year", "location","total_price", "unit_price"]try:with open(self.path, 'w', newline='', encoding='utf_8_sig') as csv_file:writer = csv.writer(csv_file, dialect='excel')if head is not None:writer.writerow(head)for item in self.data:row_data = []for k in keys:row_data.append(item[k])# print(row_data)writer.writerow(row_data)print("Write a CSV file to path %s Successful." % self.path)except Exception as e:print("Fail to write CSV to path: %s, Case: %s" % (self.path, e))if __name__ == '__main__':start = time.time()home_link_spider = HomeLinkSpider()home_link_spider.parse_page()home_link_spider.write_csv_file()end = time.time()print("耗时:{}秒".format(end-start))
注意:我们使用了 fake_useragent, requests和BeautifulSoup,这些都需要通过 pip 事先安装好才能用。
现在我们来看下爬取结果,耗时约 18.5 秒,总共爬取 580 条数据。
图片requests + parsel 组合
这次我们同样采用 requests 获取目标网页内容,使用 parsel 库(事先需通过 pip 安装)来解析。Parsel 库的用法和 BeautifulSoup 相似,都是先创建实例,然后使用各种选择器提取 DOM 元素和数据,但语法上稍有不同。Beautiful 有自己的语法规则,而 Parsel 库支持标准的 css 选择器和 xpath 选择器, 通过 get 方法或 getall 方法获取文本或属性值,使用起来更方便。# BeautifulSoup的用法from bs4 import BeautifulSoupsoup = BeautifulSoup(response.text, 'html.parser')ul = soup.find_all("ul", class_="sellListContent")[0]# Parsel的用法, 使用Selector类from parsel import Selectorselector = Selector(response.text)ul = selector.css('ul.sellListContent')[0]# Parsel获取文本值或属性值案例selector.css('div.title span::text').get()selector.css('ul li a::attr(href)').get()>>> for li in selector.css('ul > li'):... print(li.xpath('.//@href').get())
注:老版的 parsel 库使用extract()或extract_first()方法获取文本或属性值,在新版中已被get()和getall()方法替代。
全部代码如下所示:# homelink_parsel.py# Author: 大江狗from fake_useragent import UserAgentimport requestsimport csvimport reimport timefrom parsel import Selectorclass HomeLinkSpider(object):def __init__(self):self.ua = UserAgent()self.headers = {"User-Agent": self.ua.random}self.data = list()self.path = "浦东_三房_500_800万.csv"self.url = "https://sh.lianjia.com/ershoufang/pudong/a3p5/"def get_max_page(self):response = requests.get(self.url, headers=self.headers)if response.status_code == 200:# 创建Selector类实例selector = Selector(response.text)# 采用css选择器获取最大页码div Boxla = selector.css('div[class="page-box house-lst-page-box"]')# 使用eval将page-data的json字符串转化为字典格式max_page = eval(a[0].xpath('//@page-data').get())["totalPage"]print("最大页码数:{}".format(max_page))return max_pageelse:print("请求失败 status:{}".format(response.status_code))return Nonedef parse_page(self):max_page = self.get_max_page()for i in range(1, max_page + 1):url = 'https://sh.lianjia.com/ershoufang/pudong/pg{}a3p5/'.format(i)response = requests.get(url, headers=self.headers)selector = Selector(response.text)ul = selector.css('ul.sellListContent')[0]li_list = ul.css('li')for li in li_list:detail = dict()detail['title'] = li.css('div.title a::text').get()# 2室1厅 | 74.14平米 | 南 | 精装 | 高楼层(共6层) | 1999年建 | 板楼house_info = li.css('div.houseInfo::text').get()house_info_list = house_info.split(" | ")detail['bedroom'] = house_info_list[0]detail['area'] = house_info_list[1]detail['direction'] = house_info_list[2]floor_pattern = re.compile(r'\d{1,2}')match1 = re.search(floor_pattern, house_info_list[4]) # 从字符串任意位置匹配if match1:detail['floor'] = match1.group()else:detail['floor'] = "未知"# 匹配年份year_pattern = re.compile(r'\d{4}')match2 = re.search(year_pattern, house_info_list[5])if match2:detail['year'] = match2.group()else:detail['year'] = "未知"# 文兰小区 - 塘桥 提取小区名和哈快position_info = li.css('div.positionInfo a::text').getall()detail['house'] = position_info[0]detail['location'] = position_info[1]# 650万,匹配650price_pattern = re.compile(r'\d+')total_price = li.css('div.totalPrice span::text').get()detail['total_price'] = re.search(price_pattern, total_price).group()# 单价64182元/平米, 匹配64182unit_price = li.css('div.unitPrice span::text').get()detail['unit_price'] = re.search(price_pattern, unit_price).group()self.data.append(detail)def write_csv_file(self):head = ["标题", "小区", "房厅", "面积", "朝向", "楼层", "年份", "位置", "总价(万)", "单价(元/平方米)"]keys = ["title", "house", "bedroom", "area", "direction", "floor", "year", "location","total_price", "unit_price"]try:with open(self.path, 'w', newline='', encoding='utf_8_sig') as csv_file:writer = csv.writer(csv_file, dialect='excel')if head is not None:writer.writerow(head)for item in self.data:row_data = []for k in keys:row_data.append(item[k])# print(row_data)writer.writerow(row_data)print("Write a CSV file to path %s Successful." % self.path)except Exception as e:print("Fail to write CSV to path: %s, Case: %s" % (self.path, e))if __name__ == '__main__':start = time.time()home_link_spider = HomeLinkSpider()home_link_spider.parse_page()home_link_spider.write_csv_file()end = time.time()print("耗时:{}秒".format(end-start))
现在我们来看下爬取结果,爬取 580 条数据耗时约 16.5 秒,节省了 2 秒时间。可见 parsel 比 BeautifulSoup 解析效率是要高的,爬取任务少时差别不大,任务多的话差别可能会大些。图片httpx 同步 + parsel 组合
我们现在来更进一步,使用 httpx 替代 requests 库。httpx 发送同步请求的方式和 requests 库基本一样,所以我们只需要修改上例中两行代码,把 requests 替换成 httpx 即可, 其余代码一模一样。
from fake_useragent import UserAgentimport csvimport reimport timefrom parsel import Selectorimport httpxclass HomeLinkSpider(object):def __init__(self):self.ua = UserAgent()self.headers = {"User-Agent": self.ua.random}self.data = list()self.path = "浦东_三房_500_800万.csv"self.url = "https://sh.lianjia.com/ershoufang/pudong/a3p5/"def get_max_page(self):# 修改这里把requests换成httpxresponse = httpx.get(self.url, headers=self.headers)if response.status_code == 200:# 创建Selector类实例selector = Selector(response.text)# 采用css选择器获取最大页码div Boxla = selector.css('div[class="page-box house-lst-page-box"]')# 使用eval将page-data的json字符串转化为字典格式max_page = eval(a[0].xpath('//@page-data').get())["totalPage"]print("最大页码数:{}".format(max_page))return max_pageelse:print("请求失败 status:{}".format(response.status_code))return Nonedef parse_page(self):max_page = self.get_max_page()for i in range(1, max_page + 1):url = 'https://sh.lianjia.com/ershoufang/pudong/pg{}a3p5/'.format(i)# 修改这里把requests换成httpxresponse = httpx.get(url, headers=self.headers)selector = Selector(response.text)ul = selector.css('ul.sellListContent')[0]li_list = ul.css('li')for li in li_list:detail = dict()detail['title'] = li.css('div.title a::text').get()# 2室1厅 | 74.14平米 | 南 | 精装 | 高楼层(共6层) | 1999年建 | 板楼house_info = li.css('div.houseInfo::text').get()house_info_list = house_info.split(" | ")detail['bedroom'] = house_info_list[0]detail['area'] = house_info_list[1]detail['direction'] = house_info_list[2]floor_pattern = re.compile(r'\d{1,2}')match1 = re.search(floor_pattern, house_info_list[4]) # 从字符串任意位置匹配if match1:detail['floor'] = match1.group()else:detail['floor'] = "未知"# 匹配年份year_pattern = re.compile(r'\d{4}')match2 = re.search(year_pattern, house_info_list[5])if match2:detail['year'] = match2.group()else:detail['year'] = "未知"# 文兰小区 - 塘桥 提取小区名和哈快position_info = li.css('div.positionInfo a::text').getall()detail['house'] = position_info[0]detail['location'] = position_info[1]# 650万,匹配650price_pattern = re.compile(r'\d+')total_price = li.css('div.totalPrice span::text').get()detail['total_price'] = re.search(price_pattern, total_price).group()# 单价64182元/平米, 匹配64182unit_price = li.css('div.unitPrice span::text').get()detail['unit_price'] = re.search(price_pattern, unit_price).group()self.data.append(detail)def write_csv_file(self):head = ["标题", "小区", "房厅", "面积", "朝向", "楼层", "年份", "位置", "总价(万)", "单价(元/平方米)"]keys = ["title", "house", "bedroom", "area", "direction", "floor", "year", "location","total_price", "unit_price"]try:with open(self.path, 'w', newline='', encoding='utf_8_sig') as csv_file:writer = csv.writer(csv_file, dialect='excel')if head is not None:writer.writerow(head)for item in self.data:row_data = []for k in keys:row_data.append(item[k])# print(row_data)writer.writerow(row_data)print("Write a CSV file to path %s Successful." % self.path)except Exception as e:print("Fail to write CSV to path: %s, Case: %s" % (self.path, e))if __name__ == '__main__':start = time.time()home_link_spider = HomeLinkSpider()home_link_spider.parse_page()home_link_spider.write_csv_file()end = time.time()print("耗时:{}秒".format(end-start))整个爬取过程耗时 16.1 秒,可见使用 httpx 发送同步请求时效率和 requests 基本无差别。
图片注意:Windows 上使用 pip 安装 httpx 可能会出现报错,要求安装 Visual Studio C++, 这个下载安装好就没事了。
接下来,我们就要开始王炸了,使用 httpx 和 asyncio 编写一个异步爬虫看看从链家网上爬取 580 条数据到底需要多长时间。
httpx 异步+ parsel 组合
Httpx 厉害的地方就是能发送异步请求。整个异步爬虫实现原理时,先发送同步请求获取最大页码,把每个单页的爬取和数据解析变为一个 asyncio 协程任务(使用async定义),最后使用 loop 执行。
大部分代码与同步爬虫相同,主要变动地方有两个:# 异步 - 使用协程函数解析单页面,需传入单页面url地址async def parse_single_page(self, url):# 使用httpx发送异步请求获取单页数据async with httpx.AsyncClient() as client:response = await client.get(url, headers=self.headers)selector = Selector(response.text)# 其余地方一样def parse_page(self):max_page = self.get_max_page()loop = asyncio.get_event_loop()# Python 3.6之前用ayncio.ensure_future或loop.create_task方法创建单个协程任务# Python 3.7以后可以用户asyncio.create_task方法创建单个协程任务tasks = []for i in range(1, max_page + 1):url = 'https://sh.lianjia.com/ershoufang/pudong/pg{}a3p5/'.format(i)tasks.append(self.parse_single_page(url))# 还可以使用asyncio.gather(*tasks)命令将多个协程任务加入到事件循环loop.run_until_complete(asyncio.wait(tasks))loop.close()整个项目代码如下所示:from fake_useragent import UserAgentimport csvimport reimport timefrom parsel import Selectorimport httpximport asyncioclass HomeLinkSpider(object):def __init__(self):self.ua = UserAgent()self.headers = {"User-Agent": self.ua.random}self.data = list()self.path = "浦东_三房_500_800万.csv"self.url = "https://sh.lianjia.com/ershoufang/pudong/a3p5/"def get_max_page(self):response = httpx.get(self.url, headers=self.headers)if response.status_code == 200:# 创建Selector类实例selector = Selector(response.text)# 采用css选择器获取最大页码div Boxla = selector.css('div[class="page-box house-lst-page-box"]')# 使用eval将page-data的json字符串转化为字典格式max_page = eval(a[0].xpath('//@page-data').get())["totalPage"]print("最大页码数:{}".format(max_page))return max_pageelse:print("请求失败 status:{}".format(response.status_code))return None# 异步 - 使用协程函数解析单页面,需传入单页面url地址async def parse_single_page(self, url):async with httpx.AsyncClient() as client:response = await client.get(url, headers=self.headers)selector = Selector(response.text)ul = selector.css('ul.sellListContent')[0]li_list = ul.css('li')for li in li_list:detail = dict()detail['title'] = li.css('div.title a::text').get()# 2室1厅 | 74.14平米 | 南 | 精装 | 高楼层(共6层) | 1999年建 | 板楼house_info = li.css('div.houseInfo::text').get()house_info_list = house_info.split(" | ")detail['bedroom'] = house_info_list[0]detail['area'] = house_info_list[1]detail['direction'] = house_info_list[2]floor_pattern = re.compile(r'\d{1,2}')match1 = re.search(floor_pattern, house_info_list[4]) # 从字符串任意位置匹配if match1:detail['floor'] = match1.group()else:detail['floor'] = "未知"# 匹配年份year_pattern = re.compile(r'\d{4}')match2 = re.search(year_pattern, house_info_list[5])if match2:detail['year'] = match2.group()else:detail['year'] = "未知"# 文兰小区 - 塘桥 提取小区名和哈快position_info = li.css('div.positionInfo a::text').getall()detail['house'] = position_info[0]detail['location'] = position_info[1]# 650万,匹配650price_pattern = re.compile(r'\d+')total_price = li.css('div.totalPrice span::text').get()detail['total_price'] = re.search(price_pattern, total_price).group()# 单价64182元/平米, 匹配64182unit_price = li.css('div.unitPrice span::text').get()detail['unit_price'] = re.search(price_pattern, unit_price).group()self.data.append(detail)def parse_page(self):max_page = self.get_max_page()loop = asyncio.get_event_loop()# Python 3.6之前用ayncio.ensure_future或loop.create_task方法创建单个协程任务# Python 3.7以后可以用户asyncio.create_task方法创建单个协程任务tasks = []for i in range(1, max_page + 1):url = 'https://sh.lianjia.com/ershoufang/pudong/pg{}a3p5/'.format(i)tasks.append(self.parse_single_page(url))# 还可以使用asyncio.gather(*tasks)命令将多个协程任务加入到事件循环loop.run_until_complete(asyncio.wait(tasks))loop.close()def write_csv_file(self):head = ["标题", "小区", "房厅", "面积", "朝向", "楼层","年份", "位置", "总价(万)", "单价(元/平方米)"]keys = ["title", "house", "bedroom", "area", "direction","floor", "year", "location","total_price", "unit_price"]try:with open(self.path, 'w', newline='', encoding='utf_8_sig') as csv_file:writer = csv.writer(csv_file, dialect='excel')if head is not None:writer.writerow(head)for item in self.data:row_data = []for k in keys:row_data.append(item[k])writer.writerow(row_data)print("Write a CSV file to path %s Successful." % self.path)except Exception as e:print("Fail to write CSV to path: %s, Case: %s" % (self.path, e))if __name__ == '__main__':start = time.time()home_link_spider = HomeLinkSpider()home_link_spider.parse_page()home_link_spider.write_csv_file()end = time.time()print("耗时:{}秒".format(end-start))
现在到了见证奇迹的时刻了。从链家网上爬取了 580 条数据,使用 httpx 编写的异步爬虫仅仅花了 2.5 秒!!
图片对比与总结
爬取同样的内容,采用不同工具组合耗时是不一样的。httpx 异步+parsel 组合毫无疑问是最大的赢家, requests 和 BeautifulSoup 确实可以功成身退啦。
requests + BeautifulSoup: 18.5 秒
requests + parsel: 16.5 秒
httpx 同步 + parsel: 16.1 秒
httpx 异步 + parsel: 2.5 秒
注意:我们使用了 fake_useragent, requests和BeautifulSoup,这些都需要通过 pip 事先安装好才能用。
现在我们来看下爬取结果,耗时约 18.5 秒,总共爬取 580 条数据。
requests + parsel 组合
这次我们同样采用 requests 获取目标网页内容,使用 parsel 库(事先需通过 pip 安装)来解析。Parsel 库的用法和 BeautifulSoup 相似,都是先创建实例,然后使用各种选择器提取 DOM 元素和数据,但语法上稍有不同。Beautiful 有自己的语法规则,而 Parsel 库支持标准的 css 选择器和 xpath 选择器, 通过 get 方法或 getall 方法获取文本或属性值,使用起来更方便。
# BeautifulSoup的用法from bs4 import BeautifulSoupsoup = BeautifulSoup(response.text, 'html.parser')ul = soup.find_all("ul", class_="sellListContent")[0]# Parsel的用法, 使用Selector类from parsel import Selectorselector = Selector(response.text)ul = selector.css('ul.sellListContent')[0]# Parsel获取文本值或属性值案例selector.css('div.title span::text').get()selector.css('ul li a::attr(href)').get()>>> for li in selector.css('ul > li'):... print(li.xpath('.//@href').get())
注:老版的 parsel 库使用extract()或extract_first()方法获取文本或属性值,在新版中已被get()和getall()方法替代。
全部代码如下所示:# homelink_parsel.py# Author: 大江狗from fake_useragent import UserAgentimport requestsimport csvimport reimport timefrom parsel import Selectorclass HomeLinkSpider(object):def __init__(self):self.ua = UserAgent()self.headers = {"User-Agent": self.ua.random}self.data = list()self.path = "浦东_三房_500_800万.csv"self.url = "https://sh.lianjia.com/ershoufang/pudong/a3p5/"def get_max_page(self):response = requests.get(self.url, headers=self.headers)if response.status_code == 200:# 创建Selector类实例selector = Selector(response.text)# 采用css选择器获取最大页码div Boxla = selector.css('div[class="page-box house-lst-page-box"]')# 使用eval将page-data的json字符串转化为字典格式max_page = eval(a[0].xpath('//@page-data').get())["totalPage"]print("最大页码数:{}".format(max_page))return max_pageelse:print("请求失败 status:{}".format(response.status_code))return Nonedef parse_page(self):max_page = self.get_max_page()for i in range(1, max_page + 1):url = 'https://sh.lianjia.com/ershoufang/pudong/pg{}a3p5/'.format(i)response = requests.get(url, headers=self.headers)selector = Selector(response.text)ul = selector.css('ul.sellListContent')[0]li_list = ul.css('li')for li in li_list:detail = dict()detail['title'] = li.css('div.title a::text').get()# 2室1厅 | 74.14平米 | 南 | 精装 | 高楼层(共6层) | 1999年建 | 板楼house_info = li.css('div.houseInfo::text').get()house_info_list = house_info.split(" | ")detail['bedroom'] = house_info_list[0]detail['area'] = house_info_list[1]detail['direction'] = house_info_list[2]floor_pattern = re.compile(r'\d{1,2}')match1 = re.search(floor_pattern, house_info_list[4]) # 从字符串任意位置匹配if match1:detail['floor'] = match1.group()else:detail['floor'] = "未知"# 匹配年份year_pattern = re.compile(r'\d{4}')match2 = re.search(year_pattern, house_info_list[5])if match2:detail['year'] = match2.group()else:detail['year'] = "未知"# 文兰小区 - 塘桥 提取小区名和哈快position_info = li.css('div.positionInfo a::text').getall()detail['house'] = position_info[0]detail['location'] = position_info[1]# 650万,匹配650price_pattern = re.compile(r'\d+')total_price = li.css('div.totalPrice span::text').get()detail['total_price'] = re.search(price_pattern, total_price).group()# 单价64182元/平米, 匹配64182unit_price = li.css('div.unitPrice span::text').get()detail['unit_price'] = re.search(price_pattern, unit_price).group()self.data.append(detail)def write_csv_file(self):head = ["标题", "小区", "房厅", "面积", "朝向", "楼层", "年份", "位置", "总价(万)", "单价(元/平方米)"]keys = ["title", "house", "bedroom", "area", "direction", "floor", "year", "location","total_price", "unit_price"]try:with open(self.path, 'w', newline='', encoding='utf_8_sig') as csv_file:writer = csv.writer(csv_file, dialect='excel')if head is not None:writer.writerow(head)for item in self.data:row_data = []for k in keys:row_data.append(item[k])# print(row_data)writer.writerow(row_data)print("Write a CSV file to path %s Successful." % self.path)except Exception as e:print("Fail to write CSV to path: %s, Case: %s" % (self.path, e))if __name__ == '__main__':start = time.time()home_link_spider = HomeLinkSpider()home_link_spider.parse_page()home_link_spider.write_csv_file()end = time.time()print("耗时:{}秒".format(end-start))
现在我们来看下爬取结果,爬取 580 条数据耗时约 16.5 秒,节省了 2 秒时间。可见 parsel 比 BeautifulSoup 解析效率是要高的,爬取任务少时差别不大,任务多的话差别可能会大些。
httpx 同步 + parsel 组合
我们现在来更进一步,使用 httpx 替代 requests 库。httpx 发送同步请求的方式和 requests 库基本一样,所以我们只需要修改上例中两行代码,把 requests 替换成 httpx 即可, 其余代码一模一样。
整个爬取过程耗时 16.1 秒,可见使用 httpx 发送同步请求时效率和 requests 基本无差别。
from fake_useragent import UserAgentimport csvimport reimport timefrom parsel import Selectorimport httpxclass HomeLinkSpider(object):def __init__(self):self.ua = UserAgent()self.headers = {"User-Agent": self.ua.random}self.data = list()self.path = "浦东_三房_500_800万.csv"self.url = "https://sh.lianjia.com/ershoufang/pudong/a3p5/"def get_max_page(self):# 修改这里把requests换成httpxresponse = httpx.get(self.url, headers=self.headers)if response.status_code == 200:# 创建Selector类实例selector = Selector(response.text)# 采用css选择器获取最大页码div Boxla = selector.css('div[class="page-box house-lst-page-box"]')# 使用eval将page-data的json字符串转化为字典格式max_page = eval(a[0].xpath('//@page-data').get())["totalPage"]print("最大页码数:{}".format(max_page))return max_pageelse:print("请求失败 status:{}".format(response.status_code))return Nonedef parse_page(self):max_page = self.get_max_page()for i in range(1, max_page + 1):url = 'https://sh.lianjia.com/ershoufang/pudong/pg{}a3p5/'.format(i)# 修改这里把requests换成httpxresponse = httpx.get(url, headers=self.headers)selector = Selector(response.text)ul = selector.css('ul.sellListContent')[0]li_list = ul.css('li')for li in li_list:detail = dict()detail['title'] = li.css('div.title a::text').get()# 2室1厅 | 74.14平米 | 南 | 精装 | 高楼层(共6层) | 1999年建 | 板楼house_info = li.css('div.houseInfo::text').get()house_info_list = house_info.split(" | ")detail['bedroom'] = house_info_list[0]detail['area'] = house_info_list[1]detail['direction'] = house_info_list[2]floor_pattern = re.compile(r'\d{1,2}')match1 = re.search(floor_pattern, house_info_list[4]) # 从字符串任意位置匹配if match1:detail['floor'] = match1.group()else:detail['floor'] = "未知"# 匹配年份year_pattern = re.compile(r'\d{4}')match2 = re.search(year_pattern, house_info_list[5])if match2:detail['year'] = match2.group()else:detail['year'] = "未知"# 文兰小区 - 塘桥 提取小区名和哈快position_info = li.css('div.positionInfo a::text').getall()detail['house'] = position_info[0]detail['location'] = position_info[1]# 650万,匹配650price_pattern = re.compile(r'\d+')total_price = li.css('div.totalPrice span::text').get()detail['total_price'] = re.search(price_pattern, total_price).group()# 单价64182元/平米, 匹配64182unit_price = li.css('div.unitPrice span::text').get()detail['unit_price'] = re.search(price_pattern, unit_price).group()self.data.append(detail)def write_csv_file(self):head = ["标题", "小区", "房厅", "面积", "朝向", "楼层", "年份", "位置", "总价(万)", "单价(元/平方米)"]keys = ["title", "house", "bedroom", "area", "direction", "floor", "year", "location","total_price", "unit_price"]try:with open(self.path, 'w', newline='', encoding='utf_8_sig') as csv_file:writer = csv.writer(csv_file, dialect='excel')if head is not None:writer.writerow(head)for item in self.data:row_data = []for k in keys:row_data.append(item[k])# print(row_data)writer.writerow(row_data)print("Write a CSV file to path %s Successful." % self.path)except Exception as e:print("Fail to write CSV to path: %s, Case: %s" % (self.path, e))if __name__ == '__main__':start = time.time()home_link_spider = HomeLinkSpider()home_link_spider.parse_page()home_link_spider.write_csv_file()end = time.time()print("耗时:{}秒".format(end-start))
注意:Windows 上使用 pip 安装 httpx 可能会出现报错,要求安装 Visual Studio C++, 这个下载安装好就没事了。
接下来,我们就要开始王炸了,使用 httpx 和 asyncio 编写一个异步爬虫看看从链家网上爬取 580 条数据到底需要多长时间。
httpx 异步+ parsel 组合
Httpx 厉害的地方就是能发送异步请求。整个异步爬虫实现原理时,先发送同步请求获取最大页码,把每个单页的爬取和数据解析变为一个 asyncio 协程任务(使用async定义),最后使用 loop 执行。
大部分代码与同步爬虫相同,主要变动地方有两个:
# 异步 - 使用协程函数解析单页面,需传入单页面url地址async def parse_single_page(self, url):# 使用httpx发送异步请求获取单页数据async with httpx.AsyncClient() as client:response = await client.get(url, headers=self.headers)selector = Selector(response.text)# 其余地方一样def parse_page(self):max_page = self.get_max_page()loop = asyncio.get_event_loop()# Python 3.6之前用ayncio.ensure_future或loop.create_task方法创建单个协程任务# Python 3.7以后可以用户asyncio.create_task方法创建单个协程任务tasks = []for i in range(1, max_page + 1):url = 'https://sh.lianjia.com/ershoufang/pudong/pg{}a3p5/'.format(i)tasks.append(self.parse_single_page(url))# 还可以使用asyncio.gather(*tasks)命令将多个协程任务加入到事件循环loop.run_until_complete(asyncio.wait(tasks))loop.close()
整个项目代码如下所示:
from fake_useragent import UserAgentimport csvimport reimport timefrom parsel import Selectorimport httpximport asyncioclass HomeLinkSpider(object):def __init__(self):self.ua = UserAgent()self.headers = {"User-Agent": self.ua.random}self.data = list()self.path = "浦东_三房_500_800万.csv"self.url = "https://sh.lianjia.com/ershoufang/pudong/a3p5/"def get_max_page(self):response = httpx.get(self.url, headers=self.headers)if response.status_code == 200:# 创建Selector类实例selector = Selector(response.text)# 采用css选择器获取最大页码div Boxla = selector.css('div[class="page-box house-lst-page-box"]')# 使用eval将page-data的json字符串转化为字典格式max_page = eval(a[0].xpath('//@page-data').get())["totalPage"]print("最大页码数:{}".format(max_page))return max_pageelse:print("请求失败 status:{}".format(response.status_code))return None# 异步 - 使用协程函数解析单页面,需传入单页面url地址async def parse_single_page(self, url):async with httpx.AsyncClient() as client:response = await client.get(url, headers=self.headers)selector = Selector(response.text)ul = selector.css('ul.sellListContent')[0]li_list = ul.css('li')for li in li_list:detail = dict()detail['title'] = li.css('div.title a::text').get()# 2室1厅 | 74.14平米 | 南 | 精装 | 高楼层(共6层) | 1999年建 | 板楼house_info = li.css('div.houseInfo::text').get()house_info_list = house_info.split(" | ")detail['bedroom'] = house_info_list[0]detail['area'] = house_info_list[1]detail['direction'] = house_info_list[2]floor_pattern = re.compile(r'\d{1,2}')match1 = re.search(floor_pattern, house_info_list[4]) # 从字符串任意位置匹配if match1:detail['floor'] = match1.group()else:detail['floor'] = "未知"# 匹配年份year_pattern = re.compile(r'\d{4}')match2 = re.search(year_pattern, house_info_list[5])if match2:detail['year'] = match2.group()else:detail['year'] = "未知"# 文兰小区 - 塘桥 提取小区名和哈快position_info = li.css('div.positionInfo a::text').getall()detail['house'] = position_info[0]detail['location'] = position_info[1]# 650万,匹配650price_pattern = re.compile(r'\d+')total_price = li.css('div.totalPrice span::text').get()detail['total_price'] = re.search(price_pattern, total_price).group()# 单价64182元/平米, 匹配64182unit_price = li.css('div.unitPrice span::text').get()detail['unit_price'] = re.search(price_pattern, unit_price).group()self.data.append(detail)def parse_page(self):max_page = self.get_max_page()loop = asyncio.get_event_loop()# Python 3.6之前用ayncio.ensure_future或loop.create_task方法创建单个协程任务# Python 3.7以后可以用户asyncio.create_task方法创建单个协程任务tasks = []for i in range(1, max_page + 1):url = 'https://sh.lianjia.com/ershoufang/pudong/pg{}a3p5/'.format(i)tasks.append(self.parse_single_page(url))# 还可以使用asyncio.gather(*tasks)命令将多个协程任务加入到事件循环loop.run_until_complete(asyncio.wait(tasks))loop.close()def write_csv_file(self):head = ["标题", "小区", "房厅", "面积", "朝向", "楼层","年份", "位置", "总价(万)", "单价(元/平方米)"]keys = ["title", "house", "bedroom", "area", "direction","floor", "year", "location","total_price", "unit_price"]try:with open(self.path, 'w', newline='', encoding='utf_8_sig') as csv_file:writer = csv.writer(csv_file, dialect='excel')if head is not None:writer.writerow(head)for item in self.data:row_data = []for k in keys:row_data.append(item[k])writer.writerow(row_data)print("Write a CSV file to path %s Successful." % self.path)except Exception as e:print("Fail to write CSV to path: %s, Case: %s" % (self.path, e))if __name__ == '__main__':start = time.time()home_link_spider = HomeLinkSpider()home_link_spider.parse_page()home_link_spider.write_csv_file()end = time.time()print("耗时:{}秒".format(end-start))
现在到了见证奇迹的时刻了。从链家网上爬取了 580 条数据,使用 httpx 编写的异步爬虫仅仅花了 2.5 秒!!
对比与总结
爬取同样的内容,采用不同工具组合耗时是不一样的。httpx 异步+parsel 组合毫无疑问是最大的赢家, requests 和 BeautifulSoup 确实可以功成身退啦。
requests + BeautifulSoup: 18.5 秒
requests + parsel: 16.5 秒
httpx 同步 + parsel: 16.1 秒
httpx 异步 + parsel: 2.5 秒
Python 网络爬虫工具:httpx 和 parsel(对比测评)相关推荐
- python网络爬虫工具库集合
经常逛 GitHub 的同学可能会听说过大名鼎鼎的 awesome 仓库,没错,就是这个:https://github.com/sindresorhus/awesome. 这个库可谓是一个极大的宝藏, ...
- 156个Python网络爬虫资源,GitHub上awesome系列之Python爬虫工具
项目地址:lorien/awesome-web-scraping GitHub上awesome系列之Python的爬虫工具. 本列表包含Python网页抓取和数据处理相关的库. 网络相关 通用 url ...
- Python 网络爬虫 001 (科普) 网络爬虫简介
Python 网络爬虫 001 (科普) 网络爬虫简介 1. 网络爬虫是干什么的 我举几个生活中的例子: 例子一: 我平时会将 学到的知识 和 积累的经验 写成博客发送到CSDN博客网站上,那么对于我 ...
- 【读书笔记】Python网络爬虫从入门到实践(第2版)-唐松,爬虫基础体系巩固和常见场景练习
[概述] 书名:Python网络爬虫从入门到实践(第2版) 作者:唐松 日期:2021年08月01日 读书用时:1568页,100小时,59个笔记 [读书笔记] ◆ 1.2 网络爬虫是否合法 爬虫协议 ...
- python爬虫微信朋友圈怎么发文字_如何利用Python网络爬虫抓取微信朋友圈的动态(上)...
今天小编给大家分享一下如何利用Python网络爬虫抓取微信朋友圈的动态信息,实际上如果单独的去爬取朋友圈的话,难度会非常大,因为微信没有提供向网易云音乐这样的API接口,所以很容易找不到门.不过不要慌 ...
- 解析python网络爬虫pdf 黑马程序员_正版 解析Python网络爬虫 核心技术 Scrapy框架 分布式爬虫 黑马程序员 Python应用编程丛书 中国铁道出版社...
商品参数 书名:Python应用编程丛书:解析Python网络爬虫:核心技术.Scrapy框架.分布式爬虫 定价:52.00元 作者:[中国]黑马程序员 出版社:中国铁道出版社 出版日期:2018-0 ...
- python抓取微信朋友圈动态_2018最全如何利用Python网络爬虫抓取微信朋友圈的动态...
今天小编给大家分享一下如何利用Python网络爬虫抓取微信朋友圈的动态信息,实际上如果单独的去爬取朋友圈的话,难度会非常大,因为微信没有提供向网易云音乐这样的API接口,所以很容易找不到门.不过不要慌 ...
- python网络爬虫可以干什么,python网络爬虫有什么用
python爬虫能做什么 世界上80%的爬虫是基于Python开发的,学好爬虫技能,可为后续的大数据分析.挖掘.机器学习等提供重要的数据源.什么是爬虫? (推荐学习:Python视频教程)网络爬虫(又 ...
- 爬虫python能做什么 知乎,python网络爬虫能做什么
python爬虫能做什么 世界上80%的爬虫是基于Python开发的,学好爬虫技能,可为后续的大数据分析.挖掘.机器学习等提供重要的数据源.什么是爬虫? (推荐学习:Python视频教程)网络爬虫(又 ...
最新文章
- String创建方式及其区别(快速了解)
- MySQL数据库右连接查询right join ... on
- 1411区间内的真素数2
- 5页面调用原生相机_React Native与原生通信全梳理(iOS端)
- 请听一个故事------gt;百度员工离职总结:如何做个好员工
- selenium模拟登陆豆瓣网
- JavaScript回调函数的高手指南
- 一个工作了两三年程序员的学习计划
- 4月份西部数码.wang域名注册量报告:增速严重缩水
- 计算机导论考试考什么,计算机导论考试
- kali2020出现中文乱码解决
- Java实现阿里云短信验证码发送
- java 实现打印机_JAVA实现连接本地打印机并打印文件的实现代码
- luogu P3369(Splay)
- 数据库学习笔记【MySQL】
- 四、线段_同级别分解
- 世事洞明皆学问-拉链拉头的拆分安装
- centos挂载盘到根下_centos挂载磁盘及扩展根目录
- 兔子繁殖问题(C语言)
- 论文解读:OMNI-DIMENSIONAL DYNAMIC CONVOLUTION
热门文章
- 解读:通过挖掘概念间共享信息,实现股票趋势预测的图模型框架
- 四种基本力(相互作用)
- 计算机中的CNC键代表什么,国内数控(CNC)机床操作面板按键及功能组合键作用(全套)对照...
- 终于有人把数据挖掘讲明白了
- GitModel数学建模 —— 动手学数理统计
- 网页中Flash播放器里的视频获取的方法
- discuz后台主导航栏菜单中添加新的菜单项的方法
- 合宙 ESP32C3 使用micropython 驱动配套0.96寸 TFT ST7735 屏幕显示色块和文字
- IPartDoc Interface 学习笔记
- CPU(Central Processing Unit,中央处理器)