02网络爬虫-使用 Beautiful Soup 解析网页

接着上一篇博客的学习
https://blog.csdn.net/qq_41865229/article/details/121546222

1.安装Beautiful Soup

通过 requests 库已经可以获取整个网页的源码，接着我们需要从源码中提取我们需要的数据。Beautiful Soup 是 python 的一个库，其最主要的功能是从网页中提取数据。

1.Beautiful Soup 目前已经被移植到 bs4 库中，也就是说在导入 Beautiful Soup 时需要先安装 bs4 库。

在终端执行安装命令

 pip install beautifulsoup4

安装成功

2.安装好 bs4 库以后，还需安装 lxml 库。

如果我们不安装 lxml 库，就会使用 Python 默认的解析器。虽然Beautiful Soup 既支持 Python 标准库中的 HTML 解析器又支持一些第三方解析器，但是 lxml 库具有功能更加强大、速度更快的特点，因此推荐安装 lxml 库。

在终端执行安装命令

pip install lxml

安装成功

2.使用 Beautiful Soup 解析网页

安装好上面的库后, 就能使用Beautiful Soup了

以中国旅游网为例 http://www.cntour.cn/
我们爬取该网页中的蓝色标题, 如下图所示

1.审查要爬取数据对应的元素标签

2.复制要爬取元素的选择器

#main > div > div.mtop.firstMod.clearfix > div.centerBox > ul.newsList > li:nth-child(1) > a

3.稍微修改一下, 让它能选取所有蓝色标题

#main>div>div.mtop.firstMod.clearfix>div.centerBox>ul.newsList>li>a

4.使用 Beautiful Soup 爬取指定数据

import requests        #导入requests包
from bs4 import BeautifulSoupdef main():url = 'http://www.cntour.cn/'strhtml = requests.get(url)soup = BeautifulSoup(strhtml.text, 'lxml')data = soup.select('#main>div>div.mtop.firstMod.clearfix>div.centerBox>ul.newsList>li>a')print(data)return# Press the green button in the gutter to run the script.
if __name__ == '__main__':main()

运行结果

3.代码分析

1.Beautiful Soup 库能够解析网页信息，它被集成在 bs4 库中，需要时可以从 bs4 库中调用。其表达语句如下：

from bs4 import BeautifulSoup

2.首先，HTML 文档将被转换成 Unicode 编码格式，然后 Beautiful Soup 选择最合适的解析器来解析这段文档，此处指定 lxml 解析器进行解析。解析后便将复杂的 HTML 文档转换成树形结构，并且每个节点都是 Python 对象。这里将解析后的文档存储到新建的变量 soup 中，代码如下：

 url = 'http://www.cntour.cn/'strhtml = requests.get(url)soup = BeautifulSoup(strhtml.text, 'lxml')

3.获取要爬取的数据的选择器, 使用 soup.select 进行筛选，代码如下：

data = soup.select('#main>div>div.mtop.firstMod.clearfix>div.centerBox>ul.newsList>li>a')

4.清洗和组织数据

1.至此，获得了一段指定数据的 HTML 代码，存储在data变量中, 如下所示

[<a href="http://www.cntour.cn/news/21122/" target="_blank" title="戴斌：冰雪之上 旅游新局">戴斌：冰雪之上 旅游新局</a>, <a href="http://www.cntour.cn/news/21117/" target="_blank" title="服务“国之大者”拓展旅游业高质量发展新格局">服务“国之大者”拓展旅游业高质量发展新格局</a>,<a href="http://www.cntour.cn/news/21098/" target="_blank" title="首批国家夜间文化和旅游消费集聚区">首批国家夜间文化和旅游消费集聚区</a>,<a href="http://www.cntour.cn/news/20081/" target="_blank" title="发挥旅游优势 共建开放型世界经济">发挥旅游优势 共建开放型世界经济</a>, <a href="http://www.cntour.cn/news/14988/" target="_blank" title="2021中国旅游向内发力">[2021中国旅游向内发力]</a>,<a href="http://www.cntour.cn/news/14987/" target="_blank" title="2020中国旅游浴火重生">[2020中国旅游浴火重生]</a>, <a href="http://www.cntour.cn/news/14977/" target="_blank" title="“云旅游”赋能旅游业创新发展">[“云旅游”赋能旅游业创]</a>, <a href="http://www.cntour.cn/news/14970/" target="_blank" title="旅游为幸福生活添彩">[旅游为幸福生活添彩]</a>, <a href="http://www.cntour.cn/news/14965/" target="_blank" title="RCEP为旅游业带来机遇">[RCEP为旅游业带来机遇]</a>,<a href="http://www.cntour.cn/news/14943/" target="_blank" title="大数据读懂中国旅游新引力">[大数据读懂中国旅游新引]</a>,<a href="http://www.cntour.cn/news/13916/" target="_blank" title="假日旅游复苏 市场平稳有序">[假日旅游复苏 市场平稳]</a>,<a href="http://www.cntour.cn/news/13907/" target="_blank" title="全球旅游业呈现持续向好势头">[全球旅游业呈现持续向好]</a>
]

2.但还没有把数据提取出来，还带有html标签, 接下来在 PyCharm 中稍微修改一下代码：

import requests        #导入requests包
from bs4 import BeautifulSoupdef main():url = 'http://www.cntour.cn/'strhtml = requests.get(url)soup = BeautifulSoup(strhtml.text, 'lxml')data = soup.select('#main>div>div.mtop.firstMod.clearfix>div.centerBox>ul.newsList>li>a')for item in data:result = {'title': item.get_text(),'link': item.get('href')}print(result)return# Press the green button in the gutter to run the script.
if __name__ == '__main__':main()

3.首先明确要提取的数据是标题和链接，标题在＜a＞标签中，提取标签的正文用 get_text() 方法。链接在＜a＞标签的 href 属性中，提取标签中的 href 属性用 get() 方法，在括号中指定要提取的属性数据，即 get(＇href＇)。

运行结果

4.从上图可以发现，文章的链接中有一个数字 ID。
下面用正则表达式提取这个 ID。
在 Python 中调用正则表达式时使用 re 库，这个库不用安装，可以直接调用。

在 PyCharm 中修改一下代码:

import requests        #导入requests包
from bs4 import BeautifulSoup
import redef main():url = 'http://www.cntour.cn/'strhtml = requests.get(url)soup = BeautifulSoup(strhtml.text, 'lxml')data = soup.select('#main>div>div.mtop.firstMod.clearfix>div.centerBox>ul.newsList>li>a')for item in data:result = {'title': item.get_text(),'link': item.get('href'),'ID': re.findall('\d+', item.get('href'))}print(result)return# Press the green button in the gutter to run the script.
if __name__ == '__main__':main()

这里使用 re 库的 findall 方法，第一个参数表示正则表达式，第二个参数表示要提取的文本。

运行结果

5.爬虫攻防战

爬虫是模拟人使用浏览器访问网站，进行数据的批量抓取。如果大量的用户使用爬虫来爬取数据, 就会给网站服务器带来很大的压力, 所以网站开发者就针对这些爬虫者，采取了一些反爬策略。

1.服务器第一种识别爬虫的方式就是通过检查连接的 useragent 来识别到底是浏览器访问，还是代码访问的。如果是代码访问的话，访问量增大时，服务器会直接封掉来访 IP。

对于这种初级的反爬机制，我们爬虫用户可以构造浏览器的请求头,来伪装自己.

以上面创建好的爬虫为例。在进行访问时，我们在开发者环境下不仅可以找到 URL、Form Data，还可以在 Request headers 中构造浏览器的请求头，封装自己。服务器识别浏览器访问的方法就是判断 keyword 是否为 Request headers 下的 User-Agent，如图下所示。

因此，我们只需要构造这个请求头的参数。创建请求头部信息即可，代码修改后如下：

import requests        #导入requests包
from bs4 import BeautifulSoup
import redef main():url = 'http://www.cntour.cn/'headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36'}strhtml = requests.get(url, headers=headers)soup = BeautifulSoup(strhtml.text, 'lxml')data = soup.select('#main>div>div.mtop.firstMod.clearfix>div.centerBox>ul.newsList>li>a')for item in data:result = {'title': item.get_text(),'link': item.get('href'),'ID': re.findall('\d+', item.get('href'))}print(result)return# Press the green button in the gutter to run the script.
if __name__ == '__main__':main()

运行结果

2.正常人1秒看一个图，而个爬虫1秒可以抓取好多张图，比如 1 秒抓取上百张图，那么服务器的压力必然会增大。如果在一个 IP 下短时间内批量访问下载图片，这就不符合用户使用浏览器访问网站的行为，肯定要被封 IP。

封 IP原理的也很简单，就是统计每个IP的访问频率，该频率超过阈值，就会返回一个验证码，如果是正常使用浏览器访问的话，用户就会填写验证码，然后继续访问，如果是代码访问的话，无法填写验证码, 也就不能下载数据了。

这个问题的解决方案有两个，第一个就是增设延时，比如每 3 秒钟抓取一次，代码如下：

import requests        #导入requests包
from bs4 import BeautifulSoup
import re
import timedef main():url = 'http://www.cntour.cn/'headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36'}strhtml = requests.get(url, headers=headers)soup = BeautifulSoup(strhtml.text, 'lxml')data = soup.select('#main>div>div.mtop.firstMod.clearfix>div.centerBox>ul.newsList>li>a')for item in data:result = {'title': item.get_text(),'link': item.get('href'),'ID': re.findall('\d+', item.get('href'))}print(result)time.sleep(3)return# Press the green button in the gutter to run the script.
if __name__ == '__main__':main()

3.我们写爬虫的目的是为了高效批量抓取数据，这里设置 3 秒钟抓取一次，效率太低。所以还有一个更重要的解决办法，那就是从本质上解决问题。

不管如何访问，服务器的目的就是查出哪些为代码访问，然后封锁 IP。

解决办法：为避免被封 IP，在数据采集时经常会使用代理。

构建我们自己的IP代理池(下一篇博客来介绍如何构建IP代理池), 现在先简单理解为大量可用的代理IP, 这个网上一搜一大把, 我们将其以字典的形式赋值给 proxies，然后传输给 requests，代码如下：

import requests        #导入requests包
from bs4 import BeautifulSoup
import redef main():url = 'http://www.cntour.cn/'headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36'}proxies = {"https": "60.163.85.0:9000","https": "218.91.13.2:46332",}strhtml = requests.get(url, headers=headers, proxies=proxies)soup = BeautifulSoup(strhtml.text, 'lxml')data = soup.select('#main>div>div.mtop.firstMod.clearfix>div.centerBox>ul.newsList>li>a')for item in data:result = {'title': item.get_text(),'link': item.get('href'),'ID': re.findall('\d+', item.get('href'))}print(result)return# Press the green button in the gutter to run the script.
if __name__ == '__main__':main()