爬虫之bs4、xpath数据解析（案例—scrapy获取菜鸟HTML页面数据）

文章结构

1、爬虫概念简介
2、爬虫的流程
3、数据解析
- （1）bs4解析
- - （I）根据标签名查找 soup.a
  - （II）获取属性 soup.a[attr]
  - （III）soup.a.text 获取响应文本内容（字符串）
  - （IV）soup.find("target") 找到第一个符合要求的标签，外加属性定位
  - （V）soup.find_all("target")：找到所有符合要求的标签
  - （VI）根据选择器select选择指定的内容
- （2）xpath解析
- - （I）定位标签内容etree.xpath()
  - （II）定位class=song的div
  - （III）跨多级标签获取数据
  - （IV）/text() 取标签直系文本, //text()取所有
  - （V）取属性 tree.xpath("//a/@href)
- （3）正则解析
- - （I）单字符：
  - （II）数量修饰：
  - （III）边界：
  - （IV）分组：
4、爬虫实战案例
5、数据解析案例数据

1、爬虫概念简介

1、爬虫：通过编写程序模拟浏览器上网，让其去互联网获取数据的程序。

2、爬虫分类：

a)通用爬虫：获取整张页面
b)聚焦爬虫：获取一张页面中的某部分数据（一般通过正则过滤）
c)增量式爬虫：实时更新数据抓取
反爬机制：通过相关技术阻止爬虫进行网站数据的爬取
反反爬策略：攻破反爬机制获取数据

3、 http协议：client与server进行数据交互的形式。

请求访问使用的头部（headers）常用信息：

User-Agent：请求载体的身份标识
Connection：‘close’
content-type：服务器返回给客户端的数据类型

4、https：安全的http协议（证书密钥加密）

2、爬虫的流程

1）指定url。即网页地址，这里需要注意是否为Ajax请求，如果是我们需要通过定位找到真正携带我们想要数据的页面的url。
2）发起请求。request.get(url), request.post(url)
3）获取响应数据。根据步骤（2）请求页面会有响应数据响应，根据响应数据的类型，可以用content标是二进制数据，text表示文本，json()表示json类型。
4）数据解析。为什么要数据解析呢？当获取到网页整个数据时，我们并不是需要整个页面的数据，这样不利于查找和发现价值信息。这时需要通过数据解析，例如xpath，beautifulSoup库来获取需要的某一部分数据。
5）数据持久化存储。在完成了步骤（4）之后，数据已经干净了，数据格式也规范统一了，这时可以存储到本地数据库或者其他地方。

3、数据解析

整个爬虫流程关键的部分之一：数据解析。试想一下，你把整个页面的数据给你老板，老板啥也看不到，还得戴眼镜慢慢查找他想要的信息。如果你的数据没有任何规律可言，你觉得有价值吗？可利用吗？

数据解析目的：聚焦获取的数据
数据解析方法：
- 正则
- xpath
- bs4
数据解析原理：
- 1.标签的定位
- 2.提取标签中存储的文本数据或标签属性中存储的数据

（1）bs4解析

解析原理：
- 实例化一个beautiful对象，并且将源码数据加载到数据中
- 使用该对象的相关属性和方法实现标签定位和数据提取
实例化对象
- BeautifulSoup(page_text，‘lxml’) 从网页加载响应数据（网页数据）
- BeautifulSoup(fp，‘lxml’) 从本地加载数据

（I）根据标签名查找 soup.a

soup.a 只能找到第一个符合要求的标签

f = open('./test.html', 'r', encoding='utf-8')
soup = BeautifulSoup(f, 'lxml')
print(soup.div.text)

打印结果如下，上下各有一个换行符，这不是必须有的，得看源数据格式是怎么样的。

（II）获取属性 soup.a[attr]

soup.a.attrs 获取a所有的属性和属性值，返回一个字典
soup.a.attrs[‘href’] 获取href属性
soup.a[‘href’] 也可简写为这种形式

print('\n', soup.a.attrs)
# 输出结果：{'href': 'http://www.song.com/', 'title': '赵匡胤', 'target': '_self'}print(soup.a['href'])
# 打印结果：http://www.song.com/print(soup.a.attrs['href'])
# 打印结果：http://www.song.com/

（III）soup.a.text 获取响应文本内容（字符串）

soup.a.string
不建议用string，它只获取它的第一子级内容。如果标签里还有标签，那么string获取为None，而其它方式可以获取所有文本内容。
soup.a.text 获取标签下所有文本
soup.a.get_text() 获取标签下所有文本

print('string: ',soup.a.string)  # 换成标签a则报错
print('text: ',soup.a.text)
print('get_text(): ', soup.a.get_text())

（IV）soup.find(“target”) 找到第一个符合要求的标签，外加属性定位

print(soup.find('a'))    # 找到第一个符合要求的
print(soup.find('a', title="qing"))

（V）soup.find_all(“target”)：找到所有符合要求的标签

soup.find_all(‘a’)
soup.find_all([‘a’,‘b’]) 找到所有的a和b标签
soup.find_all(‘a’, limit=2) 限制前两个

print(soup.find_all('a'))
print(soup.find_all(['div', 'a'], limit=2))

（VI）根据选择器select选择指定的内容

soup.select(’#feng’)

常见的选择器：标签选择器(a)、类选择器(.)、id选择器(#)、层级选择器
- 层级选择器：
- div .dudu 例如【lala .meme .xixi】这是一个层级递进关系，下面好多级
- div > p > a > .lala 只能是下面一级
  注意】select选择器返回永远是列表，需要通过下标提取指定的对象

print(soup.select('#feng'))
print(soup.select('div span'))
"""输出结果
[<a href="http://www.haha.com" id="feng">凤凰台上凤凰游,凤去台空江自流,吴宫花草埋幽径,晋代衣冠成古丘</a>]
[<span class="tag-item">语文</span>, <span class="tag-item">数学</span>, <span class="tag-item">物理</span>, <span class="kong">'上看见看见啦'</span>, <span class="wu">上看见</span>, <span>this is span</span>]
"""

（2）xpath解析

原理和bs4的选择器差不多，但是更加简洁而且更加通用，建议大家使用这个。

etree.parse(‘path’) 从本地文件加载
etree.HTML(page_text) 看见HTML你应该知道从哪加载的了。

整个xpath都是根据标签不断定位，获取内容

（I）定位标签内容etree.xpath()

print(tree.xpath('//title/text()'))
# 打印结果：['测试bs4']

（II）定位class=song的div

print(tree.xpath('//div[@class="song"]'))

（III）跨多级标签获取数据

print(tree.xpath('//div[@class="tang"]/ul/li[4]/a'))

（IV）/text() 取标签直系文本, //text()取所有

print(tree.xpath('//title/text()'))
print(tree.xpath('//title//text()'))

（V）取属性 tree.xpath("//a/@href)

href为标签里某一个属性，前面要加@符号。

 print(tree.xpath('//a/@href'))

（3）正则解析

常用正则表达式：

（I）单字符：

. : 除换行以外所有字符
[] ：[aoe] [a-w] 表或的关系，匹配集合中任意一个字符
\d ：数字 [0-9]
\D : 非数字
\w ：数字、字母、下划线、中文
\W : 非数字、字母、下划线、中文
\s ：所有的空白字符包,括空格、制表符、换页符等等。等价于 [ \f\n\r\t\v]。
\S : 非空白

（II）数量修饰：

*: 任意多次 >=0
+ : 至少1次 >=1
? : 可有可无 0次或者1次
{m} ：固定m次 hello{3,}
{m,} ：至少m次
{m,n} ：m-n次

（III）边界：

$ : 以某某结尾
^ : 以某某开头

（IV）分组：

贪婪模式： .*，只要满足你的就一直匹配
非贪婪（惰性）模式： .*?，有一定结束限制
re.I : 忽略大小写
re.M ：多行匹配
re.S ：单行匹配
re.sub(正则表达式, 替换内容, 字符串)

pattern = 'cast'
s = 'ihellotcast hellolima'
res = re.match(pattern, s)
print(re.match('.', 'i'))
print(re.match('..', 'i'))  # None
print(re.match('..', 'ii')) # True
print('================')
print(re.match('\d*', 'abc').group())
a = [5,5,6]
b = [3,5,9]
print(id(5),id(6))
print(id(a[0]), id(a[1]), id(a))

关于正则，是一个较难的部分，可以去网上多查查。

4、爬虫实战案例

该案例是获取菜鸟HTML页面，左边标题详情页面的所有内容，并保存到本地。

项目结构如下：

url = https://www.runoob.com/html/html-basic.html

主要文件cnHtml.py，代码如下：

import os
import scrapy
from lxml import etree
import kuser_agent as kua
class CnhtmlSpider(scrapy.Spider):name = 'cnHtml'if not os.path.exists(r'D:\Home\python爬虫\spider_engine_cainiao\HTML_download'):os.mkdir(r'D:\Home\python爬虫\spider_engine_cainiao\HTML_download')def start_requests(self):yield scrapy.Request(url='https://www.runoob.com/html/html-editors.html',headers={'User-Agent': kua.get()})def parse_start(self, response):href_all = etree.HTML(response.body).xpath('//div[@id="leftcolumn"]/a/@href')url_prefix = 'https://www.runoob.com'for href in href_all:url = url_prefix + hrefyield scrapy.Request(url=url,callback=self.parse1)def parse1(self, response):file_name = etree.HTML(response.body).xpath('//title/text()')[0]file_name = file_name.split('|')[0].strip().replace(' ', '-') + '.html'yield {'file_name': file_name,'content': response.body}

配置文件pipeline.py，对应的是数据保存：

class SpiderEngineCainiaoPipeline:def process_item(self, item, spider):name = item['file_name']print(name)content = item['content']with open(r'D:\Home\python爬虫\spider_engine_cainiao\HTML_download\%s'%name, 'wb') as f:f.write(content)

启动文件start.py:

from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerProcess
from spider_engine_cainiao.spiders.cnHtml import CnhtmlSpider
def spider_interface():settings = get_project_settings()cp = CrawlerProcess(settings)cp.crawl(CnhtmlSpider)cp.start()if __name__ == '__main__':spider_interface()

另外在配置文件setting.py中，我们需要开启下载中间件和pipeline，即启用原先注释的语句

DOWNLOADER_MIDDLEWARES = {'spider_engine_cainiao.middlewares.SpiderEngineCainiaoDownloaderMiddleware': 543,
}ITEM_PIPELINES = {'spider_engine_cainiao.pipelines.SpiderEngineCainiaoPipeline': 300,
}

尝试用通用爬虫直接暴力爬取，但是遇到访问过快的问题，后续会完善，通过增加ip代理池或者其他手段。

import time
import requests
import kuser_agent as kua
from lxml import etree
url = 'https://www.runoob.com/html/html-tutorial.html'
headers = {'User-Agent': kua.get()}
page_text = requests.get(url=url, headers=headers, timeout=100).text
text = etree.HTML(page_text).xpath('//div[@id="leftcolumn"]/a')
for idx, a in enumerate(text):href = 'https://www.runoob.com' + a.xpath('./@href')[0]filename = a.xpath('./text()')[0].strip() + '.html'content = requests.get(url=href, headers={'User_Agent': kua.get()}).contentwith open('./cainiao/%s' % filename, 'wb') as f:f.write(content)time.sleep(3)

5、数据解析案例数据

<html lang="en">
<head><meta charset="UTF-8" /><title>测试bs4</title>
</head>
<body><ul>'就看见离开家'</ul><div class="hhh"><span class="tag-item">语文</span><span class="tag-item">数学</span><span class="tag-item">物理</span></div><div class="song"><span class="kong">'上看见看见啦'</span><span class="wu">上看见</span><p>李清照</p><p>王安石</p><p>苏轼</p><p>柳宗元</p><a href="http://www.song.com/" title="赵匡胤" target="_self"><span>this is span</span>宋朝是最强大的王朝，不是军队的强大，而是经济很强大，国民都很有钱</a><a href="" class="du">总为浮云能蔽日,长安不见使人愁</a><img src="http://www.baidu.com/meinv.jpg" alt="" /></div><div class="qingkong"><p>百里守约</p></div><div class="tang"><ul><li><a href="http://www.baidu.com" title="qing">清明时节雨纷纷,路上行人欲断魂,借问酒家何处有,牧童遥指杏花村</a></li><li><a href="http://www.163.com" title="秦">秦时明月汉时关,万里长征人未还,但使龙城飞将在,不教胡马度阴山</a></li><li><a href="http://www.126.com" alt="qi">岐王宅里寻常见,崔九堂前几度闻,正是江南好风景,落花时节又逢君</a></li><li><a href="http://www.sina.com" class="du">杜甫</a></li><li><a href="http://www.dudu.com" class="du">杜牧</a></li><li><b>杜小月</b></li><li><i>度蜜月</i></li><li><a href="http://www.haha.com" id="feng">凤凰台上凤凰游,凤去台空江自流,吴宫花草埋幽径,晋代衣冠成古丘</a></li></ul></div>
</body>
</html>