python 爬虫数据抓取的三种方式

常用抽取网页数据的方式有三种：正则表达式、Beautiful Soup、lxml

1.正则表达式

正则表达式有个很大的缺点是难以构造、可读性差、不易适用未来网页的变化。

提取数据步骤：创建正则对象-->匹配查找-->提取数据保存

写一段伪代码：

import re
url = 'http://xxxx.com/sdffs'
html = download(url)
re.findall('正则表达式', html)

HTML 示例：

<html>
<div><a href='www.baidu.com'>正则</a></div>
<div>111111</div>
<div><a href='www.baidu1.com'>正则1</a></div>
<div>222222</div>
<div><a href='www.baidu2.com'>正则2</a></div>
<div>333333</div>
<div><a href='www.baidu3.com'>正则3</a></div>
<div>444444</div>
</html>

例：提取所有a标签的文本

pattern = re.compile(r'<a.*?>(.*?)</a>', re.S)

a_text = re.findall(pattern, html)

知识点：

findall 返回的结果是列表套元组的形式

而search一般要加group(), groups(),

re.S 可以将正则的搜索域不再是一行，而是整个HTML字符串

.*? 非贪婪匹配 .*贪婪匹配

2.Beautiful Soup

这是一个非常流行的python模块。安装命令如下：

pip  install beautifulsoup4

使用此模块的第一步是将已下载的html内容解析为soup文档。因许多html网页格式不规范，Beautiful Soup可对其进行确定，将其调整为规范的html文件。

比如说我们想抓取每个新闻的标题和链接，并将其组合为一个字典的结构打印出来。首先查看 HTML 源码确定新闻标题信息组织形式。

可以目标信息存在于 em 标签下 a 标签内的文本和 href 属性中。可直接利用 requests 库构造请求，并用 BeautifulSoup 或者 lxml 进行解析。

方式一： `requests` + `BeautifulSoup` + `select` css选择器

# select method
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.119 Safari/537.36'}url = 'http://news.qq.com/'Soup = BeautifulSoup(requests.get(url=url, headers=headers).text.encode("utf-8"), 'lxml')em = Soup.select('em[class="f14 l24"] a')
for i in em:title = i.get_text()link = i['href']print({'标题': title,
'链接': link})

很常规的处理方式.

方式二： `requests` + `BeautifulSoup` + `find_all` 进行信息提取

# find_all method
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.119 Safari/537.36'}url = 'http://news.qq.com/'Soup = BeautifulSoup(requests.get(url=url, headers=headers).text.encode("utf-8"), 'lxml')em = Soup.find_all('em', attrs={'class': 'f14 l24'})for i in em:title = i.a.get_text()link = i.a['href']print({'标题': title,'链接': link})

同样是 requests + BeautifulSoup 的爬虫组合，但在信息提取上采用了 find_all 的方式。

3.Lxml

Lxml是基于libxml2这一XML解析库的python封装。该模块使用c语言编写，解析速度比Beautiful Soup更快。
安装命令如下：

pip install lxml
pip install cssselect

如下代码，从html中获取class=country的ul标签下，li标签id=a的文本，获取li标签class=b的文本

# lxml/etree method
import requests
from lxml import etreeheaders = {    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.119 Safari/537.36'}url = 'http://news.qq.com/'html = requests.get(url = url, headers = headers)con = etree.HTML(html.text)title = con.xpath('//em[@class="f14 l24"]/a/text()')link = con.xpath('//em[@class="f14 l24"]/a/@href')
for i in zip(title, link):print({'标题': i[0],
'链接': i[1]})

使用 lxml 库下的 etree 模块进行解析，然后使用 xpath 表达式进行信息提取，效率要略高于 BeautifulSoup + select 方法。这里对两个列表的组合采用了 zip 方法。

方式四： `requests` + `lxml/html/fromstring` + `xpath` 表达式

# lxml/html/fromstring method
import requests
import lxml.html as HTMLheaders = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.119 Safari/537.36'}url = 'http://news.qq.com/'con = HTML.fromstring(requests.get(url = url, headers = headers).text)title = con.xpath('//em[@class="f14 l24"]/a/text()')link = con.xpath('//em[@class="f14 l24"]/a/@href')
for i in zip(title, link):print({'标题': i[0],'链接': i[1]})

跟方法三类似，只是在解析上使用了 lxml 库下的 html.fromstring 模块。

三种方式的比较

抓取方法	性能	使用难度	安装难度
正则表达式	快	困难	简单（内置模块）
Beautiful Soup	慢	简单	简单（纯python）
Lxml	快	简单	相对困难

通常，lxml是抓取数据最好的选择，因为该方法既快速又健壮，而正则和Beautiful Soup只在某些特定场景下用。

参考：https://blog.csdn.net/apple9005/article/details/54930982

很多人觉得爬虫有点难以掌握，因为知识点太多，需要懂前端、需要python熟练、还需要懂数据库，更不用说正则表达式、XPath表达式这些。其实对于一个简单网页的数据抓取，不妨多尝试几种抓取方案，举一反三，也更能对python爬虫有较深的理解。长此以往，对于各类网页结构都有所涉猎，自然经验丰富，水到渠成。

希望对你有帮助。