爬虫之字体反爬（仅供学习参考）

本篇文章是对爬遇到字体反爬时遇到的问题以及各种问题的解决方案。
本人目前初入爬虫不久很多地方可能还存在考虑不周的地方，请看到本篇文章的各位大佬在发现本文章的问题后能够不吝赐教，由衷感谢各位大佬。

请求目标链接

1.使用requests模块请求目标链接

urll = ''
header = {#Cookie,ua,防盗链等参数
}
dataa = {#其他参数
}
reque = requests.get(url=urll, headers=header, data=dataa)

2.使用selenium模块请求目标链接（selenium模块本程序中未使用，如要使用替换掉前面的requests用法即可）

#添加无头浏览器和反自动化程序识别
option = Options()
#无头浏览器
option.add_argument("--headless")
option.add_argument("--disbale-gpu")
#反自动化程序识别
option.add_argument('--disable-blink-features=AutomationControlled')web = Chrome(options=option)
# 请求目标网址
web.get('')

获取响应数据

1.requests模块

reque.encoding = 'UTF-8'    # 设置编码格式
y_html = reque.text         #获取页面源码

2.selenium模块

y_html=web.page_source  # 获取页面源代码

3.本地保存的HTML文件主要用于测试数据处理

y_html = open('./dzdp.html', mode='r', encoding='utf-8').read()  # 本地保存的HTML文件主要用于测试数据处理

解析数据

在获取到页面源代码后检查发现数据保存在页面中，但是部分数据是乱码，首先使用xpath进行数据提取，其次在提取加密的数据时发现提取出的数据全部是乱码且无法正常打印。后经过研究发现是字体加密，在搜索中搜索常用的字体格式，搜索到woff时页面显示到需要的内容，下面是数据的提取以及解密步骤。
1.xpath提取正常数据

# 店名
store_name_list = tree.xpath('/html/body/div[2]/div[3]/div[1]/div[1]/div[2]/ul/li/div[2]/div[1]/a/h4/text()')
# print(store_name_list)# 链接
link_list = tree.xpath('/html/body/div[2]/div[3]/div[1]/div[1]/div[2]/ul/li/div[2]/div[1]/a[1]/@href')
# print(link_list)# 推荐菜
dish_recommendation_list = []
for i in range(1, 16):dish_recommendation = list(tree.xpath(f'/html/body/div[2]/div[3]/div[1]/div[1]/div[2]/ul/li[{i}]/div[2]/div[4]/a/text()'))dish_recommendation_list.append(dish_recommendation)
# print(dish_recommendation_list)

2.加密字体解析（在搜索中搜索woff，打开对应的css，在css中搜索woff，复制链接在浏览器中即可下载对应的字体）

# 字体woff文件编号中除0,1外的字符
woff_str = '1234567890店中美家馆小车大市公酒行国品发电金心业商司超生装园场食有新限天面工服海华水房饰城乐汽香部利子老艺花专东肉菜学福饭人百餐茶务通味所山区门药银农龙停尚安广鑫一容动南具源兴鲜记时机烤文康信果阳理锅宝达地儿衣特产西批坊州牛佳化五米修爱北养卖建材三会鸡室红站德王光名丽油院堂烧江社合星货型村自科快便日民营和活童明器烟育宾精屋经居庄石顺林尔县手厅销用好客火雅盛体旅之鞋辣作粉包楼校鱼平彩上吧保永万物教吃设医正造丰健点汤网庆技斯洗料配汇木缘加麻联卫川泰色世方寓风幼羊烫来高厂兰阿贝皮全女拉成云维贸道术运都口博河瑞宏京际路祥青镇厨培力惠连马鸿钢训影甲助窗布富牌头四多状吉苑沙恒隆春干饼氏里二管诚制售嘉长轩杂副清计黄讯太鸭号街交与叉附近层旁对巷栋环省桥湖段乡厦府铺内侧元购前幢滨处向座下県凤港开关景泉塘放昌线湾政步宁解白田町溪十八古双胜本单同九迎第台玉锦底后七斜期武岭松角纪朝峰六振珠局岗洲横边济井办汉代临弄团外塔杨铁浦字年岛陵原梅进荣友虹央桂沿事津凯莲丁秀柳集紫旗张谷的是不了很还个也这我就在以可到错没去过感次要比觉看得说常真们但最喜哈么别位能教境非为欢然他挺着价那意种想出员两推做排实分间甜度起满给热完格荐喝等其再几只现朋候样直而买于般豆量选奶打每评少算又因情找些份置适什蛋师气你姐棒试总定啊足级整带虾如态且尝主话强当更板知己无酸让入啦式笑赞片酱差像提队走嫩才刚午接重串回晚微周值费性桌拍跟块调糕'
woff_character = ['null', 'x'] + list(woff_str)  # 添加null、x的两个特殊字符
woff_open_one = TTFont(r'./ziti/af7eec51.woff')  # 读取字体文件
woff_open_two = TTFont(r'./ziti/436ffe72.woff')  # 读取字体文件
woff_unicode_one = woff_open_one['cmap'].tables[0].ttFont.getGlyphOrder()  # 字体1字符unicode编码列表
woff_unicode_two = woff_open_two['cmap'].tables[0].ttFont.getGlyphOrder()  # 字体2字符unicode编码列表
woff_dict_one = dict(zip(woff_unicode_one, woff_character))  # 字体1unicode编码对应的字符字典
woff_dict_two = dict(zip(woff_unicode_two, woff_character))  # 字体2unicode编码对应的字符字典

3.将HTML中的内容提取成加密字体对应的unicode编码（由于源数据的代码块数据无序且有未加密的数据，不能直接使用re提取数据，这里使用re模块对数据的代码块进行提取，对提取到的代码块进一步提取数据）

# 评价人数
number_appraisers_list = []
nal_com = re.compile(r'module="list-readreview"(.*?)rel="nofollow">(.*?)<b>(?P<number>.*?)</b>(.*?)条评价</a>',re.S)  # 正则表达式获取网页源码
nal_fin = nal_com.finditer(y_html)
# 源码处理提取所需数据并保存在列表中
for i in nal_fin:num_list = []nal = i.group('number').replace(r'<svgmtsi class="shopNum">', '').replace(r';</svgmtsi>', '').replace(r'&#x',r'uni').split('\n')  # 替换源码中不需要的内容for y in nal:nal_y = y.strip().split('\n')for x in nal_y:if x != '':num_list.append(x)number_appraisers_list.append(num_list)# 人均价格
price_average_list = []
pal_com = re.compile(r'target="_blank" rel="nofollow">(.*?)人均(.*?)<b>(?P<price>.*?)</b>', re.S)
pal_fin = pal_com.finditer(y_html)
# 源码处理提取所需数据并保存在列表中
for i in pal_fin:pri_list = []pal_i = i.group('price').replace(r'<svgmtsi class="shopNum">', '').replace(r';</svgmtsi>', '').replace(r'&#x', r'uni').split('\n')for y in pal_i:pal_y = y.strip().split('\n')for x in pal_y:if x != '':pri_list.append(x)price_average_list.append(pri_list)# 菜系    # 地址
style_cooking_location = []
scl_com = re.compile(r'<span(.*?)class="tag">(?P<style>.*?)</span>', re.S)
scl_fin = scl_com.finditer(y_html)
# 源码处理提取所需数据并保存在列表中
for i in scl_fin:scl_i = i.group('style').replace(r'<svgmtsi', '').replace(r'class="tagName">', '').replace(r';</svgmtsi>','').replace(r'&#x',r'uni').split('\n')joint_list = []for y in scl_i:scl_y = list(y.strip().replace(r'/', ''))x_joint = ''  # 拼接字符生成字符串for x in scl_y:if '\u4e00' <= x <= '\u9fff':  # 判断是否是汉字joint_list.append(x)elif x != '':x_joint += xjoint_list.append(x_joint.strip(' '))style_cooking_location.append(joint_list)

4.观察数据规律编写数据处理函数（分析发现两个字体分别是处理数字和汉字）

# 数字处理
def woff_processing(processing_list):data_processing = []for i in processing_list:processing = ''  # 拼接真实数据for y in i:if y in woff_unicode_one:processing += woff_dict_one[y]else:processing += ydata_processing.append(processing)return data_processing# 汉字处理
def woff_processing_two(processing_list):data_processing = []for i in processing_list:processing = ''  # 拼接真实数据for y in i:if y in woff_unicode_two:processing += woff_dict_two[y]else:processing += ydata_processing.append(processing)return data_processing

5.处理目标数据

# 评价人数
number_appraisers_list = woff_processing(number_appraisers_list)  # 数据处理
# print(number_appraisers_list)# 人均价格
price_average_list = woff_processing(price_average_list)  # 数据处理
# print(price_average_list)# 菜系    # 地址
style_cooking_location = woff_processing_two(style_cooking_location)  # 数据处理
# print(style_cooking_location)

6.数据对比
检测数据是否一致，如果不一致重复前边步骤

持久化存储

1.存放在本地文本中

with open('./dzdp.txt', mode='w', encoding='UTF-8') as f:for i in range(15):f.write(f'{store_name_list[i]}|{link_list[i]}|{number_appraisers_list[i]}|{price_average_list[i]}|{style_cooking_location[2 * i]}|{style_cooking_location[2 * i + 1]}|{dish_recommendation_list[i]}\n')

2.储存在数据库中

import pymysql
conn=pymysql.connect(host='127.0.0.1',port=3306,user='用户名',password='密码',db='数据库',charset='utf8')
cursor=conn.cursor() #创建游标对象
cursor.execute(sql语句)#执行sql语句
cursor.close()
conn.close()#关闭数据库连接

问题及解决方案

1.SyntaxError: Non-UTF-8 code starting with '\xe4' in file错误
解决方案：在文件开头添加# coding=UTF-8
2.使用xpath提取页面加密数据乱码
解决方案：改用正则提取页面源码数据
3.原数据打印不显示
解决方案：在提取出数据后就对数据中的r'&#x'进行替换，替换为r'uni'