猫眼电影爬取(woff 字体文件解析)

猫眼电影网站,中电影评分在网页显示正常但是检查网页源码会发现,评分所在位置是一串迷之字符串根本认不出,原因是使用了自定义字体代码

使用了自定义的stonefont字体，我们在网页中查找stonefont，很快有了发现，这就是标准的@font-face定义方法。因此我们访问其中woff文件的地址，将woff字体文件下载到本地。我们需要将woff字体转换成otf字体。百度可以直接转换字体，地址：http://fontstore.baidu.com/static/editor/index.html

得到

我们可以看到该文件是无序的,所以我们要先以该文件排序,,按照顺序将字形保存在列表中,

font = TTFont('./fonts/9f1eed3c6cfa21fa95ee464955b750162088.woff')  # 打开文件
uniList = font['cmap'].tables[0].ttFont.getGlyphOrder()#取出字形保存到uniList中
# print(font['glyf'][uniList[2]])
a = []
a.append(font['glyf'][uniList[4]])#0的字形在该uniList所在索引为4
a.append(font['glyf'][uniList[3]])#1的字形在该uniList所在索引为3
a.append(font['glyf'][uniList[6]])
a.append(font['glyf'][uniList[5]])
a.append(font['glyf'][uniList[7]])
a.append(font['glyf'][uniList[-1]])
a.append(font['glyf'][uniList[-2]])
a.append(font['glyf'][uniList[-4]])
a.append(font['glyf'][uniList[-3]])
a.append(font['glyf'][uniList[2]])
return a

我们发现,每次访问所使用的woff文件名不同,所以我们需要利用这个拍好序的文件作为标准,以数字1作为流程是:

从网页中获取到的信息应该是$E1DE由此我们找到该字符串对应的字形,由该字形我们判断出该数字为1

但是电脑没法从字形直接看出这个数字是多少,所以我们上面排好序的列表就可以用来判断该字形是哪个数字了

具体代码如下:

import requests
from lxml import etree
import json
import time
from fontTools.ttLib import TTFont
import re
import os
def mist():#将第一个获取到的woff文件的字形排序得到一个保存有字形的列表font = TTFont('./fonts/9f1eed3c6cfa21fa95ee464955b750162088.woff')  # 打开文件uniList = font['cmap'].tables[0].ttFont.getGlyphOrder()#取出字形保存到uniList中# print(font['glyf'][uniList[2]])a = []a.append(font['glyf'][uniList[4]])#0的字形在该uniList所在索引为4a.append(font['glyf'][uniList[3]])#1的字形在该uniList所在索引为3a.append(font['glyf'][uniList[6]])a.append(font['glyf'][uniList[5]])a.append(font['glyf'][uniList[7]])a.append(font['glyf'][uniList[-1]])a.append(font['glyf'][uniList[-2]])a.append(font['glyf'][uniList[-4]])a.append(font['glyf'][uniList[-3]])a.append(font['glyf'][uniList[2]])return a#返回字形列表def find_num(a,font,num):#num表示要排序的字形位置,font表示打开的woff文件uniList_new = font['cmap'].tables[0].ttFont.getGlyphOrder()#取出没有排序文件的字形保存到uniList中st = font['glyf'][uniList_new[num]]#找到要排序字形if st in a:return a.index(st)返回该字形对应的数字def num_clean(data):datas = ''for i in data:if i == '1':datas+=''def create_font(font_file):# 列出已下载文件file_list = os.listdir('./fonts')# 判断是否已下载if font_file not in file_list:# 未下载则下载新库print('不在字体库中, 下载:', font_file)url = 'http://vfile.meituan.net/colorstone/' + font_filenew_file = get_data(url)with open('./fonts/' + font_file, 'wb') as f:f.write(new_file)# 打开字体文件，创建 self.font属性font = TTFont('./fonts/' + font_file)return font# 把获取到的数据用字体对应起来，得到真实数据
def modify_data(data,font,a):# print(font)# 获取 GlyphOrder 节点gly_list = font.getGlyphOrder()# 枚举, number是下标，用于找到字形,gly是乱码for number, gly in enumerate(gly_list):# 把 gly 改成网页中的格式gly = gly.replace('uni', '&#x').lower() + ';'# 如果 gly 在字符串中，用字形所代表的数字替换if gly in data:data = data.replace(gly, str(find_num(a,font,number)))# 返回替换后的字符串return datadef fill_date(data):data = json.dumps(data)with open('电影.txt','a',encoding='utf-8') as f:f.write(data+'\n')def xml_data(data):html = etree.HTML(data)return html
def get_data(url,ref = ''):headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36",'referer': ref}data = requests.get(url, headers=headers).content# print(data)return data
def run():a = mist()#安照字形拍好的序列url = 'https://maoyan.com/board/4'while True:html = xml_data(get_data(url).decode('utf-8'))links = html.xpath('//*[@id="app"]/div/div/div[1]/dl/dd/a/@href')print(links)if(links == []):breakfor i in links:time.sleep(1)dict = {}url ='https://maoyan.com'+ihtml_old =get_data(url).decode('utf-8')# print(html_old)font_file = re.findall(r'vfile\.meituan\.net\/colorstone\/(\w+\.woff)', html_old)[0]font = create_font(font_file)html_news = xml_data(html_old)dict['电影名'] =  html_news.xpath('/html/body/div[3]/div/div[2]/div[1]/h3/text()')[0]dict['电影类型'] = html_news.xpath('/html/body/div[3]/div/div[2]/div[1]/ul/li[1]/text()')[0]dict['电影上映时间'] = html_news.xpath('/html/body/div[3]/div/div[2]/div[1]/ul/li[3]/text()')[0]# print(re.findall(r'<span class="index-left info-num ">\s+<span class="stonefont">(.*?)</span>\s+</span>', html_old)[0])#正则表达式获取数据dict['电影用户评分'] = modify_data(re.findall(r'<span class="index-left info-num ">\s+<span class="stonefont">(.*?)</span>\s+</span>', html_old)[0],font,a)dict['评分人数'] = modify_data(re.findall(r'''<span class="stonefont">(.*?)</span>人评分</span>''', html_old)[0],font,a)try:dict['电影累计票房'] = modify_data(''.join(re.findall(r'''<span class="stonefont">(.*?)</span><span class="unit">(.*)</span>''', html_old)[0]),font,a)except:dict['电影累计票房'] = '暂无'# ticket_number = self.modify_data(ticket_number)print(dict)fill_date(dict)next_url = 'https://maoyan.com/board/4'+html.xpath('//*[@id="app"]/div/div/div[2]/ul/li/a/@href')[-1]if (next_url == url):breakelse:url = next_url
if __name__ == '__main__':run()

猫眼电影爬取(woff 字体文件解析)相关推荐

猫眼电影-爬取（Python）
此篇文章是根据https://mp.weixin.qq.com/s/rRtb8ATXrVxr3r5uLEhRtA这个文章的步骤进行爬取的.有兴趣的可以直接到该作者的连接查看文章. 介绍一下我的装备: ...
猫眼电影票房爬取到MySQL中_猫眼电影爬取(一)：requests+正则，并将数据存储到mysql数据库...
前面讲了如何通过pymysql操作数据库,这次写一个爬虫来提取信息,并将数据存储到mysql数据库 1.爬取目标爬取猫眼电影TOP100榜单要提取的信息包括:电影排名.电影名称.上映时间.分数 2 ...
woff 字体文件解析字体结构说明
##woff文件样式查看 http://fontstore.baidu.com/static/editor/index.html 在这个网址上传woff文件即可看到woff文件的展示效果 ###w ...
自定义字体文件解析成人眼可识别文字
# coding=utf-8 from fontTools.ttLib import TTFont from PIL import Image, ImageDraw, ImageFont #绘制图片 ...
python爬取bilibili弹幕_Python爬虫爬取Bilibili弹幕过程解析
先来思考一个问题,B站一个视频的弹幕最多会有多少? 比较多的会有2000条吧,这么多数据,B站肯定是不会直接把弹幕和这个视频绑在一起的. 也就是说,有一个视频地址为https://www.bilibi ...
大众点评数据爬取（字体反爬）
大众点评数据爬取 (字体反爬) 项目描述在码市的平台上看到的一个项目:现在已经能爬取到需要的数据,但是在爬取的效率和反爬措施上还需要加强. 项目分析 1.打开大众点评的首页'http://www. ...
简单爬取微博评论详细解析，学习爬取ajax异步数据交换动态网页
爬取微博评论详细解析,学习爬取ajax异步数据交换动态网页 1.什么是ajax异步数据交换网页 2.用到的工具模块和简单解释 3.网页内容解析 4.代码实现及解释 1.什么是ajax异步数据交换网页 ...
vue+python把woff字体文件中的字体全部读取出来
获取woff字体文件的字符编码 from fontTools.ttLib import TTFontfont = TTFont("7ef51293.woff") extraName ...
不调用网页内容直接批量爬取MP3音频文件
需求: 根据字典数据表中的汉字读音列表爬取单个字的拼音音频文件目标网址: https://hanyu.baidu.com/ 网页分析: F12 因为是音频文件.直接在媒体里面找.点击Media,如果 ...

猫眼电影爬取(woff 字体文件解析)

猫眼电影爬取(woff 字体文件解析)相关推荐

最新文章

热门文章