汽车之家汽车详细参数之css反爬

嘿哈，本人呢也很喜欢车，只不过买不起，哈哈哈穷人一个我呢就经常去汽车之家看一看我喜欢的那些车，看看配置啥的，有一天我就想吧这些数据都爬下来（手痒痒了啥都想爬一下，哈哈）
就开始了我的掉头发之旅

确定要爬的网页
奔驰s级参数配置表

汽车之家很是很良心的，给你做了很详细的配置对比
查看数据加密方式
css加密（仅限本网站）其实不是css文件搞得鬼，是一段js，首先找到js数据
用正则吧数据取出来（不止一段js）

js_list = re.findall('(\(function\([a-zA-Z]{2}.*?_\).*?\(document\);)', ret)

然后和一段js代码（运行上面获取到的js数据）拼接，写入到HTML文件里面去，

DOM = ("var rules = '2';""var document = {};""function getRules(){return rules}""document.createElement = function() {""      return {""              sheet: {""                      insertRule: function(rule, i) {""                              if (rules.length == 0) {""                                      rules = rule;""                              } else {""                                      rules = rules + '#' + rule;""                              }""                      }""              }""      }""};""document.querySelectorAll = function() {""      return {};""};""document.head = {};""document.head.appendChild = function() {};""var window = {};""window.decodeURIComponent = decodeURIComponent;")

再用selenium吧对应关系读出来


for js in js_list:DOM = DOM + js
html_type = "<html><meta http-equiv='Content-Type' content='text/html; charset=utf-8' /><head></head><body>    <script type='text/javascript'>"
js = html_type + DOM + " document.write(rules)</script></body></html>"  # 待执行的JS字符串with open("js.html", "w", encoding="utf-8") as f:f.write(js)
browser = webdriver.Chrome(executable_path=r"G:\爬虫项目\b站登陆\chromedriver.exe")
browser.get("file://G:/爬虫项目/汽车之家汽车参数配置数据字体反爬/js.html")
true_text = browser.find_element_by_tag_name('body').text
browser.close()
ans= re.findall('#\.(.*?)::before { content:"(.*?)" }',true_text)
font_dict = {}
for k,v in ans:font_dict[k]=v
with open('font.txt','w',encoding='utf8') as f:json.dump(font_dict,f)

然后就得到了标签class属性值和汉字的对应关系
5. 查找未加密的数据来源

网页中看不全，吧网页保存到本地进行查看

用正则获取出来

config = re.findall('config = (.*?);', ret)option = re.findall('option = (.*?)};', ret)bag = re.findall('bag = (.*?)};', ret)

然后根据第四部获取到的对应关系进行替换
6. 写入表格

代码如下（这个代码有点乱，但是根据思路写的（那个DOM说实话我也不知道怎么来的那一段是借鉴了某大神的其他的是我自己写的））不懂得加我wx：18300485357一起学习进步

import json
import csv
import requests
import re
from lxml import etree
from selenium import webdriver# headers = {# 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36'
# }
#
# url = 'https://car.autohome.com.cn/config/series/59.html#pvareaid=3454437'
# ret = requests.get(url=url,headers=headers).text
# with open('qi_data.html','w',encoding='utf8') as f:
#     f.write(ret)with open('qi_data.html', 'r', encoding='utf8') as f:ret = f.read()
# keyLink = re.findall('keyLink = (.*?);',ret)
#
config = re.findall('config = (.*?);', ret)option = re.findall('option = (.*?)};', ret)bag = re.findall('bag = (.*?)};', ret)js_list = re.findall('(\(function\([a-zA-Z]{2}.*?_\).*?\(document\);)', ret)# 运行JS的DOMDOM = ("var rules = '2';""var document = {};""function getRules(){return rules}""document.createElement = function() {""      return {""              sheet: {""                      insertRule: function(rule, i) {""                              if (rules.length == 0) {""                                      rules = rule;""                              } else {""                                      rules = rules + '#' + rule;""                              }""                      }""              }""      }""};""document.querySelectorAll = function() {""      return {};""};""document.head = {};""document.head.appendChild = function() {};""var window = {};""window.decodeURIComponent = decodeURIComponent;")
# for js in js_list:
#        DOM = DOM + js
# html_type = "<html><meta http-equiv='Content-Type' content='text/html; charset=utf-8' /><head></head><body>    <script type='text/javascript'>"
# js = html_type + DOM + " document.write(rules)</script></body></html>"  # 待执行的JS字符串
#
# with open("js.html", "w", encoding="utf-8") as f:
#     f.write(js)
# browser = webdriver.Chrome(executable_path=r"G:\爬虫项目\b站登陆\chromedriver.exe")
# browser.get("file://G:/爬虫项目/汽车之家汽车参数配置数据字体反爬/js.html")
# true_text = browser.find_element_by_tag_name('body').text
# browser.close()
# ans= re.findall('#\.(.*?)::before { content:"(.*?)" }',true_text)
# font_dict = {}
# for k,v in ans:
#        font_dict[k]=v
# with open('font.txt','w',encoding='utf8') as f:
#        json.dump(font_dict,f)with open('font.txt', 'r', encoding='utf8') as f:ans = json.load(f)
print(ans)
# '"<span class='" + info.group(1) + "'></span>"'
#
data_list = []
str_config = str(config[0])
for k in ans:str_config = str_config.replace(r"<span class='" + k + r"'></span>", ans[k])
data_list.append(json.loads(str_config))str_option = str(option[0])
for k in ans:str_option = str_option.replace(r"<span class='" + k + r"'></span>", ans[k])data_list.append(json.loads(str_option + "}"))print(data_list)
str_bag = str(bag[0])
for k in ans:str_bag = str_bag.replace(r"<span class='" + k + r"'></span>", ans[k])data_list.append(json.loads(str_bag + "}"))configItem = data_list[0]['result']['paramtypeitems']
optionItem = data_list[1]['result']['configtypeitems']
bagItem = data_list[2]['result']['bagtypeitems'][0]['bagitems']car_item = {}
for typ in configItem:for car in typ['paramitems']:car_item[car['name']] = []for value in car['valueitems']:k = value['value']if k == '':try:k = ''for i in value['sublist']:k += ' ' + i['subname']except:k = value['value']car_item[car['name']].append(k)for typ in optionItem:for car in typ['configitems']:car_item[car['name']] = []for value in car['valueitems']:k = value['value']if k == '':try:k = ''for i in value['sublist']:k += ' ' + i['subname']except:k = value['value']car_item[car['name']].append(k)for car in bagItem[0]['valueitems']:car_item[car['name']] = []car_item[car['name']].append(car['bagid'])car_item[car['name']].append(car['pricedesc'])car_item[car['name']].append(car['description'])
print(car_item)# 1. 创建文件对象
f = open('汽车参数.csv', 'w', encoding='utf-8', newline='')# 2. 基于文件对象构建 csv写入对象
csv_writer = csv.writer(f)li = [i for i in car_item]
# 3. 构建列表头
csv_writer.writerow(li)
# 4. 写入csv文件内容
for i in range(len(car_item['车型名称'])):lis = []for k in li:lis.append(car_item[k][i].replace('&nbsp;', ''))csv_writer.writerow(lis)# 5. 关闭文件
f.close()

汽车之家汽车详细参数之css反爬相关推荐

python爬虫笔记五：汽车之家贴吧信息（字体反爬-动态映射）
学习网址: https://jia666666.blog.csdn.net/article/details/108974149 ----------------------------------- ...
python爬虫进阶-汽车之家贴吧信息（字体反爬-动态映射）
目的获取汽车之家贴吧的内容信息详细需求汽车之家贴吧思路解析一.F12获取目标信息-进行分析二.字体反爬解析-根据上一篇的文章,直接搜索关键词就好三根据其后的链接,保存为ttf在本地,查 ...
python+mitmdump爬取汽车之家汽车信息
一,准备工作 (一)安装mitmproxy 1,通过 pip install mitmproxy 安装 2,通过安装包安装网址:https://github.com/mitmproxy/mitmpr ...
抓取设了CSS反爬机制的大众点评数据（下）
该篇实现大众点评爬虫操作代码,所有原理都在(上)篇均已详细阐述,让我没想到的是大众点评不仅设置了CSS反爬,在ip限制方面也是十分的凶狠,不得已花了10块钱买了一天代理ip. 大众点评究的反爬竟有多恶 ...
python爬虫进阶-自如租房信息（CSS反爬）
目的分析学习CSS反爬并得到正确的信息详细需求 http://sz.ziroom.com/z/ 思路解析一.F12 二.分析三.复制url,浏览器打开查看四.映射字符五.汇总 1.源网页请 ...
Glidedsky系列—爬虫CSS反爬
前言题目网址为:http://glidedsky.com/level/web/crawler-css-puzzle-1 提示:以下是本篇文章正文内容,下面案例可供参考一.题目描述二.题目分析 1 ...
tesseract破解css反爬抓取自如租房信息
引言作为一个刚毕业两年的打工人,在深圳这种房价压死人的城市,买房是不可能买房了,只能寄希望于租到一个便宜又舒适的房子.今天给大家带来的案例是tesseract破解css反爬抓取自如租房信息,将好房源 ...
php汽车之家数据api,2018汽车之家汽车品牌车型数据新鲜出炉
[实例简介] 项目需要,用python刚爬的汽车之家的汽车品牌.车型数据内含品牌logo以及车型图片. [实例截图] [核心代码] 9085a56e-b1de-45ac-b3aa-3f126e6468 ...
汽车之家汽车品牌Logo信息抓取 DotnetSpider实战[三]
一.正题前的唠叨第一篇实战博客,阅读量1000+,第二篇,阅读量200+,两篇文章相差近5倍,这个差异真的令我很费劲,截止今天,我一直在思考为什么会有这么大的差距,是因为干货变少了,还是什么原因,一 ...
python网络爬虫汽车之家汽车高清图片
今天的目的是爬虫汽车之家某款汽车的外观高清图片所选汽车:红旗H9 爬取页面的网址:红旗H9外观图片最终效果图: 首先我们需要安装一下需要的库: pip install requests pip3 ...

汽车之家汽车详细参数之css反爬

汽车之家汽车详细参数之css反爬相关推荐

最新文章

热门文章