python 爬取中国裁判文书网 + 破解字体 JS 加密
完整代码下载:https://github.com/tanjunchen/SpiderProject/tree/master/wenshu
#!/usr/bin/env python # -*- coding: utf-8 -*- import execjs import requests import time import uuid import random import json# 模拟浏览器 USER_AGENTS = ["Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)","Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)","Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)","Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)","Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)","Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)","Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)","Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)","Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6","Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1","Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0","Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5","Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11","Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20","Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER","Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)","Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 LBBROWSER","Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)","Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)","Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)","Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; 360SE)","Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)","Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)","Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1","Mozilla/5.0 (iPad; U; CPU OS 4_2_1 like Mac OS X; zh-cn) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8C148 Safari/6533.18.5","Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0b13pre) Gecko/20110307 Firefox/4.0b13pre","Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11","Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10" ]# 获取header def Header():return {'User-Agent': random.choice(USER_AGENTS)}# 获取guid def get_guid():guid = execjs.compile(open(r'getGuid.js').read()).call('getGuid')return guid''' 通过uuid获取guid uuid:通过MAC地址, 时间戳, 命名空间, 随机数, 伪随机数来保证生成ID的唯一性, 有着固定的大小(128 bit) python的uuid模块提供UUID类和函数uuid1(), uuid3(), uuid4(), uuid5() 来生成1, 3, 4, 5各个版本的UUID uuid.uuid1([node[, clock_seq]]) : 基于时间戳 uuid.uuid3(namespace, name) : 基于名字的MD5散列值 uuid.uuid4() : 基于随机数 uuid.uuid5(namespace, name) : 基于名字的SHA-1散列值 '''# 获取guid def get_guid_uuid():return str(uuid.uuid1())# 获取num def get_num(guid):url = "http://wenshu.court.gov.cn/ValiCode/GetCode"data = {'guid': guid}response = requests.post(url, data=data, headers=Header())if response.ok:if response.text is not None:return response.textelse:time.sleep(5)get_num(guid)else:time.sleep(5)get_num(guid)# 获取vjkl5 def get_vjkl5(number, guid, search_content):url = 'http://wenshu.court.gov.cn/list/list/?sorttype=1&number={}&guid={}&conditions=searchWord QWJS 全文检索:{}'.format(number, guid, search_content)response = requests.get(url, headers=Header())if response.ok:if response.headers['Set-Cookie']:vjkl5 = response.headers['Set-Cookie'].split(";")[0].split("=")[1]if vjkl5:return vjkl5return None# 获取vl5x def get_real_vl5x(vjkl5):if vjkl5:vl5x = execjs.compile(open(r'getKey.js').read()).call('getKey', vjkl5)if vl5x:return vl5xreturn None# 获取文书列表 def get_list(vjkl5, guid, number, vl5x, search, order, index, page):url = 'http://wenshu.court.gov.cn/List/ListContent'data = {'Param': search,'Index': index,'Page': page,'Order': order,'Direction': 'desc','vl5x': vl5x,'number': number,'guid': guid}header = {'User-Agent': random.choice(USER_AGENTS),'Cookie': 'vjkl5=' + vjkl5}res = requests.post(url=url, params=data, headers=header)return resdef get_real_docIds(data):result = tuple(json.load(data))RunEval = result[0]['RunEval']Count = result[0]['Count']docIds = []for i in range(1, len(result)):docIds.append(result[i]['文书ID'])# dict_data = {# "RunEval": RunEval,# "Count": Count,# "docIds": docIds# }#print(docIds)def parser_str():returndef spider():search = '案件类型:执行案件,全文检索:新吴区'order = '裁判日期'# 获取guidguid = get_guid()# 获取numbernumber = get_num(guid)# 获取cookie中的vjkl5vjkl5 = get_vjkl5(number, guid, '无锡')# 获取vl5xvl5x = get_real_vl5x(vjkl5)# 获取数据列表for i in range(1, 11):res = get_list(vjkl5, guid, number, vl5x, search, order, i, 10)print(res.text)get_real_docIds(res.text)if __name__ == '__main__':spider()
python 爬取中国裁判文书网 + 破解字体 JS 加密相关推荐
- Scrapy框架爬取中国裁判文书网案件数据
Scrapy框架爬取中国裁判文书网案件数据 项目Github地址: https://github.com/Henryhaohao/Wenshu_Spider 中国裁判文书网 - http://wens ...
- python 爬取裁判文书网
19年4月版完整代码github地址:https://github.com/Monster2848/caipanwenshu 目标网站 发现这个请求中有返回数据 先带齐所有参数模拟浏览器发起一次请求 ...
- python爬取裁判文书并分析_裁判文书网爬虫攻克
最近因为公司需要爬取裁判文书网的某一类别文章数据,于是简单研究了一下,发现网站数据全是js加载的,于是想都没想直接用selenium尝试爬取,没想到竟然有selenium都抓取不到的js(刚毕业的py ...
- 用python输出所有的玫瑰花数_用Python爬取WordPress官网所有插件
转自丘壑博客,转载注明出处 前言 只要是用WordPress的人或多或少都会装几个插件,可以用来丰富扩展WordPress的各种功能.围绕WordPress平台的插件和主题已经建立了一个独特的经济生态 ...
- python爬房源信息_用python爬取链家网的二手房信息
题外话:这几天用python做题,算是有头有尾地完成了.这两天会抽空把我的思路和方法,还有代码贴出来,供python的初学者参考.我python的实战经历不多,所以代码也是简单易懂的那种.当然过程中还 ...
- python爬取千图网_python爬取lol官网英雄图片代码
python爬取lol官网英雄图片代码可以帮助用户对英雄联盟官网平台的皮肤图片进行抓取,有很多喜欢lol的玩家们想要官方的英雄图片当作自己的背景或者头像,可以使用这款软件为你爬取图片资源,操作很简单, ...
- Python爬取不羞涩网小姐姐图片——BeautifulSoup应用
引言 今年提倡原地过年,相信很多朋友都没有回家过年,像我就被迫留在深圳过年了,无聊之余只能去看看电影爬爬山.今天给大家带来一个打发无聊时光的案例,用Python爬取不羞涩网小姐姐图片,并保存到本地,老 ...
- python 爬取淘宝网课
python爬取淘宝网课,打开web控制台,发现有个链接可以下载到对应的内容,下载的格式是m3u8,用文本打开里面是许多.ts链接,当然百度后得知可以直接下个vlc然后下载,但是还是想用python试 ...
- 使用python爬取斗图网的图片
使用python爬取斗图网的图片 以下是代码部分: # -*- coding: utf-8 -*- """ Created on Wed Apr 14 14:41:42 ...
- python爬取链家网的房屋数据
python爬取链家网的房屋数据 爬取内容 爬取源网站 爬取内容 爬取思路 爬取的数据 代码 获取房屋url 获取房屋具体信息 爬取内容 爬取源网站 北京二手房 https://bj.lianjia. ...
最新文章
- three.js(六) 地形法向量生成
- python高级语法-套接字编程之UDP和TCP编程
- Jerry的通过CDS view + Smart Template 开发Fiori应用的blog合集
- 论文致谢走红后,中科院博士回信了!
- 信安教程第二版-第23章云计算安全需求分析与安全保护工程
- (9)Node.js 内置模块
- mui中子页面标志html,MUI 图标筛选切换(父页面传值子页面)代码
- 医疗机构被勒索软件攻击的可能性是金融机构的114倍
- mac vscode 背景半透明_武装Mac|常用MacBook软件分类汇总
- 小明种苹果(续)第十七次CCF认证
- Python--繁体中文与简体中文相互转换
- 小米路由器能做无线打印服务器吗,将普通打印机连接在小米路由器上能做
- 输入一行字符,分别统计出其中英文字母、空格、数字和其它字符的个数。
- grunt,gulp
- 老人为戒烟嗑瓜子 脚趾腐烂散发难闻臭味令孙儿恶心至极
- XML格式化在线工具
- 《信息安全系统设计基础》第1周问题总结
- Mac-Windows下IDEA卡顿问题解决
- 导入多段落文档排版计算机作业,2011级计算机基础操作Word作业说明_论文排版.pdf...
- 网赚无货源模式,无货源真的可以赚钱么?