day02—python爬虫学习笔记---------------化妆品生产许可证关系系统服务平台


一.对页面进行分析

化妆品生产许可证关系系统服务平台


可发现次应爬取的为页面中的详细信息而非页面中带的信息

因此对其尝试get爬取

url = "http://scxk.nmpa.gov.cn:81/xk/"
head = {            #模拟浏览器头部信息"User-Agent": "Mozilla / 5.0(Windows NT 10.0;Win64;x64;rv: 85.0) Gecko / 20100101Firefox / 85.0" #伪装成浏览器访问}
response_text = requests.get(url=url, headers=head, ).textprint(response_text)

分析返回结果可发现其中并未包含所需信息

因此假设其为ajax的异步刷新进行分析

使用F12网络工具对其进行抓包

分析得:
可发现post请求中json形式文件包含着所需信息
因此以post形式对其进行爬取


url = "http://scxk.nmpa.gov.cn:81/xk/itownet/portalAction.do?method=getXkzsList"
head = {            #模拟浏览器头部信息"User-Agent": "Mozilla / 5.0(Windows NT 10.0;Win64;x64;rv: 85.0) Gecko / 20100101Firefox / 85.0" #伪装成浏览器访问}
response_json = requests.post(url=url, headers=head, ).json()print(response_json)
{"filesize": "","keyword": "","list": [{"ID": "c3854166c00f46b5b29fe2a55d3df929","EPS_NAME": "广东彤姿生物科技有限公司","PRODUCT_SN": "粤妆20210037","CITY_CODE": null,"XK_COMPLETE_DATE": {"date": 3,"day": 1,"hours": 0,"minutes": 0,"month": 7,"nanos": 0,"seconds": 0,"time": 7308115200000,"timezoneOffset": -480,"year": 301},"XK_DATE": "2026-02-02","QF_MANAGER_NAME": "广东省药品监督管理局","BUSINESS_LICENSE_NUMBER": "91440607MA55AD6D5C","XC_DATE": "2201-08-03","NUM_": 1},{"ID": "c3ae6ecabbcf4ba68038321b80819753","EPS_NAME": "佛山拜澳生物科技有限公司","PRODUCT_SN": "粤妆20160229","CITY_CODE": null,"XK_COMPLETE_DATE": {"date": 3,"day": 1,"hours": 0,"minutes": 0,"month": 7,"nanos": 0,"seconds": 0,"time": 7308115200000,"timezoneOffset": -480,"year": 301},"XK_DATE": "2026-02-02","QF_MANAGER_NAME": "广东省药品监督管理局","BUSINESS_LICENSE_NUMBER": "91440607579693554T","XC_DATE": "2201-08-03","NUM_": 2},{"ID": "ed59438f34ae47e794f4c7ee5137c1f7","EPS_NAME": "海南京润珍珠生物技术股份有限公司","PRODUCT_SN": "琼妆20160001","CITY_CODE": "311","XK_COMPLETE_DATE": {"date": 25,"day": 0,"hours": 0,"minutes": 0,"month": 3,"nanos": 0,"seconds": 0,"time": 1619280000000,"timezoneOffset": -480,"year": 121},"XK_DATE": "2026-04-25","QF_MANAGER_NAME": "海南省药品监督管理局","BUSINESS_LICENSE_NUMBER": "91460000294121210Y","XC_DATE": "2021-04-25","NUM_": 3},{"ID": "a810f850c54f4cf7a002057cfb4ec279","EPS_NAME": "滁州向日葵药业有限公司","PRODUCT_SN": "皖妆20210005","CITY_CODE": "220","XK_COMPLETE_DATE": {"date": 18,"day": 4,"hours": 0,"minutes": 0,"month": 1,"nanos": 0,"seconds": 0,"time": 1613577600000,"timezoneOffset": -480,"year": 121},"XK_DATE": "2026-02-17","QF_MANAGER_NAME": "安徽省药品监督管理局","BUSINESS_LICENSE_NUMBER": "91341171MA2UE0W596","XC_DATE": "2021-02-18","NUM_": 4},{"ID": "73bb06d774f44c2b9d7c006be3711718","EPS_NAME": "扬州倍加洁日化有限公司","PRODUCT_SN": "苏妆20210011","CITY_CODE": "86","XK_COMPLETE_DATE": {"date": 18,"day": 4,"hours": 0,"minutes": 0,"month": 1,"nanos": 0,"seconds": 0,"time": 1613577600000,"timezoneOffset": -480,"year": 121},"XK_DATE": "2026-02-17","QF_MANAGER_NAME": "江苏省药品监督管理局","BUSINESS_LICENSE_NUMBER": "91321000760545492E","XC_DATE": "2021-02-18","NUM_": 5},{"ID": "d51920e18414449fa2bda604a2a6b93f","EPS_NAME": "克劳丽化妆品(南通)有限公司","PRODUCT_SN": "苏妆20210010","CITY_CODE": "85","XK_COMPLETE_DATE": {"date": 18,"day": 4,"hours": 0,"minutes": 0,"month": 1,"nanos": 0,"seconds": 0,"time": 1613577600000,"timezoneOffset": -480,"year": 121},"XK_DATE": "2026-02-17","QF_MANAGER_NAME": "江苏省药品监督管理局","BUSINESS_LICENSE_NUMBER": "91320623MA24WAKU9P","XC_DATE": "2021-02-18","NUM_": 6},{"ID": "6321fa3a8cad4edba7b5597c3fdea52e","EPS_NAME": "广东永佳日化实业有限公司","PRODUCT_SN": "粤妆20210050","CITY_CODE": null,"XK_COMPLETE_DATE": {"date": 10,"day": 3,"hours": 0,"minutes": 0,"month": 1,"nanos": 0,"seconds": 0,"time": 1612886400000,"timezoneOffset": -480,"year": 121},"XK_DATE": "2026-02-09","QF_MANAGER_NAME": "广东省药品监督管理局","BUSINESS_LICENSE_NUMBER": "91440514071881319E","XC_DATE": "2021-02-10","NUM_": 7},{"ID": "c2cf1364781447e1a3b6dcd90bda6fff","EPS_NAME": "江西初美化妆品有限公司","PRODUCT_SN": "赣妆20180013","CITY_CODE": "93","XK_COMPLETE_DATE": {"date": 10,"day": 3,"hours": 0,"minutes": 0,"month": 1,"nanos": 0,"seconds": 0,"time": 1612886400000,"timezoneOffset": -480,"year": 121},"XK_DATE": "2023-07-17","QF_MANAGER_NAME": "江西省药品监督管理局","BUSINESS_LICENSE_NUMBER": "91360121MA37RM989B","XC_DATE": "2021-02-10","NUM_": 8},{"ID": "23e8220096bf40f99ed1d453824b729d","EPS_NAME": "江西珍视明药业有限公司","PRODUCT_SN": "赣妆20170002","CITY_CODE": "97","XK_COMPLETE_DATE": {"date": 10,"day": 3,"hours": 0,"minutes": 0,"month": 1,"nanos": 0,"seconds": 0,"time": 1612886400000,"timezoneOffset": -480,"year": 121},"XK_DATE": "2022-01-05","QF_MANAGER_NAME": "江西省药品监督管理局","BUSINESS_LICENSE_NUMBER": "91361003789743169E","XC_DATE": "2021-02-10","NUM_": 9},{"ID": "6626aa180dca43f7b10c45a3dd91c13a","EPS_NAME": "中山新妍化妆品有限公司","PRODUCT_SN": "粤妆20180230","CITY_CODE": null,"XK_COMPLETE_DATE": {"date": 9,"day": 2,"hours": 0,"minutes": 0,"month": 1,"nanos": 0,"seconds": 0,"time": 1612800000000,"timezoneOffset": -480,"year": 121},"XK_DATE": "2023-11-04","QF_MANAGER_NAME": "广东省药品监督管理局","BUSINESS_LICENSE_NUMBER": "91442000MA51EQLWXE","XC_DATE": "2021-02-09","NUM_": 10},{"ID": "3e18d72df9c24f0b8f15e76d0fbcc913","EPS_NAME": "东莞市百丽达生物科技有限公司","PRODUCT_SN": "粤妆20170591","CITY_CODE": null,"XK_COMPLETE_DATE": {"date": 9,"day": 2,"hours": 0,"minutes": 0,"month": 1,"nanos": 0,"seconds": 0,"time": 1612800000000,"timezoneOffset": -480,"year": 121},"XK_DATE": "2022-11-30","QF_MANAGER_NAME": "广东省药品监督管理局","BUSINESS_LICENSE_NUMBER": "91441900MA4WHBJX7T","XC_DATE": "2021-02-09","NUM_": 11},{"ID": "889f31556d0149b6ac1be15b09f35b80","EPS_NAME": "深圳市创亿鑫生物科技有限公司","PRODUCT_SN": "粤妆20160938","CITY_CODE": null,"XK_COMPLETE_DATE": {"date": 9,"day": 2,"hours": 0,"minutes": 0,"month": 1,"nanos": 0,"seconds": 0,"time": 1612800000000,"timezoneOffset": -480,"year": 121},"XK_DATE": "2026-02-08","QF_MANAGER_NAME": "广东省药品监督管理局","BUSINESS_LICENSE_NUMBER": "914403007230102536","XC_DATE": "2021-02-09","NUM_": 12},{"ID": "a96e67d64344436cbc71ae72077894c4","EPS_NAME": "朝日化妆品(深圳)有限公司","PRODUCT_SN": "粤妆20160028","CITY_CODE": null,"XK_COMPLETE_DATE": {"date": 9,"day": 2,"hours": 0,"minutes": 0,"month": 1,"nanos": 0,"seconds": 0,"time": 1612800000000,"timezoneOffset": -480,"year": 121},"XK_DATE": "2026-02-08","QF_MANAGER_NAME": "广东省药品监督管理局","BUSINESS_LICENSE_NUMBER": "91440300772747550X","XC_DATE": "2021-02-09","NUM_": 13},{"ID": "fffd38b18a13480d8d872f0d28f71a53","EPS_NAME": "广州振业医药科技有限公司","PRODUCT_SN": "粤妆20210056","CITY_CODE": null,"XK_COMPLETE_DATE": {"date": 8,"day": 1,"hours": 0,"minutes": 0,"month": 1,"nanos": 0,"seconds": 0,"time": 1612713600000,"timezoneOffset": -480,"year": 121},"XK_DATE": "2026-02-07","QF_MANAGER_NAME": "广东省药品监督管理局","BUSINESS_LICENSE_NUMBER": "91440101MA9UYENHXQ","XC_DATE": "2021-02-08","NUM_": 14},{"ID": "89cf33bca751479281d1988e842d9ca1","EPS_NAME": "广州中燊医药科技有限公司","PRODUCT_SN": "粤妆20210055","CITY_CODE": null,"XK_COMPLETE_DATE": {"date": 8,"day": 1,"hours": 0,"minutes": 0,"month": 1,"nanos": 0,"seconds": 0,"time": 1612713600000,"timezoneOffset": -480,"year": 121},"XK_DATE": "2026-02-07","QF_MANAGER_NAME": "广东省药品监督管理局","BUSINESS_LICENSE_NUMBER": "91440101MA5D0HBF7L","XC_DATE": "2021-02-08","NUM_": 15}],"orderBy": "createDate","orderType": "desc","pageCount": 369,"pageNumber": 1,"pageSize": 15,"property": "","totalCount": 5534
}

经过进一步的分析可发现其中并未包含详情页面所对应的url

仔细观察发现规律

其中第一个详情页得url为

http://scxk.nmpa.gov.cn:81/xk/itownet/portal/dzpz.jsp?id=c3854166c00f46b5b29fe2a55d3df929

而第二个页面为

http://scxk.nmpa.gov.cn:81/xk/itownet/portal/dzpz.jsp?id=c3ae6ecabbcf4ba68038321b80819753

经过分析可发现只有后面id=的部分发生变化因此从json中查找发现

"ID": "c3854166c00f46b5b29fe2a55d3df929","EPS_NAME": "广东彤姿生物科技有限公司","PRODUCT_SN": "粤妆20210037","CITY_CODE": null,"XK_COMPLETE_DATE": {

id参数和url中对应的id参数相同
因此可通过爬取的首页的id参数和域名进行拼接从而获得详情页面的url
因上次的经验可得 详情页的数据是否也是动态加载出来的

经过验证可发现确实包含在一个post请求中

对详情页的ajax请求进行分析发现

第一个为:http://scxk.nmpa.gov.cn:81/xk/itownet/portalAction.do?method=getXkzsById
第二个为:http://scxk.nmpa.gov.cn:81/xk/itownet/portalAction.do?method=getXkzsById

因此得知每个ajax请求所对应的url相同
再继续分析得知所携带的id参数不同
因此我们可设计讲每个页面的id爬取并和固定url结合发送


代码环节

  1. 批量获取id
  2. 将统一的url和id值结合发起post
  3. 返回json文件并对其进行数据提取
  4. 进行数据存储
import json
import requestsurl = "http://scxk.nmpa.gov.cn:81/xk/itownet/portalAction.do?method=getXkzsList"  # 批量获取id
head = {  # 模拟浏览器头部信息"User-Agent": "Mozilla / 5.0(Windows NT 10.0;Win64;x64;rv: 85.0) Gecko / 20100101Firefox / 85.0"  # 伪装成浏览器访问
}
page1 = int(input("请输入总页数"))#循环获取数据
id_list = []
all_datalist = []  # 存储所有data数据
for page in range(1,page1):data = {  # 参数封装"on": "true","page": page,"pageSize": "15","productName": "","conditionType": "1","applyname": "","applysn": ""}response_json = requests.post(url=url, headers=head, data=data).json()for dic in response_json['list']:  # 对id值进行遍历id_list.append(dic['ID'])post_url = "http://scxk.nmpa.gov.cn:81/xk/itownet/portalAction.do?method=getXkzsById"
for id in id_list:post_data = {  # 遍历id参数"id": id}detail_json = requests.post(url=post_url, data=post_data, headers=head).json()all_datalist.append(detail_json)
fp = open('./alldata.json', 'w', encoding='utf-8')    #数据存储
json.dump(all_datalist, fp=fp, ensure_ascii=False)
print("over!")

day02---python爬虫学习笔记---------------化妆品生产许可证关系系统服务平台相关推荐

  1. python爬虫学习笔记 3.9 (了解参考:训练Tesseract)

    python爬虫学习笔记 3.9 (了解参考:训练Tesseract) 参考阅读:训练Tesseract 要使用 Tesseract 的功能,比如后面的示例中训练程序识别字母,要先在系统中设置一 个新 ...

  2. Python爬虫学习笔记 -- 爬取糗事百科

    Python爬虫学习笔记 -- 爬取糗事百科 代码存放地址: https://github.com/xyls2011/python/tree/master/qiushibaike 爬取网址:https ...

  3. python爬虫学习笔记3.2-urllib和request练习

    python爬虫学习笔记3.2-urllib和request练习 一.urllib练习 1.百度贴吧案例 需求 分析 手动测试查询流程 观察页面 分析特殊部分 https://tieba.baidu. ...

  4. python爬虫学习笔记 1.9 (Handler处理器 和 自定义Opener)

    python爬虫学习笔记 1.1(通用爬虫和聚焦爬虫) python爬虫学习笔记 1.2 ( HTTP和HTTPS ) python爬虫学习笔记 1.3 str和bytes的区别 python爬虫学习 ...

  5. python爬虫学习笔记-网络爬虫的三种数据解析方式

    爬虫的分类 1.通用爬虫:通用爬虫是搜索引擎(Baidu.Google.Yahoo等)"抓取系统"的重要组成部分.主要目的是将互联网上的网页下载到本地,形成一个互联网内容的镜像备份 ...

  6. python爬虫学习笔记2模拟登录与数据库

    前言 为了加入学校里面一个技术小组,我接受了写一个爬取学校网站通知公告的任务.这个任务比以前写的爬虫更难的地方在于,需要模拟登录才能获得页面,以及将得到的数据存入数据库. 本文按照日期来记录我完成任务 ...

  7. Python爬虫学习笔记总结(一)

    〇. python 基础 先放上python 3 的官方文档:https://docs.python.org/3/ (看文档是个好习惯) 关于python 3 基础语法方面的东西,网上有很多,大家可以 ...

  8. 一入爬虫深似海,总结python爬虫学习笔记!

    正文 〇. python 基础 先放上python 3 的官方文档:https://docs.python.org/3/ (看文档是个好习惯) 关于python 3 基础语法方面的东西,网上有很多,大 ...

  9. 一入爬虫深似海,总结python爬虫学习笔记! 1

    正文 〇. python 基础 先放上python 3 的官方文档:https://docs.python.org/3/ (看文档是个好习惯) 关于python 3 基础语法方面的东西,网上有很多,大 ...

最新文章

  1. python中的raw string的使用
  2. 密歇根大学联合谷歌大脑提出,通过「推断语义布局」实现「文本到图像合成」
  3. oracle执行先决条件检查失败的解决方法
  4. 算法面试:精选微软经典的算法面试100题(第1-20题)
  5. python【数据结构与算法】判断对称二叉树
  6. HDU2093 考试排名
  7. 11g Active DataGuard初探
  8. Maven_1.了解Maven以及其安装配置
  9. 设计师效率帮手|提高设计效率的Sketch插件!
  10. 各种前端开发工具 - 你用过几种?
  11. Citrix XenApp工作原理
  12. 数字集成电路与系统设计基础知识
  13. NLP学习笔记[1] -- 构建词向量模型 -- Word2Vec与词嵌入
  14. python中除法运算定律_除法的简便运算的公式
  15. 画火柴人动画的手机软件_火柴人动漫制作软件下载-火柴人动漫制作手机中文版 v2.1.4_5577安卓网...
  16. layui数据表格与后台交互进行渲染
  17. ArcGIS教程:地理处理服务示例(裁剪与发送)(一)
  18. Windows 平台下AMD 显卡加速pytorch训练
  19. 副族元素从上到下原子半径_元素周期表中原子半径跟元素序数的关系
  20. API接口开发其实特简单,Python FastApi Web 框架教程来了

热门文章

  1. “知行-技术人的管理之路”书籍推荐
  2. flutter Container的decoration
  3. c++ STL rope小结
  4. imut 51单片机课设 模拟室内温度检测器 *秋昊
  5. 支持Xcode10和适配iPhone XS Max、iPhone XR的方法
  6. HTML百度搜索框实现(附源码带注释)
  7. set学习之创建,初始化,赋值操作operator=, empty,size
  8. Python中获取交集并集差集的方法
  9. Linux学习之安装OpenCV3.4.1
  10. 深度挖掘最近很火的“飞滴出行项目”,究竟是好是坏?