本来是想爬取某网站的古诗词,但是这个网站的有限制,只能爬取十页的数据,再多就接口500,然后发现该网站有app端,然后通过fiddler抓取接口,爬取了十万左右的古诗词。
通过python的正则表达式爬取的,没用框架,而且设计表的时候也有一些问题不过嫌麻烦没优化,话不多说上代码。
因为当初写了好几个忘了哪个是全的了,然后找了个差不多的粘贴上了。
最后把整个app的诗词曲四书五经等等,翻译注释赏析全爬取出来了。

import re,os
import pymysql
import requests
import random
import time
requests.packages.urllib3.disable_warnings()proxy_list = [{'http': '118.113.246.131:9999'}, {'http': '36.249.109.18:9999'}, {'http': '114.104.142.65:9999'},{'http': '113.128.31.217:9999'},{'http': '171.12.112.155:9999'}, {'http': '1.197.16.218:9999'}, {'http': '182.34.36.100:9999'},{'http': '36.249.119.34:9999'}, {'http': '175.43.59.4:9999'}, {'http': '113.124.87.65:9999'},{'http': '125.108.81.211:9999'},{'http': '175.42.68.194:9999'}, {'http': '183.166.97.166:9999'}, {'http': '180.118.128.112:9000'},{'http': '60.13.42.151:9999'}, {'http': '182.149.83.194:9999'},{'http': '60.205.132.71:80'}, {'http': '120.79.64.147:8118'}, {'http': '121.232.194.144:9000'},{'http': '171.35.160.55:9999'},{'http': '36.248.129.82:9999'}, {'http': '171.15.48.137:9999'}, {'http': '163.204.245.210:9000'},{'http': '117.88.5.116:3000'},{'http': '144.123.71.3:9999'}, {'http': '125.108.81.211:9999'}, {'http': '120.234.138.102:53779'},{'http': '175.42.68.194:9999'}, {'http': '120.83.105.247:9999'},{'http': '112.111.217.56:9999'}]
def get_json(url):proxy = random.choice(proxy_list)response = requests.get(url, verify=False, proxies=proxy)if response.status_code==200:return response.json()else:while(response.status_code!=200):print(str(response.status_code))print('等待1秒..')time.sleep(1)proxy = random.choice(proxy_list)response = requests.get(url, verify=False, proxies=proxy)return response.json()def get_Yijson(url):proxy = random.choice(proxy_list)response = requests.get(url, verify=False, proxies=proxy)if response.status_code==200:return response.json()else:return ''#爬取作者姓名,和生平
def get_author():authorList = []for i in range(1,101):json = get_json('https://app.gushiwen.cn:443/api/author/Default10.aspx?c=&page='+ str(i) +'&token=gswapi')for g in range(10):nameStr = json.get('authors')[g].get('nameStr')cont = json.get('authors')[g].get('cont')idnew = json.get('authors')[g].get('idnew')authorList.append([nameStr, cont,idnew])return authorList#爬取作者idnew
def get_shi(authorName):print(authorName)idnew=''shiciList = []url = 'https://app.gushiwen.cn:443/api/shiwen/Default11.aspx?token=gswapi&page=0&astr='+str(authorName)json = get_json(url)if json!='':sumCount = json.get('sumCount') #多少首sumPage = json.get('sumPage')  # 多少页print(sumCount,sumPage)for page in range(1,sumPage+1):print('页数'+ str(page))url = 'https://app.gushiwen.cn:443/api/shiwen/Default11.aspx?token=gswapi&page=' + str(page) + '&astr=' + str(authorName)pageJson = get_json(url)time.sleep(0.3)if(pageJson==''):print('没抓取到啊')for g in range(len(pageJson.get('gushiwens'))):print('g' + str(g))idnew = pageJson.get('gushiwens')[g]['idnew']print(idnew)shiciList.append(idnew)else:print('else')shiciList.append(idnew)print(str(len(shiciList)))return shiciList#获得诗人作品里的内容
def gei_authorContent(idnews):url = 'https://app.gushiwen.cn:443/api/shiwen/shiwenv11.aspx?id='+ str(idnews) +'&token=gswapi'json = get_json(url)time.sleep(0.3)nameStr = json.get('tb_gushiwen').get('nameStr')#诗的名字author = json.get('tb_gushiwen').get('author') #作者chaodai = json.get('tb_gushiwen').get('chaodai')#朝代cont = json.get('tb_gushiwen').get('cont')# 诗词内容tag = json.get('tb_gushiwen').get('tag')  #"高中古诗|乐府|唐诗三百首|抒情|哲理|忧愤"type = '文言文' #诗langsongAuthor = json.get('tb_gushiwen').get('langsongAuthor') #朗诵作者langsongAuthorPY = json.get('tb_gushiwen').get('langsongAuthorPY') #朗诵作者拼音fanyiList =  json.get('tb_fanyis').get('fanyis')fanyicont = '' #翻译内容fanyicankao = '' #翻译参考if len(fanyiList)>0:fanyicont = fanyiList[0].get('cont')fanyicankao = fanyiList[0].get('cankao')shangxiList= []#赏析列表shangxiDict = json.get('tb_shangxis').get('shangxis')if len(shangxiDict)>0:for shangxi in shangxiDict:shangxidic = {}shangxidic['nameStr'] = shangxi['nameStr']shangxidic['cont'] = shangxi['cont']shangxidic['cont'] = re.sub(r'\u3000','', shangxidic['cont'])shangxidic['cankao'] = shangxi['cankao']shangxiList.append(shangxidic)#诗人介绍# authorjieshao = json.get('tb_author').get('cont')return str(nameStr),str(author),str(chaodai),str(cont),str(tag),str(type),str(langsongAuthor),str(langsongAuthorPY),str(fanyicont),str(fanyicankao),str(shangxiList)def get_yizhushang(idnews):resub = r'译注内容由匿名网友上传,原作者已无法考证。古诗文网免费发布仅供学习参考,其观点不代表古诗文网立场。邮箱:service@gushiwen.org'try:yi = get_Yijson('https://app.gushiwen.cn:443/api/shiwen/ajaxshiwencont11.aspx?token=gswapi&idnew=' + str(idnews) + '&value=yi')yicont = yi.get('cont')yicankao = yi.get('cankao')yicankao = re.sub(resub,'',yicankao)except:yicont = ''yicankao = ''try:zhu = get_Yijson('https://app.gushiwen.cn:443/api/shiwen/ajaxshiwencont11.aspx?token=gswapi&idnew=' + str(idnews) + '&value=zhu')zhucont = zhu.get('cont')zhucankao = zhu.get('cankao')zhucankao = re.sub(resub,'',zhucankao)except:zhucont = ''zhucankao = ''try:shang = get_Yijson('https://app.gushiwen.cn:443/api/shiwen/ajaxshiwencont11.aspx?token=gswapi&idnew=' + str(idnews) + '&value=shang')shangcont = shang.get('cont')shangcankao = shang.get('cankao')shangcankao = re.sub(resub, '', shangcankao)except:shangcont = ''shangcankao = ''try:yizhu=get_Yijson('https://app.gushiwen.cn:443/api/shiwen/ajaxshiwencont11.aspx?token=gswapi&idnew='+ str(idnews)+'&value=yizhu')yizhucont = yizhu.get('cont')yizhucankao = yizhu.get('cankao')yizhucankao = re.sub(resub, '', yizhucankao)except:yizhucont = ''yizhucankao = ''try:yishang = get_Yijson('https://app.gushiwen.cn:443/api/shiwen/ajaxshiwencont11.aspx?token=gswapi&idnew=' + str(idnews) + '&value=yishang')yishangcont = yishang.get('cont')yishangcankao = yishang.get('cankao')yishangcankao = re.sub(resub, '', yishangcankao)except:yishangcont = ''yishangcankao = ''try:zhushang = get_Yijson('https://app.gushiwen.cn:443/api/shiwen/ajaxshiwencont11.aspx?token=gswapi&idnew=' + str(idnews) + '&value=zhushang')zhushangcont =zhushang.get('cont')zhushangcankao = zhushang.get('cankao')zhushangcankao = re.sub(resub, '', zhushangcankao)except:zhushangcont = ''zhushangcankao = ''try:yizhushang = get_Yijson('https://app.gushiwen.cn:443/api/shiwen/ajaxshiwencont11.aspx?token=gswapi&idnew=' + str(idnews) + '&value=yizhushang')yizhushangcont = yizhushang.get('cont')yizhushangcankao = yizhushang.get('cankao')yizhushangcankao = re.sub(resub, '', yizhushangcankao)except:yizhushangcont = ''yizhushangcankao = ''return str(yicont),str(yicankao),str(zhucont),str(zhucankao),str(shangcont),str(shangcankao),str(yizhucont),str(yizhucankao),str(yishangcont),str(yishangcankao),str(zhushangcont),str(zhushangcankao),str(yizhushangcont),str(yizhushangcankao)def conn_mysql():url = '127.0.0.1'username = 'root'password = 'root'dbname = 'gushiwen'db=pymysql.connect(url,username,password,dbname)return dbdef get_authorziliaos(idnew):url = 'https://app.gushiwen.cn/api/author/author10.aspx?id='+ str(idnew) +'&token=gswapi'json = get_json(url)ziliaoList= []#赏析列表ziliaosDict = json.get('tb_ziliaos').get('ziliaos')if len(ziliaosDict)>0:for ziliao in ziliaosDict:ziliaodic = {}ziliaodic['nameStr'] = ziliao['nameStr']ziliaodic['cont'] = ziliao['cont']ziliaodic['cont'] = re.sub(r'\u3000','', ziliaodic['cont'])ziliaodic['cankao'] = ziliao['cankao']ziliaoList.append(ziliaodic)return str(ziliaoList)def get_ci():list = []for i in range(1,101):json=get_json('https://app.gushiwen.cn/api/shiwen/Default11.aspx?xstr=%E8%AF%8D&page='+ str(i) +'&token=gswapi')gushiwens = json.get('gushiwens')for r in range(10):idnew = gushiwens[r]['idnew']list.append(idnew)print(list)# for i in list:#     print(i)#     json = get_json('https://app.gushiwen.cn/api/shiwen/shiwenv11.aspx?t=1568883126&id='+str(i) +'&token=gswapi')def getQuList():list = []for i in range(1,81):# json = get_json('https://app.gushiwen.cn/api/shiwen/Default11.aspx?xstr=%E6%9B%B2&page=' + str(i) + '&token=gswapi')#曲# json = get_json('https://app.gushiwen.cn/api/shiwen/Default11.aspx?xstr=%E8%AF%8D&page=' + str(i) + '&token=gswapi')#词json = get_json('https://app.gushiwen.cn/api/shiwen/Default11.aspx?xstr=%E6%96%87%E8%A8%80%E6%96%87&page='+ str(i) +'&token=gswapi')  # 文言文gushiwens = json.get('gushiwens')for r in range(10):idnew = gushiwens[r]['idnew']list.append(idnew)return list#添加曲
if __name__ == '__main__':db = conn_mysql()a = 1b = 1list = getQuList()print(list)for i in list:print('正在导入第' + str(a) + '首词的内容')a += 1print('访问诗词内容--start')print(i)try:cont = gei_authorContent(i)except:continueprint('访问诗词内容--end')cursor = db.cursor()# str(nameStr), str(author), str(chaodai), str(cont), str(tag), str(type), str(langsongAuthor), str(#     langsongAuthorPY), str(fanyicont),cursor.execute('select cont from fa_shici where author= %s and nameStr=%s',[cont[1],cont[0]])number = len(cursor.fetchall())print(cont[0])if number > 0:#修改try:print('修改')updateSql = 'update fa_shici set type= %s where cont = %s 'updateData = ['文言文', cont[3]]db.cursor().execute(updateSql, updateData)db.commit()print('修改完毕')except:continueelse:b += 1print(str(b))#插入print('导入shici表--start')# 把内容放到诗词表里# nameStr,author,chaodai,cont,tag,type,langsongAuthor,langsongAuthorPY,fanyicont,fanyicankao,shangxiListauthorsSql = 'insert into fa_shici(author,nameStr,chaodai,cont,tag,type,langsongAuthor,langsongAuthorPY,fanyicont,fanyicankao,shangxiList) values (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)'authorsData = [cont[1], cont[0], cont[2], cont[3], cont[4], cont[5], cont[6], cont[7], cont[8],cont[9], cont[10]]db.cursor().execute(authorsSql, authorsData)db.commit()print('导入shici表--end')print('访问翻译注释赏析数据-start')yizhushang = get_yizhushang(i)print('访问翻译注释赏析数据-end')print('导入yizhushang表-start')# 把内容放到yizhushang表里# yicont,yicankao,zhucont,zhucankao,shangcont,shangcankao,yizhucont,yizhucankao,yishangcont,yishangcankao,shangzhucont,shangzhucankao,yizhushangcont,yizhushangcankaotry:yizhushangSql = 'insert into fa_yizhushang(namestr,shici_id,yicont,yicankao,zhucont,zhucankao,shangcont,shangcankao,yizhucont,yizhucankao,yishangcont,yishangcankao,zhushangcont,zhushangcankao,yizhushangcont,yizhushangcankao) values (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)'yizhushangData = [cont[0], b, yizhushang[0], yizhushang[1], yizhushang[2], yizhushang[3], yizhushang[4],yizhushang[5], yizhushang[6], yizhushang[7], yizhushang[8], yizhushang[9],yizhushang[10], yizhushang[11], yizhushang[12], yizhushang[13]]db.cursor().execute(yizhushangSql, yizhushangData)db.commit()except:print('导入翻译..失败')print('导入yizhushang表-end')db.close()print('爬取完毕!!!')

爬取某app的古诗文翻译注释等相关推荐

  1. python 爬取手机app的信息

    我们在爬取手机APP上面的数据的时候,都会借助Fidder来爬取.今天就教大家如何爬取手机APP上面的数据. Python学习资料或者需要代码.视频加Python学习群:516107834 环境配置 ...

  2. 用Python爬取手机APP

    前言 如果你以为python只可以爬取web网页,那就大错特错了,本篇文章教你如何爬取手机app的信息. Charles(抓包工具)的安装 1.1 下载 由于是收费软件,这里给大家一个链接,自行下载, ...

  3. Python爬取手机APP

    之前写了一个自动签到的脚本,我姐本来让我给她写一个手机app自动签到的脚本的,后来发现自己不会爬手机app,现在抽时间找了教程,看完教程后来爬一下手机app试一试.在爬手机app时先要安装的的软件是F ...

  4. python爬取app_python 爬取豌豆荚APP的爬虫 源码下载

    [实例简介]爬取豌豆荚APP,将APP信息存储到数据库,APP以md5值命名, [实例截图] [核心代码] #coding:utf-8 import database import crawler i ...

  5. day04 爬取豌豆荚app数据的两种方法

    今日内容:方法一 bs4爬取豌豆荚 爬取豌豆荚: 1.访问游戏主页 https://www.wandoujia.com/category/6001 2.点击查看更多,观察network内的请求 - 请 ...

  6. python爬取电子书_python爬取 “得到” App 电子书信息

    前言 文的文字及图片来源于网络,仅供学习.交流使用,不具有任何商业用途,版权归原作者所有,如有问题请及时联系我们以作处理. 作者: 静觅 崔庆才 PS:如有需要Python学习资料的小伙伴可以加点击下 ...

  7. python爬取 “得到” App 电子书信息

    前言 文的文字及图片来源于网络,仅供学习.交流使用,不具有任何商业用途,版权归原作者所有,如有问题请及时联系我们以作处理. PS:如有需要Python学习资料的小伙伴可以加点击下方链接自行获取 pyt ...

  8. python爬取王者_教你用Python爬取手机APP数据!以王者荣耀的数据信息为例

    前言 在我们在爬取手机APP上面的数据的时候,都会借助Fidder来爬取.今天就教大家如何爬取手机APP上面的数据. 环境配置 1.Fidder的安装和配置 下载Fidder软件地址:https:// ...

  9. 以某乎为实战案例,教你用Python爬取手机App数据

    1 前言 最近爬取的数据都是网页端,今天来教大家如何爬取手机端app数据(本文以ios苹果手机为例,其实安卓跟ios差不多)! 本文将以『某乎』为实战案例,手把手教你从配置到代码一步一步的爬取App数 ...

  10. Python网络爬虫,Appuim+夜神模拟器爬取得到APP课程数据

    一.背景介绍 随着生产力和经济社会的发展,温饱问题基本解决,人们开始追求更高层次的精神文明,开始愿意为知识和内容付费.从2016年开始,内容付费渐渐成为时尚. 罗辑思维创始人罗振宇全力打造" ...

最新文章

  1. 开发者如何构建技术影响力
  2. 清瘦的记录者: 一个比dbutils更小巧、好用的的持久化工具
  3. Oracle 数据字典表 -- SYS.COL$
  4. ajxs跨域 php_php设置header头允许ajax跨域请求
  5. java-jdk各版本特性概述
  6. Failed connect to github.com:443; No error
  7. audio 上一首 下一首 自定义样式_被 iPhone 吹爆的最香功能,安卓也终于安排上了...
  8. Intelli IDEA快捷键(配合IdeaVim)
  9. 服务器系统启用flash,如何在Windows Server 2016中启用Adobe Flash Player
  10. 为什么总是封板又打开涨停_多次涨停多次被打开,涨停板打开然后封住反复
  11. linux中的__setup的作用
  12. 浅谈人脸识别技术的方法和应用
  13. GPGPU数学基础教程
  14. Boom 3D 1.2.2 特别版 Mac 3D环绕音效增强工具
  15. MySQL数据仓库基础
  16. 随笔-关系抽取(三) — Dependency-based Models
  17. 【修真院“善良”系列之四】怎么识别招聘中的传销公司?
  18. 什么是百度竞价包年?竞价包年骗局揭露
  19. 同建金融IT新生态——令克软件富途证券达成战略合作
  20. python 通达信上传云端_云端同步的问题-通达信知识 -程序化交易(CXH99.COM)

热门文章

  1. 5451 Best Solver 构造共轭复根求递推矩阵广义斐波那契循环节降幂
  2. 小白量化彩票实战(5)彩票号码快速生成组合及利用数据库生成彩票号码组合
  3. Chrome上关于微信网页版WeChat不能正常登录的解决方案
  4. 计算机ps相框怎么做,如何在PS中制作相框?在PS中制作相框的具体方法
  5. 如何快速搭建一个直播平台?
  6. 【Apollo】【driver】【gnss】适配新的gps设备的方式与经验总结
  7. 计算机用户密码最长使用期限,电脑Win10系统强制用户定期更新密码的方法
  8. 虚拟机win7系统忘记开机密码怎么办
  9. gcj编译java_GCJ编译java程序的头痛问题
  10. 大数据技术的发展趋势