python 爬虫——爬取百度文库VIP内容

转载自：爬取百度文库

代码实现

import requests
import re
import json
import ossession = requests.session()def fetch_url(url):return session.get(url).content.decode('gbk')def get_doc_id(url):return re.findall('view/(.*).html', url)[0]def parse_type(content):return re.findall(r"docType.*?\:.*?\'(.*?)\'\,", content)[0]def parse_title(content):return re.findall(r"title.*?\:.*?\'(.*?)\'\,", content)[0]def parse_doc(content):result = ''url_list = re.findall('(https.*?0.json.*?)\\\\x22}', content)url_list = [addr.replace("\\\\\\/", "/") for addr in url_list]for url in url_list[:-5]:content = fetch_url(url)y = 0txtlists = re.findall('"c":"(.*?)".*?"y":(.*?),', content)for item in txtlists:if not y == item[1]:y = item[1]n = '\n'else:n = ''result += nresult += item[0].encode('utf-8').decode('unicode_escape', 'ignore')return resultdef parse_txt(doc_id):content_url = 'https://wenku.baidu.com/api/doc/getdocinfo?callback=cb&doc_id=' + doc_idcontent = fetch_url(content_url)md5 = re.findall('"md5sum":"(.*?)"', content)[0]pn = re.findall('"totalPageNum":"(.*?)"', content)[0]rsign = re.findall('"rsign":"(.*?)"', content)[0]content_url = 'https://wkretype.bdimg.com/retype/text/' + doc_id + '?rn=' + pn + '&type=txt' + md5 + '&rsign=' + rsigncontent = json.loads(fetch_url(content_url))result = ''for item in content:for i in item['parags']:result += i['c'].replace('\\r', '\r').replace('\\n', '\n')return resultdef parse_other(doc_id):content_url = "https://wenku.baidu.com/browse/getbcsurl?doc_id=" + doc_id + "&pn=1&rn=99999&type=ppt"content = fetch_url(content_url)url_list = re.findall('{"zoom":"(.*?)","page"', content)url_list = [item.replace("\\", '') for item in url_list]if not os.path.exists(doc_id):os.mkdir(doc_id)for index, url in enumerate(url_list):content = session.get(url).contentpath = os.path.join(doc_id, str(index) + '.jpg')with open(path, 'wb') as f:f.write(content)print("图片保存在" + doc_id + "文件夹")def save_file(filename, content):with open(filename, 'w', encoding='utf8') as f:f.write(content)print('已保存为:' + filename)def main():url = input('请输入要下载的文库URL地址_')content = fetch_url(url)doc_id = get_doc_id(url)type = parse_type(content)title = parse_title(content)if type == 'doc':result = parse_doc(content)save_file(title + '.txt', result)elif type == 'txt':result = parse_txt(doc_id)save_file(title + '.txt', result)else:parse_other(doc_id)if __name__ == "__main__":main()

python 爬虫——爬取百度文库VIP内容相关推荐

python爬虫爬取百度文库txt以及ppt资料
使用bs4,requests,re库完成对百度文库部分格式文件的爬取案例中的目标文档地址: https://wenku.baidu.com/view/cbb4af8b783e0912a3162a89 ...
python爬虫代码实例-Python爬虫爬取百度搜索内容代码实例
这篇文章主要介绍了Python爬虫爬取百度搜索内容代码实例,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学习价值,需要的朋友可以参考下搜索引擎用的很频繁,现在利用Python爬 ...
python爬虫爬取百度文档
使用python爬虫爬取百度文档文字话不多说,直接上代码! import requests import reheaders = {"User-Agent": "Moz ...
python爬虫爬取百度贴吧图片，requests方法
每天一点点,记录学习近期爬虫项目,看完请点赞哦---: 1:python 爬取菜鸟教程python100题,百度贴吧图片反爬虫下载,批量下载 2:python爬虫爬取百度贴吧图片,requests方 ...
Python爬虫爬取豆瓣电影评论内容，评论时间和评论人
Python爬虫爬取豆瓣电影评论内容,评论时间和评论人我们可以看到影评比较长,需要展开才能完整显示.但是在网页源码中是没有显示完整影评的.所以我们考虑到这部分应该是异步加载的方式显示.所以打开网页的 ...
python 爬取百度知道,Python 爬虫爬取百度百科网站
利用python写一个爬虫,爬取百度百科的某一个词条下面的全部链接和每一个链接内部的词条主题和摘要.利用request库爬取页面,然后利用BeautifulSoup对爬取到的页面提取url和关键内容. ...
python爬虫爬取百度图片总结_爬虫篇| 爬取百度图片（一）
什么是爬虫网络爬虫(又被称为网页蜘蛛,网络机器人,在FOAF社区中间,更经常的称为网页追逐者),是一种按照一定的规则,自动地抓取万维网信息的程序或者脚本.另外一些不常使用的名字还有蚂蚁.自动索引.模 ...
python爬虫爬取百度图片总结_python爬虫如何批量爬取百度图片
当我们想要获取百度图片的时候,面对一张张图片,一次次的点击右键下载十分麻烦.python爬虫可以实现批量下载,根据我们下载网站位置.图片位置.图片下载数量.图片下载位置等需求进行批量下载,本文演示py ...
python爬虫下载电影百度文档_写一个python 爬虫爬取百度电影并存入mysql中
目标是利用python爬取百度搜索的电影在类型地区年代各个标签下电影的名字评分和图片连接以及电影连接首先我们先在mysql中建表 create table liubo4( id in ...

python 爬虫——爬取百度文库VIP内容

代码实现

python 爬虫——爬取百度文库VIP内容相关推荐

最新文章

热门文章