介绍

仅支持爬取百度文库的Word文档，文字写入Word文档或者文本文件(.txt)，主要使用Python爬虫的requests库。
requests库是Python爬虫系列中请求库比较热门和便捷实用的库，另外urlib库(包)也是比较热门的。除此之外Python爬虫系列还有解析库lxml以及Beautiful Soup，Python爬虫框架scrapy。

请求网址

介绍一下headers的使用方法、及分页爬取，headers里面一般情况下其实只要User-Agent就够了。

def get_url(self):url = input("请输入下载的文库URL地址：")headers = {# 接收请求处理'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',# 声明浏览器支持的编码类型'Accept-Encoding': 'gzip, deflate, br',# 对客户端浏览器发送的接受语言'Accept-Language': 'zh-CN,zh;q=0.9',# 获取浏览器缓存'Cache-Control': 'max-age=0',# 向同一个连接发送下一个请求,直到一方主动关闭连接'Connection': 'keep-alive',# 主地址(服务器的域名)'Host': 'wenku.baidu.com','Sec-Fetch-Dest': 'document','Sec-Fetch-Mode': 'navigate','Sec-Fetch-Site': 'same-origin','Sec-Fetch-User': '?1','Upgrade-Insecure-Requests': '1',# 客户端标识证明(也像身份证一样)'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'}response = self.session.get(url=url,headers=headers)json_data = re.findall('"json":(.*?}])', response.text)[0]json_data = json.loads(json_data)# print(json_data)for index, page_load_urls in enumerate(json_data):# print(page_load_urls)page_load_url = page_load_urls['pageLoadUrl']# print(index)self.get_data(index, page_load_url)

爬取数据

获取服务器响应爬取文档数据写入Word文档，也可以将with open(‘百度文库.docx’, ‘a’, encoding=‘utf-8’)中的.docx改成.txt文本文件，这样写入的就是文本文件了，写入目前还没添加换行功能！

def get_data(self, index, url):headers = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9','Accept-Encoding': 'gzip, deflate, br','Accept-Language': 'zh-CN,zh;q=0.9','Cache-Control': 'max-age=0','Connection': 'keep-alive','Host': 'wkbjcloudbos.bdimg.com','Sec-Fetch-Dest': 'document','Sec-Fetch-Mode': 'navigate','Sec-Fetch-Site': 'none','Sec-Fetch-User': '?1','Upgrade-Insecure-Requests': '1','User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'}response = self.session.get(url=url,headers=headers)# print(response.content.decode('unicode_escape'))data = response.content.decode('unicode_escape')comand = 'wenku_' + str(index+1)json_data = re.findall(comand + "\((.*?}})\)", data)[0]# print(json_data)json_data = json.loads(json_data)result = []for i in json_data['body']:data = i["c"]# print(data)result.append(data)print(''.join(result).replace('    ', '\n'))print("")with open('百度文库.docx', 'a', encoding='utf-8') as f:f.write('')f.write(''.join(result).replace('    ', '\n'))

完整代码

import requests
import re
import jsonclass WenKu():def __init__(self):self.session = requests.Session()def get_url(self):url = input("请输入下载的文库URL地址：")headers = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9','Accept-Encoding': 'gzip, deflate, br','Accept-Language': 'zh-CN,zh;q=0.9','Cache-Control': 'max-age=0','Connection': 'keep-alive','Host': 'wenku.baidu.com','Sec-Fetch-Dest': 'document','Sec-Fetch-Mode': 'navigate','Sec-Fetch-Site': 'same-origin','Sec-Fetch-User': '?1','Upgrade-Insecure-Requests': '1','User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'}response = self.session.get(url=url,headers=headers)json_data = re.findall('"json":(.*?}])', response.text)[0]json_data = json.loads(json_data)# print(json_data)for index, page_load_urls in enumerate(json_data):# print(page_load_urls)page_load_url = page_load_urls['pageLoadUrl']# print(index)self.get_data(index, page_load_url)def get_data(self, index, url):headers = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9','Accept-Encoding': 'gzip, deflate, br','Accept-Language': 'zh-CN,zh;q=0.9','Cache-Control': 'max-age=0','Connection': 'keep-alive','Host': 'wkbjcloudbos.bdimg.com','Sec-Fetch-Dest': 'document','Sec-Fetch-Mode': 'navigate','Sec-Fetch-Site': 'none','Sec-Fetch-User': '?1','Upgrade-Insecure-Requests': '1','User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'}response = self.session.get(url=url,headers=headers)# print(response.content.decode('unicode_escape'))data = response.content.decode('unicode_escape')comand = 'wenku_' + str(index+1)json_data = re.findall(comand + "\((.*?}})\)", data)[0]# print(json_data)json_data = json.loads(json_data)result = []for i in json_data['body']:data = i["c"]# print(data)result.append(data)print(''.join(result).replace('    ', '\n'))print("")with open('百度文库.docx', 'a', encoding='utf-8') as f:f.write('')f.write(''.join(result).replace('    ', '\n'))if __name__ == '__main__':wk = WenKu()wk.get_url()

Pyhton爬取百度文库文字写入word文档相关推荐

百度文库免费复制word文档的纯文字
2022年5月11日测试过,以下方法能正常使用. 1.在页面中安F12或者从浏览器的设置中找到开发人员工具 2.切换到控制台,然后点击右上角图标进入更多设置 3.在设置-首选项中,找到"调试 ...
使用python中的Selenium爬取百度文库word文章
参考文章:Python3网络爬虫(九):使用Selenium爬取百度文库word文章,链接为: https://blog.csdn.net/c406495762/article/details/723 ...
python生成QQ机器人爬取百度文库链接推送好友并生成词云
QQ机器人爬取百度文库链接推送好友并生成词云一.环境准备二.实现QQ机器人 1.QQ机器人介绍 2.安装方法 3.实现自己的QQ机器人三.百度文库内容链接爬取推送好友代码实现: 思路分析 1. ...
python爬取qq好友_Python3实现QQ机器人自动爬取百度文库的搜索结果并发送给好友（主要是爬虫）...
一.效果如下: 二.运行环境: win10系统:python3:PyCharm 三.QQ机器人用的是qqbot模块用pip安装命令是: pip install qqbot (前提需要有request ...
写一个爬虫，可以爬取百度文库内容
爬取百度文库内容需要使用爬虫技术.以下是一个简单的 Python 爬虫示例: import requestsurl ="https://wenku.baidu.com/view/your_d ...
python 爬虫——爬取百度文库VIP内容
转载自:爬取百度文库代码实现 import requests import re import json import ossession = requests.session()def fetch ...
html怎么转换到百度,类似百度文库在线预览文档flash版（支持word、excel、ppt、pdf）+在线预览文档html版...
类似百度文库在线预览文档flash版(支持word.excel.ppt.pdf)+在线预览文档html版 (1).将文档转换为html,只支持支持office文档 (2).将文档转换为flash,实现 ...
孤荷凌寒自学python第七十九天开始写Python的第一个爬虫9并使用pydocx模块将结果写入word文档...
孤荷凌寒自学python第七十九天开始写Python的第一个爬虫9 (完整学习过程屏幕记录视频地址在文末) 今天在上一天的基础上继续完成对我的第一个代码程序的书写. 到今天终于完成了对docx模块针对 ...
自从学会Python后，无视百度文库VIP，所有文档免费下载阅读
最近要用到百度文库查资料,但是很多都需要付费VIP或者下载券,还不能复制,就有点苦逼! 还好小编会Python,在Python面前真的所有VIP都是小意思,啥视频网站,资料网站等等,统统无视收费机制! ...
itext word模板替换_【极简Python 自动化办公】Python写入Word文档
[极简Python 自动化办公]专栏是介绍如何利用python办公,减少工作负荷.篇幅精炼,内容易懂,无论是否有编程基础,都非常适合. 在上次文章中,我们学习了[用python写入excel],这次我 ...

Pyhton爬取百度文库文字写入word文档

目录

介绍

请求网址

爬取数据

完整代码

Pyhton爬取百度文库文字写入word文档相关推荐

最新文章

热门文章