python微信公众号文章转pdf

python微信公众号文章转pdf
从doxc中提取链接，转换pdf

https://www.bbsmax.com/A/Ae5RRb7m5Q/

import os
import zipfileimport docx
import requests
import pdfkit
import pypandoc
import re
import json
import base64# import wechatsogou
import wechatsogou
from IPython.lib.deepreload import reload
from fpdf import FPDF
from bs4 import BeautifulSoup
from docx import Document
from os.path import basename
from docx.opc.constants import RELATIONSHIP_TYPE as RT
from cachelib import SimpleCache
from bs4 import BeautifulSoupdef get_linked_text(soup):links = []# This kind of link has a corresponding URL in the _rel file.for tag in soup.find_all("hyperlink"):# try/except because some hyperlinks have no id.try:links.append({"id": tag["r:id"], "text": tag.text})except:pass# This kind does not.for tag in soup.find_all("instrText"):# They're identified by the word HYPERLINKif "HYPERLINK" in tag.text:# Get the URL. Probably.url = tag.text.split('"')[1]# The actual linked text is stored nearby tags.# Loop through the siblings starting here.temp = tag.parent.next_siblingtext = ""while temp is not None:# Text comes in <t> tags.maybe_text = temp.find("t")if maybe_text is not None:# Ones that have text in them.if maybe_text.text.strip() != "":text += maybe_text.text.strip()# Links end with <w:fldChar w:fldCharType="end" />.maybe_end = temp.find("fldChar[w:fldCharType]")if maybe_end is not None:if maybe_end["w:fldCharType"] == "end":breaktemp = temp.next_siblinglinks.append({"id": None, "href": url, "text": text})return linksif __name__ == '__main__':ws_api = wechatsogou.WechatSogouAPI(captcha_break_time=3)i=0j=0file_name="1公告消息.docx"archive = zipfile.ZipFile(file_name, "r")file_data = archive.read("word/document.xml")doc_soup = BeautifulSoup(file_data, "xml")linked_text = get_linked_text(doc_soup)# print(linked_text)a = [0]*len(linked_text)for url in linked_text:a[i]=url['href']i=i+1# print(a)path_wkthmltopdf = r'D:\\soft\\wkhtmltopdf\\bin\\wkhtmltopdf.exe'config = pdfkit.configuration(wkhtmltopdf=path_wkthmltopdf)for b in a:# pdfkit.from_url(b, str(j)+'.pdf')# document = docx.Document()# document.add_run("")# html = requests.get(b).content# soup = BeautifulSoup(html, 'html.parser')# pdfkit.from_url(b,str(1)+'.pdf',configuration=config)# print(b)# break# pdfkit.from_file(b,str(1)+'.pdf',configuration=config)# targetPath = os.getcwd() + os.path.sep +str(j)try:content_info = ws_api.get_article_content(b)except:print("ValueError")html = f'''<!DOCTYPE html><html lang="en"><head><meta charset="UTF-8"><title>{j}</title></head><body><h2 style="text-align: center;font-weight: 400;">{j}</h2>{content_info['content_html']}</body></html>'''pdfkit.from_string(html,'D:/a/'+str(j)+'.pdf',configuration=config)# pypandoc.convert_file(b, 'docx', str(1)+'.docx')# print(b)j=j+1# print(len(a))

python微信公众号文章转pdf相关推荐

PHP 公众号文章转 pdf,如何将微信公众号文章另存为pdf文件微信公众号文章另存为pdf文件的方法...
如何将微信公众号文章另存为pdf文件?现在很多人都有自己的微信公众号,会在上面写一些文章.很多用户说想要将微信公众号文章另存为pdf文件,但是却不知道怎么操作.下面就是微信公众号文章另存为pdf文件的 ...
python微信爬取教程_[python]微信公众号文章爬取
[python]微信公众号文章爬取需求爬取一些微信公众号的文章数据来源 1.搜狗微信搜索,可以搜索微信公众号文章,但只能显示该公众号最近十篇的文章 2.通过个人微信公众号中的素材管理,查看其他微 ...
微信公众号文章转pdf下载，不难也不太容易，磕磕绊绊倒是不少如何用xpath保存网站源码；如何精简你的文章请求链接；如何将文章转化为pdf文件，不乱码，不报错
目录敲黑板抓包分析看碟下菜,确认分析思路锁定有效数据包分析响应信息和请求连接黑板报如何用xpath保存网站源码如何精简你的文章请求链接如何将文章转化为pdf文件,不乱码,不报错总结 ...
Python 微信公众号的文章爬取
Python 微信公众号文章爬取一.思路二.接口分析三.实现第一步: 第二步: 1.请求获取对应公众号接口,取到我们需要的fakeid 2.请求获取微信公众号文章接口,取到我们需要的文章数据 ...
python爬取正确但不出文件_使用Python爬取微信公众号文章并保存为PDF文件(解决图片不显示的问题)...
前言第一次写博客,主要内容是爬取微信公众号的文章,将文章以PDF格式保存在本地. 爬取微信公众号文章(使用wechatsogou) 1.安装 pip install wechatsogou --up ...
python write非法字符报错_Python爬虫实现的微信公众号文章下载器
平时爱逛知乎,收藏了不少别人推荐的数据分析.机器学习相关的微信公众号(这里就不列举了,以免硬广嫌疑).但是在手机微信上一页页的翻阅历史文章浏览,很不方便,电脑端微信也不方便. 所以我就想有什么方法能否 ...
python下载微信公众号文章_python如何导出微信公众号文章
[相关学习推荐:python教程] 1.安装wkhtmltopdf 下载地址:https://wkhtmltopdf.org/downloads.html 我测试用的是windows的,下载安装后结果 ...
python下载微信公众号文章_python如何导出微信公众号文章方法详解
1.安装wkhtmltopdf 下载地址:https://wkhtmltopdf.org/downloads.html 我测试用的是windows的,下载安装后结果如下 2 编写python 代码导出 ...
python公众号文章_Python 抓取微信公众号文章
起因是刷微信的时候看到一篇文章,Python 抓取微信公众号文章保存成pdf,很容易搜到,就不贴出来了先用chrome登陆微信公众号后台,先获取一下自己的cookie,复制下来就行,解析一下转换成 ...

python微信公众号文章转pdf

python微信公众号文章转pdf相关推荐

最新文章

热门文章