Python抓取网页并保存为PDF

1、开发环境搭建
（1）Python2.7.13的安装：参考《廖雪峰老师的网站》
（2）Python包管理器pip的安装：参考《pip安装文档说明》
因为基于版本2.7.13，因为2.7.9以上已经自带pip，所以不需要单独安装，但是需要我们更新。上面的说明文档有说明。
（3）Python下的PDF工具：PyPDF2；
安装命令：

pip install PyPDF2

说明文档：《PyPDF2 1.26.0 说明文档》

PyPDF2简易示例：

from PyPDF2 import PdfFileMerger
merger = PdfFileMerger()
input1 = open("1.pdf", "rb")
input2 = open("2.pdf", "rb")
merger.append(input1)
merger.append(input2)
# 写入到输出pdf文档中
output = open("hql_all.pdf", "wb")
merger.write(output)

（4）Python下的Microsoft Word 2007工具：

pip install python-docx

说明文档：《python-docx 0.8.6 说明文档》

（5）依赖工具的安装
requests、beautifulsoup 是爬虫两大神器，reuqests 用于网络请求，beautifusoup 用于操作 html 数据。有了这两把梭子，干起活来利索。scrapy 这样的爬虫框架我们就不用了，这样的小程序派上它有点杀鸡用牛刀的意思。此外，既然是把 html 文件转为 pdf，那么也要有相应的库支持， wkhtmltopdf 就是一个非常的工具，它可以用适用于多平台的 html 到 pdf 的转换，pdfkit 是 wkhtmltopdf 的Python封装包。首先安装好下面的依赖包

pip install requests
pip install beautifulsoup4
pip install pdfkit

（6）手动安装wkhtmltopdf
Windows平台直接在 http://wkhtmltopdf.org/downloads.html 下载稳定版的 wkhtmltopdf 进行安装，安装完成之后把该程序的执行路径加入到系统环境 $PATH 变量中，否则 pdfkit 找不到 wkhtmltopdf 就出现错误 “No wkhtmltopdf executable found”。Ubuntu 和 CentOS 可以直接用命令行进行安装

$ sudo apt-get install wkhtmltopdf  # ubuntu
$ sudo yum intsall wkhtmltopdf      # centos

2、源代码

# coding=utf-8
import os
import re
import time
import logging
import pdfkit
import requests
from bs4 import BeautifulSoup
from PyPDF2 import PdfFileMerger  html_template = """
<!DOCTYPE html>
<html lang="en">
<head> <meta charset="UTF-8">
</head>
<body>
{content}
</body>
</html> """  def parse_url_to_html(url, name):  """ 解析URL，返回HTML内容 :param url:解析的url :param name: 保存的html文件名 :return: html """  try:  response = requests.get(url)  soup = BeautifulSoup(response.content, 'html.parser')  # 正文  body = soup.find_all(class_="x-wiki-content")[0]  # 标题  title = soup.find('h4').get_text()  # 标题加入到正文的最前面，居中显示  center_tag = soup.new_tag("center")  title_tag = soup.new_tag('h1')  title_tag.string = title  center_tag.insert(1, title_tag)  body.insert(1, center_tag)  html = str(body)  # body中的img标签的src相对路径的改成绝对路径  pattern = "(<img .*?src=\")(.*?)(\")"  def func(m):  if not m.group(3).startswith("http"):  rtn = m.group(1) + "http://www.liaoxuefeng.com" + m.group(2) + m.group(3)  return rtn  else:  return m.group(1)+m.group(2)+m.group(3)  html = re.compile(pattern).sub(func, html)  html = html_template.format(content=html)  html = html.encode("utf-8")  with open(name, 'wb') as f:  f.write(html)  return name  except Exception as e:  logging.error("解析错误", exc_info=True)  def get_url_list():  """ 获取所有URL目录列表 :return: """  response = requests.get("http://www.liaoxuefeng.com/wiki/0014316089557264a6b348958f449949df42a6d3a2e542c000")  soup = BeautifulSoup(response.content, "html.parser")  menu_tag = soup.find_all(class_="uk-nav uk-nav-side")[1]  urls = []  for li in menu_tag.find_all("li"):  url = "http://www.liaoxuefeng.com" + li.a.get('href')  urls.append(url)  return urls  def save_pdf(htmls, file_name):  """ 把所有html文件保存到pdf文件 :param htmls:  html文件列表 :param file_name: pdf文件名 :return: """  options = {  'page-size': 'Letter',  'margin-top': '0.75in',  'margin-right': '0.75in',  'margin-bottom': '0.75in',  'margin-left': '0.75in',  'encoding': "UTF-8",  'custom-header': [  ('Accept-Encoding', 'gzip')  ],  'cookie': [  ('cookie-name1', 'cookie-value1'),  ('cookie-name2', 'cookie-value2'),  ],  'outline-depth': 10,  }  pdfkit.from_file(htmls, file_name, options=options)  def main():  start = time.time()  file_name = u"liaoxuefeng_Python3_tutorial"  urls = get_url_list()  for index, url in enumerate(urls):  parse_url_to_html(url, str(index) + ".html")  htmls =[]  pdfs =[]  for i in range(0,124):  htmls.append(str(i)+'.html')  pdfs.append(file_name+str(i)+'.pdf')  save_pdf(str(i)+'.html', file_name+str(i)+'.pdf')  print u"转换完成第"+str(i)+'个html'  merger = PdfFileMerger()  for pdf in pdfs:  merger.append(open(pdf,'rb'))  print u"合并完成第"+str(i)+'个pdf'+pdf  output = open(u"廖雪峰Python_all.pdf", "wb")  merger.write(output)  print u"输出PDF成功！"  for html in htmls:  os.remove(html)  print u"删除临时文件"+html  for pdf in pdfs:  os.remove(pdf)  print u"删除临时文件"+pdf  total_time = time.time() - start  print(u"总共耗时：%f 秒" % total_time)  if __name__ == '__main__':  main()

3、学习网站
（1）菜鸟教程
（2）Python2.7官网文档

Python抓取网页并保存为PDF相关推荐

python抓取网页信息保存为xml文件_用Python抓取XML文件
如果您能够对文档运行xslt-我想您可以-另一种方法将使这变得非常简单:<?xml version="1.0" encoding="utf-8"?> ...
python 抓取网页链接_从Python中的网页抓取链接
python 抓取网页链接 Prerequisite: 先决条件: Urllib3: It is a powerful, sanity-friendly HTTP client for Python ...
python 抓取网页数据
python 抓取网页数据此文解决如何从不同网页爬取数据的问题及注意事项,重点说明requests库的应用. 在开始之前,要郑重说明一下,不是每一个网页都可以爬取数据哦.有的网页涉及个人隐私或其他敏 ...
python抓取网页文章_使用Python从公共API抓取新闻和文章
python抓取网页文章 Whether you are data scientist, programmer or AI specialist, you surely can put huge nu ...
Python抓取网页中的动态序列化数据
Python抓取网页中的动态序列化数据动态序列化数据经常应用于前后端分离的页面.或者通过VUE.JS等HTML页面环境,常规的爬虫抓取方法并不能满足数据采集的要求,因此需要其他的方式进行数据的采集. ...
python抓取网站图片_python抓取图片示例 python抓取网页上图片
python抓取网页上图片这个错误时是什么意思下面是代码 import re import urllib.request imp正则表达式匹配的url有错误 for x in add: print ...
使用Python抓取网页信息
之前用C#帮朋友写了一个抓取网页信息的程序,搞得好复杂,今天朋友又要让下网页数据,好多啊,又想偷懒,可是不想用C#了,于是想到了Python,大概花了两个小时,用记事本敲的,然后在IDLE (Pyth ...
python抓取html中特定的数据库,Python抓取网页中内容，正则分析后存入mysql数据库...
firefox+httpfox可以查看post表单首先在http://www.renren.com/这个地址输入用户名和密码, 输入用户名和密码之后post到下面这个网址: http://www.r ...
python抓取网页电话号码_利用正则表达式编写python 爬虫，抓取网页电话号码！...
利用正则表达式编写python 爬虫,抓取网页联系我们电话号码!这里以九奥科技(www.jiuaoo.com)为例,抓取'联系我们'里面的电话号码,并输出. #!/usrweilie/bin/pyth ...

Python抓取网页并保存为PDF

Python抓取网页并保存为PDF相关推荐

最新文章

热门文章