爬取王垠的博客并生成pdf

尚未完善，有待改进

#!/usr/bin/env python3
# -*- coding: utf-8 -*-__author__ = 'jiangwenwen'
import pdfkit
import time
import requests
import random
from bs4 import BeautifulSoup
from fake_useragent import UserAgent# 请求头
ua = UserAgent()headers = {'cache-control': "no-cache","Host": "www.yinwang.org","User-Agent": ua.random,"Referer": "http://www.yinwang.org/",
}# IP代理池
ip_pool = ['123.55.114.217:9999','110.52.235.91:9999','183.163.43.61:9999','119.101.126.52:9999','119.101.124.165:9999','119.101.125.38:9999','119.101.125.84:9999','110.52.235.80:9999','119.101.125.49:9999','110.52.235.162:9999','119.101.124.23:9999']# 打印成pdf
def print_pdf(url, file_name):start = time.time()print("正在打印中...")headers["User-Agent"] = ua.randomprint("User-Agent是：{0}".format(headers["User-Agent"]))content = requests.get(url, headers=headers, timeout=3, proxies=get_proxy(ip_pool)).textpdfkit.from_string(content, file_name)end = time.time()print("打印成功，本次打印耗时：%0.2f秒" % (end - start))# 获得有效代理
def get_proxy(ip_pool):for ip in ip_pool:url = "http://www.yinwang.org/"# 用requests来验证ip是否可用try:requests.get(url, proxies={"http": "http://{}".format(ip), }, timeout=3)except:continueelse:proxies = {"http": "http://{}".format(ip),"https": "http://{}".format(ip),}return proxiesresponse = requests.get("http://www.yinwang.org/", headers=headers, proxies=get_proxy(ip_pool))
soup = BeautifulSoup(response.content, 'html.parser')
tags = soup.find_all("li", class_="list-group-item title")for child in tags:article_url = "http://www.yinwang.org" + child.a.get('href')article_file_name = "桌面\\" + child.a.string + ".pdf"print_pdf(article_url, article_file_name)

转载于:https://www.cnblogs.com/jiangwenwen1/p/10328339.html

爬取王垠的博客并生成pdf相关推荐

【爬虫】利用Python爬虫爬取小麦苗itpub博客的所有文章的连接地址并写入Excel中（2）...
[爬虫]利用Python爬虫爬取小麦苗itpub博客的所有文章的连接地址并写入Excel中(2) 第一篇( http://blog.itpub.net/26736162/viewspace-22865 ...
一文搞定scrapy爬取众多知名技术博客文章保存到本地数据库，包含：cnblog、csdn、51cto、itpub、jobbole、oschina等
本文旨在通过爬取一系列博客网站技术文章的实践,介绍一下scrapy这个python语言中强大的整站爬虫框架的使用.各位童鞋可不要用来干坏事哦,这些技术博客平台也是为了让我们大家更方便的交流.学习.提高 ...
python小爬虫(爬取职位信息和博客文章信息)
1.python爬取招聘信息简单爬取智联招聘职位信息(仅供学习) # !/usr/bin/env python # -*-coding:utf-8-*- """ @Au ...
爬取掘金开发者头条博客园等我需要的文章
先说下我的爬取结果连接 http://craw.cibn.top/ 我是一个很懒的人但是每天为了进步还是要看一下各大社区掘金啊 csdn 开发者头条博客园等,毕竟这年头程序员不学习 ...
携程航班信息爬取(python)---第一次写博客，不好请别见外！
1.航班信息接口 api="https://flights.ctrip.com/itinerary/api/12808/products",这个接口中包含了所要查询的航班信息. * ...
解决百度爬虫无法爬取 Github Pages 个人博客的问题
据 marketmechina 统计,去年12月份中国市场全平台 (桌面+手机客户端)搜索引擎市场份额: 百度: 67.09% 搜狗: 18.75% 神马: 6.84% 谷歌: 2.64% 必应: 2 ...
进一步了解XPath（利用XPath爬取飞哥的博客）【python爬虫入门进阶】（04）
您好,我是码农飞哥,感谢您阅读本文,欢迎一键三连哦. 本文是爬虫专栏的第四篇,重点介绍lxml库与XPath搭配使用解析网页提取网页内容. 干货满满,建议收藏,系列文章持续更新. 小伙伴们如有问题及需 ...
如何控制Yahoo! Slurp蜘蛛的抓取频度_国外博客资源站_百度空间
如何控制Yahoo! Slurp蜘蛛的抓取频度_国外博客资源站_百度空间如何控制Yahoo! Slurp蜘蛛的抓取频度 2009年08月13日星期四 5:56 上周末豆瓣的阿北给我电话:最近你们雅 ...
【微博爬虫教程实例】基于requests、mysql爬取大数据量博主关键字下博文及评论
[关键词:手把手教程.反爬.数据库.python爬虫.微博关键词爬虫.较大数据量.数据简单过滤] 本教程适合微博相关爬虫需求者阅读,完整实例源码将放置在文末github链接中. 该实例针对微博的反爬措 ...

爬取王垠的博客并生成pdf

爬取王垠的博客并生成pdf相关推荐

最新文章

热门文章