python爬取pdf内容_Python爬取读者并制作成PDF

学了下beautifulsoup后,做个个网络爬虫,爬取读者杂志并用reportlab制作成pdf..

crawler.py

#!/usr/bin/env python

#coding=utf-8

"""

Author: Anemone

Filename: getmain.py

Last modified: 2015-02-19 16:47

E-mail: [email protected]

"""

import urllib2

from bs4 import BeautifulSoup

import re

import sys

reload(sys)

sys.setdefaultencoding(‘utf-8‘)

def getEachArticle(url):

# response = urllib2.urlopen(‘http://www.52duzhe.com/2015_01/duzh20150104.html‘)

response = urllib2.urlopen(url)

html = response.read()

soup = BeautifulSoup(html)#.decode("utf-8").encode("gbk"))

#for i in soup.find_all(‘div‘):

# print i,1

title=soup.find("h1").string

writer=soup.find(id="pub_date").string.strip()

_from=soup.find(id="media_name").string.strip()

text=soup.get_text()#.encode("utf-8")

main=re.split("BAIDU_CLB.*;",text)

result={"title":title,"writer":writer,"from":_from,"context":main[1]}

return result

#new=open("new.txt","w")

#new.write(result["title"]+"\n\n")

#new.write(result["writer"]+" "+result["from"])

#new.write(result["context"])

#new.close()

def getCatalog(issue):

url="http://www.52duzhe.com/"+issue[:4]+"_"+issue[-2:]+"/"

firstUrl=url+"duzh"+issue+"01.html"

firstUrl=url+"index.html"

duzhe=dict()

response = urllib2.urlopen(firstUrl)

html = response.read()

soup=BeautifulSoup(html)

firstUrl=url+soup.table.a.get("href")

response = urllib2.urlopen(firstUrl)

html = response.read()

soup = BeautifulSoup(html)

all=soup.find_all("h2")

for i in all:

print i.string

duzhe[i.string]=list()

for link in i.parent.find_all("a"):

href=url+link.get("href")

print href

while 1:

try:

article=getEachArticle(href)

break

except:

continue

duzhe[i.string].append(article)

return duzhe

def readDuZhe(duzhe):

for eachColumn in duzhe:

for eachArticle in duzhe[eachColumn]:

print eachArticle["title"]

if __name__ == ‘__main__‘:

# issue=raw_input("issue(201501):")

readDuZhe(getCatalog("201424"))

getpdf.py

#!/usr/bin/env python

#coding=utf-8

"""

Author: Anemone

Filename: writetopdf.py

Last modified: 2015-02-20 19:19

E-mail: [email protected]

"""

#coding=utf-8

import reportlab.rl_config

from reportlab.pdfbase import pdfmetrics

from reportlab.pdfbase.ttfonts import TTFont

from reportlab.lib import fonts

import copy

from reportlab.platypus import Paragraph, SimpleDocTemplate,flowables

from reportlab.lib.styles import getSampleStyleSheet

import crawler

def writePDF(issue,duzhe):

reportlab.rl_config.warnOnMissingFontGlyphs = 0

pdfmetrics.registerFont(TTFont(‘song‘,"simsun.ttc"))

pdfmetrics.registerFont(TTFont(‘hei‘,"msyh.ttc"))

fonts.addMapping(‘song‘, 0, 0, ‘song‘)

fonts.addMapping(‘song‘, 0, 1, ‘song‘)

fonts.addMapping(‘song‘, 1, 0, ‘hei‘)

fonts.addMapping(‘song‘, 1, 1, ‘hei‘)

stylesheet=getSampleStyleSheet()

normalStyle = copy.deepcopy(stylesheet[‘Normal‘])

normalStyle.fontName =‘song‘

normalStyle.fontSize = 11

normalStyle.leading = 11

normalStyle.firstLineIndent = 20

titleStyle = copy.deepcopy(stylesheet[‘Normal‘])

titleStyle.fontName =‘song‘

titleStyle.fontSize = 15

titleStyle.leading = 20

firstTitleStyle = copy.deepcopy(stylesheet[‘Normal‘])

firstTitleStyle.fontName =‘song‘

firstTitleStyle.fontSize = 20

firstTitleStyle.leading = 20

firstTitleStyle.firstLineIndent = 50

smallStyle = copy.deepcopy(stylesheet[‘Normal‘])

smallStyle.fontName =‘song‘

smallStyle.fontSize = 8

smallStyle.leading = 8

story = []

story.append(Paragraph("读者{0}期".format(issue), firstTitleStyle))

for eachColumn in duzhe:

story.append(Paragraph(‘__‘*28, titleStyle))

story.append(Paragraph(‘{0}‘.format(eachColumn), titleStyle))

for eachArticle in duzhe[eachColumn]:

story.append(Paragraph(eachArticle["title"],normalStyle))

story.append(flowables.PageBreak())

for eachColumn in duzhe:

for eachArticle in duzhe[eachColumn]:

story.append(Paragraph("{0}".format(eachArticle["title"]),titleStyle))

story.append(Paragraph(" {0} {1}".format(eachArticle["writer"],eachArticle["from"]),smallStyle))

para=eachArticle["context"].split("　　")

for eachPara in para:

story.append(Paragraph(eachPara,normalStyle))

story.append(flowables.PageBreak())

#story.append(Paragraph("context",normalStyle))

doc = SimpleDocTemplate("duzhe"+issue+".pdf")

print "Writing PDF..."

doc.build(story)

def main(issue):

duzhe=crawler.getCatalog(issue)

writePDF(issue,duzhe)

if __name__ == ‘__main__‘:

issue=raw_input("Enter issue(201501):")

main(issue)

以上就是本文的全部内容了，希望大家能够喜欢。

原文：http://www.jb51.net/article/61975.htm

python爬取pdf内容_Python爬取读者并制作成PDF相关推荐

python爬pdf的曲线_Python爬取读者并制作成PDF
学了下beautifulsoup后,做个个网络爬虫,爬取读者杂志并用reportlab制作成pdf.. crawler.py #!/usr/bin/env python #coding=utf-8 & ...
python搜索网页特定区域内容_Python爬取练习：指定百度搜索的内容并提取网页的标题内容...
本文的文字及图片来源于网络,仅供学习.交流使用,不具有任何商业用途,如有问题请及时联系我们以作处理. 以下文章源于白菜学python ,作者小白菜刚接触Python的新手.小白,可以复制下面的链接去 ...
python提取txt关键内容_python爬取关键字所在行并输出到txt
因为手中有几千条关键信息需要整理,所有的信息都规整的用txt保存,但是我只需要其中有关键字后面的value值,怎么搞呢?几千条信息手动肯定是不可能的,然后刚刚学习python,很神奇的语言,哈哈.试着 ...
爬取知识星球，并制作成 PDF 电子书
GitHub 地址:github.com/96chh/crawl- 功能爬取知识星球的精华区,并制作成 PDF 电子书. 效果图用法 if __name__ == '__main__':start ...
PHP PDF内容识别抓取信息方法
PHP PDF内容识别抓取信息方法 PDF Parser 使用 PDF Parser 参考:http://www.pdfparser.org/ (注意:composer.json 更新 pdfpa ...
python爬取新浪微博内容_python新浪微博爬虫，爬取微博和用户信息 (含源码及示例)...
[实例简介] 这是新浪微博爬虫,采用python+selenium实现. 免费资源,希望对你有所帮助,虽然是傻瓜式爬虫,但是至少能运行.同时rar中包括源码及爬取的示例. 参考我的文章: http:/ ...
python爬取网页新闻_Python爬取新闻网数据
前言本文的文字及图片来源于网络,仅供学习.交流使用,不具有任何商业用途,如有问题请及时联系我们以作处理. PS:如有需要Python学习资料的小伙伴可以加点击下方链接自行获取基本开发环境 Pyth ...
python爬虫有道词典_Python爬取有道词典，有道的反爬很难吗？也就这样啊！
前言大家好最近python爬虫有点火啊,啥python爬取马保国视频--我也来凑个热闹,今天我们来试着做个翻译软件--不是不是,说错了,今天我们来试着提交翻译内容并爬取翻译结果主要内容 PS ...
python爬取百度标题_Python爬取百度热搜和数据处理
一.主题式网络爬虫设计方案 1.主题式网络爬虫名称:爬取百度热搜 2.主题式网络爬虫爬取的内容与数据特征分析:百度热搜排行,标题,热度 3.主题式网络爬虫设计方案概述:先搜索网站,查找数据并比对然后再 ...

python爬取pdf内容_Python爬取读者并制作成PDF

python爬取pdf内容_Python爬取读者并制作成PDF相关推荐

最新文章

热门文章