python读取word指定内容_python解析html提取数据，并生成word文档实例解析

简介

今天试着用ptyhon做了一个抓取网页内容，并生成word文档的功能，功能很简单，做一下记录以备以后用到。

生成word用到了第三方组件python-docx，所以先进行第三方组件的安装。由于windows下安装的python默认不带setuptools这个模块，所以要先安装setuptools这个模块。

安装

1、在python官网上找到　https://bootstrap.pypa.io/ez_setup.py　　，把代码保存到本地并执行:　 python ez_setup.py

2、下载python-docx 　(https://pypi.python.org/pypi/python-docx/0.7.4)，下载完成后解压并进入到　　XXX\python-docx-0.7.4　安装python-docx :　python setup.py install

这样python-docx就安装成功了，可以用它来操作word文档了，word文档的生成参考的这里https://python-docx.readthedocs.org/en/latest/index.html

html解析用到的是sgmllib里的SGMLParser　　url内容的获取用到的是urllib、urllib2

实现代码

# -*- coding: cp936 -*-

from sgmllib import SGMLParser

import os

import sys

import urllib

import urllib2

from docx import Document

from docx.shared import Inches

import time

##获取要解析的url

class GetUrl(SGMLParser):

def __init__(self):

SGMLParser.__init__(self)

self.start=False

self.urlArr=[]

def start_div(self,attr):

for name,value in attr:

if value=="ChairmanCont Bureau":#页面js中的固定值

self.start=True

def end_div(self):

self.start=False

def start_a(self,attr):

if self.start:

for name,value in attr:

self.urlArr.append(value)

def getUrlArr(self):

return self.urlArr

##解析上面获取的url，获取有用数据

class getManInfo(SGMLParser):

def __init__(self):

SGMLParser.__init__(self)

self.start=False

self.p=False

self.dl=False

self.manInfo=[]

self.subInfo=[]

def start_div(self,attr):

for name,value in attr:

if value=="SpeakerInfo":#页面js中的固定值

self.start=True

def end_div(self):

self.start=False

def start_p(self,attr):

if self.dl:

self.p=True

def end_p(self):

self.p=False

def start_img(self,attr):

if self.dl:

for name,value in attr:

self.subInfo.append(value)

def handle_data(self,data):

if self.p:

self.subInfo.append(data.decode('utf-8'))

def start_dl(self,attr):

if self.start:

self.dl=True

def end_dl(self):

self.manInfo.append(self.subInfo)

self.subInfo=[]

self.dl=False

def getManInfo(self):

return self.manInfo

urlSource="http://www.XXX"

sourceData=urllib2.urlopen(urlSource).read()

startTime=time.clock()

##get urls

getUrl=GetUrl()

getUrl.feed(sourceData)

urlArr=getUrl.getUrlArr()

getUrl.close()

print "get url use:" + str((time.clock() - startTime))

startTime=time.clock()

##get maninfos

manInfos=getManInfo()

for url in urlArr:#one url one person

data=urllib2.urlopen(url).read()

manInfos.feed(data)

infos=manInfos.getManInfo()

manInfos.close()

print "get maninfos use:" + str((time.clock() - startTime))

startTime=time.clock()

#word

saveFile=os.getcwd()+"\\xxx.docx"

doc=Document()

##word title

doc.add_heading("HEAD".decode('gbk'),0)

p=doc.add_paragraph("HEADCONTENT:".decode('gbk'))

##write info

for infoArr in infos:

i=0

for info in infoArr:

if i==0:##img url

arr1=info.split('.')

suffix=arr1[len(arr1)-1]

arr2=info.split('/')

preffix=arr2[len(arr2)-2]

imgFile=os.getcwd()+"\\imgs\\"+preffix+"."+suffix

if not os.path.exists(os.getcwd()+"\\imgs"):

os.mkdir(os.getcwd()+"\\imgs")

imgData=urllib2.urlopen(info).read()

try:

f=open(imgFile,'wb')

f.write(imgData)

f.close()

doc.add_picture(imgFile,width=Inches(1.25))

os.remove(imgFile)

except Exception as err:

print (err)

elif i==1:

doc.add_heading(info+":",level=1)

else:

doc.add_paragraph(info,style='ListBullet')

i=i+1

doc.save(saveFile)

print "word use:" + str((time.clock() - startTime))

总结

以上就是本文关于python解析html提取数据，并生成word文档实例解析的全部内容，希望对大家有所帮助。感兴趣的朋友可以继续参阅本站其他相关专题，如有不足之处，欢迎留言指出。感谢朋友们对本站的支持！

python读取word指定内容_python解析html提取数据，并生成word文档实例解析相关推荐

python模块大全doc_Python pydoc模块详解：查看、生成帮助文档
前面己经介绍了为函数.类.方法等编写文档(只要在函数.类.方法定义后定义一个字符串即可).前面也介绍了使用 help() 函数和 __doc__ 属性来查看函数.类.方法的文档,但这种方式总是在控制器 ...
python读取word指定内容_python读取word 中指定位置的表格及表格数据
1.Word文档如下: 2.代码 # -*- coding: UTF-8 -*- from docx import Document def readSpecTable(filename, specT ...
python批量提取word指定内容_python word 段落提取
如何用python读取word 使用Python的内部方法open()读取文本文件try:f=open('/file','r')print(f.read())finally:if f:f.close( ...
python批量提取word指定内容_python批量提取word内信息
单位收集了很多word格式的调查表,领导需要收集表单里的信息,我就把所有调查表放一个文件里,写了个python小程序把所需的信息打印出来 #coding:utf-8 import os import ...
python读取xml文件内容_python读取xml文件
关于python读取xml文章很多,但大多文章都是贴一个xml文件,然后再贴个处理文件的代码.这样并不利于初学者的学习,希望这篇文章可以更通俗易懂的教如何使用python来读取xml文件. 什么是xm ...
python 读取文件名指定编码_Python 文件读写与编码解读
一.Python 读取文件使用open函数 python open() 函数用于打开一个文件,创建一个 file 对象,相关的方法才可以调用它进行读写. open(name[,mode[,buffer ...
python读取大文件内容_python读取大文件
python读取文件对各列进行索引可以用readlines, 也可以用readline, 如果是大文件一般就用readlined={} a_in = open("testfile.txt& ...
python替换excel指定内容_Python脚本操作Excel实现批量替换功能
大家好,给大家分享下如何使用Python脚本操作Excel实现批量替换. 使用的工具 Openpyxl,一个处理excel的python库,处理excel,其实针对的就是WorkBook,Sheet, ...
python读取多行函数_Python3基础 __doc__ 单行与多行函数文档
? ???? Python : 3.7.0 ?????? OS : Ubuntu 18.04.1 LTS ?????? IDE : PyCharm 2018.2.4 ????? Conda ...

python读取word指定内容_python解析html提取数据，并生成word文档实例解析

python读取word指定内容_python解析html提取数据，并生成word文档实例解析相关推荐

最新文章

热门文章