解析html是爬虫后的重要的一个处理数据的环节。一下记录解析html的几种方式。

先介绍基础的辅助函数,主要用于获取html并输入解析后的结束

#把传递解析函数,便于下面的修改
def get_html(url, paraser=bs4_paraser):headers = {'Accept': '*/*','Accept-Encoding': 'gzip, deflate, sdch','Accept-Language': 'zh-CN,zh;q=0.8','Host': 'www.360kan.com','Proxy-Connection': 'keep-alive','User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'}request = urllib2.Request(url, headers=headers)response = urllib2.urlopen(request)response.encoding = 'utf-8'if response.code == 200:data = StringIO.StringIO(response.read())gzipper = gzip.GzipFile(fileobj=data)data = gzipper.read()value = paraser(data) # open('E:/h5/haPkY0osd0r5UB.html').read()return valueelse:passvalue = get_html('http://www.360kan.com/m/haPkY0osd0r5UB.html', paraser=lxml_parser)
for row in value:print (row)

1.lxml.html的方式进行解析

The lxml XML toolkit is a Pythonic binding for the C libraries libxml2
and libxslt. It is unique in that it combines the speed and XML
feature completeness of these libraries with the simplicity of a
native Python API, mostly compatible but superior to the well-known
ElementTree API. The latest release works with all CPython versions
from 2.6 to 3.5. See the introduction for more information about
background and goals of the lxml project. Some common questions are
answered in the FAQ. 官网

'''
学习中遇到问题没人解答?小编创建了一个Python学习交流QQ群:531509025
寻找有志同道合的小伙伴,互帮互助,群里还有不错的视频学习教程和PDF电子书!
'''def lxml_parser(page):data = []doc = etree.HTML(page)all_div = doc.xpath('//div[@class="yingping-list-wrap"]')for row in all_div:# 获取每一个影评,即影评的itemall_div_item = row.xpath('.//div[@class="item"]') # find_all('div', attrs={'class': 'item'})for r in all_div_item:value = {}# 获取影评的标题部分title = r.xpath('.//div[@class="g-clear title-wrap"][1]')value['title'] = title[0].xpath('./a/text()')[0]value['title_href'] = title[0].xpath('./a/@href')[0]score_text = title[0].xpath('./div/span/span/@style')[0]score_text = re.search(r'\d+', score_text).group()value['score'] = int(score_text) / 20# 时间value['time'] = title[0].xpath('./div/span[@class="time"]/text()')[0]# 多少人喜欢value['people'] = int(re.search(r'\d+', title[0].xpath('./div[@class="num"]/span/text()')[0]).group())data.append(value)return data

2.使用BeautifulSoup,基本过时了,多的不说了,大家网上找资料看看

def bs4_paraser(html):all_value = []value = {}soup = BeautifulSoup(html, 'html.parser')# 获取影评的部分all_div = soup.find_all('div', attrs={'class': 'yingping-list-wrap'}, limit=1)for row in all_div:# 获取每一个影评,即影评的itemall_div_item = row.find_all('div', attrs={'class': 'item'})for r in all_div_item:# 获取影评的标题部分title = r.find_all('div', attrs={'class': 'g-clear title-wrap'}, limit=1)if title is not None and len(title) > 0:value['title'] = title[0].a.stringvalue['title_href'] = title[0].a['href']score_text = title[0].div.span.span['style']score_text = re.search(r'\d+', score_text).group()value['score'] = int(score_text) / 20# 时间value['time'] = title[0].div.find_all('span', attrs={'class': 'time'})[0].string# 多少人喜欢value['people'] = int(re.search(r'\d+', title[0].find_all('div', attrs={'class': 'num'})[0].span.string).group())# print rall_value.append(value)value = {}return all_value

3.使用SGMLParser,主要是通过start、end tag的方式进行了,解析工程比较明朗,但是有点麻烦,而且该案例的场景不太适合该方法。

'''
学习中遇到问题没人解答?小编创建了一个Python学习交流QQ群:531509025
寻找有志同道合的小伙伴,互帮互助,群里还有不错的视频学习教程和PDF电子书!
'''
class CommentParaser(SGMLParser):def __init__(self):SGMLParser.__init__(self)self.__start_div_yingping = Falseself.__start_div_item = Falseself.__start_div_gclear = Falseself.__start_div_ratingwrap = Falseself.__start_div_num = False# aself.__start_a = False# span 3中状态self.__span_state = 0# 数据self.__value = {}self.data = []def start_div(self, attrs):for k, v in attrs:if k == 'class' and v == 'yingping-list-wrap':self.__start_div_yingping = Trueelif k == 'class' and v == 'item':self.__start_div_item = Trueelif k == 'class' and v == 'g-clear title-wrap':self.__start_div_gclear = Trueelif k == 'class' and v == 'rating-wrap g-clear':self.__start_div_ratingwrap = Trueelif k == 'class' and v == 'num':self.__start_div_num = Truedef end_div(self):if self.__start_div_yingping:if self.__start_div_item:if self.__start_div_gclear:if self.__start_div_num or self.__start_div_ratingwrap:if self.__start_div_num:self.__start_div_num = Falseif self.__start_div_ratingwrap:self.__start_div_ratingwrap = Falseelse:self.__start_div_gclear = Falseelse:self.data.append(self.__value)self.__value = {}self.__start_div_item = Falseelse:self.__start_div_yingping = Falsedef start_a(self, attrs):if self.__start_div_yingping and self.__start_div_item and self.__start_div_gclear:self.__start_a = Truefor k, v in attrs:if k == 'href':self.__value['href'] = vdef end_a(self):if self.__start_div_yingping and self.__start_div_item and self.__start_div_gclear and self.__start_a:self.__start_a = Falsedef start_span(self, attrs):if self.__start_div_yingping and self.__start_div_item and self.__start_div_gclear:if self.__start_div_ratingwrap:if self.__span_state != 1:for k, v in attrs:if k == 'class' and v == 'rating':self.__span_state = 1elif k == 'class' and v == 'time':self.__span_state = 2else:for k, v in attrs:if k == 'style':score_text = re.search(r'\d+', v).group()self.__value['score'] = int(score_text) / 20self.__span_state = 3elif self.__start_div_num:self.__span_state = 4def end_span(self):self.__span_state = 0def handle_data(self, data):if self.__start_a:self.__value['title'] = dataelif self.__span_state == 2:self.__value['time'] = dataelif self.__span_state == 4:score_text = re.search(r'\d+', data).group()self.__value['people'] = int(score_text)pass
def sgl_parser(html):parser = CommentParaser()parser.feed(html)return parser.data

4.HTMLParaer,与3原理相识,就是调用的方法不太一样,基本上可以公用

class CommentHTMLParser(HTMLParser.HTMLParser):def __init__(self):HTMLParser.HTMLParser.__init__(self)self.__start_div_yingping = Falseself.__start_div_item = Falseself.__start_div_gclear = Falseself.__start_div_ratingwrap = Falseself.__start_div_num = False# aself.__start_a = False# span 3中状态self.__span_state = 0# 数据self.__value = {}self.data = []def handle_starttag(self, tag, attrs):if tag == 'div':for k, v in attrs:if k == 'class' and v == 'yingping-list-wrap':self.__start_div_yingping = Trueelif k == 'class' and v == 'item':self.__start_div_item = Trueelif k == 'class' and v == 'g-clear title-wrap':self.__start_div_gclear = Trueelif k == 'class' and v == 'rating-wrap g-clear':self.__start_div_ratingwrap = Trueelif k == 'class' and v == 'num':self.__start_div_num = Trueelif tag == 'a':if self.__start_div_yingping and self.__start_div_item and self.__start_div_gclear:self.__start_a = Truefor k, v in attrs:if k == 'href':self.__value['href'] = velif tag == 'span':if self.__start_div_yingping and self.__start_div_item and self.__start_div_gclear:if self.__start_div_ratingwrap:if self.__span_state != 1:for k, v in attrs:if k == 'class' and v == 'rating':self.__span_state = 1elif k == 'class' and v == 'time':self.__span_state = 2else:for k, v in attrs:if k == 'style':score_text = re.search(r'\d+', v).group()self.__value['score'] = int(score_text) / 20self.__span_state = 3elif self.__start_div_num:self.__span_state = 4def handle_endtag(self, tag):if tag == 'div':if self.__start_div_yingping:if self.__start_div_item:if self.__start_div_gclear:if self.__start_div_num or self.__start_div_ratingwrap:if self.__start_div_num:self.__start_div_num = Falseif self.__start_div_ratingwrap:self.__start_div_ratingwrap = Falseelse:self.__start_div_gclear = Falseelse:self.data.append(self.__value)self.__value = {}self.__start_div_item = Falseelse:self.__start_div_yingping = Falseelif tag == 'a':if self.__start_div_yingping and self.__start_div_item and self.__start_div_gclear and self.__start_a:self.__start_a = Falseelif tag == 'span':self.__span_state = 0def handle_data(self, data):if self.__start_a:self.__value['title'] = dataelif self.__span_state == 2:self.__value['time'] = dataelif self.__span_state == 4:score_text = re.search(r'\d+', data).group()self.__value['people'] = int(score_text)pass
def html_parser(html):parser = CommentHTMLParser()parser.feed(html)return parser.data

3,4对于该案例来说确实是不太适合,趁现在有空记录下来,功学习使用!

以上这篇对Python3 解析html的几种操作方式小结就是小编分享给大家的全部内容了,希望能给大家一个参考。

用Python3解析html的几种操作方式,你都会用吗?相关推荐

  1. html调用python_对Python3 解析html的几种操作方式小结

    解析html是爬虫后的重要的一个处理数据的环节.一下记录解析html的几种方式. 先介绍基础的辅助函数,主要用于获取html并输入解析后的结束 #把传递解析函数,便于下面的修改 def get_htm ...

  2. python中实现网页解析的三种工具分别是_对Python3 解析html的几种操作方式小结

    python3 能解析html吗 python3爬虫获取HTML文档时的问题.你羡慕小编一身潇洒无牵无挂小编却羡慕你有家有他有人等你回家 anaconda环境下python获取一个网站的HTML,不知 ...

  3. 详解Java解析XML的四种方法

    http://developer.51cto.com  2009-03-31 13:12  cnlw1985  javaeye  我要评论(8) XML现在已经成为一种通用的数据交换格式,平台的无关性 ...

  4. java解析xml的三种方法

    java解析XML的三种方法 1.SAX事件解析 package com.wzh.sax;import org.xml.sax.Attributes; import org.xml.sax.SAXEx ...

  5. php xml对象解析_php解析xml 的四种简单方法(附实例)

    XML处理是开发过程中经常遇到的,PHP对其也有很丰富的支持,本文只是对其中某几种解析技术做简要说明,包括:Xml parser, SimpleXML, XMLReader, DOMDocument. ...

  6. python爬虫详细步骤-Python爬虫的两套解析方法和四种爬虫实现过程

    对于大多数朋友而言,爬虫绝对是学习 python 的最好的起手和入门方式.因为爬虫思维模式固定,编程模式也相对简单,一般在细节处理上积累一些经验都可以成功入门.本文想针对某一网页对 python 基础 ...

  7. 详解Java解析XML的四种方法(转载)

    原文地址:http://developer.51cto.com/art/200903/117512.htm XML现在已经成为一种通用的数据交换格式,它的平台无关性,语言无关性,系统无关性,给数据集成 ...

  8. java 的xml_详解Java解析XML的四种方法

    XML现在已经成为一种通用的数据交换格式,它的平台无关性,语言无关性,系统无关性,给数据集成与交互带来了极大的方便.对于XML本身的语法知识与技术细节,需要阅读相关的技术文献,这里面包括的内容有DOM ...

  9. java解析xml的几种方式

    java解析xml的几种方式 博客分类: java基础备忘-好记性不然烂笔头 XMLJava应用服务器数据结构编程  第一种:DOM. DOM的全称是Document Object Model,也即文 ...

最新文章

  1. C# 关于委托和事件的妙文:通过一个例子详细介绍委托和事件的作用;Observer模式简介...
  2. 人类基因组最后一块拼图完成!Science罕见6篇连发
  3. Python教程:跳出多层循环for、while
  4. 如何在GitHub上下载开源文件
  5. SQL-server 如何与 visual studio 建立连接
  6. Ubuntu上安装Maven3
  7. JSP九大内置对象和四大作用域
  8. 发布Flask项目到服务器
  9. Lock wait timeout exceeded
  10. 企业风险管理的四种模式五种策略分别是什么?
  11. 蓝桥杯 核桃的数量(python)
  12. Web Services 简介
  13. C#中的异步和多线程
  14. gitLab数据备份和恢复
  15. 【MySQL】查看支持的字符集show character set;
  16. Enable VT-x in your BIOS 怎么解决
  17. db2 java 函数_DB2函数大全
  18. 读书:SQL必知必会
  19. 在WordPress中将Google PageSpeed优化到100
  20. 微信小程序时间加法_微信小程序-日期时间计算

热门文章

  1. 以新ICT构建全联接的电力物联网,迈入能源智能时代
  2. 优雅的创建一个JavaScript库
  3. memcached全面剖析--3.memcached的删除机制和发展方向
  4. 【Android】Activity和PopupWindow都实现从底部弹出或滑出选择菜单或窗口
  5. FreeBSD下安装配置Hadoop集群(性能调优)
  6. HTML发布那一年,开发版内测公告一般发布时间是几点
  7. 六、【SAP-PM模块】预防性维护业务流程
  8. ALV字段编辑时,输入长度受限制解决方法
  9. 去掉Word2007中的软回车(从网页粘贴文字的一些编辑)
  10. SMARTFORMS打印后转存为PDF文件相关问题