java解析html之HTMLparser初次尝试

为了爬取一个网页的数据，尝试了一下Htmlparser来做小爬虫。

下面是一个小案例，用来爬取论坛的帖子内容。

1. HtmlParser 简介

htmlparser是一个纯的java写的html解析的库，主要用于改造或提取html。用来分析抓取到的网页信息是个不错的选择，遗憾的是参考文档太少。
项目主页： http://htmlparser.sourceforge.net/
API文档： http://htmlparser.sourceforge.net/javadoc/index.html

2. 建立Maven工程

添加相关依赖

pom.xml

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"><modelVersion>4.0.0</modelVersion><groupId>com.fancy</groupId><artifactId>htmlParser</artifactId><version>0.0.1-SNAPSHOT</version><dependencies><dependency><groupId>org.htmlparser</groupId><artifactId>htmlparser</artifactId><version>2.1</version></dependency><dependency><groupId>junit</groupId><artifactId>junit</artifactId><version>4.12</version><scope>test</scope></dependency></dependencies>
</project>

2.1 创建一个解析器

用parser来抓取并分析一个网页。

parser并不会处理网页中的异步请求，在抓取页面后会把真个页面解析成DOM树，并以各种形式的节点/TAG存储，然后我们就可以用各种过滤器来帅选自己想要的节点。

htmlparser的已包含节点如下

org.htmlparser
Interface Node

All Superinterfaces:

Cloneable

All Known Subinterfaces:

Remark,Tag,Text

All Known Implementing Classes:

AbstractNode,AppletTag,BaseHrefTag, BodyTag, Bullet, BulletList, CompositeTag, DefinitionList, DefinitionListBullet, Div, DoctypeTag, FormTag, FrameSetTag, FrameTag, HeadingTag, HeadTag, Html, ImageTag, InputTag, JspTag, LabelTag, LinkTag, MetaTag, ObjectTag, OptionTag, ParagraphTag, ProcessingInstructionTag, RemarkNode, ScriptTag, SelectTag, Span, StyleTag, TableColumn, TableHeader, TableRow, TableTag, TagNode, TextareaTag, TextNode, TitleTag

网页被解析后获得的都是这些节点以及他们之间的父子包含关系。

每一个节点都包含如下方法（很多节点还会自己实现更多的方法，例如linktag有些方法用于获取link标签的url，检查这个url的协议类型...)

Method Summary
`void`	`accept(NodeVisitor visitor)` Apply the visitor to this node.
`Object`	`clone()` Allow cloning of nodes.
`void`	`collectInto(NodeList list,NodeFilter filter)` Collect this node and its child nodes into a list, provided the node satisfies the filtering criteria.
`void`	`doSemanticAction()` Perform the meaning of this tag.
`NodeList`	`getChildren()` Get the children of this node.
`int`	`getEndPosition()` Gets the ending position of the node.
`Node`	`getFirstChild()` Get the first child of this node.
`Node`	`getLastChild()` Get the last child of this node.
`Node`	`getNextSibling()` Get the next sibling to this node.
`Page`	`getPage()` Get the page this node came from.
`Node`	`getParent()` Get the parent of this node.
`Node`	`getPreviousSibling()` Get the previous sibling to this node.
`int`	`getStartPosition()` Gets the starting position of the node.
`String`	`getText()` Returns the text of the node.
`void`	`setChildren(NodeList children)` Set the children of this node.
`void`	`setEndPosition(int position)` Sets the ending position of the node.
`void`	`setPage(Page page)` Set the page this node came from.
`void`	`setParent(Node node)` Sets the parent of this node.
`void`	`setStartPosition(int position)` Sets the starting position of the node.
`void`	`setText(String text)` Sets the string contents of the node.
`String`	`toHtml()` Return the HTML for this node.
`String`	`toHtml(boolean verbatim)` Return the HTML for this node.
`String`	`toPlainTextString()` A string representation of the node.
`String`	`toString()` Return the string representation of the node.

节点过滤器，这些过滤器可以按照即诶但类型。节点之间父子关系，也可以自定义过滤器。多个过滤器之间可以组合成符合过滤器用于多条件过滤，

比如AndFilter，NotFilter，OrFilter，XorFilter

Class Summary
AndFilter	Accepts nodes matching all of its predicate filters (AND operation).
CssSelectorNodeFilter	A NodeFilter that accepts nodes based on whether they match a CSS2 selector.
HasAttributeFilter	This class accepts all tags that have a certain attribute, and optionally, with a certain value.
HasChildFilter	This class accepts all tags that have a child acceptable to the filter.
HasParentFilter	This class accepts all tags that have a parent acceptable to another filter.
HasSiblingFilter	This class accepts all tags that have a sibling acceptable to another filter.
IsEqualFilter	This class accepts only one specific node.
LinkRegexFilter	This class accepts tags of class LinkTag that contain a link matching a given regex pattern.
LinkStringFilter	This class accepts tags of class LinkTag that contain a link matching a given pattern string.
NodeClassFilter	This class accepts all tags of a given class.
NotFilter	Accepts all nodes not acceptable to it's predicate filter.
OrFilter	Accepts nodes matching any of its predicates filters (OR operation).
RegexFilter	This filter accepts all string nodes matching a regular expression.
StringFilter	This class accepts all string nodes containing the given string.
TagNameFilter	This class accepts all tags matching the tag name.

抓取http://www.v2ex.com网站中的一篇帖子

首先要创建获取网页内容，分析网页元素结构制作过滤器；

可以看到回复div的id都是r_加六位数字，推荐使用正则表达式匹配，主题的样式是corder-bottom:0px（一定要缺人过滤器的结果，免得引入多余节点）。

创建一个方法，获得主题和回复节点集合

    /*** * 获取html中的主题和所有回复节点* * @param url* @param ENCODE* @return*/protected  NodeList getNodelist(String url, String ENCODE) {try {NodeList nodeList = null;Parser parser = new Parser(url);parser.setEncoding(ENCODE);//定义一个Filter，过滤主题divNodeFilter filter = new NodeFilter() {@Overridepublic boolean accept(Node node) {if(node.getText().contains("style=\"border-bottom: 0px;\"")) {return true;} else {return false;}}};//定义一个Filter，过滤所有回复divNodeFilter replyfilter = new NodeFilter() {@Overridepublic boolean accept(Node node) {String containsString = "id=\"r_";if(node.getText().contains(containsString)) {return true;} else {return false;}}};//组合filterOrFilter allFilter = new OrFilter(filter, replyfilter);nodeList = parser.extractAllNodesThatMatch(allFilter);return nodeList;} catch (ParserException e) {e.printStackTrace();return null;}}

好了有了这些节点接下来就是解析了。

这个例子代码只写了一部分元素的获取，剩下的活也是体力活慢慢分析节点关系，用过滤器或者dom树找目标节点。

下面的代码是将解析到的节点数据封装到bean

  public Forum parse2Thread(String url,String ENCODE) {List<Reply> replylist = new ArrayList<Reply>();   //回复列表Topic topic = new Topic();   //主题NodeFilter divFilter = new NodeClassFilter(Div.class);//div过滤器NodeFilter headingFilter = new NodeClassFilter(HeadingTag.class);//heading过滤器NodeFilter tagFilter = new NodeClassFilter(TagNode.class);//heading过滤器NodeList nodeList = this.getNodelist(url, ENCODE);//解析node到帖子实体for (int i = 0; i < nodeList.size(); i++) {Node node = nodeList.elementAt(i);if(node.getText().contains("style=\"border-bottom: 0px;\"")) {//如果node是主题NodeList list = node.getChildren();//node的子节点//header divNode headerNode = list.extractAllNodesThatMatch(new NodeClassFilter(Div.class)).elementAt(0);//帖子主题Node h1Node = headerNode.getChildren().extractAllNodesThatMatch(headingFilter).elementAt(0);topic.setTopicName(h1Node.toPlainTextString());//发帖人信息NodeList headerChrildrens = headerNode.getChildren();topic.setAnn_name(headerChrildrens.elementAt(15).toPlainTextString());topic.setTopicDescribe(headerChrildrens.elementAt(16).toPlainTextString());//发帖人头像链接Node frNode = headerChrildrens.extractAllNodesThatMatch(divFilter).elementAt(0);ImageTag imgNode = (ImageTag) frNode.getFirstChild().getFirstChild();topic.setAnn_img(imgNode.getImageURL());//cell divNode cellNode = list.extractAllNodesThatMatch(divFilter).elementAt(1);Node topic_content = cellNode.getChildren().extractAllNodesThatMatch(divFilter).elementAt(0);Node markdown_body = topic_content.getChildren().extractAllNodesThatMatch(divFilter).elementAt(0);topic.setTopicBody(markdown_body.toPlainTextString());//暂时不包含连接和图片纯文本} else if(node.getText().contains("id=\"r_")){//节点是回复Reply reply = new Reply();Node tableNode = node.getChildren().extractAllNodesThatMatch(tagFilter).elementAt(0);Node trNode = tableNode.getChildren().extractAllNodesThatMatch(tagFilter).elementAt(0);//回复的tagNodeListNodeList tagList = trNode.getChildren().extractAllNodesThatMatch(tagFilter);ImageTag reply_img = (ImageTag) tagList.elementAt(0).getChildren().extractAllNodesThatMatch(tagFilter).elementAt(0);reply.setReply_img(reply_img.getImageURL());//nodeList bodyNode = tagList;replylist.add(reply);}}System.out.println("-----------实体----------------");Forum forum = new Forum(topic, replylist);System.out.println(forum.toString());return null;}

好了。解析都做完了，在写个主方法分析一个帖子试试；

   @Testpublic  void test() throws Exception {Html2Domain parse = new Html2DomainImpl();parse.parse2Thread("http://www.v2ex.com/t/262409#reply6","UTF-8");}

看看运行结果：

这个内容过长，截图只能看到帖子名称，和帖子内容了，有兴趣的自己去测试把。请一定要注意地址，貌似这个网站帖子连接会有失效时间，假如测试获取失败请换个帖子地址试试。

附上项目代码：测试使用的是jdk1.6+eclipse kepler

http://pan.baidu.com/s/1mh9OuDi