Nutch开发(四)

文章目录

Nutch开发(四)
- - 开发环境
- 1.Nutch插件设计介绍
- 2.解读插件目录结构
- 3. build.xml
- 4. ivy.xml
- 5. plugin.xml
- 6. 解读parse-html插件
- - HtmlParser
  - - setConf(Configuration conf)
    - parse(InputSource input)
    - getParse(Content content)
- 7.解读parse-metatags插件
- - MetaTagsParser
  - - filter方法
    - addIndexedMetatags方法
    - metadata plugin的配置

开发环境

Linux，Ubuntu20.04LST
IDEA
Nutch1.18
Solr8.11

转载请声明出处！！！By 鸭梨的药丸哥

1.Nutch插件设计介绍

Nutch高度可扩展，使用的插件系统是基于Eclipse2.x的插件系统。

Nutch暴露了几个扩展点，每个扩展点都是一个接口，通过实现接口来进行插件扩展的开发。Nutch提供以下扩展点，我们只需要实现对应的接口即可开发我们的Nutch插件

IndexWriter – Writes crawled data to a specific indexing backends (Solr, ElasticSearch, a CVS file, etc.).

IndexingFilter – Permits one to add metadata to the indexed fields. All plugins found which implement this extension point are run sequentially on the parse (from javadoc).

Parser – Parser implementations read through fetched documents in order to extract data to be indexed. This is what you need to implement if you want Nutch to be able to parse a new type of content, or extract more data from currently parseable content.

HtmlParseFilter – Permits one to add additional metadata to HTML parses (from javadoc).

Protocol – Protocol implementations allow Nutch to use different protocols (ftp, http, etc.) to fetch documents.

URLFilter – URLFilter implementations limit the URLs that Nutch attempts to fetch. The RegexURLFilter distributed with Nutch provides a great deal of control over what URLs Nutch crawls, however if you have very complicated rules about what URLs you want to crawl, you can write your own implementation.

URLNormalizer – Interface used to convert URLs to normal form and optionally perform substitutions.

ScoringFilter – A contract defining behavior of scoring plugins. A scoring filter will manipulate scoring variables in CrawlDatum and in resulting search indexes. Filters can be chained in a specific order, to provide multi-stage scoring adjustments.

SegmentMergeFilter – Interface used to filter segments during segment merge. It allows filtering on more sophisticated criteria than just URLs. In particular it allows filtering based on metadata collected while parsing page.

2.解读插件目录结构

Nutch插件的目录都相似，这里介绍一下parse-html的目录就行了

/src #源码目录
build.xml   #ant怎样编译这个plugin配置文件(编译出jar包放哪啊等配置信息)
ivy.xml     #plugin的ivy配置信息(依赖管理，跟maven的pom.xml一样的东东)
plugin.xml  #nutch描述这个plugin的信息(如，这个插件实现了哪些扩展点，插件的扩展点实现类名字等)

3. build.xml

build.xml告知ant如何编译这个插件的

<project name="parse-html" default="jar-core"><import file="../build-plugin.xml"/><!-- Build compilation dependencies --><target name="deps-jar"><!--build时依赖于另一个插件--><ant target="jar" inheritall="false" dir="../lib-nekohtml"/></target><!-- Add compilation dependencies to classpath --><path id="plugin.deps"><fileset dir="${nutch.root}/build"><include name="**/lib-nekohtml/*.jar" /></fileset></path><!-- Deploy Unit test dependencies --><target name="deps-test"><!--test时用到的依赖插件--><ant target="deploy" inheritall="false" dir="../lib-nekohtml"/><ant target="deploy" inheritall="false" dir="../nutch-extensionpoints"/></target></project>

4. ivy.xml

跟maven的pom.xml一样的东西。一些外部依赖可以在这里声明导入

<ivy-module version="1.0"><info organisation="org.apache.nutch" module="${ant.project.name}"><license name="Apache 2.0"/><ivyauthor name="Apache Nutch Team" url="https://nutch.apache.org/"/><description>Apache Nutch</description></info><configurations><include file="../../../ivy/ivy-configurations.xml"/></configurations><publications><!--get the artifact from our module name--><artifact conf="master"/></publications><!--在这里添加外部依赖--><dependencies><dependency org="org.ccil.cowan.tagsoup" name="tagsoup" rev="1.2.1"/></dependencies></ivy-module>

5. plugin.xml

<!--插件的描述信息-->
<pluginid="parse-html"name="Html Parse Plug-in"version="1.0.0"provider-name="nutch.org"><runtime><library name="parse-html.jar"><export name="*"/></library><library name="tagsoup-1.2.1.jar"/></runtime><!--插件导入--><requires><import plugin="nutch-extensionpoints"/><import plugin="lib-nekohtml"/></requires><!--扩展点的描述--><extension id="org.apache.nutch.parse.html"name="HtmlParse"point="org.apache.nutch.parse.Parser"><!--id唯一标识，class对应的实现类--><implementation id="org.apache.nutch.parse.html.HtmlParser"class="org.apache.nutch.parse.html.HtmlParser"><!--参数--><parameter name="contentType" value="text/html|application/xhtml+xml"/><parameter name="pathSuffix" value=""/></implementation></extension></plugin>

6. 解读parse-html插件

HtmlParser

HtmlParser实现了Parser扩展点

public class HtmlParser implements Parser

Parser接口方法：

public ParseResult getParse(Content c) //解析数据的
public void setConf(Configuration configuration) //用于nutch-setting中的配置
public Configuration getConf()

setConf(Configuration conf)

从nutch-setting.xml读取信息，因为nutch会在调用插件通过setConf(Configuration conf)往插件传递配置信息。

@Override
public void setConf(Configuration conf) {this.conf = conf;//创建HtmlParseFilters，里面有一个数组HtmlParseFilters装实现类的插件//HtmlParseFilters使用数组HtmlParseFilter[] htmlParseFilters装插件this.htmlParseFilters = new HtmlParseFilters(getConf());//获取解析实现类名字，空就默认使用nekohtmlthis.parserImpl = getConf().get("parser.html.impl", "neko");//编码方式this.defaultCharEncoding = getConf().get("parser.character.encoding.default", "windows-1252");//一个dom工具this.utils = new DOMContentUtils(conf);//cache策略this.cachingPolicy = getConf().get("parser.caching.forbidden.policy",Nutch.CACHING_FORBIDDEN_CONTENT);
}

查看nutch-default.xml，里面的parser.html.impl参数，确实有parser.html.impl，如果nutch-default.xml没有定义时还是会用NekoHTML去解析HTML页面。

从前面的build.xml引入了lib-nekohtml插件，这个就是NekoHTML
而ivy.xml引入了tagsoup的ivy依赖，这个就是TagSoup，两者都能解析html页面

<property><name>parser.html.impl</name><value>neko</value><description>HTML Parser implementation. Currently the following keywordsare recognized: "neko" uses NekoHTML, "tagsoup" uses TagSoup.</description>
</property>

parse(InputSource input)

再看看parse这个方法，

private DocumentFragment parse(InputSource input) throws Exception {//如果设置了tagsoup就用tagsoup来解析htmlif ("tagsoup".equalsIgnoreCase(parserImpl))return parseTagSoup(input);elsereturn parseNeko(input);
}

getParse(Content content)

注意：在ParseResult filteredParse = this.htmlParseFilters.filter(content,parseResult, metaTags, root);会运行继承HtmlParseFilter扩展点的插件，所以我们需要解析html中的格外的标签中的数据时，可以通过实现HtmlParseFilter扩展点来自定义一些html中的标签数据发解析。

public ParseResult getParse(Content content) {//HTML meta标签HTMLMetaTags metaTags = new HTMLMetaTags();//拿到urlURL base;try {base = new URL(content.getBaseUrl());} catch (MalformedURLException e) {return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());}//文本信息String text = "";//标题String title = "";//解析出的外部连接Outlink[] outlinks = new Outlink[0];//元数据Metadata metadata = new Metadata();//解析出的dom树// parse the contentDocumentFragment root;try {//拿到content封装成流byte[] contentInOctets = content.getContent();InputSource input = new InputSource(new ByteArrayInputStream(contentInOctets));//编码方式的解析EncodingDetector detector = new EncodingDetector(conf);detector.autoDetectClues(content, true);detector.addClue(sniffCharacterEncoding(contentInOctets), "sniffed");String encoding = detector.guessEncoding(content, defaultCharEncoding);metadata.set(Metadata.ORIGINAL_CHAR_ENCODING, encoding);metadata.set(Metadata.CHAR_ENCODING_FOR_CONVERSION, encoding);input.setEncoding(encoding);if (LOG.isTraceEnabled()) {LOG.trace("Parsing...");}root = parse(input);} catch (IOException e) {return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());} catch (DOMException e) {return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());} catch (SAXException e) {return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());} catch (Exception e) {LOG.error("Error: ", e);return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());}//解析出meta标签// get meta directivesHTMLMetaProcessor.getMetaTags(metaTags, root, base);//把标签数据装到metadata里面// populate Nutch metadata with HTML meta directivesmetadata.addAll(metaTags.getGeneralTags());if (LOG.isTraceEnabled()) {LOG.trace("Meta tags for " + base + ": " + metaTags.toString());}// check meta directivesif (!metaTags.getNoIndex()) { // okay to indexStringBuffer sb = new StringBuffer();if (LOG.isTraceEnabled()) {LOG.trace("Getting text...");}//解析文本信息,就是提取标签中的文本utils.getText(sb, root); // extract texttext = sb.toString();sb.setLength(0);if (LOG.isTraceEnabled()) {LOG.trace("Getting title...");}//提取title标签中的文本utils.getTitle(sb, root); // extract titletitle = sb.toString().trim();}if (!metaTags.getNoFollow()) { // okay to follow linksArrayList<Outlink> l = new ArrayList<Outlink>(); // extract outlinksURL baseTag = base;String baseTagHref = utils.getBase(root);if (baseTagHref != null) {try {baseTag = new URL(base, baseTagHref);} catch (MalformedURLException e) {baseTag = base;}}if (LOG.isTraceEnabled()) {LOG.trace("Getting links...");}//解析外部连接utils.getOutlinks(baseTag, l, root);outlinks = l.toArray(new Outlink[l.size()]);if (LOG.isTraceEnabled()) {LOG.trace("found " + outlinks.length + " outlinks in "+ content.getUrl());}}//创建parseStatusParseStatus status = new ParseStatus(ParseStatus.SUCCESS);if (metaTags.getRefresh()) {status.setMinorCode(ParseStatus.SUCCESS_REDIRECT);status.setArgs(new String[] { metaTags.getRefreshHref().toString(),Integer.toString(metaTags.getRefreshTime()) });}//封装解析数据ParseData parseData = new ParseData(status, title, outlinks,content.getMetadata(), metadata);//解析结果ParseResult parseResult = ParseResult.createParseResult(content.getUrl(),new ParseImpl(text, parseData));//运行HtmlParseFilter解析过滤器,如parse-metatags等,具体可通过配置添加// run filters on parseParseResult filteredParse = this.htmlParseFilters.filter(content,parseResult, metaTags, root);if (metaTags.getNoCache()) { // not okay to cachefor (Map.Entry<org.apache.hadoop.io.Text, Parse> entry : filteredParse)entry.getValue().getData().getParseMeta().set(Nutch.CACHING_FORBIDDEN_KEY, cachingPolicy);}return filteredParse;}

7.解读parse-metatags插件

MetaTagsParser

MetaTagsParser实现了HtmlParseFilter扩展点

public class MetaTagsParser implements HtmlParseFilter

filter方法

public ParseResult filter(Content content, ParseResult parseResult,HTMLMetaTags metaTags, DocumentFragment doc) {//拿到解析数据Parse parse = parseResult.get(content.getUrl());//拿到解析的元数据Metadata metadata = parse.getData().getParseMeta();/** NUTCH-1559: do not extract meta values from ParseData's metadata to avoid* duplicate metatag values*///meta标签的元数据（k,v）Metadata generalMetaTags = metaTags.getGeneralTags();for (String tagName : generalMetaTags.names()) {//根据配置进行添加到解析结果里面addIndexedMetatags(metadata, tagName, generalMetaTags.getValues(tagName));}Properties httpequiv = metaTags.getHttpEquivTags();for (Enumeration<?> tagNames = httpequiv.propertyNames(); tagNames.hasMoreElements();) {String name = (String) tagNames.nextElement();String value = httpequiv.getProperty(name);//这里也是添加到解析结果里面addIndexedMetatags(metadata, name, value);}return parseResult;}

addIndexedMetatags方法

观察一下这个方法，你就知道使用metadata plugin时，在使用index-metadata时，为什么配置要进行index的字段名要加上metatag.这个前缀了。

  private void addIndexedMetatags(Metadata metadata, String metatag,String value) {String lcMetatag = metatag.toLowerCase(Locale.ROOT);if (metatagset.contains("*") || metatagset.contains(lcMetatag)) {if (LOG.isDebugEnabled()) {LOG.debug("Found meta tag: {}\t{}", lcMetatag, value);}metadata.add("metatag." + lcMetatag, value);}}

metadata plugin的配置

在看看配置并和addIndexedMetatags对比一下，这就可以看出为什么插件的index.parse.md要加上metatag.前缀

<property>
<name>metatags.names</name>
<value>description,keywords</value>
<description> Names of the metatags to extract, separated by ','.Use '*' to extract all metatags. Prefixes the names with 'metatag.'in the parse-metadata. For instance to index description and keywords,you need to activate the plugin index-metadata and set the value of theparameter 'index.parse.md' to 'metatag.description,metatag.keywords'.
</description>
</property><property><name>index.parse.md</name><!--addIndexedMetatags方法解析出来的metadata有前缀metatag.--><value>metatag.description,metatag.keywords</value><description>Comma-separated list of keys to be taken from the parse metadata to generate fields.Can be used e.g. for 'description' or 'keywords' provided that these values are generatedby a parser (see parse-metatags plugin)</description>
</property>

Nutch开发(四)相关推荐

iPhone开发四剑客之《Objective-C基础教程》
iPhone 开发四剑客之<Objective-C 基础教程> Objective-C 语言是 C 语言的一个扩展集,许多(可能是大多数)具备 Mac OS X 外观的应用程序都是使用该语 ...
HTML5游戏开发(四)：飞机大战之显示场景和元素
<HTML5游戏开发>系列文章的目的有:一.以最小的成本去入门egret小项目开发,官方的教程一直都是面向中重型:二.egret可以非常轻量:三.egret相比PIXI.js和sprite ...
nutch开发(六)
nutch开发(六) 文章目录 nutch开发(六) 1.nutch1.18整合solr-8.11.0 1.1 配置index-writers.xml文件 1.2 solr core字段的配置 1.3 ...
Nutch开发（三）
Nutch开发(三) 文章目录 Nutch开发(三) 开发环境 1.Nutch url过滤 2.示例 3.在Solr建立index 关于solr字段的配置 4.关于Nutch plugin 5.关于N ...
nutch开发(二)
nutch开发(二) 文章目录 nutch开发(二) 开发环境 1.爬取后生成的目录结构 crawldb linkdb segments 2.阅读TestCrawlDbMerger createCra ...
firefox扩展开发(四) ：更多的窗口控件
firefox扩展开发(四) : 更多的窗口控件 2008-06-11 17:00 标签盒子标签盒子是啥?大家都见过,就是分页标签: 对应的代码: <?xml version="1. ...
从零开始实现ASP.NET Core MVC的插件式开发(四) - 插件安装
标题:从零开始实现ASP.NET Core MVC的插件式开发(四) - 插件安装作者:Lamond Lu 地址:https://www.cnblogs.com/lwqlun/p/11343141. ...
web策略类游戏开发(四)一个可以承载万人在线的架构
web策略类游戏开发(四)一个可以承载万人在线的架构 Webgame现在已经开始需要进入大统一服务器时代,每个游戏区域容纳的玩家数量将从现在的几万人发展到几十万人,因此在新的背景下,webgame如何 ...
移动WEB开发四、rem布局
零.文章目录文章地址个人博客-CSDN地址:https://blog.csdn.net/liyou123456789 个人博客-GiteePages:https://bluecusliyou.gi ...

Nutch开发(四)

Nutch开发(四)

文章目录

开发环境

1.Nutch插件设计介绍

2.解读插件目录结构

3. build.xml

4. ivy.xml

5. plugin.xml

6. 解读parse-html插件

HtmlParser

setConf(Configuration conf)

parse(InputSource input)

getParse(Content content)

7.解读parse-metatags插件

MetaTagsParser

filter方法

addIndexedMetatags方法

metadata plugin的配置

Nutch开发(四)相关推荐

最新文章

热门文章