Overview

总览

This section describes the motive, the notions and concepts used in Web-Harvest.

本章描述了在Web-Harvest涉

及的动机、观念和概念。

Rationale

理念

World Wide Web, though by far the

largest knowledge base, is rarely regarded as database in traditional

sense - as source of information used for further computing. Web-Harvest is

inspired by practical need for having right data at the right time. And

very often, the Web is the only source that

publicly provides wanted information.

万维网,尽管是目前最大的知识基地,但是仍然难以将它视为传统意义上的数据库,从而作为深入计算的所使用的信息来源。Web-Harvest

受启发满足实用性的需要成为在正确的时间获取正确的数据。web经常是唯一给公众提供所需要

的信息来源。

Basic concept

基本概念

The main goal behind Web-Harvest is to empower the usage of already

existing extraction technologies. Its purpose is not to propose a new

method, but to provide a way to easily use and combine the existing

ones. Web-Harvest

offers the set of processors for data handling and control

flow. Each processor can be regarded as a function - it has zero or more

input parameters and gives a result after execution. Processors could

be combined in a pipeline, making the chain of execution. For

easier manipulation and data reuse Web-Harvest provides variable context where

named variables are stored. The following diagram describes one

pipeline execution:

Web-Harvest

的总体目标的是要能使用已经存在的抽取技术。它的目标不是提供一个新的方法,而是提供一种可以简单使用并整合已经存在的技术的新方式。Web-Harvest

提供一系列数据处理和控制流程的处理器。每个处理器可以看做是一个方法-它有零个或多个输入参数并能在执行后提供一个结果。处理器可以组装为一个管道,形

成执行链。为了更加简单地操作以及数据重用,Web-Harvest

提供了变量上下文,那些被命名的变量可以存储在这个上下文中。下图描述了一个管道的执行过程:

The result of extraction could be available in files created during

execution or from the variable context if Web-Harvest is programmatically used.

在执行期间,抽取的结果可以存在于文件,如果Web-Harvest 采用编程方式进行使用时,抽取的结果也来自于变量上下文。

Configuration language

配置语言

Every extraction process is defined in one or more configuration

files, using simple XML-based language. Each processor is described

by specific XML element or structure of XML elements. For the

illustration, here is presented an example of configuration file:

每个抽取过程都定义在一个或多个配置文件中,并且使用简单的基于XML的语言。每个处理器都被特定的XML元素或XML元素的结构所描述。为了说

明,下面展示了一个配置文件的例子:

This configuration contains two pipelines. The first pipeline

performs the following steps:

这个配置包含了两个管道。第一个管道执行了下面的步骤:

HTML content at http://news.bbc.co.uk is downloaded,

HTML cleaning is performed on downloaded content producing XHTML,

XPath expression is searched for, giving URL sequence of page

images,

New variable named "urlList" is defined containing sequence

of image URLs.

2.  HTML清理

3.  XPath 表达式用于查找页面图片的URL序列,

4.  新命名urlList变量用于定义包汉了图片URL的序列。

The second pipeline uses result of the previous execution in order to

collect all page images:

Loop processor iterates over URL sequence and for every item:

Downloads image at current URL,

Stores the image on the file system.

第二个管道为了收集所有的页面图片,使用了前面执行的结果:

1.  Loop处理器迭代了所有的URL序列并且对于每项都:

2.  下载当前URL的图片,

3.  在文件系统中保存图片。

This example illustrates some procedural-language elements of Web-Harvest, like

variable definition and list iteration, few data management processors (file

and http) and couple of HTML/XML processing instructions (html-to-xml

and xpath processors).

For slightly more complex example of image download, where some other

features of Web-Harvest

are used, see Examples

page. For technical coverage of supported processors, see User

manual.

这个例子说明了Web-Harvest中

一些过程化语言的元素,比如变量定义和列表迭代,少量数据管理的处理器(文件和http)以及一些HTML/XML处理指令。(HTML到XML和

XPATH处理器)

想了解在Web-Harvest

中更加复杂一点的图片下载,以及用到的一些特点,见Examples

页。想了解所支持的处理器的技术覆盖范围,看User

manual。

Data values

All data produced and consumed during extraction process in Web-Harvest have

three representations: text, binary and list. There is also special data

value empty, whose textual representation is empty string,

binary - empty byte array and list - zero length list. Which form of

data is used - it depends on processor that consumes the data. In

previous configuration html-to-xml processor uses downloaded

content as text in order to transform it to HTML, loop

processor uses variable urlList as a list in order to iterate

over it and file processor treats downloaded images as binary

data when saving them to the files. In most cases proper representation

of the data is chosen by Web-Harvest. However - in some situations it must be

explicitly stated which one to use. One example is file

processor where default data type is text and the binary

content must be explicitly specified with type="binary".

Variables

Web-Harvest

provides the variable context for storing and using variables. There is

no special convention for naming variables like in most of the

programming languages. Thus, the names like arr[1], 100

or #$& are valid. However, if aforementioned variables

were used in scripts or templates (see next section), where expressions

are dynamically evaluated, the exception would be thrown. It is

therefore recommended to use usual programming language naming in order

to avoid any difficulties.

When Web-Harvest

is programmatically used (from Java code, not from command line)

variable context may be initially set by user in order to add custom

values and functionality. Similarly, after execution, variable context

is available for taking variables from it.

When user-defined functions are called (see User

manual) separate local variable context is created (like in many

programming languages, including Java). The valid way to exchange data

between caller and called function is through the function parameters.

Scripting and templating

Before Web-Harvest 0.5 templating mechanism was based on OGNL (Object-Graph Navigation

Language). From the version 0.5 OGNL is replaced by BeanShell, and starting from

version 1.0, multiple scripting languages are supported, giving

developers freedom to choose the favourite one.

Besides the set of powerful text and XML manipulation processors, Web-Harvest

supports real scripting languages which code can be easily intergrated

within scraper configurations. Languages currently supported are BeanShell,

Groovy and Javascript. BeanShell is probably the

closest to Java syntax and power, but Groovy and Javascript

have some other adventages. It is up to the developer to use prefered

language or even to mix different languages in the single configuration.

Templating allowes evaluating of marked parts of the text (text

"islands" surrounded with ${ and }). Evaluation is

performed using the chosen scripting language. In Web-Harvest all elements' attributes are implicitly

passed to the templating engine. In upper configuration, there are two

places where templater is doing the job:

path="images/${i}.gif" in file

processor, producing file names based on loop index,

url="${sys.fullUrl('http://news.bbc.co.uk', link)}"

in http processor, where built-in functionality is called to

calculate full URL of the image (see User

manual to check all built-in objects).

java抓取工具_抓取工具Web-Harvest - dayang2001911 - JavaEye技术网站相关推荐

  1. 最好用的_古书制作工具_古籍排版工具_古文排版_自动生成古书_多种古书风格_古籍制作工具

    古书制作工具_古籍排版工具使用方法 前言 最好用的古书制作工具, 最好用的古籍排版工具, 最好用的古籍制作工具, 最好用的古文排版, 自动生成古书, 多种古书风格 一.看下源图片见最后面 二.使用步骤 ...

  2. 最好用的_古书制作工具_古籍排版工具_古文排版_自动生成古书_多种古书风格_古籍制作工具_个性化书籍制作工具

    古书制作工具_古籍排版工具使用方法 前言 最好用的古书制作工具, 最好用的古籍排版工具, 最好用的古籍制作工具, 最好用的古文排版, 自动古书排版, 自动书籍排版, 自动生成古书, 多种古书风格 可自 ...

  3. seo自动工具_【SEO工具】搭建一个网站需要用到哪些SEO工具?

    ? 前言:SEO常用工具建站篇的内容来自最近建站的操作经验,之后还会推荐其他常用SEO工具也会分享自己写的工具,欢迎关注. " 内容大纲: 建站系统 首页关键词挖掘和布局 内容采集和发布 内 ...

  4. seo伪原创工具_伪原创工具哪个好用?

    在日常优化工作过程中,一个站长是需要同时管理好几个网站的,所以导致没有时间或精力写出更多的原创内容,在这个时候就需要借助伪原创工具来实现内容更新的目的了,比较实用的伪原创工具主要有:石青.小发猫.魔术 ...

  5. 取文字_玉镯取不出来了怎么办?教你6种最有效的方法

    也不知道是年龄到了,还是传统的文化根深蒂固,最近看到翡翠手镯都特别想入手,可惜平常磕磕碰碰的时候多,还是戴金饰比较靠谱~ 想必许多粉丝也和DD一样看到好看的镯子就走不动道吧? 不过呢,在佩戴玉镯的过程 ...

  6. python足球数据可视化_NBA数据分析_python可视化数据分析_可视化数据分析工具_可视化分析工具-帆软...

    夺冠没含金量!python和BI可视化分析,湖人赢在这点上. 在经历了很多很多之后,湖人队终于获得了总冠军,众望所归. 如果科比还在的话,一定也很自豪吧,毕竟上一次夺冠还是10年前. 那问题来了,为什 ...

  7. 制定交叉编译工具_制作交叉编译工具链的方法总结(详细)

    网上这类文章比较多,但是都不是很具体,刚好有门课结课论文要写这个,所以就总结了一下.以下的过程都是在ubuntu7.10上实际运行过的. 手工制作交叉编译工具 在制作工具开始前先要选好所需要的Binu ...

  8. 速卖通关键词挖掘工具_利用SEO工具挖掘同行竞争对手关键词数据快速布局网站词库...

    工欲善其事必先利其器,这句话告诉我们,善用工具,往往会起到很好的效果.对于SEO来说,做排名有很多方面的工作,这其中网站内容是非常重要的部分.而网站内容是围绕关键词的布局而展开.所以一个网站的关键词布 ...

  9. 关键词挖掘工具_关键词拓展工具集合

    优化网站的关键词,首先要建立关键词库,那么要拓展和挖掘海量的关键词,我们除了手动去搜索引擎里搜索,收集下拉框.相关搜索,还有没有更丰富的关键词挖掘工具呢?那么,今天网睿seo公司的就会给大家分享下目前 ...

最新文章

  1. oracle中的mount,Oracle环境中使用NFS的mount选项
  2. 微服务网关总结之 —— zuul
  3. 两个运放制作加法器_集成电路的分类及其制作工艺
  4. 代理设置。 安卓工作室配置用http代理。gradle可能需要这些http代理设置去访问互联网。例如下载依赖。 你想要复制ide的代理配置到这个项目的gradle属性文件吗?...
  5. group by和order by在springboot中连用03
  6. 【Redis】redis数据类型及应用场景
  7. 使用ODBC API读取Decimal或者Numeric
  8. 操作系统基本概念汇总
  9. Python 输入整数进行排序
  10. 黑鲨重装计算机安装无法继续,黑鲨装机,小编教你黑鲨怎么安装win7
  11. 国美易卡有序实现索引,国美易卡B+树方便区间查找
  12. 谷歌浏览器Google Chrome如何在打开新标签页时设置指定网址
  13. 裁判文书App(2020最新版) 逆向过程分析
  14. 网易互娱2017实习生招聘在线笔试第一场-1电子数字
  15. 什么是档案级光盘?它的寿命是多少年?
  16. 手把手教你在好友不知道的情况下,检查哪个微信好友删了你。
  17. 关于AD18中Board information的位置更改
  18. 如何用hadoop自带的包计算pi值
  19. uni-app 使用 jsencrypt
  20. 公司CEO,利用恶意邮件部署勒索软件

热门文章

  1. 06-浅谈MITM攻击之信息窃取
  2. WorkNC合作普达盛加工生产压铸模具
  3. 直接按开关机键重启服务器
  4. 凌动x7可以用linux吗,英特尔凌动x7-Z8700性能跑分评测 | ZMMOO
  5. 经验分享 | 一个程序员的运气有多重要
  6. 深圳茁壮IPANEL浏览器中间件 debug模块移植参考,打印分级等功能,可以移到其他嵌入式系统
  7. CATIA软件操作——自定义宏命令图标
  8. C4D致富经典入门到精通(一)
  9. js将两个数组合并,并且删除第二个元素
  10. JDBC、驱动管理器与DataSource