一、抓取流程概述
1、nutch抓取流程
当使用crawl命令进行抓取任务时,其基本流程步骤如下:
(1)InjectorJob
开始第一个迭代
(2)GeneratorJob
(3)FetcherJob
(4)ParserJob
(5)DbUpdaterJob
(6)SolrIndexerJob
开始第二个迭代
(2)GeneratorJob
(3)FetcherJob
(4)ParserJob
(5)DbUpdaterJob
(6)SolrIndexerJob
开始第三个迭代
……

2、抓取日志

使用crawl命令进行抓取时,console输出日志如下:
InjectorJob: starting at 2014-07-08 10:41:27
InjectorJob: Injecting urlDir: urls
InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
InjectorJob: total number of urls rejected by filters: 0
InjectorJob: total number of urls injected after normalization and filtering: 2
Injector: finished at 2014-07-08 10:41:32, elapsed: 00:00:05
Tue Jul 8 10:41:33 CST 2014 : Iteration 1 of 5
Generating batchId
Generating a new fetchlist
GeneratorJob: starting at 2014-07-08 10:41:34
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: false
GeneratorJob: normalizing: false
GeneratorJob: topN: 50000
GeneratorJob: finished at 2014-07-08 10:41:39, time elapsed: 00:00:05
GeneratorJob: generated batch id: 1404787293-26339
Fetching :
FetcherJob: starting
FetcherJob: batchId: 1404787293-26339
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
FetcherJob: threads: 50
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : 1404798101129
Using queue mode : byHost
Fetcher: threads: 50
QueueFeeder finished: total 2 records. Hit by time limit :0
fetching http://www.csdn.net/ (queue crawl delay=5000ms)
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
fetching http://www.itpub.net/ (queue crawl delay=5000ms)
-finishing thread FetcherThread47, activeThreads=48
-finishing thread FetcherThread46, activeThreads=47
-finishing thread FetcherThread45, activeThreads=46
-finishing thread FetcherThread44, activeThreads=45
-finishing thread FetcherThread43, activeThreads=44
-finishing thread FetcherThread42, activeThreads=43
-finishing thread FetcherThread41, activeThreads=42
-finishing thread FetcherThread40, activeThreads=41
-finishing thread FetcherThread39, activeThreads=40
-finishing thread FetcherThread38, activeThreads=39
-finishing thread FetcherThread37, activeThreads=38
-finishing thread FetcherThread36, activeThreads=37
-finishing thread FetcherThread35, activeThreads=36
-finishing thread FetcherThread34, activeThreads=35
-finishing thread FetcherThread33, activeThreads=34
-finishing thread FetcherThread32, activeThreads=33
-finishing thread FetcherThread31, activeThreads=32
-finishing thread FetcherThread30, activeThreads=31
-finishing thread FetcherThread29, activeThreads=30
-finishing thread FetcherThread48, activeThreads=29
-finishing thread FetcherThread27, activeThreads=29
-finishing thread FetcherThread26, activeThreads=28
-finishing thread FetcherThread25, activeThreads=27
-finishing thread FetcherThread24, activeThreads=26
-finishing thread FetcherThread23, activeThreads=25
-finishing thread FetcherThread22, activeThreads=24
-finishing thread FetcherThread21, activeThreads=23
-finishing thread FetcherThread20, activeThreads=22
-finishing thread FetcherThread19, activeThreads=21
-finishing thread FetcherThread18, activeThreads=20
-finishing thread FetcherThread17, activeThreads=19
-finishing thread FetcherThread16, activeThreads=18
-finishing thread FetcherThread15, activeThreads=17
-finishing thread FetcherThread14, activeThreads=16
-finishing thread FetcherThread13, activeThreads=15
-finishing thread FetcherThread12, activeThreads=14
-finishing thread FetcherThread11, activeThreads=13
-finishing thread FetcherThread10, activeThreads=12
-finishing thread FetcherThread9, activeThreads=11
-finishing thread FetcherThread8, activeThreads=10
-finishing thread FetcherThread7, activeThreads=9
-finishing thread FetcherThread5, activeThreads=8
-finishing thread FetcherThread4, activeThreads=7
-finishing thread FetcherThread3, activeThreads=6
-finishing thread FetcherThread2, activeThreads=5
-finishing thread FetcherThread49, activeThreads=4
-finishing thread FetcherThread6, activeThreads=3
-finishing thread FetcherThread28, activeThreads=2
-finishing thread FetcherThread0, activeThreads=1
fetch of http://www.itpub.net/ failed with: java.io.IOException: unzipBestEffort returned null
-finishing thread FetcherThread1, activeThreads=0
0/0 spinwaiting/active, 2 pages, 1 errors, 0.4 0 pages/s, 93 93 kb/s, 0 URLs in 0 queues
-activeThreads=0
FetcherJob: done
Parsing :
ParserJob: starting
ParserJob: resuming:    false
ParserJob: forced reparse:      false
ParserJob: batchId:     1404787293-26339
Parsing http://www.csdn.net/
http://www.csdn.net/ skipped. Content of size 92777 was truncated to 59561
Parsing http://www.itpub.net/
ParserJob: success
CrawlDB update for csdnitpub
DbUpdaterJob: starting
DbUpdaterJob: done
Indexing csdnitpub on SOLR index -> http://ip:8983/solr/
SolrIndexerJob: starting
SolrIndexerJob: done.
SOLR dedup -> http://ip:8983/solr/
Tue Jul 8 10:42:18 CST 2014 : Iteration 2 of 5
Generating batchId
Generating a new fetchlist
GeneratorJob: starting at 2014-07-08 10:42:19
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: false
GeneratorJob: normalizing: false
GeneratorJob: topN: 50000
GeneratorJob: finished at 2014-07-08 10:42:25, time elapsed: 00:00:05
GeneratorJob: generated batch id: 1404787338-30453
Fetching :
FetcherJob: starting
FetcherJob: batchId: 1404787338-30453
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
FetcherJob: threads: 50
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : 1404798146676
Using queue mode : byHost
Fetcher: threads: 50
QueueFeeder finished: total 0 records. Hit by time limit :0
二、使用命令进行逐步抓取
1、InjectorJob
此步骤将seed.txt中的url注入抓取队列中进行初始化。
(1)基本命令
$ bin/nutch inject Usage: InjectorJob <url_dir> [-crawlId <id>]$ bin/nutch inject urlsInjectorJob: starting at 2014-12-20 22:32:01InjectorJob: Injecting urlDir: urlsInjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.InjectorJob: total number of urls rejected by filters: 0InjectorJob: total number of urls injected after normalization and filtering: 1

Injector: finished at 2014-12-20 22:32:15, elapsed: 00:00:14
其中urls/seed.txt的内容如下:
http://stackoverflow.com/
(2)查看注入的url
上述步骤会在hbase中新建一个表,表名为test_1_webpage,url的相应内容会写入这张表
hbase(main):002:0> scan '334_webpage'
ROW                              COLUMN+CELL                                                                               com.stackoverflow:http/         column=f:fi, timestamp=1408953100271, value=\x00'\x8D\x00                                  com.stackoverflow:http/         column=f:ts, timestamp=1408953100271, value=\x00\x00\x01H\x0C&\x11\x8D                     com.stackoverflow:http/         column=mk:_injmrk_, timestamp=1408953100271, value=y                                       com.stackoverflow:http/         column=mk:dist, timestamp=1408953100271, value=0                                           com.stackoverflow:http/         column=mtdt:_csh_, timestamp=1408953100271, value=?\x80\x00\x00                            com.stackoverflow:http/         column=s:s, timestamp=1408953100271, value=?\x80\x00\x00
1 row(s) in 0.3020 seconds
(3)关于**_webpage表
对于每一个任务,均会生成一个crawlId_webpage的表,所有已抓取及未抓取的url相关信息均会存入此表。
若url未抓取,则该url相应的行信息较少。若url已经抓取,则抓取到的内容也会放入该行,如网页内容等。2、GeneratorJob
(1)基本命令
[jediael@jediael local]$  bin/nutch generate -crawlId 334
GeneratorJob: starting at 2014-08-25 15:57:12
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: true
GeneratorJob: normalizing: true
GeneratorJob: finished at 2014-08-25 15:57:18, time elapsed: 00:00:06
GeneratorJob: generated batch id: 1408953432-1171377744
(2)命令选项
[root@jediael local]# bin/nutch generate
Usage: GeneratorJob [-topN N] [-crawlId id] [-noFilter] [-noNorm] [-adddays numDays]-topN <N>      - number of top URLs to be selected, default is Long.MAX_VALUE -crawlId <id>  - the id to prefix the schemas to operate on, default: storage.crawl.id)"); -noFilter      - do not activate the filter plugin to filter the url, default is true -noNorm        - do not activate the normalizer plugin to normalize the url, default is true -adddays       - Adds numDays to the current time to facilitate crawling urls already fetched sooner then db.fetch.interval.default. Default value is 0.-batchId       - the batch id
----------------------
Please set the params.
(3)查看数据库
hbase(main):003:0> scan '334_webpage'
ROW                              COLUMN+CELL                                                                                com.stackoverflow:http/         column=f:bid, timestamp=1408953437910, value=1408953432-1171377744                         com.stackoverflow:http/         column=f:fi, timestamp=1408953100271, value=\x00'\x8D\x00                                  com.stackoverflow:http/         column=f:ts, timestamp=1408953100271, value=\x00\x00\x01H\x0C&\x11\x8D                     com.stackoverflow:http/         column=mk:_gnmrk_, timestamp=1408953437910, value=1408953432-1171377744                    com.stackoverflow:http/         column=mk:_injmrk_, timestamp=1408953100271, value=y                                       com.stackoverflow:http/         column=mk:dist, timestamp=1408953100271, value=0                                           com.stackoverflow:http/         column=mtdt:_csh_, timestamp=1408953100271, value=?\x80\x00\x00                            com.stackoverflow:http/         column=s:s, timestamp=1408953100271, value=?\x80\x00\x00
1 row(s) in 0.0490 seconds
此步骤新增了f:bid,mk:_gnmrk_  两列。
3、FetcherJob
(1)基本命令
[jediael@jediael local]$  bin/nutch generate -crawlId 334
GeneratorJob: starting at 2014-08-25 15:57:12
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: true
GeneratorJob: normalizing: true
GeneratorJob: finished at 2014-08-25 15:57:18, time elapsed: 00:00:06
GeneratorJob: generated batch id: 1408953432-1171377744
[jediael@jediael local]$  bin/nutch fetch -all -crawlId 334
FetcherJob: starting
FetcherJob: fetching all
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
FetcherJob: threads: 10
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : -1
Using queue mode : byHost
Fetcher: threads: 10
QueueFeeder finished: total 1 records. Hit by time limit :0
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
fetching http://stackoverflow.com/ (queue crawl delay=5000ms)
-finishing thread FetcherThread1, activeThreads=8
-finishing thread FetcherThread7, activeThreads=7
-finishing thread FetcherThread6, activeThreads=6
-finishing thread FetcherThread5, activeThreads=5
-finishing thread FetcherThread4, activeThreads=4
-finishing thread FetcherThread3, activeThreads=3
-finishing thread FetcherThread2, activeThreads=2
-finishing thread FetcherThread8, activeThreads=1
-finishing thread FetcherThread9, activeThreads=1
-finishing thread FetcherThread0, activeThreads=0
0/0 spinwaiting/active, 1 pages, 0 errors, 0.2 0 pages/s, 102 102 kb/s, 0 URLs in 0 queues
-activeThreads=0
FetcherJob: done(2)查看数据库
见db1.txt
新增f:bas,column=f:cnt,column=f:prot,f:pts,f:st,f:ts,f:typ,h:Cache-Control,h:Connection,h:Content-Encoding,h:Content-Length, h:Content-Type,h:Date,h:Expires, h:Last-Modified,h:Set-Cookie,h:Vary,h:X-Frame-Options, mk:_ftcmrk_等字段
4、ParserJob
(1)基本命令
[jediael@jediael local]$ bin/nutch parse  -all -crawlId 334
ParserJob: starting
ParserJob: resuming:    false
ParserJob: forced reparse:      false
ParserJob: parsing all
Parsing http://stackoverflow.com/
ParserJob: success
(2)命令参数
[root@jediael local]# bin/nutch parse
Usage: ParserJob (<batchId> | -all) [-crawlId <id>] [-resume] [-force]<batchId>     - symbolic batch ID created by Generator-crawlId <id> - the id to prefix the schemas to operate on, 
                    (default: storage.crawl.id)-all          - consider pages from all crawl jobs-resume       - resume a previous incomplete job-force        - force re-parsing even if a page is already parsed
(3)查看数据库
见db_parse.txt
新增了很多类似column=ol:http://stackoverflow.com/help的列,在此例中共有115个。5、DbUpdaterJob
(1)基本命令
[jediael@jediael local]$ bin/nutch updatedb -crawlId 334
DbUpdaterJob: starting
DbUpdaterJob: done
(2)查看数据库
见db_updatedb.txt
解释了上述的115个column=ol:http,并生成了115行新数据,举其中一个例子如下:
com.stackoverflow:http/users/39 column=f:fi, timestamp=1408954979355, value=\x00'\x8D\x00                                  44974/silviu-oncioiu                                                                                                       com.stackoverflow:http/users/39 column=f:st, timestamp=1408954979355, value=\x00\x00\x00\x01                               44974/silviu-oncioiu                                                                                                       com.stackoverflow:http/users/39 column=f:ts, timestamp=1408954979355, value=\x00\x00\x01H\x0CB\xD4\x09                     44974/silviu-oncioiu                                                                                                       com.stackoverflow:http/users/39 column=mk:dist, timestamp=1408954979355, value=1                                           44974/silviu-oncioiu                                                                                                       com.stackoverflow:http/users/39 column=mtdt:_csh_, timestamp=1408954979355, value=<\x0Ex5                                  44974/silviu-oncioiu                                                                                                       com.stackoverflow:http/users/39 column=s:s, timestamp=1408954979355, value=<\x0Ex5                                         44974/silviu-oncioiu                                                                                                       com.stackoverflow:http/users/39 column=f:fi, timestamp=1408954979355, value=\x00'\x8D\x00                                  74525/laosi                                                                                                                com.stackoverflow:http/users/39 column=f:st, timestamp=1408954979355, value=\x00\x00\x00\x01                               74525/laosi                                                                                                                com.stackoverflow:http/users/39 column=f:ts, timestamp=1408954979355, value=\x00\x00\x01H\x0CB\xD4\x09                     74525/laosi                                                                                                                com.stackoverflow:http/users/39 column=mk:dist, timestamp=1408954979355, value=1                                           74525/laosi                                                                                                                com.stackoverflow:http/users/39 column=mtdt:_csh_, timestamp=1408954979355, value=<\x0Ex5                                  74525/laosi                                                                                                                com.stackoverflow:http/users/39 column=s:s, timestamp=1408954979355, value=<\x0Ex5                                         74525/laosi
此时数据已准备好,等待下一轮的抓取。
6、SolrIndexerJob
(1)基本命令
[jediael@jediael local]$  bin/nutch solrindex http://****/solr/  -all -crawlId 334
SolrIndexerJob: starting
Adding 1 documents
SolrIndexerJob: done.
(2)命令参数
[root@jediael local]# bin/nutch solrindex
Usage: SolrIndexerJob <solr url> (<batchId> | -all | -reindex) [-crawlId <id>]
(3)查看数据库
无变化

【Nutch2.2.1基础教程之6】Nutch2.2.1抓取流程相关推荐

  1. 【Nutch2.2.1基础教程之2.2】集成Nutch/Hbase/Solr构建搜索引擎之二:内容分析

    请先参见"集成Nutch/Hbase/Solr构建搜索引擎之一:安装及运行",搭建测试环境 http://blog.csdn.net/jediael_lu/article/deta ...

  2. 【Nutch2.2.1基础教程之3】Nutch2.2.1配置文件

    nutch-site.xml 在nutch2.2.1中,有两份配置文件:nutch-default.xml与nutch-site.xml. 其中前者是nutch自带的默认属性,一般情况下不要修改. 如 ...

  3. 【Nutch2.2.1基础教程之2.1】集成Nutch/Hbase/Solr构建搜索引擎之一:安装及运行【单机环境】...

    1.下载相关软件,并解压 版本号如下: (1)apache-nutch-2.2.1 (2) hbase-0.90.4 (3)solr-4.9.0 并解压至/usr/search 2.Nutch的配置 ...

  4. 【Nutch2.2.1基础教程之1】nutch相关异常

    1.在任务一开始运行,注入Url时即出现以下错误. InjectorJob: Injecting urlDir: urls InjectorJob: Using class org.apache.go ...

  5. 【Nutch2.2.1基础教程之2.1】集成Nutch/Hbase/Solr构建搜索引擎之一:安装及运行【单机环境】

    1.下载相关软件,并解压 版本号如下: (1)apache-nutch-2.2.1 (2) hbase-0.90.4 (3)solr-4.9.0 并解压至/usr/search 2.Nutch的配置 ...

  6. pgsql数据库默认配置事务类型_PostgreSQL基础教程之:初始化配置

    PostgreSQL基础教程之:初始化配置 时间:2020-04-27 来源: PostgreSQL基础教程之:初始化配置 一.配置pg_hba.conf 先说明客户端认证配置文件pg_hba.con ...

  7. Linux入门基础教程之Linux下软件安装

    Linux入门基础教程之Linux下软件安装 一.在线安装: sudo apt-get install 即可安装 如果在安装完后无法用Tab键补全命令,可以执行: source ~/.zshrc AP ...

  8. python可以处理多大的数据_科多大数据之Python基础教程之Excel处理库openpyxl详解...

    原标题:科多大数据之Python基础教程之Excel处理库openpyxl详解 科多大数据小课堂来啦~Python基础教程之Excel处理库openpyxl详解 openpyxl是一个第三方库,可以处 ...

  9. 什么是python基础教程-python基础教程之python是什么?概念解析

    Python,是一种面向对象的解释型计算机程序设计语言,由荷兰人Guido van Rossum于1989年发明,第一个公开发行版发行于1991年. Python是纯粹的自由软件, 源代码和解释器CP ...

最新文章

  1. 微机原理—定时计数控制接口
  2. Samba如何配置共享资源
  3. 咕泡-装饰器 decorator 设计模式笔记
  4. 如何在Eclipse中自动删除尾随空格?
  5. [HTTP]Etag的工作流程
  6. 如何计算Java对象所占内存的大小
  7. linux解决病毒系列之一,删除十字符libudev.so病毒文件
  8. 【源码分享】POSCMS功能如何实现短信验证码
  9. GHOSTXP_SP3电脑公司快速安装机版V2013
  10. (转)Managed DirectX +C# 开发(入门篇)(六)
  11. 学习笔记(5):Google开发专家带你学 AI:入门到实战(Keras/Tensorflow)(附源码)-深度学习“四件套”:数据、模型、损失函数与优化器...
  12. 模拟、数字基带/频带通信系统:编码、信源/信道编码、调制、码间串扰
  13. Python暴力破解受密码保护的zip/rar文件
  14. Python开发-Django表单
  15. Internet Explorer无法打开Internet 站点的原因
  16. java课设迷宫游戏_Java编写迷宫小游戏
  17. 计算机磁盘管理分盘可以撤销吗,电脑磁盘出现随便分盘不合理,怎么样重新分盘...
  18. AD域详细介绍和部署
  19. esp8266解析php,ESP8266 Bootloader开源代码解析之rboot(一)
  20. PHP代码审计DVWA[Weak Session IDs(弱会话IDS)

热门文章

  1. Web前端开发笔记——第四章 JavaScript程序设计 第五节 数组
  2. 图解python_图解Python深拷贝和浅拷贝
  3. linux64位ioremap函数,linux操作系统中的ioremap函数详解
  4. linux jrdmm 命令 局部 编译,Cgminer-4.10.0 Linux 挖矿
  5. java算法编程题_【java题目】考验你编程能力和算法的时候到了
  6. C与java通讯小结
  7. 滑动窗口算法_有点难度,几道和「滑动窗口」有关的算法面试题
  8. c语言指针写鞍点,c语言——鞍点
  9. php 数据库 加载图片,图片显示不出来,但是数据库里有显示
  10. python条形图颜色设置_python – 根据值在matplotlib中更改3D条形图中的条形颜色