Heritrix3.1.0系统里面的组件以及对象之间总是存在千丝万缕的联系,本人为了表述某个功能的具体实现总是不得不牵涉到相关的对象及其实现,不然本人无法将该功能实现的逻辑描述清楚;可是在逻辑上本人又不得不考虑到话题的连贯性,本人姑妄言之,读者姑妄听之

本文接下来要分析的是ServerCache类及CrawlHost和CrawlServer类,了解这些类的作用是继续分析的前提

ServerCache是抽象类,在全局上为Heritrix3.1.0系统应用提供CrawlHost对象和CrawlServer对象的注册

/*** Abstract class for crawl-global registry of CrawlServer (host:port) and* CrawlHost (hostname) objects.*/
public abstract class ServerCache {public abstract CrawlHost getHostFor(String host);public abstract CrawlServer getServerFor(String serverKey);/*** Utility for performing an action on every CrawlHost. * * @param action 1-argument Closure to apply to each CrawlHost*/public abstract void forAllHostsDo(Closure action);private static Logger logger =Logger.getLogger(ServerCache.class.getName());/*** Get the {@link CrawlHost} associated with <code>curi</code>.* @param uuri CandidateURI we're to return Host for.* @return CandidateURI instance that matches the passed Host name.*/public CrawlHost getHostFor(UURI uuri) {CrawlHost h = null;try {if (uuri.getScheme().equals("dns")) {h = getHostFor("dns:");} else if (uuri.getScheme().equals("whois")) {h = getHostFor("whois:");} else {h = getHostFor(uuri.getReferencedHost());}} catch (URIException e) {logger.log(Level.SEVERE, uuri.toString(), e);}return h;}/*** Get the {@link CrawlServer} associated with <code>curi</code>.* @param uuri CandidateURI we're to get server from.* @return CrawlServer instance that matches the passed CandidateURI.*/public CrawlServer getServerFor(UURI uuri) {//System.out.println("classname:"+this.getClass().getName());CrawlServer cs = null;try {String key = CrawlServer.getServerKey(uuri);// TODOSOMEDAY: make this robust against those rare cases// where authority is not a hostname.if (key != null) {cs = getServerFor(key);}} catch (URIException e) {logger.log(Level.FINE, "No server key obtainable: "+uuri.toString(), e);} catch (NullPointerException npe) {logger.log(Level.FINE, "No server key obtainable: "+uuri.toString(), npe);}return cs;}abstract public Set<String> hostKeys();
}

该抽象类相当于模板,外部类通过CrawlServer getServerFor(UURI uuri)模板方法获取CrawlHost对象(根据UURI uuri对象的host值)

CrawlHost getHostFor(UURI uuri)模板方法获取CrawlServer对象(根据UURI uuri对象的key值)

如果我现在说CrawlHost对象和CrawlServer对象是什么,包括你我,都不会很理解它们;我们了解事物总是从具体到抽象,从其表现行为来理解该对象,所以本人先把这两个对象晾到一边

DefaultServerCache类继承自上面的ServerCache抽象类,提供了抽象方法的具体实现,用于获取CrawlHost对象和CrawlServer对象

*** Server and Host cache.* @author stack* @version $Date: 2011-01-20 23:28:38 +0000 (Thu, 20 Jan 2011) $, $Revision: 7067 $*/
public class DefaultServerCache extends ServerCache implements Closeable, Serializable {private static final long serialVersionUID = 1L;@SuppressWarnings("unused")private static Logger logger =Logger.getLogger(DefaultServerCache.class.getName());/*** hostname[:port] -> CrawlServer.* Set in the initialization.*/protected ObjectIdentityCache<CrawlServer> servers = null;/*** hostname -> CrawlHost.* Set in the initialization.*/protected ObjectIdentityCache<CrawlHost> hosts = null;/*** Constructor.*/public DefaultServerCache() {this(new ObjectIdentityMemCache<CrawlServer>(), new ObjectIdentityMemCache<CrawlHost>());}public DefaultServerCache(ObjectIdentityCache<CrawlServer> servers, ObjectIdentityCache<CrawlHost> hosts) {this.servers = servers;this.hosts = hosts;}/*** Get the {@link CrawlServer} associated with <code>name</code>.* @param serverKey Server name we're to return server for.* @return CrawlServer instance that matches the passed server name.*/public CrawlServer getServerFor(final String serverKey) {CrawlServer cserver = servers.getOrUse(serverKey,new Supplier<CrawlServer>() {public CrawlServer get() {String skey = new String(serverKey); // ensure private minimal keyreturn new CrawlServer(skey);}});return cserver;}/*** Get the {@link CrawlHost} associated with <code>name</code>.* @param hostname Host name we're to return Host for.* @return CrawlHost instance that matches the passed Host name.*/public CrawlHost getHostFor(final String hostname) {if (hostname == null || hostname.length() == 0) {return null;}CrawlHost host = hosts.getOrUse(hostname,new Supplier<CrawlHost>() {public CrawlHost get() {String hkey = new String(hostname); // ensure private minimal keyreturn new CrawlHost(hkey);}});return host;}/*** @param serverKey Key to use doing lookup.* @return True if a server instance exists.*/public boolean containsServer(String serverKey) {return (CrawlServer) servers.get(serverKey) != null; }/*** @param hostKey Key to use doing lookup.* @return True if a host instance exists.*/public boolean containsHost(String hostKey) {return (CrawlHost) hosts.get(hostKey) != null; }/*** Called when shutting down the cache so we can do clean up.*/public void close() {if (this.hosts != null) {// If we're using a bdb bigmap, the call to clear will// close down the bdb database.this.hosts.close();this.hosts = null;}if (this.servers != null) { this.servers.close();this.servers = null;}}/*** NOTE: Should not mutate the CrawlHost instance so retrieved; depending on* the hostscache implementation, the change may not be reliably persistent.  * * @see org.archive.modules.net.ServerCache#forAllHostsDo(org.apache.commons.collections.Closure)*/public void forAllHostsDo(Closure c) {for(String host : hosts.keySet()) {c.execute(hosts.get(host));}}public Set<String> hostKeys() {return hosts.keySet();}
}

该继承类里面的获取CrawlHost对象和CrawlServer对象的方法,我们已经有过相同的分析,可以参考Heritrix 3.1.0 源码解析(七),在这里本人再次该文中的UML模型展示出来,便于我们参照

不过DefaultServerCache类用到的用于对象缓存管理类是ObjectIdentityMemCache,将对象缓存在内存中

CrawlHost类和CrawlServer类都实现了IdentityCacheable接口,是可用于缓存的对象的标识接口

BdbServerCache类继承自DefaultServerCache类,将对象存储在BDB数据库里面

/*** ServerCache backed by BDB big maps; the usual choice for crawls.* * @contributor pjack* @contributor gojomo*/
public class BdbServerCache extends DefaultServerCache
implements Lifecycle {private static final long serialVersionUID = 1L;protected BdbModule bdb;@Autowiredpublic void setBdbModule(BdbModule bdb) {this.bdb = bdb;}public BdbServerCache() {}public void start() {if(isRunning()) {return;}try {this.servers = bdb.getObjectCache("servers", false, CrawlServer.class, CrawlServer.class);this.hosts = bdb.getObjectCache("hosts", false, CrawlHost.class, CrawlHost.class);} catch (DatabaseException e) {throw new IllegalStateException(e);}isRunning = true;}boolean isRunning = false; public boolean isRunning() {return isRunning;}public void stop() {isRunning = false; // TODO: release bigmaps?
    }
}

---------------------------------------------------------------------------

本系列Heritrix 3.1.0 源码解析系本人原创

转载请注明出处 博客园 刺猬的温驯

本文链接 http://www.cnblogs.com/chenying99/archive/2013/04/28/3049817.html

Heritrix 3.1.0 源码解析(二十八)相关推荐

  1. Heritrix 3.1.0 源码解析(十四)

    我在分析BdbFrontier对象的void schedule(CrawlURI caURI).CrawlURI next() .void finished(CrawlURI cURI)方法是,其实还 ...

  2. Heritrix 3.1.0 源码解析(八)

    本文接着分析存储CrawlURI curi的队列容器,最重要的是BdbWorkQueue类及BdbMultipleWorkQueues类 BdbWorkQueue类继承自抽象类WorkQueue,抽象 ...

  3. Heritrix 3.1.0 源码解析(六)

    本文分析BdbFrontier对象的相关状态和方法 BdbFrontier类继承自WorkQueueFrontier类   WorkQueueFrontier类继承自AbstractFrontier类 ...

  4. Heritrix 3.1.0 源码解析(十一)

    上文分析了Heritrix3.1.0系统是怎么添加CrawlURI curi对象的,那么在系统初始化的时候,是怎么载入CrawlURI curi种子的呢? 我们回顾前面的文章,在我们执行采集任务的la ...

  5. Heritrix 3.1.0 源码解析(三十四)

    本文主要分析FetchFTP处理器,该处理器用于ftp文件的下载,该处理器的实现是通过封装commons-net-2.0.jar组件来实现ftp文件下载 在FetchFTP处理器里面定义了内部类Soc ...

  6. matlabeig函数根据什么原理_vue3.0 源码解析二 :响应式原理(下)

    一 回顾上文 上节我们讲了数据绑定proxy原理,vue3.0用到的基本的拦截器,以及reactive入口等等.调用reactive建立响应式,首先通过判断数据类型来确定使用的hander,然后创建p ...

  7. 【转】ABP源码分析二十八:ABP.MemoryDB

    这个模块简单,且无实际作用(该模块用于支持ABP框架单元测试的).一般实际项目中都有用数据库做持久化,用了数据库就无法用这个MemoryDB 模块了.原因在于ABP限制了UnitOfWork的类型只能 ...

  8. ABP源码分析二十八:ABP.MemoryDB

    这个模块简单,且无实际作用.一般实际项目中都有用数据库做持久化,用了数据库就无法用这个MemoryDB 模块了.原因在于ABP限制了UnitOfWork的类型只能有一个(前文以作介绍),一般用了数据库 ...

  9. 锚框、交并比和非极大值抑制(tf2.0源码解析)

    锚框.交并比和非极大值抑制(tf2.0源码解析) 文章目录 锚框.交并比和非极大值抑制(tf2.0源码解析) 一.锚框生成 1.锚框的宽高 2.锚框的个数 3.注意点(★★★) 4.tf2.0代码 二 ...

  10. Android Glide 3.7.0 源码解析(八) , RecyclableBufferedInputStream 的 mark/reset 实现

    个人博客传送门 一.mark / reset 的作用 Android Glide 3.7.0 源码解析(七) , 细说图形变换和解码有提到过RecyclableBufferedInputStream ...

最新文章

  1. ToDictionary的用法
  2. nodejs如何利用rpc调用python
  3. Facebook Messenger正式登陆Android Auto车载信息娱乐平台
  4. Tableau必知必会之如何用颜色 突显 前N项和后N项
  5. linux软中断分析,linux操作系统下的软中断问题分析_linux教程
  6. Thumbnailator-图片处理的Google开源Java类库
  7. kalman filter卡尔曼滤波器- 数学推导和原理理解-----网上讲的比较好的kalman filter和整理、将预测值和观测值融和...
  8. golang nats request/reply模式
  9. 利用IDocHostUIHandler接口屏蔽WebBrowser的弹出菜单
  10. LeetCode Factorial Trailing Zeroes (阶乘后缀零)
  11. Python 创建本地服务器环境生成二维码
  12. C语言实现舒尔特表格生成器
  13. python自动登录网银_网银自动充值-登陆联通网站沃支付
  14. 怎么做自媒体,这份入门攻略,建议收藏
  15. 浅谈0-day漏洞的在野利用
  16. html判断display,display与show的区别
  17. Android 通过AlarmClock设置系统闹钟
  18. 读书笔记-财务报表资本结构分析
  19. Centos7操作系统搭建Snipe-IT资产管理系统
  20. win10 GTX1060 安装CUDA+PyTorch GPU

热门文章

  1. matlab 正弦波 fft,【求助】正弦信号序列fft频谱分析!!!
  2. 计算机应用基础第3次平时作业,计算机应用基础第3次作业.doc
  3. go应用---Time.second
  4. c语言输出菱形for循环_C语言如何输出菱形
  5. url参数拼接 php,js URL参数的拼接方法比较_javascript技巧
  6. mysql servlet登录验证_使用Servlet和jdbc创建用户登录验证
  7. 爬虫之Requests库入门
  8. PHP之mb_strrpos使用
  9. Fenzo:来自Netflix基于Java语言的Mesos调度器
  10. Legal or Not