Heritrix 3.1.0 源码解析（六）

本文分析BdbFrontier对象的相关状态和方法

BdbFrontier类继承自WorkQueueFrontier类 WorkQueueFrontier类继承自AbstractFrontier类

BdbFrontier类的void start()方法如下（在其父类WorkQueueFrontier里面）：

org.archive.crawler.frontier.BdbFrontier

org.archive.crawler.frontier.WorkQueueFrontier

public void start() {if(isRunning()) {return; }uriUniqFilter.setDestination(this);super.start();try {initInternalQueues();} catch (Exception e) {throw new IllegalStateException(e);}}

调用父类AbstractFrontier的void start()方法

 public void start() {if(isRunning()) {return; }if (getRecoveryLogEnabled()) try {initJournal(loggerModule.getPath().getFile().getAbsolutePath());} catch (IOException e) {throw new IllegalStateException(e);}pause();startManagerThread();}

首先设置当前对象（BdbFrontier）为State.PAUSE状态，然后调用void startManagerThread()方法

/*** Start the dedicated thread with an independent view of the frontier's* state. */protected void startManagerThread() {managerThread = new Thread(this+".managerThread") {public void run() {AbstractFrontier.this.managementTasks();}};managerThread.setPriority(Thread.NORM_PRIORITY+1); managerThread.start();}

在线程对象Thread managerThread里面调用void managementTasks()方法

/*** Main loop of frontier's managerThread. Only exits when State.FINISH * is requested (perhaps automatically at URI exhaustion) and reached. * * General strategy is to try to fill outbound queue, then process an* item from inbound queue, and repeat. A HOLD (to be implemented) or * PAUSE puts frontier into a stable state that won't be changed* asynchronously by worker thread activity. */protected void managementTasks() {assert Thread.currentThread() == managerThread;try {loop: while (true) {try {State reachedState = null; switch (targetState) {case EMPTY:reachedState = State.EMPTY; case RUN:// enable outbound takes if previously lockedwhile(outboundLock.isWriteLockedByCurrentThread()) {outboundLock.writeLock().unlock();}if(reachedState==null) {reachedState = State.RUN; }reachedState(reachedState);Thread.sleep(1000);if(isEmpty()&&targetState==State.RUN) {requestState(State.EMPTY); } else if (!isEmpty()&&targetState==State.EMPTY) {requestState(State.RUN); }break;case HOLD:// TODO; for now treat same as PAUSEcase PAUSE:// pausing// prevent all outbound takes
                        outboundLock.writeLock().lock();// process all inboundwhile (targetState == State.PAUSE) {if (getInProcessCount()==0) {reachedState(State.PAUSE);}Thread.sleep(1000);}break;case FINISH:// prevent all outbound takes
                        outboundLock.writeLock().lock();// process all inboundwhile (getInProcessCount()>0) {Thread.sleep(1000);}finalTasks(); // TODO: more cleanup?
                        reachedState(State.FINISH);break loop;}} catch (RuntimeException e) {// log, try to pause, continuelogger.log(Level.SEVERE,"",e);if(targetState!=State.PAUSE && targetState!=State.FINISH) {requestState(State.PAUSE);}}}} catch (InterruptedException e) {throw new RuntimeException(e);} // try to leave in safely restartable state: targetState = State.PAUSE;while(outboundLock.isWriteLockedByCurrentThread()) {outboundLock.writeLock().unlock();}//TODO: ensure all other structures are cleanly reset on restart
        logger.log(Level.FINE,"ending frontier mgr thread");}

上面的方法是不断的根据BdbFrontier对象当前状态设置成员变量protected ReentrantReadWriteLock outboundLock = new ReentrantReadWriteLock(true)的锁定状态

后面的void initInternalQueues() 方法是初始化爬虫任务的相关队列

/*** Initializes internal queues.  May decide to keep all queues in memory based on* {@link QueueAssignmentPolicy#maximumNumberOfKeys}.  Otherwise invokes* {@link #initAllQueues()} to actually set up the queues.* * Subclasses should invoke this method with recycle set to "true" in * a private readObject method, to restore queues after a checkpoint.* * @param recycle* @throws IOException* @throws DatabaseException*/protected void initInternalQueues() throws IOException, DatabaseException {this.initOtherQueues();if (workQueueDataOnDisk()&& preparer.getQueueAssignmentPolicy().maximumNumberOfKeys() >= 0&& preparer.getQueueAssignmentPolicy().maximumNumberOfKeys() <= MAX_QUEUES_TO_HOLD_ALLQUEUES_IN_MEMORY) {this.allQueues = new ObjectIdentityMemCache<WorkQueue>(701, .9f, 100);} else {this.initAllQueues();}}

首先调用BdbFrontier对象的void initOtherQueues()方法，在BdbFrontier类里面

@Overrideprotected void initOtherQueues() throws DatabaseException {boolean recycle = (recoveryCheckpoint != null);// tiny risk of OutOfMemoryError: if giant number of snoozed// queues all wake-to-ready at oncereadyClassQueues = new LinkedBlockingQueue<String>();inactiveQueuesByPrecedence = new ConcurrentSkipListMap<Integer,Queue<String>>();retiredQueues = bdb.getStoredQueue("retiredQueues", String.class, recycle);// primary snoozed queuessnoozedClassQueues = new DelayQueue<DelayedWorkQueue>();// just in case: overflow for extreme situationssnoozedOverflow = bdb.getStoredMap("snoozedOverflow", Long.class, DelayedWorkQueue.class, true, false);this.futureUris = bdb.getStoredMap("futureUris", Long.class, CrawlURI.class, true, recoveryCheckpoint!=null);// initialize master map in which other queues livethis.pendingUris = createMultipleWorkQueues();}

上述方法初始化了一系列的队列，这些队列各自的作用待后文再分析

void initAllQueues()方法是初始化成员变量ObjectIdentityCache<WorkQueue> allQueues = null;如下，在BdbFrontier类里面

@Overrideprotected void initAllQueues() throws DatabaseException {boolean isRecovery = (recoveryCheckpoint != null);this.allQueues = bdb.getObjectCache("allqueues", isRecovery, WorkQueue.class, BdbWorkQueue.class);if(isRecovery) {// restore simple instance fields JSONObject json = recoveryCheckpoint.loadJson(beanName);try {nextOrdinal.set(json.getLong("nextOrdinal"));queuedUriCount.set(json.getLong("queuedUriCount"));futureUriCount.set(json.getLong("futureUriCount"));succeededFetchCount.set(json.getLong("succeededFetchCount"));failedFetchCount.set(json.getLong("failedFetchCount"));disregardedUriCount.set(json.getLong("disregardedUriCount"));totalProcessedBytes.set(json.getLong("totalProcessedBytes"));JSONArray inactivePrecedences = json.getJSONArray("inactivePrecedences"); // restore all intended inactiveQueuesfor(int i = 0; i < inactivePrecedences.length(); i++) {int precedence = inactivePrecedences.getInt(i);inactiveQueuesByPrecedence.put(precedence,createInactiveQueueForPrecedence(precedence,true));}} catch (JSONException e) {throw new RuntimeException(e);}           // retired queues already restored with prior data in initOtherQueues// restore ready queues (those not already on inactive, retired)BufferedReader activeQueuesReader = null;try {activeQueuesReader = recoveryCheckpoint.loadReader(beanName,"active");String line; while((line = activeQueuesReader.readLine())!=null) {readyClassQueues.add(line); }} catch (IOException ioe) {throw new RuntimeException(ioe); } finally {IOUtils.closeQuietly(activeQueuesReader); }// TODO: restore largestQueues topNset?
        }}

ObjectIdentityCache<WorkQueue> allQueues成员用于管理BdbWorkQueue工作队列的缓存

---------------------------------------------------------------------------

本系列Heritrix 3.1.0 源码解析系本人原创

转载请注明出处博客园刺猬的温驯

本文链接 http://www.cnblogs.com/chenying99/archive/2013/04/18/3027677.html

Heritrix 3.1.0 源码解析（六）相关推荐

Heritrix 3.1.0 源码解析（八）
本文接着分析存储CrawlURI curi的队列容器,最重要的是BdbWorkQueue类及BdbMultipleWorkQueues类 BdbWorkQueue类继承自抽象类WorkQueue,抽象 ...
Heritrix 3.1.0 源码解析（十一）
上文分析了Heritrix3.1.0系统是怎么添加CrawlURI curi对象的,那么在系统初始化的时候,是怎么载入CrawlURI curi种子的呢? 我们回顾前面的文章,在我们执行采集任务的la ...
Heritrix 3.1.0 源码解析（三十四）
本文主要分析FetchFTP处理器,该处理器用于ftp文件的下载,该处理器的实现是通过封装commons-net-2.0.jar组件来实现ftp文件下载在FetchFTP处理器里面定义了内部类Soc ...
Heritrix 3.1.0 源码解析（十四）
我在分析BdbFrontier对象的void schedule(CrawlURI caURI).CrawlURI next() .void finished(CrawlURI cURI)方法是,其实还 ...
solrlucene3.6.0源码解析（三）
solr索引操作(包括新增更新删除提交合并等)相关UML图如下从上面的类图我们可以发现,其中体现了工厂方法模式及责任链模式的运用 UpdateRequestProcessor相当于责任链模式 ...
Celery 源码解析六：Events 的实现
序列文章: Celery 源码解析一:Worker 启动流程概述 Celery 源码解析二:Worker 的执行引擎 Celery 源码解析三: Task 对象的实现 Celery 源码解析四: 定时 ...
锚框、交并比和非极大值抑制（tf2.0源码解析)
锚框.交并比和非极大值抑制(tf2.0源码解析) 文章目录锚框.交并比和非极大值抑制(tf2.0源码解析) 一.锚框生成 1.锚框的宽高 2.锚框的个数 3.注意点(★★★) 4.tf2.0代码二 ...
基于8.0源码解析：startService 启动过程
基于8.0源码解析:startService 启动过程首先看一张startService的图,心里有个大概的预估,跟Activity启动流程比,Service的启动稍微简单点,并且我把Service ...
Android Glide 3.7.0 源码解析(八) , RecyclableBufferedInputStream 的 mark/reset 实现
个人博客传送门一.mark / reset 的作用 Android Glide 3.7.0 源码解析(七) , 细说图形变换和解码有提到过RecyclableBufferedInputStream ...

Heritrix 3.1.0 源码解析（六）

Heritrix 3.1.0 源码解析（六）相关推荐

最新文章

热门文章