1.官方提供的代码demo

        Analyzer analyzer = new StandardAnalyzer();// Store the index in memory:Directory directory = new RAMDirectory();// To store an index on disk, use this instead://Directory directory = FSDirectory.open("/tmp/testindex");IndexWriterConfig config = new IndexWriterConfig(analyzer);IndexWriter iwriter = new IndexWriter(directory, config);Document doc = new Document();String text = "This is the text to be indexed.";doc.add(new Field("fieldname", text, TextField.TYPE_STORED));iwriter.addDocument(doc);iwriter.close();

2.涉及到的类及其关系

2.1 TokenStream

/*** A <code>TokenStream</code> enumerates the sequence of tokens, either from* {@link Field}s of a {@link Document} or from query text.* <p>* This is an abstract class; concrete subclasses are:
 * <ul>* <li>{@link Tokenizer}, a <code>TokenStream</code> whose input is a Reader; and* <li>{@link TokenFilter}, a <code>TokenStream</code> whose input is another* <code>TokenStream</code>.* </ul>* A new <code>TokenStream</code> API has been introduced with Lucene 2.9. This API* has moved from being {@link Token}-based to {@link Attribute}-based. While* {@link Token} still exists in 2.9 as a convenience class, the preferred way* to store the information of a {@link Token} is to use {@link AttributeImpl}s.* <p>* <code>TokenStream</code> now extends {@link AttributeSource}, which provides* access to all of the token {@link Attribute}s for the <code>TokenStream</code>.* Note that only one instance per {@link AttributeImpl} is created and reused* for every token. This approach reduces object creation and allows local* caching of references to the {@link AttributeImpl}s. See* {@link #incrementToken()} for further details.* <p>* <b>The workflow of the new <code>TokenStream</code> API is as follows:</b>* <ol>* <li>Instantiation of <code>TokenStream</code>/{@link TokenFilter}s which add/get* attributes to/from the {@link AttributeSource}.* <li>The consumer calls {@link TokenStream#reset()}.* <li>The consumer retrieves attributes from the stream and stores local* references to all attributes it wants to access.* <li>The consumer calls {@link #incrementToken()} until it returns false* consuming the attributes after each call.* <li>The consumer calls {@link #end()} so that any end-of-stream operations* can be performed.* <li>The consumer calls {@link #close()} to release any resource when finished* using the <code>TokenStream</code>.* </ol>* To make sure that filters and consumers know which attributes are available,* the attributes must be added during instantiation. Filters and consumers are* not required to check for availability of attributes in* {@link #incrementToken()}.* <p>* You can find some example code for the new API in the analysis package level* Javadoc.* <p>* Sometimes it is desirable to capture a current state of a <code>TokenStream</code>,* e.g., for buffering purposes (see {@link CachingTokenFilter},* TeeSinkTokenFilter). For this usecase* {@link AttributeSource#captureState} and {@link AttributeSource#restoreState}* can be used.* <p>The {@code TokenStream}-API in Lucene is based on the decorator pattern.* Therefore all non-abstract subclasses must be final or have at least a final* implementation of {@link #incrementToken}! This is checked when Java* assertions are enabled.*/

2.2 Analyzer

/*** An Analyzer builds TokenStreams, which analyze text.  It thus represents a* policy for extracting index terms from text.* <p>* In order to define what analysis is done, subclasses must define their* {@link TokenStreamComponents TokenStreamComponents} in {@link #createComponents(String)}.* The components are then reused in each call to {@link #tokenStream(String, Reader)}.* <p>* Simple example:* <pre class="prettyprint">* Analyzer analyzer = new Analyzer() {*  {@literal @Override}*   protected TokenStreamComponents createComponents(String fieldName) {*     Tokenizer source = new FooTokenizer(reader);
 *     TokenStream filter = new FooFilter(source);
 *     filter = new BarFilter(filter);
 *     return new TokenStreamComponents(source, filter);
 *   }*   {@literal @Override}*   protected TokenStream normalize(TokenStream in) {*     // Assuming FooFilter is about normalization and BarFilter is about*     // stemming, only FooFilter should be applied*     return new FooFilter(in);
 *   }* };
 * </pre>* For more examples, see the {@link org.apache.lucene.analysis Analysis package documentation}.* <p>* For some concrete implementations bundled with Lucene, look in the analysis modules:* <ul>*   <li><a href="{@docRoot}/../analyzers-common/overview-summary.html">Common</a>:*       Analyzers for indexing content in different languages and domains.*   <li><a href="{@docRoot}/../analyzers-icu/overview-summary.html">ICU</a>:*       Exposes functionality from ICU to Apache Lucene. *   <li><a href="{@docRoot}/../analyzers-kuromoji/overview-summary.html">Kuromoji</a>:*       Morphological analyzer for Japanese text.*   <li><a href="{@docRoot}/../analyzers-morfologik/overview-summary.html">Morfologik</a>:*       Dictionary-driven lemmatization for the Polish language.*   <li><a href="{@docRoot}/../analyzers-phonetic/overview-summary.html">Phonetic</a>:*       Analysis for indexing phonetic signatures (for sounds-alike search).*   <li><a href="{@docRoot}/../analyzers-smartcn/overview-summary.html">Smart Chinese</a>:*       Analyzer for Simplified Chinese, which indexes words.*   <li><a href="{@docRoot}/../analyzers-stempel/overview-summary.html">Stempel</a>:*       Algorithmic Stemmer for the Polish Language.*   <li><a href="{@docRoot}/../analyzers-uima/overview-summary.html">UIMA</a>: *       Analysis integration with Apache UIMA. * </ul>*/

2.3 Directory

/** A Directory is a flat list of files.  Files may be written once, when they* are created.  Once a file is created it may only be opened for read, or* deleted.  Random access is permitted both when reading and writing.** <p> Java's i/o APIs not used directly, but rather all i/o is* through this API.  This permits things such as: <ul>* <li> implementation of RAM-based indices;
 * <li> implementation indices stored in a database, via JDBC;
 * <li> implementation of an index as a single file;
 * </ul>** Directory locking is implemented by an instance of {@link* LockFactory}.**/

2.4 IndexWriter

/**An <code>IndexWriter</code> creates and maintains an index.<p>The {@link OpenMode} option on {@link IndexWriterConfig#setOpenMode(OpenMode)} determines whether a new index is created, or whether an existing index isopened. Note that you can open an index with {@link OpenMode#CREATE}even while readers are using the index. The old readers will continue to search the "point in time" snapshot they had opened, and won't see the newly created index until they re-open. If {@link OpenMode#CREATE_OR_APPEND} is used IndexWriter will create a new index if there is not already an index at the provided pathand otherwise open the existing index.</p><p>In either case, documents are added with {@link #addDocument(Iterable)addDocument} and removed with {@link #deleteDocuments(Term...)} or {@link#deleteDocuments(Query...)}. A document can be updated with {@link#updateDocument(Term, Iterable) updateDocument} (which just deletesand then adds the entire document). When finished adding, deleting and updating documents, {@link #close() close} should be called.</p><a name="sequence_numbers"></a><p>Each method that changes the index returns a {@code long} sequence number, whichexpresses the effective order in which each change was applied.{@link #commit} also returns a sequence number, describing whichchanges are in the commit point and which are not.  Sequence numbersare transient (not saved into the index in any way) and only validwithin a single {@code IndexWriter} instance.</p><a name="flush"></a><p>These changes are buffered in memory and periodicallyflushed to the {@link Directory} (during the above methodcalls). A flush is triggered when there are enough added documentssince the last flush. Flushing is triggered either by RAM usage of thedocuments (see {@link IndexWriterConfig#setRAMBufferSizeMB}) or thenumber of added documents (see {@link IndexWriterConfig#setMaxBufferedDocs(int)}).The default is to flush when RAM usage hits{@link IndexWriterConfig#DEFAULT_RAM_BUFFER_SIZE_MB} MB. Forbest indexing speed you should flush by RAM usage with alarge RAM buffer. Additionally, if IndexWriter reaches the configured number ofbuffered deletes (see {@link IndexWriterConfig#setMaxBufferedDeleteTerms})the deleted terms and queries are flushed and applied to existing segments.In contrast to the other flush options {@link IndexWriterConfig#setRAMBufferSizeMB} and {@link IndexWriterConfig#setMaxBufferedDocs(int)}, deleted termswon't trigger a segment flush. Note that flushing just moves theinternal buffered state in IndexWriter into the index, butthese changes are not visible to IndexReader until either{@link #commit()} or {@link #close} is called.  A flush mayalso trigger one or more segment merges which by defaultrun with a background thread so as not to block theaddDocument calls (see <a href="#mergePolicy">below</a>for changing the {@link MergeScheduler}).</p><p>Opening an <code>IndexWriter</code> creates a lock file for the directory in use. Trying to openanother <code>IndexWriter</code> on the same directory will lead to a{@link LockObtainFailedException}.</p><a name="deletionPolicy"></a><p>Expert: <code>IndexWriter</code> allows an optional{@link IndexDeletionPolicy} implementation to be specified.  Youcan use this to control when prior commits are deleted fromthe index.  The default policy is {@link KeepOnlyLastCommitDeletionPolicy}which removes all prior commits as soon as a new commit isdone.  Creating your own policy can allow you to explicitlykeep previous "point in time" commits alive in the index forsome time, either because this is useful for your application,or to give readers enough time to refresh to the new commitwithout having the old commit deleted out from under them.The latter is necessary when multiple computers take turns openingtheir own {@code IndexWriter} and {@code IndexReader}sagainst a single shared index mounted via remote filesystemslike NFS which do not support "delete on last close" semantics.A single computer accessing an index via NFS is fine with thedefault deletion policy since NFS clients emulate "delete onlast close" locally.  That said, accessing an index via NFSwill likely result in poor performance compared to a local IOdevice. </p><a name="mergePolicy"></a> <p>Expert:<code>IndexWriter</code> allows you to separately changethe {@link MergePolicy} and the {@link MergeScheduler}.The {@link MergePolicy} is invoked whenever there arechanges to the segments in the index.  Its role is toselect which merges to do, if any, and return a {@linkMergePolicy.MergeSpecification} describing the merges.The default is {@link LogByteSizeMergePolicy}.  Then, the {@linkMergeScheduler} is invoked with the requested merges andit decides when and how to run the merges.  The default is{@link ConcurrentMergeScheduler}. </p><a name="OOME"></a><p><b>NOTE</b>: if you hit aVirtualMachineError, or disaster strikes during a checkpointthen IndexWriter will close itself.  This is adefensive measure in case any internal state (buffereddocuments, deletions, reference counts) were corrupted.  Any subsequent calls will throw an AlreadyClosedException.</p><a name="thread-safety"></a><p><b>NOTE</b>: {@linkIndexWriter} instances are completely threadsafe, meaning multiple threads can call any of itsmethods, concurrently.  If your application requiresexternal synchronization, you should <b>not</b>synchronize on the <code>IndexWriter</code> instance asthis may cause deadlock; use your own (non-Lucene) objects
  instead. </p><p><b>NOTE</b>: If you call<code>Thread.interrupt()</code> on a thread that's withinIndexWriter, IndexWriter will try to catch this (eg, ifit's in a wait() or Thread.sleep()), and will then throwthe unchecked exception {@link ThreadInterruptedException}and <b>clear</b> the interrupt status on the thread.</p>
*//** Clarification: Check Points (and commits)* IndexWriter writes new index files to the directory without writing a new segments_N* file which references these new files. It also means that the state of* the in memory SegmentInfos object is different than the most recent* segments_N file written to the directory.** Each time the SegmentInfos is changed, and matches the (possibly* modified) directory files, we have a new "check point".* If the modified/new SegmentInfos is written to disk - as a new* (generation of) segments_N file - this check point is also an* IndexCommit.** A new checkpoint always replaces the previous checkpoint and* becomes the new "front" of the index. This allows the IndexFileDeleter* to delete files that are referenced only by stale checkpoints.* (files that were created since the last commit, but are no longer* referenced by the "front" of the index). For this, IndexFileDeleter* keeps track of the last non commit checkpoint.*/

转载于:https://www.cnblogs.com/davidwang456/p/9935786.html

lucene源码分析(2)读取过程实例相关推荐

  1. lucene 源码分析_Lucene分析过程指南

    lucene 源码分析 本文是我们名为" Apache Lucene基础知识 "的学院课程的一部分. 在本课程中,您将了解Lucene. 您将了解为什么这样的库很重要,然后了解Lu ...

  2. Hyperledger Fabric从源码分析背书提案过程

    在之前的文章中 Hyperledger Fabric从源码分析链码安装过程 Hyperledger Fabric从源码分析链码实例化过程 Hyperledger Fabric从源码分析链码查询与调用 ...

  3. MyBatis 源码分析 - 配置文件解析过程

    文章目录 * 本文速览 1.简介 2.配置文件解析过程分析 2.1 配置文件解析入口 2.2 解析 properties 配置 2.3 解析 settings 配置 2.3.1 settings 节点 ...

  4. Lucene 源码分析之倒排索引(三)

    上文找到了 collect(-) 方法,其形参就是匹配的文档 Id,根据代码上下文,其中 doc 是由 iterator.nextDoc() 获得的,那 DefaultBulkScorer.itera ...

  5. CloudCompare源码分析:读取ply文件

    CloudCompare源码分析_读取ply文件 写这些博客的原因,是因为打算好好研究一下点云的各种库的源码,其中比较知名的是PCL(point cloud library)和CC(CloudComp ...

  6. WebRTC源码分析-呼叫建立过程之五(创建Offer,CreateOffer,上篇)

    目录 1. 引言 2 CreateOffer声明 && 两个参数 2.1 CreateOffer声明 2.2 参数CreateSessionDescriptionObserver 2. ...

  7. modelandview使用过程_深入源码分析SpringMVC执行过程

    本文主要讲解 SpringMVC 执行过程,并针对相关源码进行解析. 首先,让我们从 Spring MVC 的四大组件:前端控制器(DispatcherServlet).处理器映射器(HandlerM ...

  8. lucene源码分析的一些资料

    针对lucene6.1较新的分析:http://46aae4d1e2371e4aa769798941cef698.devproxy.yunshipei.com/conansonic/article/d ...

  9. SOFA 源码分析 —— 服务发布过程

    前言 SOFA 包含了 RPC 框架,底层通信框架是 bolt ,基于 Netty 4,今天将通过 SOFA-RPC 源码中的例子,看看他是如何发布一个服务的. 示例代码 下面的代码在 com.ali ...

最新文章

  1. 2020Alibaba数学竞赛预选赛第二轮参考答案
  2. python3下载文件
  3. print2flashwindows7旗舰版下载哪一个_JUJUMAO_MSDN原版 win 7 二合一 旗舰版32位 64位原版ISO镜像...
  4. 用$.getJSON() 和$.post()获取第三方数据做页面 ——惠品折页面(1)
  5. matlab中idwt,matlab图片处理
  6. 开启 Windows 10 中的「卓越性能」电源计划
  7. word打印设置相关
  8. 暑假闲着没事第一弹:基于Django的长江大学教务处成绩查询系统
  9. linux基础教程之在Linux上安装Go语言开发包
  10. php一般培训呢多久,php的培训一般课程是多久
  11. 教你如何看懂photoshop中的直方图 让曝光达到完美
  12. 互联网赚钱项目有哪些?目前最火的互联网项目
  13. “喜报云报销”荣获中国金软件移动互联网领域最具应用价值解决方案奖
  14. XUPT第三届新生算法赛
  15. Python爬虫一则
  16. WMS和WMTS的区别
  17. 直播app开发解决方案
  18. java 打jar包 (JAR命令)
  19. 【斐波拉契数列】 Python
  20. 2004年11月11日

热门文章

  1. 《剑指offer》c++版本 12. 矩阵中的路径
  2. leetcode 739. 每日温度 单调栈解法和暴力法及其优化 c代码
  3. 罗斯霍曼理工学院计算机毕业生,全美最强STEM大学,了解一下?
  4. python中matplotlib条形图数值大的在最底层显示_如何使用python的matplotlib模块绘制水平条形图...
  5. postman请求soap 请求_postman中请求如何传递对象到spring controller?
  6. java 文件路径表达式_Java基础(二十二) Lambda表达式和File类
  7. C++ 命名空间 实战(一)嵌套的命名空间
  8. mongo 3t 处理时间
  9. 109. Leetcode 309. 最佳买卖股票时机含冷冻期 (动态规划-股票交易)
  10. Git 笔记 上传文件至github