1.Analyzer的使用

Analyzer使用在IndexWriter的构造方法

 /*** Constructs a new IndexWriter per the settings given in <code>conf</code>.* If you want to make "live" changes to this writer instance, use* {@link #getConfig()}.* * <p>* <b>NOTE:</b> after ths writer is created, the given configuration instance* cannot be passed to another writer.* * @param d*          the index directory. The index is either created or appended*          according <code>conf.getOpenMode()</code>.* @param conf*          the configuration settings according to which IndexWriter should*          be initialized.* @throws IOException*           if the directory cannot be read/written to, or if it does not*           exist and <code>conf.getOpenMode()</code> is*           <code>OpenMode.APPEND</code> or if there is any other low-level*           IO error*/public IndexWriter(Directory d, IndexWriterConfig conf) throws IOException {enableTestPoints = isEnableTestPoints();conf.setIndexWriter(this); // prevent reuse by other instancesconfig = conf;infoStream = config.getInfoStream();softDeletesEnabled = config.getSoftDeletesField() != null;// obtain the write.lock. If the user configured a timeout,// we wrap with a sleeper and this might take some time.writeLock = d.obtainLock(WRITE_LOCK_NAME);boolean success = false;try {directoryOrig = d;directory = new LockValidatingDirectoryWrapper(d, writeLock);analyzer = config.getAnalyzer();mergeScheduler = config.getMergeScheduler();mergeScheduler.setInfoStream(infoStream);codec = config.getCodec();OpenMode mode = config.getOpenMode();final boolean indexExists;final boolean create;if (mode == OpenMode.CREATE) {indexExists = DirectoryReader.indexExists(directory);create = true;} else if (mode == OpenMode.APPEND) {indexExists = true;create = false;} else {// CREATE_OR_APPEND - create only if an index does not existindexExists = DirectoryReader.indexExists(directory);create = !indexExists;}// If index is too old, reading the segments will throw// IndexFormatTooOldException.
String[] files = directory.listAll();// Set up our initial SegmentInfos:IndexCommit commit = config.getIndexCommit();// Set up our initial SegmentInfos:
      StandardDirectoryReader reader;if (commit == null) {reader = null;} else {reader = commit.getReader();}if (create) {if (config.getIndexCommit() != null) {// We cannot both open from a commit point and create:if (mode == OpenMode.CREATE) {throw new IllegalArgumentException("cannot use IndexWriterConfig.setIndexCommit() with OpenMode.CREATE");} else {throw new IllegalArgumentException("cannot use IndexWriterConfig.setIndexCommit() when index has no commit");}}// Try to read first.  This is to allow create// against an index that's currently open for// searching.  In this case we write the next// segments_N file with no segments:final SegmentInfos sis = new SegmentInfos(Version.LATEST.major);if (indexExists) {final SegmentInfos previous = SegmentInfos.readLatestCommit(directory);sis.updateGenerationVersionAndCounter(previous);}segmentInfos = sis;rollbackSegments = segmentInfos.createBackupSegmentInfos();// Record that we have a change (zero out all// segments) pending:
        changed();} else if (reader != null) {// Init from an existing already opened NRT or non-NRT reader:if (reader.directory() != commit.getDirectory()) {throw new IllegalArgumentException("IndexCommit's reader must have the same directory as the IndexCommit");}if (reader.directory() != directoryOrig) {throw new IllegalArgumentException("IndexCommit's reader must have the same directory passed to IndexWriter");}if (reader.segmentInfos.getLastGeneration() == 0) {  // TODO: maybe we could allow this?  It's tricky...throw new IllegalArgumentException("index must already have an initial commit to open from reader");}// Must clone because we don't want the incoming NRT reader to "see" any changes this writer now makes:segmentInfos = reader.segmentInfos.clone();SegmentInfos lastCommit;try {lastCommit = SegmentInfos.readCommit(directoryOrig, segmentInfos.getSegmentsFileName());} catch (IOException ioe) {throw new IllegalArgumentException("the provided reader is stale: its prior commit file \"" + segmentInfos.getSegmentsFileName() + "\" is missing from index");}if (reader.writer != null) {// The old writer better be closed (we have the write lock now!):assert reader.writer.closed;// In case the old writer wrote further segments (which we are now dropping),// update SIS metadata so we remain write-once:
          segmentInfos.updateGenerationVersionAndCounter(reader.writer.segmentInfos);lastCommit.updateGenerationVersionAndCounter(reader.writer.segmentInfos);}rollbackSegments = lastCommit.createBackupSegmentInfos();} else {// Init from either the latest commit point, or an explicit prior commit point:
String lastSegmentsFile = SegmentInfos.getLastCommitSegmentsFileName(files);if (lastSegmentsFile == null) {throw new IndexNotFoundException("no segments* file found in " + directory + ": files: " + Arrays.toString(files));}// Do not use SegmentInfos.read(Directory) since the spooky// retrying it does is not necessary here (we hold the write lock):segmentInfos = SegmentInfos.readCommit(directoryOrig, lastSegmentsFile);if (commit != null) {// Swap out all segments, but, keep metadata in// SegmentInfos, like version & generation, to// preserve write-once.  This is important if// readers are open against the future commit// points.if (commit.getDirectory() != directoryOrig) {throw new IllegalArgumentException("IndexCommit's directory doesn't match my directory, expected=" + directoryOrig + ", got=" + commit.getDirectory());}SegmentInfos oldInfos = SegmentInfos.readCommit(directoryOrig, commit.getSegmentsFileName());segmentInfos.replace(oldInfos);changed();if (infoStream.isEnabled("IW")) {infoStream.message("IW", "init: loaded commit \"" + commit.getSegmentsFileName() + "\"");}}rollbackSegments = segmentInfos.createBackupSegmentInfos();}commitUserData = new HashMap<>(segmentInfos.getUserData()).entrySet();pendingNumDocs.set(segmentInfos.totalMaxDoc());// start with previous field numbers, but new FieldInfos// NOTE: this is correct even for an NRT reader because we'll pull FieldInfos even for the un-committed segments:globalFieldNumberMap = getFieldNumberMap();validateIndexSort();config.getFlushPolicy().init(config);bufferedUpdatesStream = new BufferedUpdatesStream(infoStream);docWriter = new DocumentsWriter(flushNotifications, segmentInfos.getIndexCreatedVersionMajor(), pendingNumDocs,enableTestPoints, this::newSegmentName,config, directoryOrig, directory, globalFieldNumberMap);readerPool = new ReaderPool(directory, directoryOrig, segmentInfos, globalFieldNumberMap,bufferedUpdatesStream::getCompletedDelGen, infoStream, conf.getSoftDeletesField(), reader);if (config.getReaderPooling()) {readerPool.enableReaderPooling();}// Default deleter (for backwards compatibility) is// KeepOnlyLastCommitDeleter:// Sync'd is silly here, but IFD asserts we sync'd on the IW instance:synchronized(this) {deleter = new IndexFileDeleter(files, directoryOrig, directory,config.getIndexDeletionPolicy(),segmentInfos, infoStream, this,indexExists, reader != null);// We incRef all files when we return an NRT reader from IW, so all files must exist even in the NRT case:assert create || filesExist(segmentInfos);}if (deleter.startingCommitDeleted) {// Deletion policy deleted the "head" commit point.// We have to mark ourself as changed so that if we// are closed w/o any further changes we write a new// segments_N file.
        changed();}if (reader != null) {// We always assume we are carrying over incoming changes when opening from reader:
        segmentInfos.changed();changed();}if (infoStream.isEnabled("IW")) {infoStream.message("IW", "init: create=" + create + " reader=" + reader);messageState();}success = true;} finally {if (!success) {if (infoStream.isEnabled("IW")) {infoStream.message("IW", "init: hit exception on init; releasing write lock");}IOUtils.closeWhileHandlingException(writeLock);writeLock = null;}}}

2.Analyzer的定义

/*** An Analyzer builds TokenStreams, which analyze text.  It thus represents a* policy for extracting index terms from text.* <p>* In order to define what analysis is done, subclasses must define their* {@link TokenStreamComponents TokenStreamComponents} in {@link #createComponents(String)}.* The components are then reused in each call to {@link #tokenStream(String, Reader)}.* <p>* Simple example:* <pre class="prettyprint">* Analyzer analyzer = new Analyzer() {*  {@literal @Override}*   protected TokenStreamComponents createComponents(String fieldName) {*     Tokenizer source = new FooTokenizer(reader);*     TokenStream filter = new FooFilter(source);*     filter = new BarFilter(filter);*     return new TokenStreamComponents(source, filter);*   }*   {@literal @Override}*   protected TokenStream normalize(TokenStream in) {*     // Assuming FooFilter is about normalization and BarFilter is about*     // stemming, only FooFilter should be applied*     return new FooFilter(in);*   }* };* </pre>* For more examples, see the {@link org.apache.lucene.analysis Analysis package documentation}.* <p>* For some concrete implementations bundled with Lucene, look in the analysis modules:* <ul>*   <li><a href="{@docRoot}/../analyzers-common/overview-summary.html">Common</a>:*       Analyzers for indexing content in different languages and domains.*   <li><a href="{@docRoot}/../analyzers-icu/overview-summary.html">ICU</a>:*       Exposes functionality from ICU to Apache Lucene. *   <li><a href="{@docRoot}/../analyzers-kuromoji/overview-summary.html">Kuromoji</a>:*       Morphological analyzer for Japanese text.*   <li><a href="{@docRoot}/../analyzers-morfologik/overview-summary.html">Morfologik</a>:*       Dictionary-driven lemmatization for the Polish language.*   <li><a href="{@docRoot}/../analyzers-phonetic/overview-summary.html">Phonetic</a>:*       Analysis for indexing phonetic signatures (for sounds-alike search).*   <li><a href="{@docRoot}/../analyzers-smartcn/overview-summary.html">Smart Chinese</a>:*       Analyzer for Simplified Chinese, which indexes words.*   <li><a href="{@docRoot}/../analyzers-stempel/overview-summary.html">Stempel</a>:*       Algorithmic Stemmer for the Polish Language.* </ul>*/

可以看出,Analyzer针对不同的语言给出了不同的方式

其中,common抽象出所有Analyzer类,如下图所示

转载于:https://www.cnblogs.com/davidwang456/p/10058895.html

lucene源码分析(7)Analyzer分析相关推荐

  1. lucene 源码分析_Lucene分析过程指南

    lucene 源码分析 本文是我们名为" Apache Lucene基础知识 "的学院课程的一部分. 在本课程中,您将了解Lucene. 您将了解为什么这样的库很重要,然后了解Lu ...

  2. Lucene 源码分析之倒排索引(三)

    上文找到了 collect(-) 方法,其形参就是匹配的文档 Id,根据代码上下文,其中 doc 是由 iterator.nextDoc() 获得的,那 DefaultBulkScorer.itera ...

  3. lucene源码分析的一些资料

    针对lucene6.1较新的分析:http://46aae4d1e2371e4aa769798941cef698.devproxy.yunshipei.com/conansonic/article/d ...

  4. Colly源码解析——结合例子分析底层实现

    通过<Colly源码解析--框架>分析,我们可以知道Colly执行的主要流程.本文将结合http://go-colly.org上的例子分析一些高级设置的底层实现.(转载请指明出于break ...

  5. lodash源码中debounce函数分析

    lodash源码中debounce函数分析 一.使用 在lodash中我们可以使用debounce函数来进行防抖和截流,之前我并未仔细注意过,但是不可思议的是,lodash中的防抖节流函数是一个函数两 ...

  6. Linux内核学习(五):linux kernel源码结构以及makefile分析

    Linux内核学习(五):linux kernel源码结构以及makefile分析 前面我们知道了linux内核镜像的生成.加载以及加载工具uboot. 这里我们来看看linux内核的源码的宏观东西, ...

  7. lucene源码学习

    1.官网地址 https://lucene.apache.org/core/9_1_0/index.html 2. lucene源码结构 https://juejin.cn/post/68449037 ...

  8. lucene源码分析(1)基本要素

    1.源码包 core: Lucene core library analyzers-common: Analyzers for indexing content in different langua ...

  9. 从源码和内核角度分析redis和nginx以及java NIO可以支持多大的并发

    有人询问我网上一篇关于"redis为什么单线程这么快"的文章,我建议他不要看了,因为redis是单进程不是单线程,后面的意见不用看了,文章质量肯定不会很好,他也说了自己看了很久源码 ...

  10. Caffe源码中common文件分析

    Caffe源码(caffe version:09868ac , date: 2015.08.15)中的一些重要头文件如caffe.hpp.blob.hpp等或者外部调用Caffe库使用时,一般都会in ...

最新文章

  1. PTA数据结构与算法题目集(中文)7-15
  2. python编程在哪里写-Python自带的IDE在哪里
  3. JAVA_OA(六):SpringMVC登陆实例
  4. MySQL执行计划解析
  5. solr4 mysql自动更新_solr7.4 定时增量更新数据-Go语言中文社区
  6. 面向对象 —— 静态成员(变量与方法)
  7. Curl 方式实现POST提交数据
  8. PEPS 无钥匙进入系统低频芯片 PCF7991 介绍
  9. 宝塔面板安装MySQL数据库
  10. 使用万用表来进行简易的运放芯片配对
  11. WRF模式运行及相关问题的解决
  12. 拼图软件——texturepacker
  13. Inventory文件扩展
  14. linux mv命令例子,linux命令mv
  15. 实用一位加法电路-全加器(全加器真值表、全加器的逻辑组合电路)、几种基本组合逻辑电路真值表 补充:逻辑电路基础:与门、或门、非门----计算机组成原理
  16. 常见3D游戏物理引擎总结
  17. 火车票订票管理系统c语言,基于c 的火车票订票管理系统的设计与实现.docx
  18. Tableau函数:实现数值累计值
  19. 微信与支付宝钱包的竞争分析
  20. 工程技术开发的圈套与局限性

热门文章

  1. mysql查询交叉连接_复杂的MySQL查询,联合,交叉或自然连接?
  2. python爬取有道词典_利用Python3和Charles爬取有道词典,生成翻译exe单文件
  3. 指针数组概念 和 函数指针数组实战 和指针函数的概念和实战
  4. VC包含目录、附加依赖项、库目录及具体设置
  5. 解决python安装第三方库速度很慢的问题(opencv为例)
  6. linux ttyusb读写_linux下非root用户获得devttyUSB0的读写权限
  7. [Solved] UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xff in position 0: invalid start byte
  8. 什么是pretext tasks?
  9. NTU 课程笔记:Nonparametric statistics
  10. 数学知识复习:二阶导复合函数的链式法则