lucene源码分析(7)Analyzer分析
1.Analyzer的使用
Analyzer使用在IndexWriter的构造方法
/*** Constructs a new IndexWriter per the settings given in <code>conf</code>.* If you want to make "live" changes to this writer instance, use* {@link #getConfig()}.* * <p>* <b>NOTE:</b> after ths writer is created, the given configuration instance* cannot be passed to another writer.* * @param d* the index directory. The index is either created or appended* according <code>conf.getOpenMode()</code>.* @param conf* the configuration settings according to which IndexWriter should* be initialized.* @throws IOException* if the directory cannot be read/written to, or if it does not* exist and <code>conf.getOpenMode()</code> is* <code>OpenMode.APPEND</code> or if there is any other low-level* IO error*/public IndexWriter(Directory d, IndexWriterConfig conf) throws IOException {enableTestPoints = isEnableTestPoints();conf.setIndexWriter(this); // prevent reuse by other instancesconfig = conf;infoStream = config.getInfoStream();softDeletesEnabled = config.getSoftDeletesField() != null;// obtain the write.lock. If the user configured a timeout,// we wrap with a sleeper and this might take some time.writeLock = d.obtainLock(WRITE_LOCK_NAME);boolean success = false;try {directoryOrig = d;directory = new LockValidatingDirectoryWrapper(d, writeLock);analyzer = config.getAnalyzer();mergeScheduler = config.getMergeScheduler();mergeScheduler.setInfoStream(infoStream);codec = config.getCodec();OpenMode mode = config.getOpenMode();final boolean indexExists;final boolean create;if (mode == OpenMode.CREATE) {indexExists = DirectoryReader.indexExists(directory);create = true;} else if (mode == OpenMode.APPEND) {indexExists = true;create = false;} else {// CREATE_OR_APPEND - create only if an index does not existindexExists = DirectoryReader.indexExists(directory);create = !indexExists;}// If index is too old, reading the segments will throw// IndexFormatTooOldException. String[] files = directory.listAll();// Set up our initial SegmentInfos:IndexCommit commit = config.getIndexCommit();// Set up our initial SegmentInfos: StandardDirectoryReader reader;if (commit == null) {reader = null;} else {reader = commit.getReader();}if (create) {if (config.getIndexCommit() != null) {// We cannot both open from a commit point and create:if (mode == OpenMode.CREATE) {throw new IllegalArgumentException("cannot use IndexWriterConfig.setIndexCommit() with OpenMode.CREATE");} else {throw new IllegalArgumentException("cannot use IndexWriterConfig.setIndexCommit() when index has no commit");}}// Try to read first. This is to allow create// against an index that's currently open for// searching. In this case we write the next// segments_N file with no segments:final SegmentInfos sis = new SegmentInfos(Version.LATEST.major);if (indexExists) {final SegmentInfos previous = SegmentInfos.readLatestCommit(directory);sis.updateGenerationVersionAndCounter(previous);}segmentInfos = sis;rollbackSegments = segmentInfos.createBackupSegmentInfos();// Record that we have a change (zero out all// segments) pending: changed();} else if (reader != null) {// Init from an existing already opened NRT or non-NRT reader:if (reader.directory() != commit.getDirectory()) {throw new IllegalArgumentException("IndexCommit's reader must have the same directory as the IndexCommit");}if (reader.directory() != directoryOrig) {throw new IllegalArgumentException("IndexCommit's reader must have the same directory passed to IndexWriter");}if (reader.segmentInfos.getLastGeneration() == 0) { // TODO: maybe we could allow this? It's tricky...throw new IllegalArgumentException("index must already have an initial commit to open from reader");}// Must clone because we don't want the incoming NRT reader to "see" any changes this writer now makes:segmentInfos = reader.segmentInfos.clone();SegmentInfos lastCommit;try {lastCommit = SegmentInfos.readCommit(directoryOrig, segmentInfos.getSegmentsFileName());} catch (IOException ioe) {throw new IllegalArgumentException("the provided reader is stale: its prior commit file \"" + segmentInfos.getSegmentsFileName() + "\" is missing from index");}if (reader.writer != null) {// The old writer better be closed (we have the write lock now!):assert reader.writer.closed;// In case the old writer wrote further segments (which we are now dropping),// update SIS metadata so we remain write-once: segmentInfos.updateGenerationVersionAndCounter(reader.writer.segmentInfos);lastCommit.updateGenerationVersionAndCounter(reader.writer.segmentInfos);}rollbackSegments = lastCommit.createBackupSegmentInfos();} else {// Init from either the latest commit point, or an explicit prior commit point: String lastSegmentsFile = SegmentInfos.getLastCommitSegmentsFileName(files);if (lastSegmentsFile == null) {throw new IndexNotFoundException("no segments* file found in " + directory + ": files: " + Arrays.toString(files));}// Do not use SegmentInfos.read(Directory) since the spooky// retrying it does is not necessary here (we hold the write lock):segmentInfos = SegmentInfos.readCommit(directoryOrig, lastSegmentsFile);if (commit != null) {// Swap out all segments, but, keep metadata in// SegmentInfos, like version & generation, to// preserve write-once. This is important if// readers are open against the future commit// points.if (commit.getDirectory() != directoryOrig) {throw new IllegalArgumentException("IndexCommit's directory doesn't match my directory, expected=" + directoryOrig + ", got=" + commit.getDirectory());}SegmentInfos oldInfos = SegmentInfos.readCommit(directoryOrig, commit.getSegmentsFileName());segmentInfos.replace(oldInfos);changed();if (infoStream.isEnabled("IW")) {infoStream.message("IW", "init: loaded commit \"" + commit.getSegmentsFileName() + "\"");}}rollbackSegments = segmentInfos.createBackupSegmentInfos();}commitUserData = new HashMap<>(segmentInfos.getUserData()).entrySet();pendingNumDocs.set(segmentInfos.totalMaxDoc());// start with previous field numbers, but new FieldInfos// NOTE: this is correct even for an NRT reader because we'll pull FieldInfos even for the un-committed segments:globalFieldNumberMap = getFieldNumberMap();validateIndexSort();config.getFlushPolicy().init(config);bufferedUpdatesStream = new BufferedUpdatesStream(infoStream);docWriter = new DocumentsWriter(flushNotifications, segmentInfos.getIndexCreatedVersionMajor(), pendingNumDocs,enableTestPoints, this::newSegmentName,config, directoryOrig, directory, globalFieldNumberMap);readerPool = new ReaderPool(directory, directoryOrig, segmentInfos, globalFieldNumberMap,bufferedUpdatesStream::getCompletedDelGen, infoStream, conf.getSoftDeletesField(), reader);if (config.getReaderPooling()) {readerPool.enableReaderPooling();}// Default deleter (for backwards compatibility) is// KeepOnlyLastCommitDeleter:// Sync'd is silly here, but IFD asserts we sync'd on the IW instance:synchronized(this) {deleter = new IndexFileDeleter(files, directoryOrig, directory,config.getIndexDeletionPolicy(),segmentInfos, infoStream, this,indexExists, reader != null);// We incRef all files when we return an NRT reader from IW, so all files must exist even in the NRT case:assert create || filesExist(segmentInfos);}if (deleter.startingCommitDeleted) {// Deletion policy deleted the "head" commit point.// We have to mark ourself as changed so that if we// are closed w/o any further changes we write a new// segments_N file. changed();}if (reader != null) {// We always assume we are carrying over incoming changes when opening from reader: segmentInfos.changed();changed();}if (infoStream.isEnabled("IW")) {infoStream.message("IW", "init: create=" + create + " reader=" + reader);messageState();}success = true;} finally {if (!success) {if (infoStream.isEnabled("IW")) {infoStream.message("IW", "init: hit exception on init; releasing write lock");}IOUtils.closeWhileHandlingException(writeLock);writeLock = null;}}}
2.Analyzer的定义
/*** An Analyzer builds TokenStreams, which analyze text. It thus represents a* policy for extracting index terms from text.* <p>* In order to define what analysis is done, subclasses must define their* {@link TokenStreamComponents TokenStreamComponents} in {@link #createComponents(String)}.* The components are then reused in each call to {@link #tokenStream(String, Reader)}.* <p>* Simple example:* <pre class="prettyprint">* Analyzer analyzer = new Analyzer() {* {@literal @Override}* protected TokenStreamComponents createComponents(String fieldName) {* Tokenizer source = new FooTokenizer(reader);* TokenStream filter = new FooFilter(source);* filter = new BarFilter(filter);* return new TokenStreamComponents(source, filter);* }* {@literal @Override}* protected TokenStream normalize(TokenStream in) {* // Assuming FooFilter is about normalization and BarFilter is about* // stemming, only FooFilter should be applied* return new FooFilter(in);* }* };* </pre>* For more examples, see the {@link org.apache.lucene.analysis Analysis package documentation}.* <p>* For some concrete implementations bundled with Lucene, look in the analysis modules:* <ul>* <li><a href="{@docRoot}/../analyzers-common/overview-summary.html">Common</a>:* Analyzers for indexing content in different languages and domains.* <li><a href="{@docRoot}/../analyzers-icu/overview-summary.html">ICU</a>:* Exposes functionality from ICU to Apache Lucene. * <li><a href="{@docRoot}/../analyzers-kuromoji/overview-summary.html">Kuromoji</a>:* Morphological analyzer for Japanese text.* <li><a href="{@docRoot}/../analyzers-morfologik/overview-summary.html">Morfologik</a>:* Dictionary-driven lemmatization for the Polish language.* <li><a href="{@docRoot}/../analyzers-phonetic/overview-summary.html">Phonetic</a>:* Analysis for indexing phonetic signatures (for sounds-alike search).* <li><a href="{@docRoot}/../analyzers-smartcn/overview-summary.html">Smart Chinese</a>:* Analyzer for Simplified Chinese, which indexes words.* <li><a href="{@docRoot}/../analyzers-stempel/overview-summary.html">Stempel</a>:* Algorithmic Stemmer for the Polish Language.* </ul>*/
可以看出,Analyzer针对不同的语言给出了不同的方式
其中,common抽象出所有Analyzer类,如下图所示
转载于:https://www.cnblogs.com/davidwang456/p/10058895.html
lucene源码分析(7)Analyzer分析相关推荐
- lucene 源码分析_Lucene分析过程指南
lucene 源码分析 本文是我们名为" Apache Lucene基础知识 "的学院课程的一部分. 在本课程中,您将了解Lucene. 您将了解为什么这样的库很重要,然后了解Lu ...
- Lucene 源码分析之倒排索引(三)
上文找到了 collect(-) 方法,其形参就是匹配的文档 Id,根据代码上下文,其中 doc 是由 iterator.nextDoc() 获得的,那 DefaultBulkScorer.itera ...
- lucene源码分析的一些资料
针对lucene6.1较新的分析:http://46aae4d1e2371e4aa769798941cef698.devproxy.yunshipei.com/conansonic/article/d ...
- Colly源码解析——结合例子分析底层实现
通过<Colly源码解析--框架>分析,我们可以知道Colly执行的主要流程.本文将结合http://go-colly.org上的例子分析一些高级设置的底层实现.(转载请指明出于break ...
- lodash源码中debounce函数分析
lodash源码中debounce函数分析 一.使用 在lodash中我们可以使用debounce函数来进行防抖和截流,之前我并未仔细注意过,但是不可思议的是,lodash中的防抖节流函数是一个函数两 ...
- Linux内核学习(五):linux kernel源码结构以及makefile分析
Linux内核学习(五):linux kernel源码结构以及makefile分析 前面我们知道了linux内核镜像的生成.加载以及加载工具uboot. 这里我们来看看linux内核的源码的宏观东西, ...
- lucene源码学习
1.官网地址 https://lucene.apache.org/core/9_1_0/index.html 2. lucene源码结构 https://juejin.cn/post/68449037 ...
- lucene源码分析(1)基本要素
1.源码包 core: Lucene core library analyzers-common: Analyzers for indexing content in different langua ...
- 从源码和内核角度分析redis和nginx以及java NIO可以支持多大的并发
有人询问我网上一篇关于"redis为什么单线程这么快"的文章,我建议他不要看了,因为redis是单进程不是单线程,后面的意见不用看了,文章质量肯定不会很好,他也说了自己看了很久源码 ...
- Caffe源码中common文件分析
Caffe源码(caffe version:09868ac , date: 2015.08.15)中的一些重要头文件如caffe.hpp.blob.hpp等或者外部调用Caffe库使用时,一般都会in ...
最新文章
- PTA数据结构与算法题目集(中文)7-15
- python编程在哪里写-Python自带的IDE在哪里
- JAVA_OA(六):SpringMVC登陆实例
- MySQL执行计划解析
- solr4 mysql自动更新_solr7.4 定时增量更新数据-Go语言中文社区
- 面向对象 —— 静态成员(变量与方法)
- Curl 方式实现POST提交数据
- PEPS 无钥匙进入系统低频芯片 PCF7991 介绍
- 宝塔面板安装MySQL数据库
- 使用万用表来进行简易的运放芯片配对
- WRF模式运行及相关问题的解决
- 拼图软件——texturepacker
- Inventory文件扩展
- linux mv命令例子,linux命令mv
- 实用一位加法电路-全加器(全加器真值表、全加器的逻辑组合电路)、几种基本组合逻辑电路真值表 补充:逻辑电路基础:与门、或门、非门----计算机组成原理
- 常见3D游戏物理引擎总结
- 火车票订票管理系统c语言,基于c 的火车票订票管理系统的设计与实现.docx
- Tableau函数:实现数值累计值
- 微信与支付宝钱包的竞争分析
- 工程技术开发的圈套与局限性
热门文章
- mysql查询交叉连接_复杂的MySQL查询,联合,交叉或自然连接?
- python爬取有道词典_利用Python3和Charles爬取有道词典,生成翻译exe单文件
- 指针数组概念 和 函数指针数组实战 和指针函数的概念和实战
- VC包含目录、附加依赖项、库目录及具体设置
- 解决python安装第三方库速度很慢的问题(opencv为例)
- linux ttyusb读写_linux下非root用户获得devttyUSB0的读写权限
- [Solved] UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xff in position 0: invalid start byte
- 什么是pretext tasks?
- NTU 课程笔记:Nonparametric statistics
- 数学知识复习:二阶导复合函数的链式法则