1.源码包

core: Lucene core library

analyzers-common: Analyzers for indexing content in different languages and domains.
analyzers-icu: Analysis integration with ICU (International Components for Unicode).
analyzers-kuromoji: Japanese Morphological Analyzer
analyzers-morfologik: Analyzer for dictionary stemming, built-in Polish dictionary
analyzers-nori: Korean Morphological Analyzer
analyzers-opennlp: OpenNLP Library Integration
analyzers-phonetic: Analyzer for indexing phonetic signatures (for sounds-alike search)
analyzers-smartcn: Analyzer for indexing Chinese
analyzers-stempel: Analyzer for indexing Polish
backward-codecs: Codecs for older versions of Lucene.
benchmark: System for benchmarking Lucene
classification: Classification module for Lucene
codecs: Lucene codecs and postings formats.
demo: Simple example code
expressions: Dynamically computed values to sort/facet/search on based on a pluggable grammar.
facet: Faceted indexing and search capabilities
grouping: Collectors for grouping search results.
highlighter: Highlights search keywords in results
join: Index-time and Query-time joins for normalized content
memory: Single-document in-memory index implementation
misc: Index tools and other miscellaneous code
queries: Filters and Queries that add to core Lucene
queryparser: Query parsers and parsing framework
replicator: Files replication utility
sandbox: Various third party contributions and new ideas
spatial: Geospatial search
spatial3d: 3D spatial planar geometry APIs
spatial-extras: Geospatial search
suggest: Auto-suggest and Spellchecking support
test-framework: Framework for testing Lucene-based applications

其中core包下面的api

org.apache.lucene
Top-level package.
org.apache.lucene.analysis
Text analysis.
org.apache.lucene.analysis.standard
Fast, general-purpose grammar-based tokenizer StandardTokenizer implements the Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29.
org.apache.lucene.analysis.tokenattributes
General-purpose attributes for text analysis.
org.apache.lucene.codecs
Codecs API: API for customization of the encoding and structure of the index.
org.apache.lucene.codecs.blocktree
BlockTree terms dictionary.
org.apache.lucene.codecs.compressing
StoredFieldsFormat that allows cross-document and cross-field compression of stored fields.
org.apache.lucene.codecs.lucene50
Components from the Lucene 5.0 index format See org.apache.lucene.codecs.lucene50 for an overview of the index format.
org.apache.lucene.codecs.lucene60
Components from the Lucene 6.0 index format.
org.apache.lucene.codecs.lucene62
Components from the Lucene 6.2 index format See org.apache.lucene.codecs.lucene70 for an overview of the current index format.
org.apache.lucene.codecs.lucene70
Lucene 7.0 file format.
org.apache.lucene.codecs.perfield
Postings format that can delegate to different formats per-field.
org.apache.lucene.document
The logical representation of a Document for indexing and searching.
org.apache.lucene.geo
Geospatial Utility Implementations for Lucene Core
org.apache.lucene.index
Code to maintain and access indices.
org.apache.lucene.search
Code to search indices.
org.apache.lucene.search.similarities
This package contains the various ranking models that can be used in Lucene.
org.apache.lucene.search.spans
The calculus of spans.
org.apache.lucene.store
Binary i/o API, used for all index data.
org.apache.lucene.util
Some utility classes.
org.apache.lucene.util.automaton
Finite-state automaton for regular expressions.
org.apache.lucene.util.bkd
Block KD-tree, implementing the generic spatial data structure described in this paper.
org.apache.lucene.util.fst
Finite state transducers
org.apache.lucene.util.graph
Utility classes for working with token streams as graphs.
org.apache.lucene.util.mutable
Comparable object wrappers
org.apache.lucene.util.packed
Packed integer arrays and streams.

2.术语定义

2.1Term

org.apache.lucene.index.Term

/**A Term represents a word from text.  This is the unit of search.  It iscomposed of two elements, the text of the word, as a string, and the name ofthe field that the text occurred in.Note that terms may represent more than words from text fields, but alsothings like dates, email addresses, urls, etc.  */

2.2 Field

org.apache.lucene.document.Field

/*** Expert: directly create a field for a document.  Most* users should use one of the sugar subclasses: * <ul>*    <li>{@link TextField}: {@link Reader} or {@link String} indexed for full-text search*    <li>{@link StringField}: {@link String} indexed verbatim as a single token*    <li>{@link IntPoint}: {@code int} indexed for exact/range queries.*    <li>{@link LongPoint}: {@code long} indexed for exact/range queries.*    <li>{@link FloatPoint}: {@code float} indexed for exact/range queries.*    <li>{@link DoublePoint}: {@code double} indexed for exact/range queries.*    <li>{@link SortedDocValuesField}: {@code byte[]} indexed column-wise for sorting/faceting*    <li>{@link SortedSetDocValuesField}: {@code SortedSet<byte[]>} indexed column-wise for sorting/faceting*    <li>{@link NumericDocValuesField}: {@code long} indexed column-wise for sorting/faceting*    <li>{@link SortedNumericDocValuesField}: {@code SortedSet<long>} indexed column-wise for sorting/faceting*    <li>{@link StoredField}: Stored-only value for retrieving in summary results* </ul>** <p> A field is a section of a Document. Each field has three* parts: name, type and value. Values may be text* (String, Reader or pre-analyzed TokenStream), binary* (byte[]), or numeric (a Number).  Fields are optionally stored in the* index, so that they may be returned with hits on the document.** <p>* NOTE: the field type is an {@link IndexableFieldType}.  Making changes* to the state of the IndexableFieldType will impact any* Field it is used in.  It is strongly recommended that no* changes be made after Field instantiation.*/

 2.3 Document

  org.apache.lucene.document.Document

/** Documents are the unit of indexing and search.** A Document is a set of fields.  Each field has a name and a textual value.* A field may be {@link org.apache.lucene.index.IndexableFieldType#stored() stored} with the document, in which* case it is returned with search hits on the document.  Thus each document* should typically contain one or more stored fields which uniquely identify* it.** <p>Note that fields which are <i>not</i> {@link org.apache.lucene.index.IndexableFieldType#stored() stored} are* <i>not</i> available in documents retrieved from the index, e.g. with {@link* ScoreDoc#doc} or {@link IndexReader#document(int)}.*/

2.4 segment

org.apache.lucene.index.SegmentInfo

/*** Information about a segment such as its name, directory, and files related* to the segment.** @lucene.experimental*/

2.5 FSDirectory

/*** Base class for Directory implementations that store index* files in the file system.  * <a name="subclasses"></a>* There are currently three core* subclasses:** <ul>**  <li>{@link SimpleFSDirectory} is a straightforward*       implementation using Files.newByteChannel.*       However, it has poor concurrent performance*       (multiple threads will bottleneck) as it*       synchronizes when multiple threads read from the*       same file.**  <li>{@link NIOFSDirectory} uses java.nio's*       FileChannel's positional io when reading to avoid*       synchronization when reading from the same file.*       Unfortunately, due to a Windows-only <a*       href="http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6265734">Sun*       JRE bug</a> this is a poor choice for Windows, but*       on all other platforms this is the preferred*       choice. Applications using {@link Thread#interrupt()} or*       {@link Future#cancel(boolean)} should use*       {@code RAFDirectory} instead. See {@link NIOFSDirectory} java doc*       for details.*        *  <li>{@link MMapDirectory} uses memory-mapped IO when*       reading. This is a good choice if you have plenty*       of virtual memory relative to your index size, eg*       if you are running on a 64 bit JRE, or you are*       running on a 32 bit JRE but your index sizes are*       small enough to fit into the virtual memory space.*       Java has currently the limitation of not being able to*       unmap files from user code. The files are unmapped, when GC*       releases the byte buffers. Due to*       <a href="http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4724038">*       this bug</a> in Sun's JRE, MMapDirectory's {@link IndexInput#close}*       is unable to close the underlying OS file handle. Only when*       GC finally collects the underlying objects, which could be*       quite some time later, will the file handle be closed.*       This will consume additional transient disk usage: on Windows,*       attempts to delete or overwrite the files will result in an*       exception; on other platforms, which typically have a &quot;delete on*       last close&quot; semantics, while such operations will succeed, the bytes
 *       are still consuming space on disk.  For many applications this*       limitation is not a problem (e.g. if you have plenty of disk space,*       and you don't rely on overwriting files on Windows) but it's still*       an important limitation to be aware of. This class supplies a*       (possibly dangerous) workaround mentioned in the bug report,*       which may fail on non-Sun JVMs.* </ul>** <p>Unfortunately, because of system peculiarities, there is* no single overall best implementation.  Therefore, we've* added the {@link #open} method, to allow Lucene to choose* the best FSDirectory implementation given your* environment, and the known limitations of each* implementation.  For users who have no reason to prefer a* specific implementation, it's best to simply use {@link* #open}.  For all others, you should instantiate the* desired implementation directly.** <p><b>NOTE:</b> Accessing one of the above subclasses either directly or* indirectly from a thread while it's interrupted can close the* underlying channel immediately if at the same time the thread is* blocked on IO. The channel will remain closed and subsequent access* to the index will throw a {@link ClosedChannelException}.* Applications using {@link Thread#interrupt()} or* {@link Future#cancel(boolean)} should use the slower legacy* {@code RAFDirectory} from the {@code misc} Lucene module instead.** <p>The locking implementation is by default {@link* NativeFSLockFactory}, but can be changed by* passing in a custom {@link LockFactory} instance.** @see Directory*/

3.概念实例

    Analyzer analyzer = new StandardAnalyzer();// Store the index in memory:Directory directory = new RAMDirectory();// To store an index on disk, use this instead://Directory directory = FSDirectory.open("/tmp/testindex");IndexWriterConfig config = new IndexWriterConfig(analyzer);IndexWriter iwriter = new IndexWriter(directory, config);Document doc = new Document();String text = "This is the text to be indexed.";doc.add(new Field("fieldname", text, TextField.TYPE_STORED));iwriter.addDocument(doc);iwriter.close();// Now search the index:DirectoryReader ireader = DirectoryReader.open(directory);IndexSearcher isearcher = new IndexSearcher(ireader);// Parse a simple query that searches for "text":QueryParser parser = new QueryParser("fieldname", analyzer);Query query = parser.parse("text");ScoreDoc[] hits = isearcher.search(query, null, 1000).scoreDocs;assertEquals(1, hits.length);// Iterate through the results:for (int i = 0; i < hits.length; i++) {Document hitDoc = isearcher.doc(hits[i].doc);assertEquals("This is the text to be indexed.", hitDoc.get("fieldname"));}ireader.close();directory.close();

参考文献

【1】http://lucene.apache.org/core/7_5_0/

【2】http://lucene.apache.org/core/7_5_0/core/index.html

转载于:https://www.cnblogs.com/davidwang456/p/9934176.html

lucene源码分析(1)基本要素相关推荐

  1. Lucene 源码分析之倒排索引(三)

    上文找到了 collect(-) 方法,其形参就是匹配的文档 Id,根据代码上下文,其中 doc 是由 iterator.nextDoc() 获得的,那 DefaultBulkScorer.itera ...

  2. lucene 源码分析_Lucene分析过程指南

    lucene 源码分析 本文是我们名为" Apache Lucene基础知识 "的学院课程的一部分. 在本课程中,您将了解Lucene. 您将了解为什么这样的库很重要,然后了解Lu ...

  3. lucene源码分析的一些资料

    针对lucene6.1较新的分析:http://46aae4d1e2371e4aa769798941cef698.devproxy.yunshipei.com/conansonic/article/d ...

  4. lucene源码分析(8)MergeScheduler

    1.使用IndexWriter.java mergeScheduler.merge(this, MergeTrigger.EXPLICIT, newMergesFound); 2.定义MergeSch ...

  5. lucene源码分析(7)Analyzer分析

    1.Analyzer的使用 Analyzer使用在IndexWriter的构造方法 /*** Constructs a new IndexWriter per the settings given i ...

  6. lucene源码分析(6)Query分析

    查询的入口 /** Lower-level search API.** <p>{@link LeafCollector#collect(int)} is called for every ...

  7. lucene源码分析(5)lucence-group

    1. 普通查询的用法 org.apache.lucene.search.IndexSearcher public void search(Query query, Collector results) ...

  8. lucene源码分析(4)Similarity相似度算法

    lucene 7.5.0默认的评分Similarity是BM25Similarity (IndexSearcher.java) // the default Similarityprivate sta ...

  9. lucene源码分析(2)读取过程实例

    1.官方提供的代码demo Analyzer analyzer = new StandardAnalyzer();// Store the index in memory:Directory dire ...

最新文章

  1. list对象_list对象,容量自适应的数组式容器
  2. Ubuntu中启用关闭Network-manager网络设置问题!
  3. 建立STM32的工程步骤(版本1)
  4. linux系统解决boot空间不足
  5. Nginx-windows下nginx安装、配置与使用
  6. Zabbix学习之路(一)之Zabbix安装
  7. SAP C4C里没有选择Port binding的url Mashup行为分析
  8. SAP UI5 Web Component的图标实现
  9. java 新窗口跳转页面_Java web开发中页面跳转小技巧——跳转后新页面在新窗口打开...
  10. Fliptile【搜索】
  11. oracle查询结果存入临时表,Oracle查询问题引发临时表使用
  12. SQL Server 索引结构及其使用(三)(转)
  13. springboot整合elasticjob
  14. 数据库原理—数据库管理系统的功能和特点(四)
  15. linux NFS共享
  16. 用户画像案例一:汽车精准营销
  17. C++基础语法-01-引用
  18. Qt阅读器-ofd格式
  19. 77页智慧应急解决方案 2022
  20. O2O模式是什么意思 O2O运作模式有哪些?

热门文章

  1. java struts2 demo,Struts2第一个Demo求指导
  2. mysql停止主从_不停止mysql服务配置主从
  3. 岭回归和lasso回归_正则化(2):与岭回归相似的 Lasso 回归
  4. ad输出光绘文件_90%的工程师容易忽视(一):PCB输出gerber文件,这样操作才正确!...
  5. 行号 设置vim_Vim从小白到入门
  6. java web开发学习手册_【Java手册】Java开发手册_华山版(2019.06)
  7. linux软件升级直接替换,Linux几个命令的升级替代品
  8. ARP的超时重新请求
  9. java 拼接字符串性能_java字符串拼接与性能分析详解
  10. java 状态迁移图_kafka 实战笔记