2019独角兽企业重金招聘Python工程师标准>>>

In the previous post you learnt how to get a copy of Lucene.net and where to go in order to look for more information. As you noticed the documentation is far from being complete and easy to read. So in the post I’ll write about the main concepts behind Lucene.net and which are the main steps in the development of a solution based on Lucene.net.

Some of the main concepts

Before looking at the development phases, it’s important to have a look at the main actors of Lucene.net.

Directoy

The directory is where Lucene indexes are stored: it can be a physical folder on the filesystem (FSDirectory) or an area in memory where files are stored (RAMDirectory). The index structure is compatible with all ports of Lucene, so you could also have the indexing done with .NET and searched with Java, or the other way around (obviously, using the filesystem directory).

IndexWriter

This component is responsible for the management of the indexes. It creates indexes, adds documents to the index, optimizes the index.

Analyzer

This is where the complexity of the indexing resides. In a few words the analyzer contains the policy for extracting index terms from the text. There are several analyzers available both in the core library and in the contrib project. And the java version has even more analyzers that have not been ported to .net yet.

Probably the analyzer you’ll use the most is the StandardAnalyzer, which tokenizes the text based on European-language grammars, sets everything to lowercase and removes English stopwords.

Another interesting analyzer is the SnowballAnalyzer, which works exactly like the standard one, but adds one more step at the end: the stemming phase, using the Snowball stemming language. Stemming is the process of reducing inflected words to their root. For example, if you are looking for “developing”, probably you are also interested in the word “developed” or “develop” or “developer”. During the indexing phase, the stemming process normalizes all these inflected words to their root “develop”. And does the same when querying the index (if you search for “development” it will search for “develop”). Obviously this is tied to the language of the text, so the snowball analyzer comes with many different “grammars” for that.

Document and Fields

A document is a single entity that is put into the index. And it contains many fields which are, like in a database, the single different pieces of information that make a document. Different fields can be indexed using different algorithm and analyzers. For example you might just want to store the document id, without being able to search on it. But you want to be able to search by tags as single keywords, and, finally you want to index the body of blog post for full text search (thus using the Analyzer and the tokenizers).

Since this is an important topic, I’ll write a more in depth post in the future.

Searcher and IndexReader

The searcher is the component that, with the help of the IndexReader, scans the index files and returns results based on the query supplied.

QueryParser

The query parser is responsible for parsing a string of text to create a query object. It evaluates Lucene query syntax and uses an analyzer (which should be the same you used to index the text) to tokenize the single statements.

The main development steps

And now let’s have a brief overview at the logical steps involved in integrating Lucene.net into your applications:

1 – Initialize Directory and IndexWriter

The first step is initializing the Directory and the IndexWriter. In a web application, like Subtext, this is most likely done in the application startup and then the instance stored in a global variable somewhere (or accessed through a Singleton) since only one Writer can read the Dictionary at the same time.

And when you create the IndexWriter you can supply the analyzer that will be used by default to index all the text.

2 – Add Documents to the Index

Each document is made by various Fields. You have to create a Document with all the Fields that must be indexed and also the ones you need in order to link the result to the real document that is being indexed (for example the id of the post).

And once created the Document, you have to add it to the Directory with the IndexWriter.

At this point, you could either add more documents or close the IndexWriter. The index will be saved to the Directory and can be re-opened later for adding more Documents or to perform queries on in.

3 – Create the Query

Once you have all your documents in the index, it’s time to do some queries.

You can create the query either via the QueryParser or creating a Query object directly via API.

4 – Pass the Query to the IndexSearcher

And once you have the Query object you have to pass it to the Search method of a IndexSearcher.

One caveats is that the IndexSearcher sees the index only at the point it was at the time it was opened. So in order to search over the most recent set of documents you have to re-open theIndexSearcher. But re-opening takes time and resources, so in a web application you might want to cache it somehow and re-open it periodically.

5 – Iterates over the results

The Search method returns the results, inside a Hit object, which contains all the documents that match the query, ordered by Score, which is a very complex math formula that should tell you how much the document found is related to your query. For more information refer to Lucene website:Scoring.

6 – Close everything

And once you are done with everything, close the IndexWriterIndexSearcher and the Directoryobject. In a web application this is typically performed in the application shutdown event.

Next

You just read about the main concepts behind Lucene.net. In a future post I’ll write how to implement Lucene.net into a sample console application that puts together all the concepts discussed here.

Tags:  Lucene.net

Kick it • Digg This! • Technorati Links • Save to del.icio.us (86 saves, tagged: .net asp.net fulltextsearch)

posted on Monday, August 31, 2009 12:11 PM

Related Links

Lucene.net: your first application (9/2/2009) Dissecting Lucene.net storage: Documents and Fields (9/4/2009) Lucene - or how I stopped worrying, and learned to love unstructured data (9/8/2009) How Subtext’s Lucene.net index is structured (9/10/2009) Lucene.net is powering Subtext 2.5 search (2/26/2010)

转载于:https://my.oschina.net/u/138995/blog/178778

Lucene.net: the main concepts相关推荐

  1. Deep Learning in a Nutshell: Core Concepts

    转载自:Deep Learning in a Nutshell: Core Concepts | Parallel Forall http://devblogs.nvidia.com/parallel ...

  2. lucene案例demo

    1.github代码 https://github.com/mx342/luceneDemo 2.demo代码 import org.apache.commons.io.FileUtils; impo ...

  3. 成为专业程序员路上用到的各种优秀资料、神器及框架

    撸了今年阿里.头条和美团的面试,我有一个重要发现.......>>> 前言(关注有红包http://t.cn/RHuOTnd) 成为一名专业程序员的道路上,需要坚持练习.学习与积累, ...

  4. 【真正福利】成为专业程序员路上用到的各种优秀资料、神器及框架

    好东西不是随便收集下,发篇博文,骗些点赞的!积累了5年多的东西,是时候放出来跟大家见见面了. 或许有的园友在14年的时候收藏过我的一篇"工欲善其事.必先利其器"的博文,时隔3年,已 ...

  5. 专业程序员路上用到的各种优秀资料、神器及框架

    前言 成为一名专业程序员的道路上,需要坚持练习.学习与积累,技术方面既要有一定的广度,更要有自己的深度. 笔者作为一位tool mad,将工作以来用到的各种优秀资料.神器及框架整理在此,毕竟好记性不如 ...

  6. 【福利】成为专业程序员路上用到的各种优秀资料、神器及框架

    好东西不是随便收集下,发篇博文,骗些点赞的!积累了5年多的东西,是时候放出来跟大家见见面了. 或许有的园友在14年的时候收藏过我的一篇"工欲善其事.必先利其器"的博文,时隔3年,已 ...

  7. 二:程序员资料大全-各种神奇的资料收集笔记

    http://tools.zhaishidan.cn/ 资料篇 技术站点 必看书籍 大牛博客 GitHub篇 工具篇 平台工具 常用工具 第三方服务 爬虫相关(好玩的工具) 安全相关 Web服务器性能 ...

  8. 程序员优秀学习资料整理(不断更新中)

    如果你发现自己陷入各种新技术.工具包围中,而纠结于该选择哪些学习,读读这篇文章,技术的执念. 综合资源 资源链接汇集 awesome - 各种主流语言的优秀项目汇集 :+1: lists - 资源集合 ...

  9. 成为专业程序员用到的各种资料,神器及框架

    好东西不是随便收集下,发篇博文,骗些点赞的!积累了5年多的东西,是时候放出来跟大家见见面了. 或许有的园友在14年的时候收藏过我的一篇"工欲善其事.必先利其器"的博文,时隔3年,已 ...

最新文章

  1. this.options[selectedIndex]的使用
  2. SSD行业要变天了!因为这种闪存芯片要来
  3. HDU 2079-课程时间(生成函数)
  4. Unity 代码集锦之图片处理
  5. ML机器学习导论学习笔记
  6. UITextView实现图文混排效果
  7. [python进阶]11接口:从协议到抽象基类
  8. 计算机一级上机考试试题题库,2016年计算机一级上机考试题库
  9. html 去文本框中的双引号_前端·HTML基础
  10. React中那些纠结你的地方(一)
  11. redis 内存管理分析
  12. C#解析HL7协议数据2.X
  13. iOS 音乐播放 Swift
  14. channel练习题
  15. obj转stl_STL转STP的方法视频教程,OBJ格式转STP或者IGS开模具格式的过程,STL转STP软件介绍...
  16. vmware虚拟机安装maca苹果系统,滚动条无限重启
  17. 制作ESXI6.7启动盘
  18. 脑裂是什么,zk是如何解决脑裂问题的
  19. 异常:Class net.sf.cglib.core.DebuggingClassWriter overrides final method visit
  20. [模型]多目标规划模型

热门文章

  1. jQuery--AJAX传递xml
  2. testem方便的web tdd 测试框架使用
  3. 阿里云前端周刊 - 第 29 期
  4. Ali RocketMQ与Kafka对照
  5. ORA-00845 : MEMORY_TARGET not supported on this system(调大数据库内存无法启动)
  6. Lisp 家族迎来新成员,函数式语言 Lux 是什么?
  7. js获取 浏览器,手机内核
  8. Python设置环境变量,改变GnomeConnectionManager的语言
  9. 项目部署时网关怎么回事_使用Kubernetes部署聊天网关(或技术按预期运行时)...
  10. 需求简报_代码简报:有史以来最怪诞的丑毛衣