Lucene-Analyzer

Lucene文本解析器实现把一段文本信息拆分成多个分词，我们都知道搜索引擎是通过分词检索的，文本解析器的好坏直接决定了搜索的精度和搜索的速度。

1.简单的Demo

    private static final String[] examples = { "The quick brown 1234 fox jumped over the lazy dog!","XY&Z 15.6 Corporation - xyz@example.com", "北京市北京大学" };private static final Analyzer[] ANALYZERS = new Analyzer[] {            new WhitespaceAnalyzer(), new SimpleAnalyzer(), new StopAnalyzer(), new StandardAnalyzer(), new CJKAnalyzer(), new SmartChineseAnalyzer() };//空格符拆分             非字母拆分             非字母拆分去掉停词       Unicode文本分割        日韩文分割               简体中文分割    @Testpublic void testAnalyzer() throws IOException {for (int i = 0; i < ANALYZERS.length; i++) {String simpleName = ANALYZERS[i].getClass().getSimpleName();for (int j = 0; j < examples.length; j++) {//TokenStream是分析处理组件中的一种中间数据格式，它从一个reader中获取文本， 分词器Tokenizer和过滤器TokenFilter继承自TokenStreamTokenStream contents = ANALYZERS[i].tokenStream("contents", new StringReader(examples[j]));//添加多个Attribute，从而可以了解到分词之后详细的词元信息  ，OffsetAttribute 表示token的首字母和尾字母在原文本中的位置
                OffsetAttribute offsetAttribute = contents.addAttribute(OffsetAttribute.class);TypeAttribute typeAttribute = contents.addAttribute(TypeAttribute.class); //TypeAttribute 表示token的词汇类型信息，默认值为wordcontents.reset();System.out.println("  " + simpleName + " analyzing : " + examples[j]);while (contents.incrementToken()) {String s1 = offsetAttribute.toString();int i1 = offsetAttribute.startOffset();// 起始偏移量int i2 = offsetAttribute.endOffset(); // 结束偏移量System.out.println("    " + s1 + "[" + i1 + "," + i2 + ":" + typeAttribute.type() + "]" + "  ");}contents.end();contents.close(); //调用incrementToken()结束迭代之后，调用end()和close()方法，其中end()可以唤醒当前TokenStream的处理器去做一些收尾工作，close()可以关闭TokenStream和Analyzer去释放在分析过程中使用的资源。System.out.println();}}}
}

2. 了解tokenStream的Attribute

tokenStream()方法之后，添加多个Attribute，可以了解到分词之后详细的词元信息，比如CharTermAttribute用于保存词元的内容，TypeAttribute用于保存词元的类型。

CharTermAttribute 表示token本身的内容
PositionIncrementAttribute 表示当前token相对于前一个token的相对位置，也就是相隔的词语数量（例如“text for attribute”，
text和attribute之间的getPositionIncrement为2），如果两者之间没有停用词，那么该值被置为默认值1
OffsetAttribute 表示token的首字母和尾字母在原文本中的位置
TypeAttribute 表示token的词汇类型信息，默认值为word，
其它值有<ALPHANUM> <APOSTROPHE> <ACRONYM> <COMPANY> <EMAIL> <HOST> <NUM> <CJ> <ACRONYM_DEP>
FlagsAttribute 与TypeAttribute类似，假设你需要给token添加额外的信息，而且希望该信息可以通过分析链，那么就可以通过flags去传递
PayloadAttribute 在每个索引位置都存储了payload（关键信息），当使用基于Payload的查询时，该信息在评分中非常有用

    @Testpublic void testAttribute() throws IOException {Analyzer analyzer = new StandardAnalyzer();String input = "This is a test text for attribute! Just add-some word.";TokenStream tokenStream = analyzer.tokenStream("text", new StringReader(input));CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);        PositionIncrementAttribute positionIncrementAttribute = tokenStream.addAttribute(PositionIncrementAttribute.class);OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class);TypeAttribute typeAttribute = tokenStream.addAttribute(TypeAttribute.class);PayloadAttribute payloadAttribute = tokenStream.addAttribute(PayloadAttribute.class);payloadAttribute.setPayload(new BytesRef("Just"));tokenStream.reset();while (tokenStream.incrementToken()) {System.out.print("[" + charTermAttribute + " increment:" + positionIncrementAttribute.getPositionIncrement()+ " start:" + offsetAttribute.startOffset() + " end:" + offsetAttribute.endOffset() + " type:"+ typeAttribute.type() + " payload:" + payloadAttribute.getPayload() + "]\n");}tokenStream.end();tokenStream.close();}

3.Lucene 的分词器Tokenizer和过滤器TokenFilter

一个分析器由一个分词器和多个过滤器组成，分词器接受reader数据转换成 TokenStream，TokenFilter主要用于TokenStream的过滤操作，用来处理Tokenizer或者上一个TokenFilter处理后的结果，如果是对现有分词器进行扩展或修改。

自定义TokenFilter需要实现incrementToken()抽象函数，

public class TestTokenFilter {@Testpublic void test() throws IOException {String text = "Hi, Dr Wang, Mr Liu asks if you stay with Mrs Liu yesterday!";Analyzer analyzer = new WhitespaceAnalyzer();CourtesyTitleFilter filter = new CourtesyTitleFilter(analyzer.tokenStream("text", text));CharTermAttribute charTermAttribute = filter.addAttribute(CharTermAttribute.class);filter.reset();while (filter.incrementToken()) {System.out.print(charTermAttribute + " ");}}
}/*** 自定义词扩展过滤器*/
class CourtesyTitleFilter extends TokenFilter {Map<String, String> courtesyTitleMap = new HashMap<>();private CharTermAttribute termAttribute;protected CourtesyTitleFilter(TokenStream input) {super(input);termAttribute = addAttribute(CharTermAttribute.class);courtesyTitleMap.put("Dr", "doctor");courtesyTitleMap.put("Mr", "mister");courtesyTitleMap.put("Mrs", "miss");}@Overridepublic final boolean incrementToken() throws IOException {if (!input.incrementToken()) {return false;}String small = termAttribute.toString();if (courtesyTitleMap.containsKey(small)) {termAttribute.setEmpty().append(courtesyTitleMap.get(small));}return true;}
}

输出结果如下
Hi, doctor Wang, mister Liu asks if you stay with miss Liu yesterday!

4.自定义Analyzer实现扩展停用词

class StopAnalyzerExtend extends Analyzer {private CharArraySet stopWordSet;//停止词词典public CharArraySet getStopWordSet() {return this.stopWordSet;}public void setStopWordSet(CharArraySet stopWordSet) {this.stopWordSet = stopWordSet;}public StopAnalyzerExtend() {super();setStopWordSet(StopAnalyzer.ENGLISH_STOP_WORDS_SET);}/*** @param stops 需要扩展的停止词*/public StopAnalyzerExtend(List<String> stops) {this();/**如果直接为stopWordSet赋值的话，会报如下异常，这是因为在StopAnalyzer中有ENGLISH_STOP_WORDS_SET = CharArraySet.unmodifiableSet(stopSet);* ENGLISH_STOP_WORDS_SET 被设置为不可更改的set集合*///stopWordSet = getStopWordSet();stopWordSet = CharArraySet.copy(getStopWordSet());stopWordSet.addAll(StopFilter.makeStopSet(stops));}@Overrideprotected TokenStreamComponents createComponents(String fieldName) {Tokenizer source = new LowerCaseTokenizer();return new TokenStreamComponents(source, new StopFilter(source, stopWordSet));}public static void main(String[] args) throws IOException {ArrayList<String> strings = new ArrayList<String>() {{add("小鬼子");add("美国佬");}};Analyzer analyzer = new StopAnalyzerExtend(strings);String content = "小鬼子 and 美国佬 are playing together!";TokenStream tokenStream = analyzer.tokenStream("myfield", content);tokenStream.reset();CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);while (tokenStream.incrementToken()) {// 已经过滤掉自定义停用词// 输出：playing   together
            System.out.println(charTermAttribute.toString());}tokenStream.end();tokenStream.close();}
}

5.自定义Analyzer实现字长过滤

class LongFilterAnalyzer extends Analyzer {private int len;public int getLen() {return this.len;}public void setLen(int len) {this.len = len;}public LongFilterAnalyzer() {super();}public LongFilterAnalyzer(int len) {super();setLen(len);}@Overrideprotected TokenStreamComponents createComponents(String fieldName) {final Tokenizer source = new WhitespaceTokenizer();//过滤掉长度<len，并且>20的tokenTokenStream tokenStream = new LengthFilter(source, len, 20);return new TokenStreamComponents(source, tokenStream);}public static void main(String[] args) {//把长度小于2的过滤掉，开区间Analyzer analyzer = new LongFilterAnalyzer(2);String words = "I am a java coder! Testingtestingtesting!";TokenStream stream = analyzer.tokenStream("myfield", words);try {stream.reset();CharTermAttribute offsetAtt = stream.addAttribute(CharTermAttribute.class);while (stream.incrementToken()) {System.out.println(offsetAtt.toString());}stream.end();stream.close();} catch (IOException e) {}}
}长度小于两个字符的文本都被过滤掉了。

6.PerFieldAnalyzerWrapper 处理不同的Field使用不同的Analyzer 。PerFieldAnalyzerWrapper可以像其它的Analyzer一样使用，包括索引和查询分析

    @Testpublic void testPerFieldAnalyzerWrapper() throws IOException, ParseException {Map<String, Analyzer> fields = new HashMap<>();fields.put("partnum", new KeywordAnalyzer());// 对于其他的域，默认使用SimpleAnalyzer分析器，对于指定的域partnum使用KeywordAnalyzerPerFieldAnalyzerWrapper perFieldAnalyzerWrapper = new PerFieldAnalyzerWrapper(new SimpleAnalyzer(), fields);Directory directory = new RAMDirectory();IndexWriterConfig indexWriterConfig = new IndexWriterConfig(perFieldAnalyzerWrapper);IndexWriter indexWriter = new IndexWriter(directory, indexWriterConfig);Document document = new Document();FieldType fieldType = new FieldType();fieldType.setStored(true);fieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS);document.add(new Field("partnum", "Q36", fieldType));document.add(new Field("description", "Illidium Space Modulator", fieldType));indexWriter.addDocument(document);indexWriter.close();IndexSearcher indexSearcher = new IndexSearcher(DirectoryReader.open(directory));// 直接使用TermQuery是可以检索到的TopDocs search = indexSearcher.search(new TermQuery(new Term("partnum", "Q36")), 10);Assert.assertEquals(1, search.totalHits);// 如果使用QueryParser，那么必须要使用PerFieldAnalyzerWrapper，否则如下所示，是检索不到的Query description = new QueryParser("description", new SimpleAnalyzer()).parse("partnum:Q36 AND SPACE");search = indexSearcher.search(description, 10);Assert.assertEquals(0, search.totalHits);System.out.println("SimpleAnalyzer :" + description.toString());// +partnum:q// +description:space，原因是SimpleAnalyzer会剥离非字母字符并将字母小写化// 使用PerFieldAnalyzerWrapper可以检索到// partnum:Q36 AND SPACE表示在partnum中出现Q36，在description中出现SPACEdescription = new QueryParser("description", perFieldAnalyzerWrapper).parse("partnum:Q36 AND SPACE");search = indexSearcher.search(description, 10);Assert.assertEquals(1, search.totalHits);System.out.println("(SimpleAnalyzer,KeywordAnalyzer) :" + description.toString());// +partnum:Q36 +description:space}

参考： http://www.codepub.cn/2016/05/23/Lucene-6-0-in-action-4-The-text-analyzer/

转载于:https://www.cnblogs.com/dengzy/p/6057197.html

Lucene-Analyzer相关推荐

at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:162)错误的解决办法主要是针对IK分词的结果运行错误解决
主要错误信息 Exception in thread "main" java.lang.AbstractMethodError: org.apache.lucene.analysi ...
使用Lucene.Net实现全文检索
目录一 Lucene.Net概述二分词三索引四搜索五实践中的问题一 Lucene.Net概述 Lucene.Net是一个C#开发的开源全文索引库,其源码包括"核心&quo ...
Lucene in action 笔记 analysis篇
Analysis, in Lucene, is the process of converting field text into its most fundamental indexed repre ...
Lucene 的四大索引查询 ——bool 域搜索通配符范围搜索
Lucene 的四大索引查询清单1:使用布尔操作符 Java代码 //Test boolean operator blic void testOperator(String index ...
Lucene从入门到进阶（6.6.0版本）
Lucene学习笔记前言基于最新的Lucene-6.6.0进行学习,很多方法都过时并不适用了,本文尽可能以最简单的方法入门学习. 第二章的例子都是官方的例子,写得很好很详细,但是竟然一句注释都没有 ...
站内搜索——Lucene +盘古分词
为了方便的学习站内搜索,下面我来演示一个MVC项目. 1.首先在项目中[添加引入]三个程序集和[Dict]文件夹,并新建一个[分词内容存放目录] Lucene.Net.dll.PanGu.dll.Pa ...
IK Analyzer 中文分词器
IK Analyzer是一个开源的,基于java语言开发的轻量级的中文分词工具包.从2006年12月推出1.0版开始, IKAnalyzer已经推出了3个大版本.最初,它是以开源项目Luence为应用 ...
adalm pluto_将Apache Pluto与Lucene搜索引擎示例教程集成
adalm pluto Knowledge information retrieval isn't a luxury requirement that your application may or ...
java盘古分词_.NET使用Lucene.Net和盘古分词类库实现中文分词
.NET中文分词实现http://http:// 使用 Lucene.Net.dll http://www.apache.org/dist/incubator/lucene.net/binaries/ ...
Lucene.net和盘古分词使用小结
盘古分词是开源项目,核心技术基于Lucene.net.虽然有点旧(2010年),但是还是可以用的.案例.应用程序.以及源码可以详见以下链接. http://pangusegment.codeplex. ...

Lucene-Analyzer

Lucene-Analyzer相关推荐

最新文章

热门文章