15、Analyzer分析器之中文分析器的扩展

其实在第五章节里已经有介绍过下面的分析器了，只是没有做例子，今天将下面没有做过例子分析器进行一个例子说明

paoding：庖丁解牛最新版在 https://code.google.com/p/paoding/ 中最多支持Lucene 3.0，且最新提交的代码在 2008-06-03，在svn中最新也是2010年提交，已经过时，不予考虑。

mmseg4j：最新版已从 https://code.google.com/p/mmseg4j/ 移至 https://github.com/chenlb/mmseg4j-solr ，支持Lucene 4.10，且在github中最新提交代码是2014年6月，从09年～14年一共有：18个版本，也就是一年几乎有3个大小版本，有较大的活跃度，用了mmseg算法。

IK-analyzer：最新版在https://code.google.com/p/ik-analyzer/上，支持Lucene 4.10从2006年12月推出1.0版开始， IKAnalyzer已经推出了4个大版本。最初，它是以开源项目Luence为应用主体的，结合词典分词和文法分析算法的中文分词组件。从3.0版本开始，IK发展为面向Java的公用分词组件，独立于Lucene项目，同时提供了对Lucene的默认优化实现。在2012版本中，IK实现了简单的分词歧义排除算法，标志着IK分词器从单纯的词典分词向模拟语义分词衍化。但是也就是2012年12月后没有在更新。 这里我们不在说明这个分析器，感兴趣的小伙伴可以看看第5章节的说明哦

ansj_seg：最新版本在 https://github.com/NLPchina/ansj_seg tags仅有1.1版本，从2012年到2014年更新了大小6次，但是作者本人在2014年10月10日说明：“可能我以后没有精力来维护ansj_seg了”，现在由”nlp_china”管理。2014年11月有更新。并未说明是否支持Lucene，是一个由CRF（条件随机场）算法所做的分词算法。

imdict-chinese-analyzer：最新版在 https://code.google.com/p/imdict-chinese-analyzer/ ，最新更新也在2009年5月，下载源码，不支持Lucene 4.10 。是利用HMM（隐马尔科夫链）算法。

Jcseg：最新版本在git.oschina.net/lionsoul/jcseg，支持Lucene 4.10，作者有较高的活跃度。利用mmseg算法。

MMseg 的分析器的使用

首先将引用相关的jar包，

<!--mmseg4j 的分析器的使用  -->
<dependency><groupId>com.chenlb.mmseg4j</groupId><artifactId>mmseg4j-core</artifactId><version>1.10.0</version>
</dependency>

具体代码的实现

package mmseg;
import com.chenlb.mmseg4j.*;
import java.io.File;
import java.io.IOException;
import java.io.StringReader;/*** Created by kangz on 2016/12/19.*/
public class MMsegAnalyzerTest {public static void main(String[] args) throws IOException {String txt = "";txt = "那个好看的笑容里面全是悲伤白富美，他在行尸走肉的活着，他的故事悲伤的像一场没有结局的黑白电影，他是她小说里的主角， 她懂他，他爱过她，她不知道自己是爱他的的外表，还是爱他的故事，还是爱他身上的那个自己。";File file = new File("D:\\LucentTest\\luceneIndex2");//词典的目录Dictionary dic = Dictionary.getInstance();//建立词典实例，与比较老的版本中不相同。不能直接new。 默认读取的是jar包中 words.dic（可修改其内容）也可指定词典目录  可以是 File 也可以是String 的形式Seg seg = null;//seg = new SimpleSeg(dic);//简单的seg = new ComplexSeg(dic);//复杂的MMSeg mmSeg = new MMSeg(new StringReader(txt), seg);Word word = null;while((word = mmSeg.next())!=null) {if(word != null) {System.out.print(word + "|");}}}
}

Jcseg 的分析器的使用

首先将引用相关的jar包，

<!--Jcseg 的分析器的使用 -->
<dependency><groupId>org.lionsoul</groupId><artifactId>jcseg-core</artifactId><version>2.0.1</version>
</dependency>
<dependency><groupId>org.lionsoul</groupId><artifactId>jcseg-analyzer</artifactId><version>2.0.1</version>
</dependency>

Lucene集成Jcseg的测试代码

将jcseg源码包中的 lexicon和 jcseg.properties两个文件复制到src/main/resources下，并修改 jcseg.properties中的lexicon.path = src/main/resources/lexicon

新建一个类：

package lexicon;import org.apache.commons.io.FileUtils;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.document.*;
import org.apache.lucene.index.*;
import org.apache.lucene.search.*;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.junit.Test;
import org.lionsoul.jcseg.analyzer.v5x.JcsegAnalyzer5X;
import org.lionsoul.jcseg.tokenizer.core.JcsegTaskConfig;import java.io.File;
import java.nio.file.Paths;/*** Created by kangz on 2016/12/19.*/
public class LexiconAnalyzersTest {@Testpublic void test() throws Exception {//如果不知道选择哪个Directory的子类，那么推荐使用FSDirectory.open()方法来打开目录 创建一个分析器对象Analyzer analyzer = new JcsegAnalyzer5X(JcsegTaskConfig.COMPLEX_MODE);//非必须(用于修改默认配置): 获取分词任务配置实例JcsegAnalyzer5X jcseg = (JcsegAnalyzer5X) analyzer;JcsegTaskConfig config = jcseg.getTaskConfig();//追加同义词, 需要在 jcseg.properties中配置jcseg.loadsyn=1config.setAppendCJKSyn(true);//追加拼音, 需要在jcseg.properties中配置jcseg.loadpinyin=1config.setAppendCJKPinyin(true);//更多配置, 请查看 org.lionsoul.jcseg.tokenizer.core.JcsegTaskConfig/** ------------------------------------------------------------------------ **/// 打开索引库// 指定索引库存放的位置Directory directory = FSDirectory.open(Paths.get("D:\\LucentTest\\luceneIndex"));//创建一个IndexwriterConfig对象//第一个参数：lucene的版本，第二个参数：分析器对象IndexWriterConfig indexWriterConfig=new IndexWriterConfig(analyzer);//创建一个Indexwriter对象IndexWriter indexWriter = new IndexWriter(directory, indexWriterConfig);indexWriter.deleteAll();//清除之前的索引  注: 全部删除索引， 请慎用。//读取文件信息//原始文档存放的目录File path = new File("D:\\LucentTest\\luceneFile");for (File file:path.listFiles()) {if (file.isDirectory()) continue;//读取文件信息//文件名String fileName = file.getName();//文件内容String fileContent = FileUtils.readFileToString(file);//文件的路径String filePath = file.getPath();//文件的大小long fileSize = FileUtils.sizeOf(file);//创建文档对象Document document = new Document();//创建域//三个参数：1、域的名称2、域的值3、是否存储 Store.YES：存储  Store.NO：不存储Field nameField = new TextField("name", fileName, Field.Store.YES);Field contentField = new TextField("content", fileContent, Field.Store.YES);Field sizeField=new LongPoint("size",fileSize);Field pathField  = new StoredField("path", filePath);//把域添加到document对象中document.add(nameField);document.add(contentField);document.add(pathField);document.add(sizeField);//把document写入索引库indexWriter.addDocument(document);}indexWriter.close();}//使用查询@Testpublic void testTermQuery() throws Exception {//以读的方式打开索引库Directory directory = FSDirectory.open(Paths.get("D:\\LucentTest\\luceneIndex"));//创建一个IndexReaderIndexReader indexReader = DirectoryReader.open(directory);//创建一个IndexSearcher对象IndexSearcher indexSearcher = new IndexSearcher(indexReader);//创建一个查询对象Query query = new TermQuery(new Term("content", "全文检索"));//执行查询TopDocs topDocs = indexSearcher.search(query,10);System.out.println("查询结果总数量：" + topDocs.totalHits);for (ScoreDoc scoreDoc : topDocs.scoreDocs) {//取document对象Document document = indexSearcher.doc(scoreDoc.doc);System.out.println("得分：" + scoreDoc.score);//System.out.println(document.get("content"));System.out.println(document.get("path"));}indexReader.close();}
}

ansj分析器的使用

首先要引用jar包

Maven项目配置Ansj

根据官方手册，在 pom.xml 文件中加入依赖，如下所示

<!--ansj 的分析器的使用-->
<dependency><groupId>org.ansj</groupId><artifactId>ansj_seg</artifactId><version>5.0.2</version>
</dependency>
<dependency><groupId>org.ansj</groupId><artifactId>ansj_lucene5_plug</artifactId><version>5.0.3.0</version>
</dependency>

Lucene集成Ansj的测试代码

Ansj In Lucene 的官方参考文档： http://nlpchina.github.io/ansj_seg/

到 https://github.com/NLPchina/ansj_seg 下载 ZIP 压缩文件，解压，将其中的 library 文件夹和 library.properties 文件拷贝到 maven 项目下的 src/main/resources 中，修改 library.properties 内容如下

#redress dic file path
ambiguityLibrary=src/main/resources/library/ambiguity.dic
#path of userLibrary this is default library
userLibrary=src/main/resources/library/default.dic
#path of crfModel
crfModel=src/main/resources/library/crf.model
#set real name
isRealName=true

具体的代码如下

package ansj;import org.ansj.library.UserDefineLibrary;
import org.ansj.lucene5.AnsjAnalyzer;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.highlight.Highlighter;
import org.apache.lucene.search.highlight.InvalidTokenOffsetsException;
import org.apache.lucene.search.highlight.QueryScorer;
import org.apache.lucene.search.highlight.SimpleHTMLFormatter;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.junit.Test;import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;
import java.nio.file.Paths;
import java.util.Date;/*** Created by kangz on 2016/12/19.*/
public class AnsjAnalyzerTest {/*** 简单测试 AnsjAnalyzer的性能及基础应用* @throws IOException*/@Testpublic void test() throws IOException {Analyzer ca = new AnsjAnalyzer(AnsjAnalyzer.TYPE.index);Reader sentence = new StringReader("全文检索是将整本书java、整篇文章中的任意内容信息查找出来的检索，java。它可以根据需要获得全文中有关章、节、段、句、词等信息，计算机程序通过扫描文章中的每一个词");TokenStream ts = ca.tokenStream("sentence", sentence);System.out.println("start: " + (new Date()));long before = System.currentTimeMillis();while (ts.incrementToken()) {System.out.println(ts.getAttribute(CharTermAttribute.class));}ts.close();long now = System.currentTimeMillis();System.out.println("time: " + (now - before) / 1000.0 + " s");}@Testpublic void indexTest() throws IOException, ParseException {Analyzer analyzer = new AnsjAnalyzer(AnsjAnalyzer.TYPE.index);Directory directory = FSDirectory.open(Paths.get("D:\\LucentTest\\luceneIndex2"));IndexWriter iwriter;UserDefineLibrary.insertWord("蛇药片", "n", 1000);// 创建一个IndexWriterConfig 对象IndexWriterConfig config = new IndexWriterConfig(analyzer);// 创建indexwriter对象IndexWriter indexWriter = new IndexWriter(directory, config);// 创建一个文档对象Document document = new Document();Field nameField = new TextField("text", "季德胜蛇药片 10片*6板 ", Field.Store.YES);nameField.boost();document.add(nameField);//写入索引库indexWriter.addDocument(document);indexWriter.commit();indexWriter.close();System.out.println("索引建立完毕");search(analyzer, directory, "\"季德胜蛇药片\"");}//封装索引查询private void search(Analyzer queryAnalyzer, Directory directory, String queryStr) throws IOException, ParseException {IndexSearcher isearcher;DirectoryReader directoryReader = DirectoryReader.open(directory);// 查询索引isearcher = new IndexSearcher(directoryReader);QueryParser tq = new QueryParser("text", queryAnalyzer);Query query = tq.parse(queryStr);System.out.println(query);TopDocs hits = isearcher.search(query, 5);System.out.println(queryStr + ":共找到" + hits.totalHits + "条记录!");for (int i = 0; i < hits.scoreDocs.length; i++) {int docId = hits.scoreDocs[i].doc;Document document = isearcher.doc(docId);System.out.println(toHighlighter(queryAnalyzer, query, document));}}//private String toHighlighter(Analyzer analyzer, Query query, Document doc) {String field = "text";try {SimpleHTMLFormatter simpleHtmlFormatter = new SimpleHTMLFormatter("<font color=\"red\">", "</font>");Highlighter highlighter = new Highlighter(simpleHtmlFormatter, new QueryScorer(query));TokenStream tokenStream1 = analyzer.tokenStream("text", new StringReader(doc.get(field)));String highlighterStr = highlighter.getBestFragment(tokenStream1, doc.get(field));return highlighterStr == null ? doc.get(field) : highlighterStr;} catch (IOException | InvalidTokenOffsetsException e) {}return null;}}

主要参考的文档 http://codepub.cn/2016/03/23/Maven-project-integrating-Lucene-Chinese-Segmentation-tools-Jcseg-and-Ansj/

下面是小编的微信转帐二维码，小编再次谢谢读者的支持，小编会更努力的

----请看下方↓↓↓↓↓↓↓

百度搜索 Drools从入门到精通：可下载开源全套Drools教程

深度Drools教程不段更新中：

更多Drools实战陆续发布中………

扫描下方二维码关注公众号 ↓↓↓↓↓↓↓↓↓↓

15、Analyzer分析器之中文分析器的扩展相关推荐

中文分析器IK Analyzer
一.中文分析器IK Analyzer IK Analyzer 是一个开源的,基亍 java 语言开发的轻量级的中文分词工具包.从 2006年 12 月推出 1.0 版开始, IKAnalyzer ...
（转）淘淘商城系列——中文分析器IK-Analyzer的使用
http://blog.csdn.net/yerenyuan_pku/article/details/72884461 在Solr中默认是没有中文分析器的,需要手工配置,配置一个FieldType,在 ...
IKAnalyzer中文分析器的使用
首先,也是最重要的一点,你得有 IKAnalyzer 这个jar包 https://pan.baidu.com/s/1bw_pxleG5SCghMSRKNL97A 提取码:ywtg 获得以下文件: 构 ...
elasticsearch-7.15.2 同时支持中文ik分词器和pinyin分词器
文章目录 1. 自定义分词器 2. 映射模型 3. 效果图 1. 自定义分词器 ES如何支持拼音和中文分词 ? 自定义分词器支持拼音和中文分词 PUT /jd_goods {"settin ...
lr1分析器c语言实验报告怎么写,编译原理课程的设计构造LR分析法语法分析器.doc...
编译原理课程的设计构造LR分析法语法分析器太原学院课程设计报告书课程名称设计题目构造LR(0)分析法语法分析器专业班级学号姓名指导教师 2016年 12 月 15日目录 ...
影像分析器之：矢量示波器
说明:此处将视频和图像统称为影像.影像示波器(也称分析器)常常存在于视频处理软件中,如Adobe Premiere.Davinci Resolve Studio等等.Photoshop处理图像时可利用 ...
Android性能优化——使用 APK Analyzer 分析你的 APK
Android Studio 2.2包含了APK Analyzer,通过它我们能够直观地看到APK的组成.使用APK Analyzer不仅能够减少你花在debug上的时间,而且还能减少你的APK大小. ...
dubbo源码分析系列（1）扩展机制的实现
1 系列目录 dubbo源码分析系列(1)扩展机制的实现 dubbo源码分析系列(2)服务的发布 dubbo源码分析系列(3)服务的引用 dubbo源码分析系列(4)dubbo通信设计 2 SPI扩展 ...
情感分析实战(中文)-共现语义篇
情感分析实战(中文)-共现语义网络分析背景:该专栏的目的是将自己做了N个情感分析的毕业设计的一个总结版,不仅自己可以在这次总结中,把自己过往的一些经验进行归纳,梳理,巩固自己的知识从而进一步提升,而 ...

15、Analyzer分析器之中文分析器的扩展

15、Analyzer分析器之中文分析器的扩展相关推荐

最新文章

热门文章