word分词器、ansj分词器、mmseg4j分词器、ik-analyzer分词器分词效果评估（转）

转自：http://yangshangchuan.iteye.com/blog/2056537（有代码可下载）

word分词器、ansj分词器、mmseg4j分词器、ik-analyzer分词器分词效果评估

博客分类：

人工智能

word分词word分词器word分词组件word分词库中文分词开源中文分词Java中文分词

word分词是一个Java实现的中文分词组件，提供了多种基于词典的分词算法，并利用ngram模型来消除歧义。能准确识别英文、数字，以及日期、时间等数量词，能识别人名、地名、组织机构名等未登录词。同时提供了Lucene、Solr、ElasticSearch插件。

word分词器分词效果评估主要评估下面7种分词算法：

正向最大匹配算法：MaximumMatching
逆向最大匹配算法：ReverseMaximumMatching
正向最小匹配算法：MinimumMatching
逆向最小匹配算法：ReverseMinimumMatching
双向最大匹配算法：BidirectionalMaximumMatching
双向最小匹配算法：BidirectionalMinimumMatching
双向最大最小匹配算法：BidirectionalMaximumMinimumMatching

所有的双向算法都使用ngram来消歧，分词效果评估分别评估bigram和trigram。

评估采用的测试文本有253 3709行，共2837 4490个字符，标准文本和测试文本一行行对应，标准文本中的词以空格分隔，评估标准为严格一致，评估核心代码如下：

Java代码

/**
* 分词效果评估
* @param resultText 实际分词结果文件路径
* @param standardText 标准分词结果文件路径
* @return 评估结果
*/
public static EvaluationResult evaluation(String resultText, String standardText) {
int perfectLineCount=0;
int wrongLineCount=0;
int perfectCharCount=0;
int wrongCharCount=0;
try(BufferedReader resultReader = new BufferedReader(new InputStreamReader(new FileInputStream(resultText),"utf-8"));
BufferedReader standardReader = new BufferedReader(new InputStreamReader(new FileInputStream(standardText),"utf-8"))){
String result;
while( (result = resultReader.readLine()) != null ){
result = result.trim();
String standard = standardReader.readLine().trim();
if(result.equals("")){
continue;
}
if(result.equals(standard)){
//分词结果和标准一模一样
perfectLineCount++;
perfectCharCount+=standard.replaceAll("\\s+", "").length();
}else{
//分词结果和标准不一样
wrongLineCount++;
wrongCharCount+=standard.replaceAll("\\s+", "").length();
}
}
} catch (IOException ex) {
LOGGER.error("分词效果评估失败：", ex);
}
int totalLineCount = perfectLineCount+wrongLineCount;
int totalCharCount = perfectCharCount+wrongCharCount;
EvaluationResult er = new EvaluationResult();
er.setPerfectCharCount(perfectCharCount);
er.setPerfectLineCount(perfectLineCount);
er.setTotalCharCount(totalCharCount);
er.setTotalLineCount(totalLineCount);
er.setWrongCharCount(wrongCharCount);
er.setWrongLineCount(wrongLineCount);
return er;
}

Java代码

/**
* 中文分词效果评估结果
* @author 杨尚川
*/
public class EvaluationResult implements Comparable{
private int totalLineCount;
private int perfectLineCount;
private int wrongLineCount;
private int totalCharCount;
private int perfectCharCount;
private int wrongCharCount;
public float getLinePerfectRate(){
return perfectLineCount/(float)totalLineCount*100;
}
public float getLineWrongRate(){
return wrongLineCount/(float)totalLineCount*100;
}
public float getCharPerfectRate(){
return perfectCharCount/(float)totalCharCount*100;
}
public float getCharWrongRate(){
return wrongCharCount/(float)totalCharCount*100;
}
public int getTotalLineCount() {
return totalLineCount;
}
public void setTotalLineCount(int totalLineCount) {
this.totalLineCount = totalLineCount;
}
public int getPerfectLineCount() {
return perfectLineCount;
}
public void setPerfectLineCount(int perfectLineCount) {
this.perfectLineCount = perfectLineCount;
}
public int getWrongLineCount() {
return wrongLineCount;
}
public void setWrongLineCount(int wrongLineCount) {
this.wrongLineCount = wrongLineCount;
}
public int getTotalCharCount() {
return totalCharCount;
}
public void setTotalCharCount(int totalCharCount) {
this.totalCharCount = totalCharCount;
}
public int getPerfectCharCount() {
return perfectCharCount;
}
public void setPerfectCharCount(int perfectCharCount) {
this.perfectCharCount = perfectCharCount;
}
public int getWrongCharCount() {
return wrongCharCount;
}
public void setWrongCharCount(int wrongCharCount) {
this.wrongCharCount = wrongCharCount;
}
@Override
public String toString(){
return segmentationAlgorithm.name()+"（"+segmentationAlgorithm.getDes()+"）："
+"\n"
+"分词速度："+segSpeed+" 字符/毫秒"
+"\n"
+"行数完美率："+getLinePerfectRate()+"%"
+" 行数错误率："+getLineWrongRate()+"%"
+" 总的行数："+totalLineCount
+" 完美行数："+perfectLineCount
+" 错误行数："+wrongLineCount
+"\n"
+"字数完美率："+getCharPerfectRate()+"%"
+" 字数错误率："+getCharWrongRate()+"%"
+" 总的字数："+totalCharCount
+" 完美字数："+perfectCharCount
+" 错误字数："+wrongCharCount;
}
@Override
public int compareTo(Object o) {
EvaluationResult other = (EvaluationResult)o;
if(other.getLinePerfectRate() - getLinePerfectRate() > 0){
return 1;
}
if(other.getLinePerfectRate() - getLinePerfectRate() < 0){
return -1;
}
return 0;
}
}

word分词使用trigram评估结果：

Java代码

BidirectionalMaximumMinimumMatching（双向最大最小匹配算法）：
分词速度：265.62566 字符/毫秒
行数完美率：55.352688% 行数错误率：44.647312% 总的行数：2533709 完美行数：1402476 错误行数：1131233
字数完美率：46.23227% 字数错误率：53.76773% 总的字数：28374490 完美字数：13118171 错误字数：15256319
BidirectionalMaximumMatching（双向最大匹配算法）：
分词速度：335.62155 字符/毫秒
行数完美率：50.16934% 行数错误率：49.83066% 总的行数：2533709 完美行数：1271145 错误行数：1262564
字数完美率：40.692997% 字数错误率：59.307003% 总的字数：28374490 完美字数：11546430 错误字数：16828060
ReverseMaximumMatching（逆向最大匹配算法）：
分词速度：686.71045 字符/毫秒
行数完美率：46.723125% 行数错误率：53.27688% 总的行数：2533709 完美行数：1183828 错误行数：1349881
字数完美率：36.67598% 字数错误率：63.32402% 总的字数：28374490 完美字数：10406622 错误字数：17967868
MaximumMatching（正向最大匹配算法）：
分词速度：733.9535 字符/毫秒
行数完美率：46.661713% 行数错误率：53.338287% 总的行数：2533709 完美行数：1182272 错误行数：1351437
字数完美率：36.72861% 字数错误率：63.271393% 总的字数：28374490 完美字数：10421556 错误字数：17952934
BidirectionalMinimumMatching（双向最小匹配算法）：
分词速度：432.87375 字符/毫秒
行数完美率：45.863907% 行数错误率：54.136093% 总的行数：2533709 完美行数：1162058 错误行数：1371651
字数完美率：35.942123% 字数错误率：64.05788% 总的字数：28374490 完美字数：10198395 错误字数：18176095
ReverseMinimumMatching（逆向最小匹配算法）：
分词速度：1033.58636 字符/毫秒
行数完美率：41.776066% 行数错误率：58.223934% 总的行数：2533709 完美行数：1058484 错误行数：1475225
字数完美率：31.678978% 字数错误率：68.32102% 总的字数：28374490 完美字数：8988748 错误字数：19385742
MinimumMatching（正向最小匹配算法）：
分词速度：1175.4431 字符/毫秒
行数完美率：36.853836% 行数错误率：63.146164% 总的行数：2533709 完美行数：933769 错误行数：1599940
字数完美率：26.859812% 字数错误率：73.14019% 总的字数：28374490 完美字数：7621334 错误字数：20753156

word分词使用bigram评估结果：

Java代码

BidirectionalMaximumMinimumMatching（双向最大最小匹配算法）：
分词速度：233.49121 字符/毫秒
行数完美率：55.31531% 行数错误率：44.68469% 总的行数：2533709 完美行数：1401529 错误行数：1132180
字数完美率：45.834396% 字数错误率：54.165604% 总的字数：28374490 完美字数：13005277 错误字数：15369213
BidirectionalMaximumMatching（双向最大匹配算法）：
分词速度：303.59401 字符/毫秒
行数完美率：52.007233% 行数错误率：47.992767% 总的行数：2533709 完美行数：1317712 错误行数：1215997
字数完美率：42.424194% 字数错误率：57.575806% 总的字数：28374490 完美字数：12037649 错误字数：16336841
BidirectionalMinimumMatching（双向最小匹配算法）：
分词速度：349.67215 字符/毫秒
行数完美率：46.766422% 行数错误率：53.23358% 总的行数：2533709 完美行数：1184925 错误行数：1348784
字数完美率：36.52718% 字数错误率：63.47282% 总的字数：28374490 完美字数：10364401 错误字数：18010089
ReverseMaximumMatching（逆向最大匹配算法）：
分词速度：598.04272 字符/毫秒
行数完美率：46.723125% 行数错误率：53.27688% 总的行数：2533709 完美行数：1183828 错误行数：1349881
字数完美率：36.67598% 字数错误率：63.32402% 总的字数：28374490 完美字数：10406622 错误字数：17967868
MaximumMatching（正向最大匹配算法）：
分词速度：676.7993 字符/毫秒
行数完美率：46.661713% 行数错误率：53.338287% 总的行数：2533709 完美行数：1182272 错误行数：1351437
字数完美率：36.72861% 字数错误率：63.271393% 总的字数：28374490 完美字数：10421556 错误字数：17952934
ReverseMinimumMatching（逆向最小匹配算法）：
分词速度：806.9586 字符/毫秒
行数完美率：41.776066% 行数错误率：58.223934% 总的行数：2533709 完美行数：1058484 错误行数：1475225
字数完美率：31.678978% 字数错误率：68.32102% 总的字数：28374490 完美字数：8988748 错误字数：19385742
MinimumMatching（正向最小匹配算法）：
分词速度：1020.9208 字符/毫秒
行数完美率：36.853836% 行数错误率：63.146164% 总的行数：2533709 完美行数：933769 错误行数：1599940
字数完美率：26.859812% 字数错误率：73.14019% 总的字数：28374490 完美字数：7621334 错误字数：20753156

Ansj0.9的评估结果如下：

Java代码

Ansj ToAnalysis 精准分词：
分词速度：495.9188 字符/毫秒
行数完美率：58.609295% 行数错误率：41.390705% 总的行数：2533709 完美行数：1484989 错误行数：1048720
字数完美率：50.97614% 字数错误率：49.023857% 总的字数：28374490 完美字数：14464220 错误字数：13910270
Ansj NlpAnalysis NLP分词：
分词速度：350.7527 字符/毫秒
行数完美率：58.60353% 行数错误率：41.396465% 总的行数：2533709 完美行数：1484843 错误行数：1048866
字数完美率：50.75546% 字数错误率：49.244545% 总的字数：28374490 完美字数：14401602 错误字数：13972888
Ansj BaseAnalysis 基本分词：
分词速度：532.65424 字符/毫秒
行数完美率：54.028584% 行数错误率：45.97142% 总的行数：2533709 完美行数：1368927 错误行数：1164782
字数完美率：46.84512% 字数错误率：53.15488% 总的字数：28374490 完美字数：13292064 错误字数：15082426
Ansj IndexAnalysis 面向索引的分词：
分词速度：564.6103 字符/毫秒
行数完美率：53.510803% 行数错误率：46.489197% 总的行数：2533709 完美行数：1355808 错误行数：1177901
字数完美率：46.355087% 字数错误率：53.644913% 总的字数：28374490 完美字数：13153019 错误字数：15221471

Ansj1.4的评估结果如下：

Java代码

Ansj ToAnalysis 精准分词：
分词速度：581.7306 字符/毫秒
行数完美率：58.60302% 行数错误率：41.39698% 总的行数：2533709 完美行数：1484830 错误行数：1048879
字数完美率：50.968987% 字数错误率：49.031013% 总的字数：28374490 完美字数：14462190 错误字数：13912300
Ansj NlpAnalysis NLP分词：
分词速度：138.81165 字符/毫秒
行数完美率：58.1515% 行数错误率：41.8485% 总的行数：2533687 完美行数：1473377 错误行数：1060310
字数完美率：49.806484% 字数错误率：50.19352% 总的字数：28374398 完美字数：14132290 错误字数：14242108
Ansj BaseAnalysis 基本分词：
分词速度：627.68475 字符/毫秒
行数完美率：55.3174% 行数错误率：44.6826% 总的行数：2533709 完美行数：1401582 错误行数：1132127
字数完美率：48.177986% 字数错误率：51.822014% 总的字数：28374490 完美字数：13670258 错误字数：14704232
Ansj IndexAnalysis 面向索引的分词：
分词速度：715.55176 字符/毫秒
行数完美率：50.89444% 行数错误率：49.10556% 总的行数：2533709 完美行数：1289517 错误行数：1244192
字数完美率：42.965115% 字数错误率：57.034885% 总的字数：28374490 完美字数：12191132 错误字数：16183358

Ansj分词评估程序如下：

Java代码

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import org.ansj.domain.Term;
import org.ansj.splitWord.analysis.BaseAnalysis;
import org.ansj.splitWord.analysis.IndexAnalysis;
import org.ansj.splitWord.analysis.NlpAnalysis;
import org.ansj.splitWord.analysis.ToAnalysis;
/**
* Ansj分词器分词效果评估
* @author 杨尚川
*/
public class AnsjEvaluation {
public static void main(String[] args) throws Exception{
// 测试文件 d:/test-text.txt 和标准分词结果文件 d:/standard-text.txt 的下载地址：
// http://pan.baidu.com/s/1hqihzjY
List<EvaluationResult> list = new ArrayList<>();
// 对文本进行分词
float rate = seg("d:/test-text.txt", "d:/result-text-BaseAnalysis.txt", "BaseAnalysis");
// 对分词结果进行评估
EvaluationResult result = evaluation("d:/result-text-BaseAnalysis.txt", "d:/standard-text.txt");
result.setAnalyzer("Ansj BaseAnalysis 基本分词");
result.setSegSpeed(rate);
list.add(result);
// 对文本进行分词
rate = seg("d:/test-text.txt", "d:/result-text-ToAnalysis.txt", "ToAnalysis");
// 对分词结果进行评估
result = evaluation("d:/result-text-ToAnalysis.txt", "d:/standard-text.txt");
result.setAnalyzer("Ansj ToAnalysis 精准分词");
result.setSegSpeed(rate);
list.add(result);
// 对文本进行分词
rate = seg("d:/test-text.txt", "d:/result-text-NlpAnalysis.txt", "NlpAnalysis");
// 对分词结果进行评估
result = evaluation("d:/result-text-NlpAnalysis.txt", "d:/standard-text.txt");
result.setAnalyzer("Ansj NlpAnalysis NLP分词");
result.setSegSpeed(rate);
list.add(result);
// 对文本进行分词
rate = seg("d:/test-text.txt", "d:/result-text-IndexAnalysis.txt", "IndexAnalysis");
// 对分词结果进行评估
result = evaluation("d:/result-text-IndexAnalysis.txt", "d:/standard-text.txt");
result.setAnalyzer("Ansj IndexAnalysis 面向索引的分词");
result.setSegSpeed(rate);
list.add(result);
//输出评估结果
Collections.sort(list);
System.out.println("");
for(EvaluationResult r : list){
System.out.println(r+"\n");
}
}
private static float seg(final String input, final String output, final String type) throws Exception{
float rate = 0;
try(BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(input),"utf-8"));
BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(output),"utf-8"))){
long size = Files.size(Paths.get(input));
System.out.println("size:"+size);
System.out.println("文件大小："+(float)size/1024/1024+" MB");
int textLength=0;
int progress=0;
long start = System.currentTimeMillis();
String line = null;
while((line = reader.readLine()) != null){
if("".equals(line.trim())){
writer.write("\n");
continue;
}
textLength += line.length();
switch(type){
case "BaseAnalysis":
for(Term term : BaseAnalysis.parse(line)){
writer.write(term.getName()+" ");
}
break;
case "ToAnalysis":
for(Term term : ToAnalysis.parse(line)){
writer.write(term.getName()+" ");
}
break;
case "NlpAnalysis":
try{
for(Term term : NlpAnalysis.parse(line)){
writer.write(term.getName()+" ");
}
}catch(Exception e){}
break;
case "IndexAnalysis":
for(Term term : IndexAnalysis.parse(line)){
writer.write(term.getName()+" ");
}
break;
}
writer.write("\n");
progress += line.length();
if( progress > 500000){
progress = 0;
System.out.println("分词进度："+(int)(textLength*2.99/size*100)+"%");
}
}
long cost = System.currentTimeMillis() - start;
rate = textLength/(float)cost;
System.out.println("字符数目："+textLength);
System.out.println("分词耗时："+cost+" 毫秒");
System.out.println("分词速度："+rate+" 字符/毫秒");
}
return rate;
}
/**
* 分词效果评估
* @param resultText 实际分词结果文件路径
* @param standardText 标准分词结果文件路径
* @return 评估结果
*/
private static EvaluationResult evaluation(String resultText, String standardText) {
int perfectLineCount=0;
int wrongLineCount=0;
int perfectCharCount=0;
int wrongCharCount=0;
try(BufferedReader resultReader = new BufferedReader(new InputStreamReader(new FileInputStream(resultText),"utf-8"));
BufferedReader standardReader = new BufferedReader(new InputStreamReader(new FileInputStream(standardText),"utf-8"))){
String result;
while( (result = resultReader.readLine()) != null ){
result = result.trim();
String standard = standardReader.readLine().trim();
if(result.equals("")){
continue;
}
if(result.equals(standard)){
//分词结果和标准一模一样
perfectLineCount++;
perfectCharCount+=standard.replaceAll("\\s+", "").length();
}else{
//分词结果和标准不一样
wrongLineCount++;
wrongCharCount+=standard.replaceAll("\\s+", "").length();
}
}
} catch (IOException ex) {
System.err.println("分词效果评估失败：" + ex.getMessage());
}
int totalLineCount = perfectLineCount+wrongLineCount;
int totalCharCount = perfectCharCount+wrongCharCount;
EvaluationResult er = new EvaluationResult();
er.setPerfectCharCount(perfectCharCount);
er.setPerfectLineCount(perfectLineCount);
er.setTotalCharCount(totalCharCount);
er.setTotalLineCount(totalLineCount);
er.setWrongCharCount(wrongCharCount);
er.setWrongLineCount(wrongLineCount);
return er;
}
/**
* 分词结果
*/
private static class EvaluationResult implements Comparable{
private String analyzer;
private float segSpeed;
private int totalLineCount;
private int perfectLineCount;
private int wrongLineCount;
private int totalCharCount;
private int perfectCharCount;
private int wrongCharCount;
public String getAnalyzer() {
return analyzer;
}
public void setAnalyzer(String analyzer) {
this.analyzer = analyzer;
}
public float getSegSpeed() {
return segSpeed;
}
public void setSegSpeed(float segSpeed) {
this.segSpeed = segSpeed;
}
public float getLinePerfectRate(){
return perfectLineCount/(float)totalLineCount*100;
}
public float getLineWrongRate(){
return wrongLineCount/(float)totalLineCount*100;
}
public float getCharPerfectRate(){
return perfectCharCount/(float)totalCharCount*100;
}
public float getCharWrongRate(){
return wrongCharCount/(float)totalCharCount*100;
}
public int getTotalLineCount() {
return totalLineCount;
}
public void setTotalLineCount(int totalLineCount) {
this.totalLineCount = totalLineCount;
}
public int getPerfectLineCount() {
return perfectLineCount;
}
public void setPerfectLineCount(int perfectLineCount) {
this.perfectLineCount = perfectLineCount;
}
public int getWrongLineCount() {
return wrongLineCount;
}
public void setWrongLineCount(int wrongLineCount) {
this.wrongLineCount = wrongLineCount;
}
public int getTotalCharCount() {
return totalCharCount;
}
public void setTotalCharCount(int totalCharCount) {
this.totalCharCount = totalCharCount;
}
public int getPerfectCharCount() {
return perfectCharCount;
}
public void setPerfectCharCount(int perfectCharCount) {
this.perfectCharCount = perfectCharCount;
}
public int getWrongCharCount() {
return wrongCharCount;
}
public void setWrongCharCount(int wrongCharCount) {
this.wrongCharCount = wrongCharCount;
}
@Override
public String toString(){
return analyzer+"："
+"\n"
+"分词速度："+segSpeed+" 字符/毫秒"
+"\n"
+"行数完美率："+getLinePerfectRate()+"%"
+" 行数错误率："+getLineWrongRate()+"%"
+" 总的行数："+totalLineCount
+" 完美行数："+perfectLineCount
+" 错误行数："+wrongLineCount
+"\n"
+"字数完美率："+getCharPerfectRate()+"%"
+" 字数错误率："+getCharWrongRate()+"%"
+" 总的字数："+totalCharCount
+" 完美字数："+perfectCharCount
+" 错误字数："+wrongCharCount;
}
@Override
public int compareTo(Object o) {
EvaluationResult other = (EvaluationResult)o;
if(other.getLinePerfectRate() - getLinePerfectRate() > 0){
return 1;
}
if(other.getLinePerfectRate() - getLinePerfectRate() < 0){
return -1;
}
return 0;
}
}
}

MMSeg4j1.9.1的评估结果如下：

Java代码

MMSeg4j ComplexSeg：
分词速度：794.24805 字符/毫秒
行数完美率：38.817604% 行数错误率：61.182396% 总的行数：2533688 完美行数：983517 错误行数：1550171
字数完美率：29.604435% 字数错误率：70.39557% 总的字数：28374428 完美字数：8400089 错误字数：19974339
MMSeg4j SimpleSeg：
分词速度：1026.1058 字符/毫秒
行数完美率：37.570095% 行数错误率：62.429905% 总的行数：2533688 完美行数：951909 错误行数：1581779
字数完美率：28.455273% 字数错误率：71.54473% 总的字数：28374428 完美字数：8074021 错误字数：20300407
MMSeg4j MaxWordSeg：
分词速度：813.0676 字符/毫秒
行数完美率：34.27573% 行数错误率：65.72427% 总的行数：2533688 完美行数：868440 错误行数：1665248
字数完美率：25.20896% 字数错误率：74.79104% 总的字数：28374428 完美字数：7152898 错误字数：21221530

MMSeg4j1.9.1分词评估程序如下：

Java代码

import com.chenlb.mmseg4j.ComplexSeg;
import com.chenlb.mmseg4j.Dictionary;
import com.chenlb.mmseg4j.MMSeg;
import com.chenlb.mmseg4j.MaxWordSeg;
import com.chenlb.mmseg4j.Seg;
import com.chenlb.mmseg4j.SimpleSeg;
import com.chenlb.mmseg4j.Word;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.io.StringReader;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
/**
* MMSeg4j分词器分词效果评估
* @author 杨尚川
*/
public class MMSeg4jEvaluation {
public static void main(String[] args) throws Exception{
// 测试文件 d:/test-text.txt 和标准分词结果文件 d:/standard-text.txt 的下载地址：
// http://pan.baidu.com/s/1hqihzjY
List<EvaluationResult> list = new ArrayList<>();
Dictionary dic = Dictionary.getInstance();
// 对文本进行分词
float rate = seg("d:/test-text.txt", "d:/result-text-ComplexSeg.txt", new ComplexSeg(dic));
// 对分词结果进行评估
EvaluationResult result = evaluation("d:/result-text-ComplexSeg.txt", "d:/standard-text.txt");
result.setAnalyzer("MMSeg4j ComplexSeg");
result.setSegSpeed(rate);
list.add(result);
// 对文本进行分词
rate = seg("d:/test-text.txt", "d:/result-text-SimpleSeg.txt", new SimpleSeg(dic));
// 对分词结果进行评估
result = evaluation("d:/result-text-SimpleSeg.txt", "d:/standard-text.txt");
result.setAnalyzer("MMSeg4j SimpleSeg");
result.setSegSpeed(rate);
list.add(result);
// 对文本进行分词
rate = seg("d:/test-text.txt", "d:/result-text-MaxWordSeg.txt", new MaxWordSeg(dic));
// 对分词结果进行评估
result = evaluation("d:/result-text-MaxWordSeg.txt", "d:/standard-text.txt");
result.setAnalyzer("MMSeg4j MaxWordSeg");
result.setSegSpeed(rate);
list.add(result);
//输出评估结果
Collections.sort(list);
System.out.println("");
for(EvaluationResult r : list){
System.out.println(r+"\n");
}
}
private static float seg(final String input, final String output, final Seg seg) throws Exception{
float rate = 0;
try(BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(input),"utf-8"));
BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(output),"utf-8"))){
long size = Files.size(Paths.get(input));
System.out.println("size:"+size);
System.out.println("文件大小："+(float)size/1024/1024+" MB");
int textLength=0;
int progress=0;
long start = System.currentTimeMillis();
String line = null;
while((line = reader.readLine()) != null){
if("".equals(line.trim())){
writer.write("\n");
continue;
}
textLength += line.length();
writer.write(seg(line, seg));
writer.write("\n");
progress += line.length();
if( progress > 500000){
progress = 0;
System.out.println("分词进度："+(int)(textLength*2.99/size*100)+"%");
}
}
long cost = System.currentTimeMillis() - start;
rate = textLength/(float)cost;
System.out.println("字符数目："+textLength);
System.out.println("分词耗时："+cost+" 毫秒");
System.out.println("分词速度："+rate+" 字符/毫秒");
}
return rate;
}
private static String seg(String text, Seg seg) throws IOException {
StringBuilder result = new StringBuilder();
MMSeg mmSeg = new MMSeg(new StringReader(text), seg);
Word word = null;
while((word=mmSeg.next())!=null) {
result.append(word.getString()).append(" ");
}
return result.toString().trim();
}
/**
* 分词效果评估
* @param resultText 实际分词结果文件路径
* @param standardText 标准分词结果文件路径
* @return 评估结果
*/
private static EvaluationResult evaluation(String resultText, String standardText) {
int perfectLineCount=0;
int wrongLineCount=0;
int perfectCharCount=0;
int wrongCharCount=0;
try(BufferedReader resultReader = new BufferedReader(new InputStreamReader(new FileInputStream(resultText),"utf-8"));
BufferedReader standardReader = new BufferedReader(new InputStreamReader(new FileInputStream(standardText),"utf-8"))){
String result;
while( (result = resultReader.readLine()) != null ){
result = result.trim();
String standard = standardReader.readLine().trim();
if(result.equals("")){
continue;
}
if(result.equals(standard)){
//分词结果和标准一模一样
perfectLineCount++;
perfectCharCount+=standard.replaceAll("\\s+", "").length();
}else{
//分词结果和标准不一样
wrongLineCount++;
wrongCharCount+=standard.replaceAll("\\s+", "").length();
}
}
} catch (IOException ex) {
System.err.println("分词效果评估失败：" + ex.getMessage());
}
int totalLineCount = perfectLineCount+wrongLineCount;
int totalCharCount = perfectCharCount+wrongCharCount;
EvaluationResult er = new EvaluationResult();
er.setPerfectCharCount(perfectCharCount);
er.setPerfectLineCount(perfectLineCount);
er.setTotalCharCount(totalCharCount);
er.setTotalLineCount(totalLineCount);
er.setWrongCharCount(wrongCharCount);
er.setWrongLineCount(wrongLineCount);
return er;
}
/**
* 分词结果
*/
private static class EvaluationResult implements Comparable{
private String analyzer;
private float segSpeed;
private int totalLineCount;
private int perfectLineCount;
private int wrongLineCount;
private int totalCharCount;
private int perfectCharCount;
private int wrongCharCount;
public String getAnalyzer() {
return analyzer;
}
public void setAnalyzer(String analyzer) {
this.analyzer = analyzer;
}
public float getSegSpeed() {
return segSpeed;
}
public void setSegSpeed(float segSpeed) {
this.segSpeed = segSpeed;
}
public float getLinePerfectRate(){
return perfectLineCount/(float)totalLineCount*100;
}
public float getLineWrongRate(){
return wrongLineCount/(float)totalLineCount*100;
}
public float getCharPerfectRate(){
return perfectCharCount/(float)totalCharCount*100;
}
public float getCharWrongRate(){
return wrongCharCount/(float)totalCharCount*100;
}
public int getTotalLineCount() {
return totalLineCount;
}
public void setTotalLineCount(int totalLineCount) {
this.totalLineCount = totalLineCount;
}
public int getPerfectLineCount() {
return perfectLineCount;
}
public void setPerfectLineCount(int perfectLineCount) {
this.perfectLineCount = perfectLineCount;
}
public int getWrongLineCount() {
return wrongLineCount;
}
public void setWrongLineCount(int wrongLineCount) {
this.wrongLineCount = wrongLineCount;
}
public int getTotalCharCount() {
return totalCharCount;
}
public void setTotalCharCount(int totalCharCount) {
this.totalCharCount = totalCharCount;
}
public int getPerfectCharCount() {
return perfectCharCount;
}
public void setPerfectCharCount(int perfectCharCount) {
this.perfectCharCount = perfectCharCount;
}
public int getWrongCharCount() {
return wrongCharCount;
}
public void setWrongCharCount(int wrongCharCount) {
this.wrongCharCount = wrongCharCount;
}
@Override
public String toString(){
return analyzer+"："
+"\n"
+"分词速度："+segSpeed+" 字符/毫秒"
+"\n"
+"行数完美率："+getLinePerfectRate()+"%"
+" 行数错误率："+getLineWrongRate()+"%"
+" 总的行数："+totalLineCount
+" 完美行数："+perfectLineCount
+" 错误行数："+wrongLineCount
+"\n"
+"字数完美率："+getCharPerfectRate()+"%"
+" 字数错误率："+getCharWrongRate()+"%"
+" 总的字数："+totalCharCount
+" 完美字数："+perfectCharCount
+" 错误字数："+wrongCharCount;
}
@Override
public int compareTo(Object o) {
EvaluationResult other = (EvaluationResult)o;
if(other.getLinePerfectRate() - getLinePerfectRate() > 0){
return 1;
}
if(other.getLinePerfectRate() - getLinePerfectRate() < 0){
return -1;
}
return 0;
}
}
}

ik-analyzer2012_u6的评估结果如下：

Java代码

IKAnalyzer 智能切分：
分词速度：178.3516 字符/毫秒
行数完美率：37.55943% 行数错误率：62.440567% 总的行数：2533686 完美行数：951638 错误行数：1582048
字数完美率：27.978464% 字数错误率：72.02154% 总的字数：28374416 完美字数：7938726 错误字数：20435690
IKAnalyzer 细粒度切分：
分词速度：182.97859 字符/毫秒
行数完美率：18.872742% 行数错误率：81.12726% 总的行数：2533686 完美行数：478176 错误行数：2055510
字数完美率：10.936535% 字数错误率：89.06347% 总的字数：28374416 完美字数：3103178 错误字数：25271238

ik-analyzer2012_u6分词评估程序如下：

Java代码

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.io.StringReader;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import org.wltea.analyzer.core.IKSegmenter;
import org.wltea.analyzer.core.Lexeme;
/**
* IKAnalyzer分词器分词效果评估
* @author 杨尚川
*/
public class IKAnalyzerEvaluation {
public static void main(String[] args) throws Exception{
// 测试文件 d:/test-text.txt 和标准分词结果文件 d:/standard-text.txt 的下载地址：
// http://pan.baidu.com/s/1hqihzjY
List<EvaluationResult> list = new ArrayList<>();
// 对文本进行分词
float rate = seg("d:/test-text.txt", "d:/result-text-ComplexSeg.txt", true);
// 对分词结果进行评估
EvaluationResult result = evaluation("d:/result-text-ComplexSeg.txt", "d:/standard-text.txt");
result.setAnalyzer("IKAnalyzer 智能切分");
result.setSegSpeed(rate);
list.add(result);
// 对文本进行分词
rate = seg("d:/test-text.txt", "d:/result-text-SimpleSeg.txt", false);
// 对分词结果进行评估
result = evaluation("d:/result-text-SimpleSeg.txt", "d:/standard-text.txt");
result.setAnalyzer("IKAnalyzer 细粒度切分");
result.setSegSpeed(rate);
list.add(result);
//输出评估结果
Collections.sort(list);
System.out.println("");
for(EvaluationResult r : list){
System.out.println(r+"\n");
}
}
private static float seg(final String input, final String output, final boolean useSmart) throws Exception{
float rate = 0;
try(BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(input),"utf-8"));
BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(output),"utf-8"))){
long size = Files.size(Paths.get(input));
System.out.println("size:"+size);
System.out.println("文件大小："+(float)size/1024/1024+" MB");
int textLength=0;
int progress=0;
long start = System.currentTimeMillis();
String line = null;
while((line = reader.readLine()) != null){
if("".equals(line.trim())){
writer.write("\n");
continue;
}
textLength += line.length();
writer.write(seg(line, useSmart));
writer.write("\n");
progress += line.length();
if( progress > 500000){
progress = 0;
System.out.println("分词进度："+(int)(textLength*2.99/size*100)+"%");
}
}
long cost = System.currentTimeMillis() - start;
rate = textLength/(float)cost;
System.out.println("字符数目："+textLength);
System.out.println("分词耗时："+cost+" 毫秒");
System.out.println("分词速度："+rate+" 字符/毫秒");
}
return rate;
}
private static String seg(String text, boolean useSmart) throws IOException {
StringBuilder result = new StringBuilder();
IKSegmenter ik = new IKSegmenter(new StringReader(text), useSmart);
Lexeme word = null;
while((word=ik.next())!=null) {
result.append(word.getLexemeText()).append(" ");
}
return result.toString().trim();
}
/**
* 分词效果评估
* @param resultText 实际分词结果文件路径
* @param standardText 标准分词结果文件路径
* @return 评估结果
*/
private static EvaluationResult evaluation(String resultText, String standardText) {
int perfectLineCount=0;
int wrongLineCount=0;
int perfectCharCount=0;
int wrongCharCount=0;
try(BufferedReader resultReader = new BufferedReader(new InputStreamReader(new FileInputStream(resultText),"utf-8"));
BufferedReader standardReader = new BufferedReader(new InputStreamReader(new FileInputStream(standardText),"utf-8"))){
String result;
while( (result = resultReader.readLine()) != null ){
result = result.trim();
String standard = standardReader.readLine().trim();
if(result.equals("")){
continue;
}
if(result.equals(standard)){
//分词结果和标准一模一样
perfectLineCount++;
perfectCharCount+=standard.replaceAll("\\s+", "").length();
}else{
//分词结果和标准不一样
wrongLineCount++;
wrongCharCount+=standard.replaceAll("\\s+", "").length();
}
}
} catch (IOException ex) {
System.err.println("分词效果评估失败：" + ex.getMessage());
}
int totalLineCount = perfectLineCount+wrongLineCount;
int totalCharCount = perfectCharCount+wrongCharCount;
EvaluationResult er = new EvaluationResult();
er.setPerfectCharCount(perfectCharCount);
er.setPerfectLineCount(perfectLineCount);
er.setTotalCharCount(totalCharCount);
er.setTotalLineCount(totalLineCount);
er.setWrongCharCount(wrongCharCount);
er.setWrongLineCount(wrongLineCount);
return er;
}
/**
* 分词结果
*/
private static class EvaluationResult implements Comparable{
private String analyzer;
private float segSpeed;
private int totalLineCount;
private int perfectLineCount;
private int wrongLineCount;
private int totalCharCount;
private int perfectCharCount;
private int wrongCharCount;
public String getAnalyzer() {
return analyzer;
}
public void setAnalyzer(String analyzer) {
this.analyzer = analyzer;
}
public float getSegSpeed() {
return segSpeed;
}
public void setSegSpeed(float segSpeed) {
this.segSpeed = segSpeed;
}
public float getLinePerfectRate(){
return perfectLineCount/(float)totalLineCount*100;
}
public float getLineWrongRate(){
return wrongLineCount/(float)totalLineCount*100;
}
public float getCharPerfectRate(){
return perfectCharCount/(float)totalCharCount*100;
}
public float getCharWrongRate(){
return wrongCharCount/(float)totalCharCount*100;
}
public int getTotalLineCount() {
return totalLineCount;
}
public void setTotalLineCount(int totalLineCount) {
this.totalLineCount = totalLineCount;
}
public int getPerfectLineCount() {
return perfectLineCount;
}
public void setPerfectLineCount(int perfectLineCount) {
this.perfectLineCount = perfectLineCount;
}
public int getWrongLineCount() {
return wrongLineCount;
}
public void setWrongLineCount(int wrongLineCount) {
this.wrongLineCount = wrongLineCount;
}
public int getTotalCharCount() {
return totalCharCount;
}
public void setTotalCharCount(int totalCharCount) {
this.totalCharCount = totalCharCount;
}
public int getPerfectCharCount() {
return perfectCharCount;
}
public void setPerfectCharCount(int perfectCharCount) {
this.perfectCharCount = perfectCharCount;
}
public int getWrongCharCount() {
return wrongCharCount;
}
public void setWrongCharCount(int wrongCharCount) {
this.wrongCharCount = wrongCharCount;
}
@Override
public String toString(){
return analyzer+"："
+"\n"
+"分词速度："+segSpeed+" 字符/毫秒"
+"\n"
+"行数完美率："+getLinePerfectRate()+"%"
+" 行数错误率："+getLineWrongRate()+"%"
+" 总的行数："+totalLineCount
+" 完美行数："+perfectLineCount
+" 错误行数："+wrongLineCount
+"\n"
+"字数完美率："+getCharPerfectRate()+"%"
+" 字数错误率："+getCharWrongRate()+"%"
+" 总的字数："+totalCharCount
+" 完美字数："+perfectCharCount
+" 错误字数："+wrongCharCount;
}
@Override
public int compareTo(Object o) {
EvaluationResult other = (EvaluationResult)o;
if(other.getLinePerfectRate() - getLinePerfectRate() > 0){
return 1;
}
if(other.getLinePerfectRate() - getLinePerfectRate() < 0){
return -1;
}
return 0;
}
}
}

ansj、mmseg4j和ik-analyzer的评估程序可在附件中下载，word分词只需运行项目根目录下的evaluation.bat脚本即可。

参考资料：

1、word分词器分词效果评估测试数据集和标准数据集

2、word分词器评估程序

3、word分词器主页

4、ansj分词器主页

5、mmseg4j分词器主页

6、ik-analyzer分词器主页