Document Categorizer

文档分类程序可以将文本分类为预定义的类别。它基于最大熵框架。

模型训练


import java.io.File;import java.io.FileInputStream;import java.io.IOException;import java.io.InputStream;import opennlp.tools.doccat.DoccatModel;import opennlp.tools.doccat.DocumentCategorizerME;public class DocumentCategoriesPredit {public static void main(String[] args) throws IOException {// TODO Auto-generated method stubString rootDir = System.getProperty("user.dir") + File.separator;String fileResourcesDir = rootDir + "resources" + File.separator;String modelResourcesDir = rootDir + "opennlpmodel" + File.separator;//String filePath = fileResourcesDir + "sentenceDetector.txt";String modelPath = modelResourcesDir + "en-documentCategorizer-my.bin";InputStream modelIn = new FileInputStream(modelPath) ;//加载模型DoccatModel model = new  DoccatModel(modelIn);//实例化模型DocumentCategorizerME docCategorizer  = new DocumentCategorizerME(model);//文档分类检测，返回的是一个概率数组double[] bProbs= docCategorizer.categorize(new String[]{"x", "y", "z"});System.out.println("最合适的分类："+docCategorizer.getBestCategory(bProbs));System.out.println("所有可能的分类："+docCategorizer.getAllResults(bProbs));for(int i=0;i<bProbs.length;i++){System.out.println("分类："+docCategorizer.getCategory(i)+";概率："+bProbs[i]);}}}

文档分类

import java.io.BufferedOutputStream;import java.io.File;import java.io.FileOutputStream;import java.io.IOException;import java.io.OutputStream;import java.nio.charset.StandardCharsets;import opennlp.tools.cmdline.doccat.DoccatFineGrainedReportListener;import opennlp.tools.doccat.DoccatFactory;import opennlp.tools.doccat.DoccatModel;import opennlp.tools.doccat.DocumentCategorizerEvaluator;import opennlp.tools.doccat.DocumentCategorizerME;import opennlp.tools.doccat.DocumentSample;import opennlp.tools.util.InputStreamFactory;import opennlp.tools.util.MarkableFileInputStreamFactory;import opennlp.tools.util.ObjectStream;import opennlp.tools.util.ObjectStreamUtils;import opennlp.tools.util.PlainTextByLineStream;import opennlp.tools.util.TrainingParameters;public class DocumentCategorizerTrain {public static void main(String[] args) throws IOException {// TODO Auto-generated method stubString rootDir = System.getProperty("user.dir") + File.separator;String fileResourcesDir = rootDir + "resources" + File.separator;String modelResourcesDir = rootDir + "opennlpmodel" + File.separator;//训练数据的路径String filePath = fileResourcesDir + "tokenizer.txt";//训练后模型的保存路径String modelPath = modelResourcesDir + "en-documentCategorizer-my.bin";//按行读取数据InputStreamFactory inputStreamFactory = new MarkableFileInputStreamFactory(new File(filePath));ObjectStream<String> lineStream = new PlainTextByLineStream(inputStreamFactory, StandardCharsets.UTF_8);//按行读取数据ObjectStream<DocumentSample> sampleStream = ObjectStreamUtils.createObjectStream(new DocumentSample("c1", new String[]{"a", "b", "c"}),new DocumentSample("c1", new String[]{"a", "b", "c", "gg", "rr"}),new DocumentSample("c1", new String[]{"a", "b", "c", "ee", "rr"}),new DocumentSample("c2", new String[]{"x", "y", "z"}),new DocumentSample("c2", new String[]{"x", "y", "z"}),new DocumentSample("c2", new String[]{"x", "y", "z"}),new DocumentSample("c2", new String[]{"x", "y", "z"}),new DocumentSample("c2", new String[]{"x", "y", "z"}),new DocumentSample("c2", new String[]{"x", "y", "z"}),new DocumentSample("c2", new String[]{"x", "y", "z"}),new DocumentSample("c2", new String[]{"x", "y", "z", "nn", "kk"}),new DocumentSample("c2", new String[]{"x", "y", "z", "ff", "cc"}));TrainingParameters params = new TrainingParameters();params.put(TrainingParameters.ITERATIONS_PARAM, 200);params.put(TrainingParameters.CUTOFF_PARAM, 0);DoccatFactory factory =new DoccatFactory();//训练模型DoccatModel model =DocumentCategorizerME.train("en", sampleStream, TrainingParameters.defaultParams(),factory);//保存模型FileOutputStream fos=new FileOutputStream(new File(modelPath));OutputStream modelOut = new BufferedOutputStream(fos);model.serialize(modelOut);//评估模型DocumentCategorizerEvaluator evaluator=new DocumentCategorizerEvaluator(new DocumentCategorizerME(model),new DoccatFineGrainedReportListener());evaluator.evaluate(sampleStream);System.out.println("正确率："+ evaluator.getAccuracy());}}

[NLP]OpenNLP文档分类器的使用相关推荐

python机器学习案例系列教程——文档分类器，朴素贝叶斯分类器，费舍尔分类器
全栈工程师开发手册 (作者:栾鹏) python数据挖掘系列教程 github地址:https://github.com/626626cdllp/data-mining/tree/master/Bay ...
【NLP】文档集数据处理 gensim corpora.Dictionary 的简单使用
[NLP]文档集数据处理 gensim corpora.Dictionary 1. corpora 和 dictionary 2. 词典操作 3. 存储 4. 其他操作 5. 分批处理和分布式计算 6 ...
弘玑Cyclone2022产品发布会：全新上线智能文档处理交互平台——尚书台
近日,在弘玑Cyclone"智无边界,数字未来"发布会上,弘玑Cyclone2022年超级自动化系列产品全新亮相,首席产品官贾岿博士带领产品团队以创新技术对新时代语境下的数字生产力 ...
NLP之TM之LDA：利用LDA算法瞬时掌握文档的主题内容—利用希拉里邮件数据集训练LDA模型并对新文本进行主题分类
NLP之TM之LDA:利用LDA算法瞬时掌握文档的主题内容-利用希拉里邮件数据集训练LDA模型并对新文本进行主题分类目录输出结果设计思路核心代码训练数据集 LDA模型应用输出结果设计思路 ...
NLP：两种方法(自定义函数和封装函数)实现提取两人对话内容(***分隔txt文档)，并各自保存为txt文档
NLP:两种方法(自定义函数和封装函数)实现提取两人对话内容(***分隔txt文档),并各自保存为txt文档目录问题探究实现代码问题探究实现代码 f=open("niu.txt&q ...
当NLP遇见OCR：如何提升智能文档分析效果？
随着数智化时代的到来,各行各业已经步入智能化升级的关键阶段,传统行业智能化进程已然加速.百度大脑赋能企业服务升级,为企业提供更加智慧化.人性化的服务,让企业服务更聪明.更高效. 百度大脑 AI 开放平 ...
智能文档分析：NLP和OCR的融合技术
随着数智化时代的到来,各行各业已经步入智能化升级的关键阶段,传统行业智能化进程已然加速.百度大脑赋能企业服务升级,为企业提供更加智慧化.人性化的服务,让企业服务更聪明.更高效. 百度大脑AI开放平台提 ...
NLP︱句子级、词语级以及句子-词语之间相似性（相关名称：文档特征、词特征、词权重）
每每以为攀得众山小,可.每每又切实来到起点,大牛们,缓缓脚步来俺笔记葩分享一下吧,please~ --------------------------- 关于相似性以及文档特征.词特征有太多种说法.弄 ...
Apache OpenNLP提供的文档
Apache OpenNLP提供了一个手册和Javadoc API文档. 本手册介绍了如何使用和培训各种OpenNLP组件. Apache OpenNLP 1.7.2文档 Apache OpenNLP ...
基于NLP处理企业家传记文档
基于NLP处理中国企业家文档 1. 实验环境本次技术采用Python编程,Python可以从官网https://www.python.org/下载,选出适合用户操作系统的二进制发行版后,按提示一步一 ...

[NLP]OpenNLP文档分类器的使用

Document Categorizer

模型训练

文档分类

[NLP]OpenNLP文档分类器的使用相关推荐

最新文章

热门文章