[NLP]OpenNLP词形还原器(Lemmatizer)的使用
Lemmatizer
词形还原,把用POS tagger标注格式的词还原为标注前的格式。如
输入:
Rockwell_NNP International_NNP Corp._NNP 's_POS Tulsa_NNP unit_NN said_VBD it_PRP
或者,输入三列,第一列为原词,第二列为词性标注,第三列为lemma的词形
He PRP he
reckons VBZ reckon
the DT the
current JJ current
accounts NNS account
deficit NN deficit
will MD will
narrow VB narrow
to TO to
only RB only
# # #
1.8 CD 1.8
millions CD million
in IN in
September NNP september
. . O
输出:
Rockwell NNP rockwell
International NNP international
Corp. NNP corp.
's POS 's
Tulsa NNP tulsa
unit NN unit
said VBD say
it PRP it
模型训练
```java
import java.io.BufferedOutputStream;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStream;
import java.nio.charset.StandardCharsets;
import opennlp.tools.lemmatizer.LemmaSample;
import opennlp.tools.lemmatizer.LemmaSampleStream;
import opennlp.tools.lemmatizer.LemmatizerEvaluator;
import opennlp.tools.lemmatizer.LemmatizerFactory;
import opennlp.tools.lemmatizer.LemmatizerME;
import opennlp.tools.lemmatizer.LemmatizerModel;
import opennlp.tools.util.InputStreamFactory;
import opennlp.tools.util.MarkableFileInputStreamFactory;
import opennlp.tools.util.ObjectStream;
import opennlp.tools.util.PlainTextByLineStream;
import opennlp.tools.util.TrainingParameters;public class LemmatizerTrain {public static void main(String[] args) throws IOException {// TODO Auto-generated method stubString rootDir = System.getProperty("user.dir") + File.separator;String fileResourcesDir = rootDir + "resources" + File.separator;String modelResourcesDir = rootDir + "opennlpmodel" + File.separator;//训练数据的路径String filePath = fileResourcesDir + "lemmatizer.txt";//训练后模型的保存路径String modelPath = modelResourcesDir + "lemmatizer-my.bin";//按行读取数据InputStreamFactory inputStreamFactory = new MarkableFileInputStreamFactory(new File(filePath));ObjectStream<String> lineStream = new PlainTextByLineStream(inputStreamFactory, StandardCharsets.UTF_8);//按行读取数据ObjectStream<LemmaSample> sampleStream = new LemmaSampleStream(lineStream);LemmatizerFactory factory =new LemmatizerFactory();//训练模型LemmatizerModel model =LemmatizerME.train("en",sampleStream, TrainingParameters.defaultParams(),factory);//保存模型FileOutputStream fos=new FileOutputStream(new File(modelPath));OutputStream modelOut = new BufferedOutputStream(fos);model.serialize(modelOut);//评估模型LemmatizerEvaluator evaluator=new LemmatizerEvaluator(new LemmatizerME(model));evaluator.evaluate(sampleStream);System.out.println("正确的词数:"+ evaluator.getWordAccuracy()); }
}
词形还原
```java
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import opennlp.tools.lemmatizer.LemmatizerME;
import opennlp.tools.lemmatizer.LemmatizerModel;public class LemmatizerPredit {public static void main(String[] args) throws IOException {// TODO Auto-generated method stubString rootDir = System.getProperty("user.dir") + File.separator;String fileResourcesDir = rootDir + "resources" + File.separator;String modelResourcesDir = rootDir + "opennlpmodel" + File.separator;//String filePath = fileResourcesDir + "sentenceDetector.txt";String modelPath = modelResourcesDir + "lemmatizer-my.bin";InputStream modelIn = new FileInputStream(modelPath) ;//加载模型LemmatizerModel model = new LemmatizerModel(modelIn);//实例化模型LemmatizerME lemmatizer = new LemmatizerME(model);//词形还原String[] tokens = new String[] { "Rockwell", "International", "Corp.", "'s","Tulsa", "unit", "said", "it", "signed", "a", "tentative", "agreement","extending", "its", "contract", "with", "Boeing", "Co.", "to","provide", "structural", "parts", "for", "Boeing", "'s", "747","jetliners", "." };String[] postags = new String[] { "NNP", "NNP", "NNP", "POS", "NNP", "NN","VBD", "PRP", "VBD", "DT", "JJ", "NN", "VBG", "PRP$", "NN", "IN","NNP", "NNP", "TO", "VB", "JJ", "NNS", "IN", "NNP", "POS", "CD", "NNS","." };String[] lemmas =lemmatizer.lemmatize(tokens, postags);for(String str:lemmas){System.out.println(str);} }
}
[NLP]OpenNLP词形还原器(Lemmatizer)的使用相关推荐
- rtx 加密机制_[原创]RTX(腾讯通)本地保存密码TEA变形算法及还原器
[调试逆向] [原创]RTX(腾讯通)本地保存密码TEA变形算法及还原器 2012-11-11 12:35 16367 [调试逆向] [原创]RTX(腾讯通)本地保存密码TEA变形算法及还原器 201 ...
- [NLP]OpenNLP块检测器(Chunker)的使用
Chunker 分块是将文章的内容分成句法相关的词组,如名词组.动词组,但不指定它们的内部结构,也不说明它们在主句中的作用. 训练数据的输入格式如下: Rockwell NNP B-NP Intern ...
- [NLP]OpenNLP命名实体识别(NameFinder)的使用
目录 Name Finder 模型训练 命名识别 Name Finder 命名查找器可以检测文本中的命名实体和数字.为了能够检测到实体,命名查找器需要一个模型.模型依赖于它被训练的语言和实体类型.Op ...
- [NLP]OpenNLP标记器的使用
目录 Tokenizer 模型训练 句子分词 Tokenizer OpenNLP标记器将输入字符序列分段为标记.标记通常是单词.标点符号.数字等.标记化是一个两阶段的过程:首先,确定句子边界,然后识别 ...
- [NLP]OpenNLP语句检测器的使用
目录 Sentence Detector 模型训练 语句检测 Sentence Detector 语句检测器,OpenNLP语句检测器可以检测标点字符是否标记了句子的结尾.在这个意义上,句子被定义为两 ...
- [NLP]OpenNLP语言检测器的使用
目录 Language Detector 模型训练 语言类型预测 Language Detector 语言检测器,属于分类范畴.即OpenNLP语言检测器根据模型的能力用ISO-639-3(国际语种代 ...
- [NLP]OpenNLP Maven工程的依赖
目录 OpenNLP Tools依赖 OpenNLP UIMA Annotators 依赖 OpenNLP Morfologik AddOn依赖 OpenNLP Brat Annotator依赖 Op ...
- unlim支撑垃圾短信还原器
日常生活中,因为虚拟移动基站或者是因为信息泄漏我们总会收到一些算法看不懂但是人可以看懂的短信.例如赌博/诈骗/套路贷款/等等,在这里我们利用生成式模型基于双向预训练语言模型对这些人能看懂但是分类算法看 ...
- 博创杯做的魔方还原器
转载于:https://www.cnblogs.com/anubisnero/p/5601762.html
最新文章
- input 选择框改变背景小技巧
- 扶贫干部拍胸脯认证,AI开发者上手零门槛,百度打造 “云智一体”全栈开发杀手锏...
- 2021年李永乐6套卷一道无穷小定义的题目
- C++ Primer 5th笔记(chap 10)泛型算法 :算法形参
- Java POI 导出EXCEL经典实现 Java导出Excel
- wchar_t与char转换(转载)
- Spring 实现发送电子邮件的两种方法
- 双非二本院校,北京211,字节跳动 → 一个新秀的六年
- oracle日期函数有效,oracle日期处理函数整理
- 全球与中国弹簧探针市场深度研究分析报告(2022)
- 大地测量学笔记 : 高斯克吕格投影
- Ubuntu 9.04 解决没有声音的问题 (Realtek声卡)
- ZF与MMSE接收检测
- Gerrit报错:Permission denied (publickey)
- CAD转图片用什么软件?办公常备软件
- 【网络安全】练习与复习十二
- centos linux 修改系统默认语言设置,centos怎么更改语言设置为中文
- 【矩阵论】1.准备知识——Hermite阵,二次型,矩阵合同,正定阵,幂0阵,幂等阵,矩阵的秩
- lol服务器位置2017,2017LOL转区系统在哪儿 LOL12月转区系统地址入口
- 小猫爪:S32K3学习笔记02-S32K3之FlexCAN