最近做一个项目需要对给定的文本中的句子做Parse,根据POS tag及句子成分信息找出词语/短语之间的dependency,然后根据dependency构建句子的parse tree. 需要用到Stanford Parser和OpenNLP 中的Shallow Parser,这两个Parser都用JAVA实现,提供API方式调用,可以根据句子输出语法解析树。下面总结两类Parser的作用及JAVA程序调用方法。

1 Shallow Parser

Shallow Parser主要作用是找出句子中的短语信息,包括名词短语NP,动词短语VP,形容词短语ADJP,副词短语ADVP等等,示例程序如下

package edu.pku.yangliu.nlp.pdt;import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.StringReader;
import java.util.HashMap;import opennlp.tools.chunker.ChunkerME;
import opennlp.tools.chunker.ChunkerModel;
import opennlp.tools.cmdline.PerformanceMonitor;
import opennlp.tools.cmdline.postag.POSModelLoader;
import opennlp.tools.postag.POSModel;
import opennlp.tools.postag.POSSample;
import opennlp.tools.postag.POSTaggerME;
import opennlp.tools.tokenize.WhitespaceTokenizer;
import opennlp.tools.util.InvalidFormatException;
import opennlp.tools.util.ObjectStream;
import opennlp.tools.util.PlainTextByLineStream;/**a Shallow Parser based on opennlp* @author yangliu* @blog http://blog.csdn.net/yangliuy* @mail yang.liu@pku.edu.cn*/public class ShallowParser {private static ShallowParser instance = null ;private static POSModel model;private static ChunkerModel cModel ;//Singleton patternpublic static ShallowParser getInstance() throws InvalidFormatException, IOException{if(ShallowParser.instance == null){POSModel model = new POSModelLoader().load(new File("en-pos-maxent.bin"));InputStream is = new FileInputStream("en-chunker.bin");ChunkerModel cModel = new ChunkerModel(is);ShallowParser.instance = new ShallowParser(model, cModel);}return ShallowParser.instance;}public ShallowParser(POSModel model, ChunkerModel cModel){ShallowParser.model = model;ShallowParser.cModel = cModel;}/** A shallow Parser, chunk a sentence and return a map for the phrase*  labels of words <wordsIndex, phraseLabel>*   Notice: There should be " " BEFORE and after ",", " ","(",")" etc.* @param input The input sentence* @param model The POSModel of the chunk* @param cModel The ChunkerModel of the chunk* @return  HashMap<Integer,String>*/public HashMap<Integer,String> chunk(String input) throws IOException {     PerformanceMonitor perfMon = new PerformanceMonitor(System.err, "sent");POSTaggerME tagger = new POSTaggerME(model);ObjectStream<String> lineStream = new PlainTextByLineStream(new StringReader(input));perfMon.start();String line;String whitespaceTokenizerLine[] = null; String[] tags = null;while ((line = lineStream.read()) != null) {whitespaceTokenizerLine = WhitespaceTokenizer.INSTANCE.tokenize(line);tags = tagger.tag(whitespaceTokenizerLine);    POSSample posTags = new POSSample(whitespaceTokenizerLine, tags);System.out.println(posTags.toString());perfMon.incrementCounter();}perfMon.stopAndPrintFinalResult();// chunkerChunkerME chunkerME = new ChunkerME(cModel);String result[] = chunkerME.chunk(whitespaceTokenizerLine, tags);HashMap<Integer,String> phraseLablesMap = new HashMap<Integer, String>();Integer wordCount = 1;Integer phLableCount = 0;for (String phLable : result){if(phLable.equals("O")) phLable += "-Punctuation"; //The phLable of the last word is OPif(phLable.split("-")[0].equals("B")) phLableCount++;phLable = phLable.split("-")[1] + phLableCount;//if(phLable.equals("ADJP")) phLable = "NP"; //Notice: ADJP included in NP//if(phLable.equals("ADVP")) phLable = "VP"; //Notice: ADVP included in VPSystem.out.println(wordCount + ":" + phLable);phraseLablesMap.put(wordCount, phLable);wordCount++;}//Span[] span = chunkerME.chunkAsSpans(whitespaceTokenizerLine, tags);//for (Span phLable : span)//System.out.println(phLable.toString());return phraseLablesMap;}/** Just for testing* @param tdl Typed Dependency List* @return WDTreeNode root of WDTree*/public static void main(String[] args) throws IOException {//Notice: There should be " " BEFORE and after ",", " ","(",")" etc.String input = "We really enjoyed using the Canon PowerShot SD500 .";//String input = "Bell , based in Los Angeles , makes and distributes electronic , computer and building products .";ShallowParser swParser = ShallowParser.getInstance();swParser.chunk(input);}}

注意要配置好POS Model及Chunker Model的路径,这两个Model的数据文件都可以从OpenNLP的官网下载。

输出结果

Loading POS Tagger model ... done (1.563s)Average: 9.3 sent/s
Total: 1 sent
Runtime: 0.107s
We_PRP really_RB enjoyed_VBD using_VBG the_DT Canon_NNP PowerShot_NNP SD500_NNP ._.
1:NP1
2:ADVP2
3:VP3
4:VP3
5:NP4
6:NP4
7:NP4
8:NP4
9:Punctuation4

从结果中可以看出,Shallow Parser首先输出了POS tag信息,然后从句子中找出了两个名词短语NP1和NP4,一个动词短语VP3和一个副词短语ADVP2

2 Stanford Parser

Stanford Parser可以找出句子中词语之间的dependency关联信息,并且以Stanford Dependency格式输出,包括有向图及树等形式。示例代码如下

package edu.pku.yangliu.nlp.pdt;
import java.io.IOException;
import java.io.StringReader;
import java.util.HashMap;
import java.util.List;import opennlp.tools.util.InvalidFormatException;import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.ling.HasWord;
import edu.stanford.nlp.objectbank.TokenizerFactory;
import edu.stanford.nlp.parser.lexparser.LexicalizedParser;
import edu.stanford.nlp.process.CoreLabelTokenFactory;
import edu.stanford.nlp.process.DocumentPreprocessor;
import edu.stanford.nlp.process.PTBTokenizer;
import edu.stanford.nlp.trees.GrammaticalStructure;
import edu.stanford.nlp.trees.GrammaticalStructureFactory;
import edu.stanford.nlp.trees.PennTreebankLanguagePack;
import edu.stanford.nlp.trees.Tree;
import edu.stanford.nlp.trees.TreebankLanguagePack;
import edu.stanford.nlp.trees.TypedDependency;/**Phrase sentences based on stanford parser* @author yangliu* @blog http://blog.csdn.net/yangliuy* @mail yang.liu@pku.edu.cn*/public class StanfordParser {private static StanfordParser instance = null ;private static LexicalizedParser lp;//Singleton patternpublic static StanfordParser getInstance(){if(StanfordParser.instance == null){LexicalizedParser lp = LexicalizedParser.loadModel("edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz","-retainTmpSubcategories");StanfordParser.instance = new StanfordParser(lp);}return StanfordParser.instance;}public StanfordParser(LexicalizedParser lp){StanfordParser.lp = lp;}/**Parse sentences in a file* @param SentFilename The input file* @return  void*/public void DPFromFile(String SentFilename) {TreebankLanguagePack tlp = new PennTreebankLanguagePack();GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();for (List<HasWord> sentence : new DocumentPreprocessor(SentFilename)) {Tree parse = lp.apply(sentence);parse.pennPrint();System.out.println();GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);List<TypedDependency> tdl = (List<TypedDependency>)gs.typedDependenciesCollapsedTree();System.out.println(tdl);System.out.println();}}/**Parse sentences from a String* @param sent The input sentence* @return  List<TypedDependency> The list for type dependency*/public List<TypedDependency> DPFromString(String sent) {TokenizerFactory<CoreLabel> tokenizerFactory = PTBTokenizer.factory(new CoreLabelTokenFactory(), "");List<CoreLabel> rawWords = tokenizerFactory.getTokenizer(new StringReader(sent)).tokenize();Tree parse = lp.apply(rawWords);TreebankLanguagePack tlp = new PennTreebankLanguagePack();GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);//Choose the type of dependenciesCollapseTree//so that dependencies which do not //preserve the tree structure are omittedreturn (List<TypedDependency>) gs.typedDependenciesCollapsedTree();   }
}

Main函数如下

/**Just for testing* @param args* @throws IOException * @throws InvalidFormatException */public static void main(String[] args) throws InvalidFormatException, IOException {// TODO Auto-generated method stub//Notice: There should be " " BEFORE and after ",", " ","(",")" etc.String sent = "We really enjoyed using the Canon PowerShot SD500 .";//String sent = "Bell , based in Los Angeles , makes and distributes electronic , computer and building products .";//String sent = "It has an exterior design that combines form and function more elegantly than any point-and-shoot we've ever tested . "; //String sent = "A Digic II-powered image-processing system enables the SD500 to snap a limitless stream of 7-megapixel photos at a respectable clip , its start-up time is tops in its class , and it delivers decent photos when compared to its competition . "; //String sent = "I've had it for about a month and it is simply the best point-and-shoot your money can buy . "; StanfordParser sdPaser = StanfordParser.getInstance();List<TypedDependency> tdl = sdPaser.DPFromString(sent);for(TypedDependency oneTdl : tdl){System.out.println(oneTdl);} ShallowParser swParser = ShallowParser.getInstance();HashMap<Integer,String> phraseLablesMap = new HashMap<Integer, String>();phraseLablesMap = swParser.chunk(sent);WDTree wdtree = new WDTree();WDTreeNode root = wdtree.bulidWDTreeFromList(tdl, phraseLablesMap);wdtree.printWDTree(root);}

输出的词语之间的dependency关联,POS tag信息及句子语法解析树如下

Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz ... done [2.1 sec].
nsubj(enjoyed-3, We-1)
advmod(enjoyed-3, really-2)
root(ROOT-0, enjoyed-3)
xcomp(enjoyed-3, using-4)
det(SD500-8, the-5)
nn(SD500-8, Canon-6)
nn(SD500-8, PowerShot-7)
dobj(using-4, SD500-8)
Loading POS Tagger model ... done (1.492s)
We_PRP really_RB enjoyed_VBD using_VBG the_DT Canon_NNP PowerShot_NNP SD500_NNP ._.Average: 200.0 sent/s
Total: 1 sent
Runtime: 0.0050s
1:NP1
2:ADVP2
3:VP3
4:VP3
5:NP4
6:NP4
7:NP4
8:NP4
9:Punctuation4children of ROOT-0_ (phLable:null):
enjoyed-3_  rel:root phLable:VP3   children of enjoyed-3_ (phLable:VP3):
We-1_  rel:nsubj phLable:NP1   really-2_  rel:advmod phLable:ADVP2   using-4_  rel:xcomp phLable:VP3   children of using-4_ (phLable:VP3):
SD500-8_  rel:dobj phLable:NP4   children of SD500-8_ (phLable:NP4):
the-5_  rel:det phLable:NP4   Canon-6_  rel:nn phLable:NP4   PowerShot-7_  rel:nn phLable:NP4   

基于Stanford Parser 及OpenNLP Shallow Parser构建句子语法解析树相关推荐

  1. R语言基于Bagging算法(融合多个决策树)构建集成学习Bagging分类模型、并评估模型在测试集和训练集上的分类效果(accuray、F1、偏差Deviance):Bagging算法与随机森林对比

    R语言基于Bagging算法(融合多个决策树)构建集成学习Bagging分类模型.并评估模型在测试集和训练集上的分类效果(accuray.F1.偏差Deviance):Bagging算法与随机森林对比 ...

  2. Python基于MASK信息抽取ROI子图并构建基于迁移学习(densenet)的图像分类器实战(原始影像和mask文件都是二维的情况)

    Python基于MASK信息抽取ROI子图并构建基于迁移学习(densenet)的图像分类器实战(原始影像和mask文件都是二维的情况) 目录

  3. Python基于statsmodels包构建多元线性回归模型:模型构建、模型解析、模型推理预测

    Python基于statsmodels包构建多元线性回归模型:模型构建.模型解析.模型推理预测 目录

  4. numpy找到矩阵中不同元素的种类_基于NumPy和图像分类的人工神经网络构建

    基于NumPy和图像分类的人工神经网络构建 本文利用NumPy系统在Python中构建人工神经网络,以便为Fruits360数据集执行图像分类应用程序. 本文提及的所有内容(即图像和源代码,不包括Fr ...

  5. c++多边形扫描线填充算法_基于3DGIS技术的梯形格网构建及其简化算法设计

    传统矢量地图LOD绘制流程包含简化.剖分与渲染三个步骤.由上述分析可知,传统矢量地图LOD绘制流程中简化与剖分是两个独立的过程,重复的剖分计算导致其渲染效率相对低下.梯形格网方法解决了传统方法中重复剖 ...

  6. 计算机系统应用技术课程,基于protégé的课程内容本体的构建-计算机系统应用.pdf...

    基于protégé的课程内容本体的构建-计算机系统应用 计 算 机 系 统 应 用 2012 年 第 2 1 卷 第 12 期 ① 基于 protégé的课程内容本体的构建 1 2 蔡群英 , 黄镇建 ...

  7. 基于胜任力模型为集团企业构建动态信息安全培训课程体系

    基于胜任力模型为集团企业构建动态信息安全培训课程体系 集团型企业经常会遇到: n  "在和运营商的交流.讲标中,会议后发现竞争对手利用酒店人员将录音设备安装在会议现场,造成技术机密和商业机密 ...

  8. 基于单目视觉的同时定位与地图构建方法综述

    摘要: 增强现实是一种在现实场景中无缝地融入虚拟物体或信息的技术, 能够比传统的文字.图像和视频等方式 更高效.直观地呈现信息,有着非常广泛的应用. 同时定位与地图构建作为增强现实的关键基础技术, 可 ...

  9. Python实现基于用户的协同过滤推荐算法构建电影推荐系统

    说明:这是一个机器学习实战项目(附带数据+代码+文档+视频讲解),如需数据+代码+文档+视频讲解可以直接到文章最后获取. 1.项目背景 基于用户的协同过滤推荐(User-based CF)的原理假设: ...

最新文章

  1. Hibernate总结
  2. 利用 iPhone X 的脸部识别能力为内容制作工作服务
  3. 【Qt开发】V4L2 API详解 Buffer的准备和数据读取
  4. 优化搜索排序结果从而“ 提升CTR、CVR业务指标”
  5. current online redo logfile 丢失的处理方法
  6. mac上配置mysql
  7. python 移动文件或文件夹操作
  8. html小写罗马字符怎么写,如何在 LATEX 中插入大小写的罗马字符
  9. windows10查看桌面壁纸路径
  10. win7计算机开机启动项设置,如何设置WIN7开机启动项?
  11. 解决phpstudy的Apache启动失败
  12. Python实现自动群发自定义QQ消息
  13. 2020最火网络新词英文_2020年最流行的话 2020最火网络新词
  14. USB总线-Linux内核USB3.0设备控制器之dwc3 gadget驱动初始化过程分析(五)
  15. 手游模拟器里也可以用C++实现 特征码遍历
  16. 如何在C加加的面向对象写石头剪刀布游戏
  17. 关于学校邮箱收不到matlab验证短信
  18. 广告冷启动_创业公司“品牌冷启动”是战略级的工作
  19. [ FI基本业务流程 ] - FI与MM间的业务集成
  20. [转]AJAX基础教程

热门文章

  1. 一个关于jboss Halting VM的情况处理
  2. 小白如何学3D建模?从零开始变大神
  3. linux eap网络,linux – 定期无法连接到WPA2-EAP接入点
  4. TCP UDP之网络编程及数据库入门
  5. ios 按钮文字下划线_iOS实现一段文字中部分有下划线,并且可以点击
  6. DUTOJ-1151: 投硬币
  7. 【封面】华为解读“生态伙伴”
  8. 微信朋友圈广告怎么做?
  9. 卧槽!百度网盘宣布 VIP 开放免费领取!亲测有效!!
  10. java 解析 svg文件_java – 如何加载和解析SVG文档