apache OpenNLP简要介绍

本文是OpenNLP的一个简要介绍。

OpenNLP目前Apache的顶级项目，纯JAVA自然语言处理工具包，支持大多数的自然语言处理任务，如tokenization（分词）, sentence segmentation（分句）, part-of-speech tagging（词性标记）, named entity extraction（命名实体提取）, chunking（语块）, parsing（分析）和 coreference resolution.共指解析。

OpenNLP的任务一般都需要通过训练出来的模型进行学习后给出结果。所以任务的入参都有模型、输入文件和输出文件。提供了命令行界面和API两种接口形式。

分句、分词是基础，后面的命名实体提取、词性标记、分块、句法解析等任务大多要基于他们的输出结果。

下面分开介绍各个任务。

1. Sentence Detector分句

句法分析主要是通过标点符号进行分句，英文句号的多用途导致不容易分句。所以需要训练学习分句。

分句的结果是把句子排列在一行上。

Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.

Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group.

Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields PLC,

    was named a director of this British industrial conglomerate.

训练数据每行表示一个句子，段落之间以空行分隔。推荐每段十几句话。

2. Tokenization分词

以词、标点符号和数字等进行分词，分词结果以空格把每个词分开，保留标点符号，且标点符号前后有空格。分词前需要先分句。（每行一句）

实现了三种分词器：

· WhitespaceTokenizer - A whitespace tokenizer, non whitespace sequences are identified astokens 空格标记器，以空格进行标识

· SimpleTokenizer - A character class tokenizer, sequences of the same character classare tokens 字符标记器，以特定字符进行标记，相同的是一个token

· LearnableTokenizer - A maximum entropy tokenizer, detects token boundaries based onprobability model最大熵标记器，基于概率模型监测边界。

训练数据在需要分词的地方以<SPLIT>进行标记。英文单词之间自然以空格分隔，中文该如何处理？应该是以SPLIT标记分隔。

Pierre Vinken<SPLIT>, 61 years old<SPLIT>, will join the board as a nonexecutive director Nov. 29<SPLIT>.

Mr. Vinken is chairman of Elsevier N.V.<SPLIT>, the Dutch publishing group<SPLIT>.

Rudolph Agnew<SPLIT>, 55 years old and former chairman of Consolidated Gold Fields PLC<SPLIT>,

    was named a nonexecutive director of this British industrial conglomerate<SPLIT>.

3. NameFinder 名称查找器（命名实体标识）

Named Entity Recognition 命名实体标识的目的是从文本中提取出诸如人名、时间、公司名称等实体，也可以用于提取其他术语。原始文本数据必须先分句、分词后才能进行命名实体标识。

标识后的结果如下：

<START:person> Pierre Vinken <END> , 61 years old , will join the board as a nonexecutive director Nov. 29 .

Mr . <START:person> Vinken <END> is chairman of Elsevier N.V. , the Dutch publishing group .

<START:person> Rudolph Agnew <END> , 55 years old and former chairman of Consolidated Gold Fields PLC ,

    was named a director of this British industrial conglomerate .

训练数据最好分句分词后再进行标记。标记符号<START:person>和<END>前后也要有空格。也就是标记符号作为一个词进行分词。否则会报错（not only one outcome）。

4. DocumentCategorizer文档分类器

OpenNLP基于最大熵框架进行文档分类。例如有人喜欢Gross Margin，示例文本会分到GMDecrease下。分类是特殊需求，没有预先训练模型。

训练数据如下：

GMDecrease Major acquisitions that have a lower gross margin than the existing network also \

           had a negative impact on the overall gross margin, but it should improve following \

           the implementation of its integration strategies .

GMIncrease The upward movement of gross margin resulted from amounts pursuant to adjustments \

           to obligations towards dealers .

训练数据是以分类开头的一段文本，每行是一个文档，第一个词就是分类。。文档的换行以反斜杠分隔，如上面的例子。

5. Part-of-SpeechTagger词性标记

词性标记器基于分词类型和分词的上下文，用其相应的词类型来标记分词。每个词可能具有多个POS标签，具体取决于词本身和词的上下文。 OpenNLP POS Tagger使用概率模型来预测标签集中的正确pos标签。为了限制可能标签，可以使用标签字典，这会提高了标注的准确性和性能。

训练数据要求：每行是分好词的句子，每个词和词标记以“_”下划线连接，词间以空格分隔。段落（文档）以空行分隔。

About_IN 10_CD Euro_NNP ,_, I_PRP reckon_VBP ._.

That_DT sounds_VBZ good_JJ ._.

使用标签字典有两个好处：字典中不当标签可能不会用到，可以提升集束（beam）搜索算法搜索速度和减少可能性。（总之就是提升性能吧）字典以XML形式表达，存储在POSDictionary类中。详情参考文档和源代码。

<?xml version="1.0" encoding="UTF-8"?><dictionary>

  <entry tags="NNP">

    <token>Mary</token>

  </entry>

  <entry tags="VBD">

    <token>had</token>

  </entry>

  <entry tags="DT">

    <token>a</token>

  </entry>

  <entry tags="JJ">

    <token>little</token>

  </entry>

  <entry tags="NN JJ">

    <token>lamb</token>

  </entry>

  <entry tags="PRP$">

    <token>His</token>

  </entry>

  <entry tags="NN">

    <token>fleece</token>

  </entry>

  <entry tags="VBD">

    <token>was</token>

  </entry>

  <entry tags="JJ">

    <token>white</token>

  </entry>

</dictionary>

6. Lemmatizer词形还原

词形还原（lemmatization），是把一个任何形式的语言词汇还原为一般形式（能表达完整语义），而词干提取（stemming）是抽取词的词干或词根形式（不一定能够表达完整语义）

感觉是把单词的各种形式（复数、进行时、过去式等）还原回去。如下面例子中的

said VBD say

signed VBD sign

采用基于统计和字典进行词形还原。输入数据格式为分词且进行了词性标注的数据。

训练数据包括三列，列之间以空格分隔。每个词一行，句子间以空行分隔。第一列是句子中的单词，第二列是词性标记，第单列是词根（词源）

训练数据，简单句子:

He        PRP  he

reckons   VBZ  reckon

the       DT   the

current   JJ   current

accounts  NNS  account

deficit   NN   deficit

will      MD   will

narrow    VB   narrow

to        TO   to

only      RB   only

#         #    #

1.8       CD   1.8

millions  CD   million

in        IN   in

September NNP  september

.         .    O

7.Chunker分块器

文本分块，是按语法对文本进行分块，如名词组、动词组等，但不指定内部结构和在主句中的角色。

训练数据由三列构成，每个词在一行上，每句后有一个空行。第一列是句子中的单词，第二列是词性标签，第三列是分块标签。通常块类型由两部分构成B-CHUNK，其中第一部分的B标识开始，I-CHUNK中的I表示块中间的部分。例如：I-NP 表示一个名词短语中的非开始单词，B-VP表示动词短语的一个词。

He        PRP  B-NP

reckons   VBZ  B-VP

the       DT   B-NP

current   JJ   I-NP

account   NN   I-NP

deficit   NN   I-NP

will      MD   B-VP

narrow    VB   I-VP

to        TO   B-PP

only      RB   B-NP

#         #    I-NP

1.8       CD   I-NP

billion   CD   I-NP

in        IN   B-PP

September NNP  B-NP

.         .    O

8. Parser解析器

不知道为什么，文档中说“仅用于演示和测试”，原文“Thetool is only intended for demonstration and testing.”供了分块解析器和树插入解析器，但后一种仍然是实验性的，不推荐用于生产环境。

训练数据Penn Treebank格式，每句一行。

(TOP (S (NP-SBJ (DT Some) )(VP (VBP say) (NP (NNP November) ))(. .) ))

(TOP (S (NP-SBJ (PRP I) )(VP (VBP say) (NP (CD 1992) ))(. .) ('' '') ))

Penn Treebank标记格式详见PennTreebank网站。解析模型还包括pos标记模型。此训练需要大量标记数据来提升分析性能（准确性）

9. CoreferenceResolution共指解析

Coreference resolution (共指解析)是自然语言处理(nlp)中的一个基本任务，目的在于自动识别表示同一个实体的名词短语或代词，并将他们归类。操作文档中未对具体的接口进行介绍。