斯坦福ner python_斯坦福大学Corenlp和Java入门（Python程序员）

斯坦福ner python

Hello there! I’m back and I want this to be the first of a series of post on Stanford’s CoreNLP library. In this article I will focus on the installation of the library and an introduction to its basic features for Java newbies like myself. I will firstly go through the installation steps and a couple of tests from the command line. I will later walk you through a two very simple Java scripts that you will be able to easily incorporate into your Python NLP pipeline. You can find the complete code on github!

你好！我回来了，我希望这是Stanford的CoreNLP库上系列文章的第一篇。在本文中，我将重点介绍该库的安装及其对像我这样的Java新手的基本功能的介绍。我将首先通过命令行完成安装步骤和一些测试。稍后，我将向您介绍两个非常简单的Java脚本，您可以将它们轻松地合并到Python NLP管道中。您可以在github上找到完整的代码！

CoreNLP is a toolkit with which you can generate a quite complete NLP pipeline with only a few lines of code. The library includes pre-built methods for all the main NLP procedures, such as Part of Speech (POS) tagging, Named Entity Recognition (NER), Dependency Parsing or Sentiment Analysis. It also supports other languages apart from English, more specifically Arabic, Chinese, German, French, and Spanish.

CoreNLP是一个工具包，您可以使用它仅用几行代码即可生成相当完整的NLP管道。该库包括用于所有主要NLP程序的预构建方法，例如词性(POS)标记，命名实体识别(NER)，依赖性分析或情感分析。它还支持英语以外的其他语言，尤其是阿拉伯语，中文，德语，法语和西班牙语。

I am a big fan of the library, mainly because of HOW COOL its Sentiment Analysis model is ❤ (I will talk more about it in the next post). However, I can see why most people would rather use other libraries like NLTK or SpaCy, as CoreNLP can be a bit of an overkill. The reality is that coreNLP can be much more computationally expensive than other libraries, and for shallow NLP processes the results are not even significantly better. Plus it’s written in Java, and getting started with it is a bit of a pain for Python users (however it is doable, as you will see below, and it also has a Python API if you can’t be bothered).

我是该图书馆的忠实拥护者，主要是因为它的情感分析模型是❤(我将在下一篇文章中详细介绍)。但是，我明白了为什么大多数人宁愿使用NLTK或SpaCy之类的其他库，因为CoreNLP可能有点过头了。现实情况是，coreNLP的计算量可能比其他库昂贵得多，对于较浅的NLP进程，结果甚至没有明显改善。加上它是用Java编写的，并且开始使用它对Python用户来说有点痛苦(但是它是可行的，如您将在下面看到的，并且它还有一个Python API，如果您不打扰的话)。

CoreNLP Pipeline and Basic Annotators

CoreNLP管道和基本注释器

The basic building block of coreNLP is the coreNLP pipeline. The pipeline takes an input text, processes it and outputs the results of this processing in the form of a coreDocument object. A coreNLP pipeline can be customised and adapted to the needs of your NLP project. The properties objects allow to do this customization by adding, removing or editing annotators.

coreNLP的基本构建模块是coreNLP 管道。管道接收输入文本，对其进行处理，并以coreDocument对象的形式输出该处理的结果。可以定制coreNLP管道并使其适应NLP项目的需求。 属性对象允许通过添加，删除或编辑注释器来进行此自定义。

That was a lot of jargon, so let’s break it down with an example. All the information and figures were extracted from the official coreNLP page.

那是很多行话，所以让我们用一个例子来分解它。所有信息和数据均摘自coreNLP官方页面。

In the figure above we have a basic coreNLP Pipeline, the one that is ran by default when you first run the coreNLP Pipeline class without changing anything. At the very left we have the input text entering the pipeline, this will usually be a plain .txt file. The pipeline itself is composed by 6 annotators. Each of these annotators will process the input text sequentially, the intermediate outputs of the processing sometimes being used as inputs by some other annotator. If we wanted to change this pipeline by adding or removing annotators, we would use the properties object. The final output is a set of annotations in the form of a coreDocument object.

在上图中，我们有一个基本的coreNLP Pipeline，它是在您首次运行coreNLP Pipeline类而不更改任何内容时默认运行的。在最左侧，有输入文本进入管道，这通常是一个纯文本 .txt文件。管道本身由6个注释器组成。这些注释器中的每个注释器将顺序处理输入文本，有时其他一些注释器会将处理的中间输出用作输入。如果我们想通过添加或删除注释器来更改此管道，则可以使用properties对象 。最终输出是一组注释，格式为 coreDocument对象 。

We will be working with this basic pipeline throughout the article. The nature of the objects will be more clear later on when we look at an example. For the moment let’s note down what each of the annotator does:

在整篇文章中，我们将使用这个基本管道。在后面的示例中，对象的性质将更加清楚。现在，让我们记下每个注释器的作用：

Annotator 1: Tokenization → turns raw text into tokens.

注释器1：标记化 →将原始文本转换为标记。
Annotator 2: Sentence Splitting → divides raw text into sentences.

注释器2：句子拆分 →将原始文本分成句子。
Annotator 3: Part of Speech (POS) Tagging → assigns part of speech labels to tokens, such as whether they are verbs or nouns. Each token in the text will be given a tag.

注释器3：词性(POS)标记 →将词性标签分配给标记，例如它们是动词还是名词。文本中的每个标记都将被赋予一个标签。

Annotator 4: Lemmatization → converts every word into its lemma, its dictionary form. For example the word “was” is mapped to “be”.

注释器4：引词化 →将每个单词转换成引理，即字典形式。例如，单词“ was”被映射为“ be”。
Annotator 5: Named Entity Recognition (NER) → Recognises when an entity (a person, country, organization etc…) is named in a text. It also recognises numerical entities such as dates.

注释器5：命名实体识别(NER) →识别何时在文本中命名一个实体(一个人，一个国家，一个组织等)。它还可以识别数字实体，例如日期。

Annotator 6: Dependency Parsing → Will parse the text and highlight dependencies between words.

注释器6：依赖关系解析 →将解析文本并突出显示单词之间的依赖关系。

Lastly, all the outputs from the 6 annotators are organised into a CoreDocument. These are basically data objects that contain annotation information in a structured way. CoreDocuments make our lives easier since, as you will see later on, they store all the information so that we can access it with a simple API.

最后，来自6个注释器的所有输出都组织成一个CoreDocument 。这些基本上是数据对象，它们以结构化方式包含注释信息。 CoreDocuments使我们的生活更加轻松，因为，正如您稍后将看到的那样，它们存储了所有信息，以便我们可以使用简单的API对其进行访问。

Installation

安装

You will need to have Java installed. You can download the latest version here. For downloading CoreNLP I followed the official guide:

您将需要安装Java。您可以在此处下载最新版本。为了下载CoreNLP，我遵循了官方指南：

Downloading the CoreNLP zip file using curl or wget使用curl或wget下载CoreNLP zip文件

curl -O -L http://nlp.stanford.edu/software/stanford-corenlp-latest.zip

2. Unzip the file

2.解压缩文件

unzip stanford-corenlp-latest.zip

3. Move into the newly created directory

3.移至新创建的目录

cd stanford-corenlp-4.1.0

Let’s now go through a couple of examples to make sure everything works.

现在让我们来看几个示例，以确保一切正常。

Example using the command line and an input.txt file

使用命令行和input.txt文件的示例

For this example, firstly we will open the terminal and create a test file that we will use as input. The code was adapted from coreNLP’s official site. You can use the following command:

对于此示例，首先，我们将打开终端并创建一个测试文件，将其用作输入。该代码改编自coreNLP的官方网站。您可以使用以下命令：

echo "the quick brown fox jumped over the lazy dog" > test.txt

echoprints the sentence "the quick brown fox jumped over the lazy dog" on the test.txt file.

echo在test.txt文件上打印句子"the quick brown fox jumped over the lazy dog" 。

Let’s now run a default coreNLP pipeline on the test sentence.

现在让我们在测试语句上运行默认的coreNLP管道。

java -cp “*” -mx3g edu.stanford.nlp.pipeline.StanfordCoreNLP -outputFormat xml -file test.txt

This is a java command that loads and runs the coreNLP pipeline from the class edu.stanford.nlp.pipeline.StanfordCoreNLP. Since we have not changed anything from that class, the settings will be set to default. The pipeline will use as input the test.txt file and will output an XML file.

这是一个Java命令，可从edu.stanford.nlp.pipeline.StanfordCoreNLP类加载并运行coreNLP管道。由于我们没有对该类进行任何更改，因此设置将设置为默认值。管道将使用test.txt文件作为输入，并将输出XML文件。

Once you run the command the pipeline will start annotating the text. You will notice it takes a while… (around 20 seconds for a 9-word-sentence

斯坦福ner python_斯坦福大学Corenlp和Java入门（Python程序员）相关推荐

魔幻！过年在家，Java和Python程序员比工资打起来了...
Python真的野蛮生长到不行了? 最近,笔者在某网站刷到一条信息,两个程序员在家比工资,竟然打起来了!原因就是从事5年开发的Java程序员竟然工资输给了工作仅2年的Python程序员! 从上图招聘情 ...
python程序员薪资-python工资高还是java？
说起来,随着人工智能和大数据逐渐进入人们的眼中,越来越多的人看到互联网未来大好发展趋势,而想要学习一门技术来进入其中,以期分一杯羹.但是,作为人工智能和大数据的重要编程语言,Python和Java,该 ...
薪资不如 Java、C，BAT 需求大，揭秘 Python 程序员跳槽现状！
作者 | 雨蹊本文经授权转载自100offer(ID:im100offer) 「人生苦短,快用Python」,这话曾是不少选择投入Python麾下的「码农」们的一句调侃和自我标榜. 与敏捷开发.大数 ...
python程序员工资-Python工资高还是Java？
说起来,随着人工智能和大数据逐渐进入人们的眼中,越来越多的人看到互联网未来大好发展趋势,而想要学习一门技术来进入其中,以期分一杯羹.但是,作为人工智能和大数据的重要编程语言,Python和Java,该 ...
趣图：Python 程序员转 Java
(给程序员的那些事加星标,每天看趣图) 当 Python 程序员尝试转投 Java 之时 ↓↓↓ 往期趣图(点击下方图片可跳转阅读) 关注「程序员的那些事」加星标,不错过趣图 (商务合作联系:Juli ...
英语不好学java好学吗_英语不好能学好java，做程序员吗？
原标题:英语不好能学好java,做程序员吗? 很多想学java的朋友,都存在着这样一个疑惑:我的英语很差劲,对学习java没有影响吗?java编程用到英语的地方很多吧,我英语不好能学好java吗? 学 ...
读《Java夜未眠程序员的心声》感
读<Java夜未眠程序员的心声>感在这个快餐文化盛行的年代,已经很少有人耐的下心来读书了,我指的是大量的读书,作为程序员,你选择了这个行业,就注定了与学习为伍,短短几十年IT业发生了翻天 ...
最受Java编码员和程序员欢迎的好助手：Android IDE工具和应用
Android(['ændrɔid])是一个以Linux为基础的半开源操作系统,主要用于移动设备,由Google和开放手持设备联盟开发与领导. Android 系统最初由安迪·鲁宾(Andy Rubi ...
吐槽java之《程序员的呐喊》读后总结
<程序员的呐喊>读后总结 --关于java的批判一.写在总结前面的一些废话 <程序员的呐喊>(后文简称呐喊),是一本非常有趣的散篇,全文都是作者对目前软件开发界的看法,主要翻 ...

斯坦福ner python_斯坦福大学Corenlp和Java入门（Python程序员）

斯坦福ner python_斯坦福大学Corenlp和Java入门（Python程序员）相关推荐

最新文章

热门文章