命名实体识别学习记录（spaCy/OpenNLP..）

spaCy
- 环境
- 功能实现
NLTK
- 环境
- 功能实现
Stanford NLP
- 环境
- 功能实现
NER works
- Spacy
- - Install
  - Run
  - Results
  - Entity types
- NLTK
- - Install
  - Run
  - Results
  - Entity types
- [Stanford NLP](https://nlp.stanford.edu/software/CRF-NER.shtml)
- - Install
  - Run
  - Results
  - Entity types
- BERT-NER
- - Install
  - Run
  - Results
  - Entity types

spaCy

API文档

环境

只列举不是一查就能查到的命令：

下载en_core_web_sm：本人唯一成功的方法是本地下载，然后pip install + 本地路径。（conda显示安装好了但不行）
下载textacy：python -m pip install textacy
但有 verb_phrases = textacy.extract.matches(doc, patterns=patterns) TypeError: ‘module’ object is not callable的报错，说明库找不到
发现是新版的函数库有区别的原因，通过查看库的源代码，将上句改成下句即成功。
旧版：verb_phrases = textacy.extract.matches(doc, patterns=patterns)
新版：verb_phrases = textacy.corpus.extract.matches.token_matches(doclike=doc, patterns=patterns)

功能实现

参考博客 2.4-2.8跑通含名词与动词识别

NLTK

环境

报错NLTK：Resource punkt not found. Please use the NLTK Downloader to obtain the resource
解决：在gitee下载packages 记得把zip解压成dir

功能实现

NLTK+Stanford NLP的代码

Stanford NLP

环境

按照文章里下载并改成本地路径即可

功能实现

NLTK+Stanford NLP的代码
发现生成NER速度很慢，改进方法：sn.tag_sents() 参考这篇

NER works

For most of the modules, just use pip install + xxx to download.

Spacy

Install

spacy
pandas
en_core_web_sm: both pip and conda don’t work. Download newest package here and run pip install + local path

Run

python spacy_NER.py

Results

After step Run, you’ll get spacy_NER_result.csv as NER results.

Entity types

There are 18 types in spacy but I only use 11 of them since they’re more related to our program.

type_list = ['EVENT', 'FAC', 'GPE', 'LANGUAGE', 'LAW', 'LOC', 'NORP', 'ORG', 'PERSON', 'PRODUCT', 'WORK_OF_ART']

ENT_TYPE_(18 in total)	DESCRIPTION
CARDINAL	Numerals that do not fall under another type.
DATE	Absolute or relative dates or periods.
EVENT	Named hurricanes, battles, wars, sports events, etc.
FAC	Buildings, airports, highways, bridges, etc.
GPE	Geopolitical entity, i.e. countries, cities, states.
LANGUAGE	Any named language.
LAW	Named documents made into laws.
LOC	Non-GPE locations, mountain ranges, bodies of water.
MONEY	Monetary values, including unit.
NORP	Nationalities or religious or political groups.
ORDINAL	“first”, “second”, etc.
ORG	Companies, agencies, institutions.
PERCENT	Percentage, including “%”.
PERSON	People, including fictional.
PRODUCT	Objects, vehicles, foods, etc. (Not services.)
QUANTITY	Measurements, as of weight or distance.
TIME	Times smaller than a day.
WORK_OF_ART	Titles of books, songs, etc.

NLTK

Install

pandas
re
nltk: for error Resource punkt not found. Please use the NLTK Downloader to obtain the resource, you can follow this
- download folderpackages from github or gitee, and rename it to nltk_data
- the terminal will output several searched paths, you can just choose one and unzip nltk_data folder, like ‘D:\nltk_data’

Run

python NLTK_NER.py

Results

After step Run, you’ll get NLTK_NER_result.csv as NER results.

Entity types

There are 9 types in NLTK but I only use 5 of them since they’re more related to our program.

type_list = ['ORGANIZATION', 'PERSON', 'LOCATION', 'FACILITY', 'GPE']

ENT_TYPE_(9 in total)	DESCRIPTION
ORGANIZATION	Georgia-Pacific Corp., WHO
PERSON	Eddy Bonte, President Obama
LOCATION	Murray River, Mount Everest
DATE	June, 2008-06-29
TIME	two fifty am, 1:30 p.m
MONEY	175 million Canadian Dollars, GBP 10.40
PERCENT	twenty pct, 18.75 %
FACILITY	Washington Monument, Stonehenge
GPE	geopolitical entity：South East Asia, Midlothian)

Stanford NLP

Install

re
nltk.tag
os
pandas
nltk: download from Download index, and unzip it to a local path, like ‘D://stanford-ner-2020-11-17’
java: use your local java path

please make sure your local paths are correct because they engage in loading the NER model

# set java path in environment variables
java_path = r'C:\Program Files\Java\jdk1.8.0_261\bin\java.exe'
os.environ['JAVAHOME'] = java_path
# load stanford NER
sn = StanfordNERTagger('D://stanford-ner-2020-11-17/classifiers/english.muc.7class.distsim.crf.ser.gz', path_to_jar='D://stanford-ner-2020-11-17/stanford-ner.jar')

Run

python Stanford_NLP_NER.py

Results

After step Run, you’ll get Stanford_NLP_NER_result.csv as NER results.

Entity types

There are 7 types in Stanford NLP but I only use 3 of them since they’re more related to our program.

type_list = ['LOCATION', 'PERSON', 'ORGANIZATION']

ENT_TYPE_(7 in total, except facility & GPE)	DESCRIPTION
Location	Murray River, Mount Everest
Person	Eddy Bonte, President Obama
Organization	Georgia-Pacific Corp., WHO
Money	175 million Canadian Dollars, GBP 10.40
Percent	twenty pct, 18.75 %
Date	June, 2008-06-29
Time	two fifty am, 1:30 p.m)

BERT-NER

Install

# Kaggle
!git clone -b dev https://github.com/kamalkraj/BERT-NER.git
!pip3 install -r /kaggle/working/BERT-NER/requirements.txt
# Local
git clone -b dev https://github.com/kamalkraj/BERT-NER.git
pip3 install -r /kaggle/working/BERT-NER/requirements.txt

Run

# Kaggle
!python /kaggle/working/BERT-NER/run_ner.py --data_dir=/kaggle/working/BERT-NER/data/ --bert_model=bert-base-cased --task_name=ner --output_dir=out_base --max_seq_length=128 --do_train --num_train_epochs 5 --do_eval --warmup_proportion=0.1
# Local
python run_ner.py --data_dir=data/ --bert_model=bert-base-cased --task_name=ner --output_dir=out_base --max_seq_length=128 --do_train --num_train_epochs 5 --do_eval --warmup_proportion=0.1

Since I don’t have GPU, I run it on Kaggle and get output_base successfully.

在这里插入图片描述

If you use default parameters, you can just download pretrained model BERT_BASE and BERT_LARGE.

Then define a model and get NER outputs.

# BERT_NER.py
model_large = Ner("D:/pythonProject/BERT-NER-dev/out_large/") # local path

python BERT_NER.py

Results

After step Run, you’ll get BERT_NER_result.csv as NER results.

Entity types

There are 11 types in BERT-NER.

["O", "B-MISC", "I-MISC",  "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC", "[CLS]", "[SEP]"]

PER: person
LOC: location
ORG: organization
MISC: miscellaneous (consisting of diverse things or members)

BIO lables:

O
B-X：X phrase’s beginning
I-X：X phrase’s middle

B-PER:“a person name begins here”

I-PER tag:“a person name continues”

O tag: “no name here”

NP: Noun Phrase

命名实体识别学习记录（spaCy/OpenNLP..）相关推荐

命名实体识别学习-用lstm+crf处理conll03数据集
title: 命名实体识别学习-用lstm+crf处理conll03数据集 date: 2020-07-18 16:32:31 tags: 命名实体识别学习-用lstm+crf处理conll03数据集 ...
命名实体识别学习笔记——使用Ltp
版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明. 本文链接:https://blog.csdn.net/xuewenstudy/article/ ...
命名实体识别学习笔记
1 命名实体识别概述 1.1 定义命名实体识别(Name Entity Recognition,NER),也称作"专名识别",是指识别文本中具有特定意义的实体,包括人名.地名.机 ...
基于spaCy的领域命名实体识别
基于spaCy的命名实体识别 ----以"大屠杀"领域命名实体识别研究为例作者: Dr. W.J.B. Mattingly Postdoctoral Fellow at the ...
使用Spacy实现命名实体识别
使用Spacy实现命名实体识别本次实验的目的是完成文本数据的词性标注和识别文本中的命名实体一.数据来源数据是2022年2月4日的新闻二.数据预处理使用jieba对文本进行分词和去停用词,使用 ...
spacy spaCy主要功能包括分词、词性标注、词干化、命名实体识别、名词短语提取等等
spaCy主要功能包括分词.词性标注.词干化.命名实体识别.名词短语提取等等https://zhuanlan.zhihu.com/p/51425975
（转）OpenNLP进行中文命名实体识别（下：载入模型识别实体）
上一节介绍了使用OpenNLP训练命名实体识别模型的方法,并将模型写到磁盘上形成二进制bin文件,这一节就是将模型从磁盘上载入,然后进行命名实体识别.依然是先上代码: [java] view plai ...
[NLP]OpenNLP命名实体识别(NameFinder)的使用
目录 Name Finder 模型训练命名识别 Name Finder 命名查找器可以检测文本中的命名实体和数字.为了能够检测到实体,命名查找器需要一个模型.模型依赖于它被训练的语言和实体类型.Op ...
对命名实体识别进行基准测试：StanfordNLP，IBM，spaCy，Dialogflow和TextSpace
作者|Felix Laumann 编译|VK 来源|Towards Data Science NER是信息提取的一个子任务,它试图定位并将非结构化文本中提到的指定实体划分为预定义的类别,如人名.组织. ...

命名实体识别学习记录（spaCy/OpenNLP..）

命名实体识别学习记录（spaCy/OpenNLP..）

spaCy

环境

功能实现

NLTK

环境

功能实现

Stanford NLP

环境

功能实现

NER works

Spacy

Install

Run

Results

Entity types

NLTK

Install

Run

Results

Entity types

Stanford NLP

Install

Run

Results

Entity types

BERT-NER

Install

Run

Results

Entity types

命名实体识别学习记录（spaCy/OpenNLP..）相关推荐

最新文章

热门文章