命名实体识别学习记录(spaCy/OpenNLP..)
命名实体识别学习记录(spaCy/OpenNLP..)
- spaCy
- 环境
- 功能实现
- NLTK
- 环境
- 功能实现
- Stanford NLP
- 环境
- 功能实现
- NER works
- Spacy
- Install
- Run
- Results
- Entity types
- NLTK
- Install
- Run
- Results
- Entity types
- [Stanford NLP](https://nlp.stanford.edu/software/CRF-NER.shtml)
- Install
- Run
- Results
- Entity types
- BERT-NER
- Install
- Run
- Results
- Entity types
spaCy
API文档
环境
只列举不是一查就能查到的命令:
- 下载en_core_web_sm:本人唯一成功的方法是本地下载,然后pip install + 本地路径。(conda显示安装好了但不行)
- 下载textacy:python -m pip install textacy
但有 verb_phrases = textacy.extract.matches(doc, patterns=patterns) TypeError: ‘module’ object is not callable的报错,说明库找不到
发现是新版的函数库有区别的原因,通过查看库的源代码,将上句改成下句即成功。
旧版:verb_phrases = textacy.extract.matches(doc, patterns=patterns)
新版:verb_phrases = textacy.corpus.extract.matches.token_matches(doclike=doc, patterns=patterns)
功能实现
参考博客 2.4-2.8跑通 含名词与动词识别
NLTK
环境
- 报错NLTK:Resource punkt not found. Please use the NLTK Downloader to obtain the resource
解决:在gitee下载packages 记得把zip解压成dir
功能实现
NLTK+Stanford NLP的代码
Stanford NLP
环境
按照文章里下载并改成本地路径即可
功能实现
NLTK+Stanford NLP的代码
发现生成NER速度很慢,改进方法:sn.tag_sents() 参考这篇
NER works
For most of the modules, just use pip install + xxx
to download.
Spacy
Install
spacy
pandas
en_core_web_sm: both pip and conda don’t work. Download newest package here and run
pip install + local path
Run
python spacy_NER.py
Results
After step Run, you’ll get spacy_NER_result.csv
as NER results.
Entity types
There are 18 types in spacy but I only use 11 of them since they’re more related to our program.
type_list = ['EVENT', 'FAC', 'GPE', 'LANGUAGE', 'LAW', 'LOC', 'NORP', 'ORG', 'PERSON', 'PRODUCT', 'WORK_OF_ART']
ENT_TYPE_(18 in total) | DESCRIPTION |
---|---|
CARDINAL | Numerals that do not fall under another type. |
DATE | Absolute or relative dates or periods. |
EVENT | Named hurricanes, battles, wars, sports events, etc. |
FAC | Buildings, airports, highways, bridges, etc. |
GPE | Geopolitical entity, i.e. countries, cities, states. |
LANGUAGE | Any named language. |
LAW | Named documents made into laws. |
LOC | Non-GPE locations, mountain ranges, bodies of water. |
MONEY | Monetary values, including unit. |
NORP | Nationalities or religious or political groups. |
ORDINAL | “first”, “second”, etc. |
ORG | Companies, agencies, institutions. |
PERCENT | Percentage, including “%”. |
PERSON | People, including fictional. |
PRODUCT | Objects, vehicles, foods, etc. (Not services.) |
QUANTITY | Measurements, as of weight or distance. |
TIME | Times smaller than a day. |
WORK_OF_ART | Titles of books, songs, etc. |
NLTK
Install
- pandas
- re
- nltk: for error
Resource punkt not found. Please use the NLTK Downloader to obtain the resource
, you can follow this- download folder
packages
from github or gitee, and rename it tonltk_data
- the terminal will output several searched paths, you can just choose one and unzip nltk_data folder, like ‘D:\nltk_data’
- download folder
Run
python NLTK_NER.py
Results
After step Run, you’ll get NLTK_NER_result.csv
as NER results.
Entity types
There are 9 types in NLTK but I only use 5 of them since they’re more related to our program.
type_list = ['ORGANIZATION', 'PERSON', 'LOCATION', 'FACILITY', 'GPE']
ENT_TYPE_(9 in total) | DESCRIPTION |
---|---|
ORGANIZATION | Georgia-Pacific Corp., WHO |
PERSON | Eddy Bonte, President Obama |
LOCATION | Murray River, Mount Everest |
DATE | June, 2008-06-29 |
TIME | two fifty am, 1:30 p.m |
MONEY | 175 million Canadian Dollars, GBP 10.40 |
PERCENT | twenty pct, 18.75 % |
FACILITY | Washington Monument, Stonehenge |
GPE | geopolitical entity:South East Asia, Midlothian) |
Stanford NLP
Install
re
nltk.tag
os
pandas
nltk: download from Download index, and unzip it to a local path, like ‘D://stanford-ner-2020-11-17’
java: use your local java path
please make sure your local paths are correct because they engage in loading the NER model
# set java path in environment variables
java_path = r'C:\Program Files\Java\jdk1.8.0_261\bin\java.exe'
os.environ['JAVAHOME'] = java_path
# load stanford NER
sn = StanfordNERTagger('D://stanford-ner-2020-11-17/classifiers/english.muc.7class.distsim.crf.ser.gz', path_to_jar='D://stanford-ner-2020-11-17/stanford-ner.jar')
Run
python Stanford_NLP_NER.py
Results
After step Run, you’ll get Stanford_NLP_NER_result.csv
as NER results.
Entity types
There are 7 types in Stanford NLP but I only use 3 of them since they’re more related to our program.
type_list = ['LOCATION', 'PERSON', 'ORGANIZATION']
ENT_TYPE_(7 in total, except facility & GPE) | DESCRIPTION |
---|---|
Location | Murray River, Mount Everest |
Person | Eddy Bonte, President Obama |
Organization | Georgia-Pacific Corp., WHO |
Money | 175 million Canadian Dollars, GBP 10.40 |
Percent | twenty pct, 18.75 % |
Date | June, 2008-06-29 |
Time | two fifty am, 1:30 p.m) |
BERT-NER
Install
# Kaggle
!git clone -b dev https://github.com/kamalkraj/BERT-NER.git
!pip3 install -r /kaggle/working/BERT-NER/requirements.txt
# Local
git clone -b dev https://github.com/kamalkraj/BERT-NER.git
pip3 install -r /kaggle/working/BERT-NER/requirements.txt
Run
# Kaggle
!python /kaggle/working/BERT-NER/run_ner.py --data_dir=/kaggle/working/BERT-NER/data/ --bert_model=bert-base-cased --task_name=ner --output_dir=out_base --max_seq_length=128 --do_train --num_train_epochs 5 --do_eval --warmup_proportion=0.1
# Local
python run_ner.py --data_dir=data/ --bert_model=bert-base-cased --task_name=ner --output_dir=out_base --max_seq_length=128 --do_train --num_train_epochs 5 --do_eval --warmup_proportion=0.1
Since I don’t have GPU, I run it on Kaggle and get output_base successfully.
在这里插入图片描述
If you use default parameters, you can just download pretrained model BERT_BASE and BERT_LARGE.
Then define a model and get NER outputs.
# BERT_NER.py
model_large = Ner("D:/pythonProject/BERT-NER-dev/out_large/") # local path
python BERT_NER.py
Results
After step Run, you’ll get BERT_NER_result.csv
as NER results.
Entity types
There are 11 types in BERT-NER.
["O", "B-MISC", "I-MISC", "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC", "[CLS]", "[SEP]"]
- PER: person
- LOC: location
- ORG: organization
- MISC: miscellaneous (consisting of diverse things or members)
BIO lables:
- O
- B-X:X phrase’s beginning
- I-X:X phrase’s middle
B-PER:“a person name begins here”
I-PER tag:“a person name continues”
O tag: “no name here”
- NP: Noun Phrase
命名实体识别学习记录(spaCy/OpenNLP..)相关推荐
- 命名实体识别学习-用lstm+crf处理conll03数据集
title: 命名实体识别学习-用lstm+crf处理conll03数据集 date: 2020-07-18 16:32:31 tags: 命名实体识别学习-用lstm+crf处理conll03数据集 ...
- 命名实体识别学习笔记——使用Ltp
版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明. 本文链接:https://blog.csdn.net/xuewenstudy/article/ ...
- 命名实体识别学习笔记
1 命名实体识别概述 1.1 定义 命名实体识别(Name Entity Recognition,NER),也称作"专名识别",是指识别文本中具有特定意义的实体,包括人名.地名.机 ...
- 基于spaCy的领域命名实体识别
基于spaCy的命名实体识别 ----以"大屠杀"领域命名实体识别研究为例 作者: Dr. W.J.B. Mattingly Postdoctoral Fellow at the ...
- 使用Spacy实现命名实体识别
使用Spacy实现命名实体识别 本次实验的目的是完成文本数据的词性标注和识别文本中的命名实体 一.数据来源 数据是2022年2月4日的新闻 二.数据预处理 使用jieba对文本进行分词和去停用词,使用 ...
- spacy spaCy主要功能包括分词、词性标注、词干化、命名实体识别、名词短语提取等等
spaCy主要功能包括分词.词性标注.词干化.命名实体识别.名词短语提取等等https://zhuanlan.zhihu.com/p/51425975
- (转)OpenNLP进行中文命名实体识别(下:载入模型识别实体)
上一节介绍了使用OpenNLP训练命名实体识别模型的方法,并将模型写到磁盘上形成二进制bin文件,这一节就是将模型从磁盘上载入,然后进行命名实体识别.依然是先上代码: [java] view plai ...
- [NLP]OpenNLP命名实体识别(NameFinder)的使用
目录 Name Finder 模型训练 命名识别 Name Finder 命名查找器可以检测文本中的命名实体和数字.为了能够检测到实体,命名查找器需要一个模型.模型依赖于它被训练的语言和实体类型.Op ...
- 对命名实体识别进行基准测试:StanfordNLP,IBM,spaCy,Dialogflow和TextSpace
作者|Felix Laumann 编译|VK 来源|Towards Data Science NER是信息提取的一个子任务,它试图定位并将非结构化文本中提到的指定实体划分为预定义的类别,如人名.组织. ...
最新文章
- DataTable转换成IList
- disconf-自动注入属性变化
- SAP CRM WebClient UI图标url的生成逻辑
- java 菜单快捷键_Java 菜单快捷键
- Android Studio 安装 NDK(Mac OX)
- python计算最大回撤_最大回撤线性算法实现
- Springboot+Mysql物流快递在线寄查快递系统
- 易语言不用uac权限写到c盘,易语言制作UAC管理员模式添加器
- 分布式工作笔记001---分布式系统中CAP 定理的含义
- iOS之 开发学习笔记-block
- 【X240 QQ视频对方听不到声音】解决方法
- pom.xml 注释
- 【python】chardet函数用法
- 20182442-胡名琪
- 基于FPGA的光纤数据传输
- Linux下3种常用的网络测速工具
- 跨平台的会员通 打通品牌任督二脉
- 用AI取代SGD?无需训练ResNet-50,AI秒级预测全部2400万个参数,准确率60% | NeurIPS 2021...
- prometheus-简介
- Sphinx武林秘籍(下)