命名实体识别学习记录(spaCy/OpenNLP..)

  • spaCy
    • 环境
    • 功能实现
  • NLTK
    • 环境
    • 功能实现
  • Stanford NLP
    • 环境
    • 功能实现
  • NER works
    • Spacy
      • Install
      • Run
      • Results
      • Entity types
    • NLTK
      • Install
      • Run
      • Results
      • Entity types
    • [Stanford NLP](https://nlp.stanford.edu/software/CRF-NER.shtml)
      • Install
      • Run
      • Results
      • Entity types
    • BERT-NER
      • Install
      • Run
      • Results
      • Entity types

spaCy

API文档

环境

只列举不是一查就能查到的命令:

  1. 下载en_core_web_sm:本人唯一成功的方法是本地下载,然后pip install + 本地路径。(conda显示安装好了但不行)
  2. 下载textacy:python -m pip install textacy
    但有 verb_phrases = textacy.extract.matches(doc, patterns=patterns) TypeError: ‘module’ object is not callable的报错,说明库找不到
    发现是新版的函数库有区别的原因,通过查看库的源代码,将上句改成下句即成功。
    旧版:verb_phrases = textacy.extract.matches(doc, patterns=patterns)
    新版:verb_phrases = textacy.corpus.extract.matches.token_matches(doclike=doc, patterns=patterns)

功能实现

参考博客 2.4-2.8跑通 含名词与动词识别

NLTK

环境

  1. 报错NLTK:Resource punkt not found. Please use the NLTK Downloader to obtain the resource
    解决:在gitee下载packages 记得把zip解压成dir

功能实现

NLTK+Stanford NLP的代码

Stanford NLP

环境

按照文章里下载并改成本地路径即可

功能实现

NLTK+Stanford NLP的代码
发现生成NER速度很慢,改进方法:sn.tag_sents() 参考这篇

NER works

For most of the modules, just use pip install + xxx to download.

Spacy

Install

  • spacy

  • pandas

  • en_core_web_sm: both pip and conda don’t work. Download newest package here and run pip install + local path

Run

python spacy_NER.py

Results

After step Run, you’ll get spacy_NER_result.csv as NER results.

Entity types

There are 18 types in spacy but I only use 11 of them since they’re more related to our program.

type_list = ['EVENT', 'FAC', 'GPE', 'LANGUAGE', 'LAW', 'LOC', 'NORP', 'ORG', 'PERSON', 'PRODUCT', 'WORK_OF_ART']
ENT_TYPE_(18 in total) DESCRIPTION
CARDINAL Numerals that do not fall under another type.
DATE Absolute or relative dates or periods.
EVENT Named hurricanes, battles, wars, sports events, etc.
FAC Buildings, airports, highways, bridges, etc.
GPE Geopolitical entity, i.e. countries, cities, states.
LANGUAGE Any named language.
LAW Named documents made into laws.
LOC Non-GPE locations, mountain ranges, bodies of water.
MONEY Monetary values, including unit.
NORP Nationalities or religious or political groups.
ORDINAL “first”, “second”, etc.
ORG Companies, agencies, institutions.
PERCENT Percentage, including “%”.
PERSON People, including fictional.
PRODUCT Objects, vehicles, foods, etc. (Not services.)
QUANTITY Measurements, as of weight or distance.
TIME Times smaller than a day.
WORK_OF_ART Titles of books, songs, etc.

NLTK

Install

  • pandas
  • re
  • nltk: for error Resource punkt not found. Please use the NLTK Downloader to obtain the resource, you can follow this
    • download folderpackages from github or gitee, and rename it to nltk_data
    • the terminal will output several searched paths, you can just choose one and unzip nltk_data folder, like ‘D:\nltk_data’

Run

python NLTK_NER.py

Results

After step Run, you’ll get NLTK_NER_result.csv as NER results.

Entity types

There are 9 types in NLTK but I only use 5 of them since they’re more related to our program.

type_list = ['ORGANIZATION', 'PERSON', 'LOCATION', 'FACILITY', 'GPE']
ENT_TYPE_(9 in total) DESCRIPTION
ORGANIZATION Georgia-Pacific Corp., WHO
PERSON Eddy Bonte, President Obama
LOCATION Murray River, Mount Everest
DATE June, 2008-06-29
TIME two fifty am, 1:30 p.m
MONEY 175 million Canadian Dollars, GBP 10.40
PERCENT twenty pct, 18.75 %
FACILITY Washington Monument, Stonehenge
GPE geopolitical entity:South East Asia, Midlothian)

Stanford NLP

Install

  • re

  • nltk.tag

  • os

  • pandas

  • nltk: download from Download index, and unzip it to a local path, like ‘D://stanford-ner-2020-11-17’

  • java: use your local java path

please make sure your local paths are correct because they engage in loading the NER model

# set java path in environment variables
java_path = r'C:\Program Files\Java\jdk1.8.0_261\bin\java.exe'
os.environ['JAVAHOME'] = java_path
# load stanford NER
sn = StanfordNERTagger('D://stanford-ner-2020-11-17/classifiers/english.muc.7class.distsim.crf.ser.gz', path_to_jar='D://stanford-ner-2020-11-17/stanford-ner.jar')

Run

python Stanford_NLP_NER.py

Results

After step Run, you’ll get Stanford_NLP_NER_result.csv as NER results.

Entity types

There are 7 types in Stanford NLP but I only use 3 of them since they’re more related to our program.

type_list = ['LOCATION', 'PERSON', 'ORGANIZATION']
ENT_TYPE_(7 in total, except facility & GPE) DESCRIPTION
Location Murray River, Mount Everest
Person Eddy Bonte, President Obama
Organization Georgia-Pacific Corp., WHO
Money 175 million Canadian Dollars, GBP 10.40
Percent twenty pct, 18.75 %
Date June, 2008-06-29
Time two fifty am, 1:30 p.m)

BERT-NER

Install

# Kaggle
!git clone -b dev https://github.com/kamalkraj/BERT-NER.git
!pip3 install -r /kaggle/working/BERT-NER/requirements.txt
# Local
git clone -b dev https://github.com/kamalkraj/BERT-NER.git
pip3 install -r /kaggle/working/BERT-NER/requirements.txt

Run

# Kaggle
!python /kaggle/working/BERT-NER/run_ner.py --data_dir=/kaggle/working/BERT-NER/data/ --bert_model=bert-base-cased --task_name=ner --output_dir=out_base --max_seq_length=128 --do_train --num_train_epochs 5 --do_eval --warmup_proportion=0.1
# Local
python run_ner.py --data_dir=data/ --bert_model=bert-base-cased --task_name=ner --output_dir=out_base --max_seq_length=128 --do_train --num_train_epochs 5 --do_eval --warmup_proportion=0.1

Since I don’t have GPU, I run it on Kaggle and get output_base successfully.

在这里插入图片描述

If you use default parameters, you can just download pretrained model BERT_BASE and BERT_LARGE.

Then define a model and get NER outputs.

# BERT_NER.py
model_large = Ner("D:/pythonProject/BERT-NER-dev/out_large/") # local path

python BERT_NER.py

Results

After step Run, you’ll get BERT_NER_result.csv as NER results.

Entity types

There are 11 types in BERT-NER.

["O", "B-MISC", "I-MISC",  "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC", "[CLS]", "[SEP]"]
  1. PER: person
  2. LOC: location
  3. ORG: organization
  4. MISC: miscellaneous (consisting of diverse things or members)

BIO lables:

  • O
  • B-X:X phrase’s beginning
  • I-X:X phrase’s middle

B-PER:“a person name begins here”

I-PER tag:“a person name continues”

O tag: “no name here”

  • NP: Noun Phrase

命名实体识别学习记录(spaCy/OpenNLP..)相关推荐

  1. 命名实体识别学习-用lstm+crf处理conll03数据集

    title: 命名实体识别学习-用lstm+crf处理conll03数据集 date: 2020-07-18 16:32:31 tags: 命名实体识别学习-用lstm+crf处理conll03数据集 ...

  2. 命名实体识别学习笔记——使用Ltp

    版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明. 本文链接:https://blog.csdn.net/xuewenstudy/article/ ...

  3. 命名实体识别学习笔记

    1 命名实体识别概述 1.1 定义 命名实体识别(Name Entity Recognition,NER),也称作"专名识别",是指识别文本中具有特定意义的实体,包括人名.地名.机 ...

  4. 基于spaCy的领域命名实体识别

    基于spaCy的命名实体识别 ----以"大屠杀"领域命名实体识别研究为例 作者: Dr. W.J.B. Mattingly Postdoctoral Fellow at the ...

  5. 使用Spacy实现命名实体识别

    使用Spacy实现命名实体识别 本次实验的目的是完成文本数据的词性标注和识别文本中的命名实体 一.数据来源 数据是2022年2月4日的新闻 二.数据预处理 使用jieba对文本进行分词和去停用词,使用 ...

  6. spacy spaCy主要功能包括分词、词性标注、词干化、命名实体识别、名词短语提取等等

    spaCy主要功能包括分词.词性标注.词干化.命名实体识别.名词短语提取等等https://zhuanlan.zhihu.com/p/51425975

  7. (转)OpenNLP进行中文命名实体识别(下:载入模型识别实体)

    上一节介绍了使用OpenNLP训练命名实体识别模型的方法,并将模型写到磁盘上形成二进制bin文件,这一节就是将模型从磁盘上载入,然后进行命名实体识别.依然是先上代码: [java] view plai ...

  8. [NLP]OpenNLP命名实体识别(NameFinder)的使用

    目录 Name Finder 模型训练 命名识别 Name Finder 命名查找器可以检测文本中的命名实体和数字.为了能够检测到实体,命名查找器需要一个模型.模型依赖于它被训练的语言和实体类型.Op ...

  9. 对命名实体识别进行基准测试:StanfordNLP,IBM,spaCy,Dialogflow和TextSpace

    作者|Felix Laumann 编译|VK 来源|Towards Data Science NER是信息提取的一个子任务,它试图定位并将非结构化文本中提到的指定实体划分为预定义的类别,如人名.组织. ...

最新文章

  1. DataTable转换成IList
  2. disconf-自动注入属性变化
  3. SAP CRM WebClient UI图标url的生成逻辑
  4. java 菜单快捷键_Java 菜单快捷键
  5. Android Studio 安装 NDK(Mac OX)
  6. python计算最大回撤_最大回撤线性算法实现
  7. Springboot+Mysql物流快递在线寄查快递系统
  8. 易语言不用uac权限写到c盘,易语言制作UAC管理员模式添加器
  9. 分布式工作笔记001---分布式系统中CAP 定理的含义
  10. iOS之 开发学习笔记-block
  11. 【X240 QQ视频对方听不到声音】解决方法
  12. pom.xml 注释
  13. 【python】chardet函数用法
  14. 20182442-胡名琪
  15. 基于FPGA的光纤数据传输
  16. Linux下3种常用的网络测速工具
  17. 跨平台的会员通 打通品牌任督二脉
  18. 用AI取代SGD?无需训练ResNet-50,AI秒级预测全部2400万个参数,准确率60% | NeurIPS 2021...
  19. prometheus-简介
  20. Sphinx武林秘籍(下)

热门文章

  1. 名帖370 赵孟頫《真草千字文》
  2. element-ui换肤,全局换肤
  3. MVB总线在地铁列车控制系统中的应用
  4. 【狂神说】 mysql 自学总结 7~9章
  5. 国际志愿者日 我们与爱同行
  6. 庸殖状恐剐雷善菊囟蕾
  7. 回顾2007:新兴网络服务汇总(完整篇)
  8. 2.3.2 实体完整性约束
  9. Spring @Value读取配置文件
  10. java获取视频封面图片