【AdaSeq基础】30+NER数据汇总,涉及多行业、多模态命名实体识别数据集收集
简介
命名实体识别NER是NLP基础任务,一直以来受到学术界和业界的广泛关注,本文汇总了常见的中英文、多语言、多模态NER数据集介绍。
相关数据详情可以访问链接:
https://github.com/modelscope/AdaSeq/blob/master/docs/datasets.md
一、中文数据集
首先我们先介绍常用的中文NER数据集,语料来源包括新闻、电商、文娱、医疗、微博、论文文献等。
MSRA命名实体识别数据集
简介:本数据集包括训练集(46364)、测试集(4365),实体类型包括地名(LOC)、人名(NAME)、组织名(ORG),数据源自新闻领域。
语种:Chinese
"训练集/验证集/测试集"数量: 46364/-/4365
实体类别数量:3
论文:https://aclanthology.org/W06-0115.pdf
下载地址:https://tianchi.aliyun.com/dataset/144307
简历命名实体识别数据集
简介:本数据集包括训练集(3821)、验证集(463)、测试集(477),实体类型包括国籍(CONT)、教育背景(EDU)、地名(LOC)、人名(NAME)、组织名(ORG)、专业(PRO)、民族(RACE)、职称(TITLE),文本比较规范,实体识别模型效果通常F1 90%以上。
语种:Chinese
"训练集/验证集/测试集"数量:3821/463/477
实体类别数量:9
论文:https://aclanthology.org/P18-1144.pdf
下载地址:https://tianchi.aliyun.com/dataset/144345
Github: https://github.com/jiesutd/LatticeLSTM
weibo命名实体识别数据集
简介:本数据集包括训练集(1350)、验证集(269)、测试集(270),实体类型包括地缘政治实体(GPE.NAM)、地名(LOC.NAM)、机构名(ORG.NAM)、人名(PER.NAM)及其对应的代指(以NOM为结尾),数据来自社交媒体平台,表达方式比较灵活。
语种:Chinese
"训练集/验证集/测试集"数量: 1350/269/270
实体类别数量:4
论文:https://aclanthology.org/D15-1064.pdf
下载地址:https://tianchi.aliyun.com/dataset/144312
Github: https://github.com/hltcoe/golden-horse
OntoNotes Release 4.0
简介:OntoNotes Release 4.0 consists of 2.4 million words as follows: 300k words of Arabic newswire 250k words of Chinese newswire, 250k words of Chinese broadcast news, 150k words of Chinese broadcast conversation and 150k words of Chinese web text and 600k words of English newswire, 200k word of English broadcast news, 200k words of English broadcast conversation and 300k words of English web text.
语种:English, Mandarin Chinese, Arabic, Chinese
"训练集/验证集/测试集"数量: 15724/4301/4346
下载地址:https://catalog.ldc.upenn.edu/LDC2011T03
OntoNotes Release 5.0
简介:OntoNotes Release 5.0 is the final release of the OntoNotes project, a collaborative effort between BBN Technologies, the University of Colorado, the University of Pennsylvania and the University of Southern Californias Information Sciences Institute. The goal of the project was to annotate a large corpus comprising various genres of text (news, conversational telephone speech, weblogs, usenet newsgroups, broadcast, talk shows) in three languages (English, Chinese, and Arabic) with structural information (syntax and predicate argument structure) and shallow semantics (word sense linked to an ontology and coreference).
语种:English
"训练集/验证集/测试集"数量: 59924/8528/8262
论文:https://aclanthology.org/W13-3516.pdf
下载地址:https://catalog.ldc.upenn.edu/LDC2013T19
CLUENER2020 中文细粒度命名实体识别
简介:本数据是在清华大学开源的文本分类数据集THUCTC基础上,选出部分数据进行细粒度命名实体标注,原数据来源于Sina News RSS.
语种:Chinese
"训练集/验证集/测试集"数量:10748/1343/1345
实体类别数量:10
论文:https://arxiv.org/ftp/arxiv/papers/2001/2001.04351.pdf
下载地址:https://tianchi.aliyun.com/dataset/144362
GitHub:https://github.com/CLUEbenchmark/CLUENER2020
人民日报NER数据集
简介:本NER数据集由人民日报语料库1998版和2014版生成,包含了人名(PER)、地名(LOC)和机构名(ORG)3类常见的实体类型。
语种:Chinese
实体类别数量:3
下载地址:https://github.com/InsaneLife/ChineseNLPCorpus/tree/master/NER/renMinRiBao
中文医学命名实体识别数据集CMeEE
简介:中文医学命名实体识别CMeEE,全称为Chinese Medical Entity Extraction dataset,来自于知名的中文医学NLP评测基准CBLUE。数据集包含504种常见的儿科疾病、7,085种身体部位、12,907种临床表现、4,354种医疗程序等九大类医学实体,包含训练集15,000条,验证集5,000条和测试集数据3,000条。 CMeEE包括两个版本:CMeEE和CMeEE-V2(在CMeEE基础上更新了部分标注错误)。 请研究人员到CBLUE项目主页下载:https://tianchi.aliyun.com/dataset/95414
语种:Chinese
"训练集/验证集/测试集"数量: 15000/5000/3000
实体类别数量: 9
论文:https://aclanthology.org/2022.acl-long.544/
下载地址:https://tianchi.aliyun.com/dataset/144495
Github: https://github.com/CBLUEbenchmark/CBLUE
Yidu-S4K:医渡云结构化4K数据集
简介:Yidu-S4K 数据集源自CCKS 2019 评测任务一,即“面向中文电子病历的命名实体识别”的数据集。
语种:Chinese
"训练集/验证集/测试集"数量: 1000/-/379
实体类别数量:6
下载地址:https://tianchi.aliyun.com/dataset/144419
Youku NER Dataset / 文娱NER数据集
简介:命名体识别(NER)是一项重要的自然语言处理任务,本数据集提供了文娱领域的NER开放数据集,包括了3大类、9小类实体类别。该数据集由阿里巴巴达摩院和新加坡科技设计大学联合提供。
语种:Chinese
"训练集/验证集/测试集"数量: 8,001/1,000/1,001
实体类别数量: 9
论文:https://aclanthology.org/N19-1079.pdf
下载地址:https://tianchi.aliyun.com/dataset/108771
Github: https://github.com/allanj/ner_incomplete_annotation
E-Commercial NER Dataset / 电商NER数据集
简介:命名体识别(NER)是一项重要的自然语言处理任务,本数据集提供了电商领域的NER开放数据集,包括了4大类、9小类实体类别。该数据集由阿里巴巴达摩院和新加坡科技设计大学联合提供。
语种:Chinese
"训练集/验证集/测试集"数量: 6,000/998/1,000
实体类别数量: 9
论文:https://aclanthology.org/N19-1079.pdf
下载地址:https://tianchi.aliyun.com/dataset/108758
Github: https://github.com/allanj/ner_incomplete_annotation
Chinese-Literature-NER-RE-Dataset
简介:A Discourse-Level Named Entity Recognition and Relation Extraction Dataset for Chinese Literature Text.
语种:Chinese
实体类别数量:7
论文:https://arxiv.org/pdf/1711.07010.pdf
下载地址:https://tianchi.aliyun.com/dataset/144431
GitHub:https://github.com/lancopku/Chinese-Literature-NER-RE-Dataset
二、英文+多语言数据集
接下来我们介绍常用的英文和其它语种NER数据集,包括多模态NER的数据:
conll2002命名实体识别数据集
简介:CoNLL 2002和CoNLL 2003应该是NER开发者和研究人员常用的数据集了,分别是包含英语、俄语、西语、法语四种语言。每种语言的数据集涉及人名、地名、组织名和misc四类实体。
语种:Spanish, Dutch
实体类别数量:4
论文:https://aclanthology.org/W02-2024.pdf
下载地址:https://www.cnts.ua.ac.be/conll2002/ner/
conll2003命名实体识别数据集
简介:同上。
语种:English、German
实体类别数量:4
论文:https://aclanthology.org/W03-0419.pdf
下载地址:https://www.clips.uantwerpen.be/conll2003/ner/
wnut16命名实体识别数据集
简介:本数据集包括训练集(2394)、验证集(1000)、测试集(3850),实体类型包括company、facility、loc、movie、musicartist、other、person、product、sportsteam、tvshow。
语种:English
"训练集/验证集/测试集"数量:2394/1000/3850
实体类别数量: 10
论文:https://aclanthology.org/W16-3919.pdf
下载地址:https://tianchi.aliyun.com/dataset/144348
wnut17命名实体识别数据集
简介:本数据集包括训练集(3394)、验证集(1009)、测试集(1287),实体类型包括corporation、creative-work、group、location、person、product。
语种:English
"训练集/验证集/测试集"数量:3394/1009/1287
实体类别数量:6
论文:https://aclanthology.org/W17-4418.pdf
下载地址:https://tianchi.aliyun.com/dataset/144349
conllpp命名实体识别数据集
简介:本数据集包括训练集(14041)、验证集(3250)、测试集(3453),实体类型包括地点(LOC)、混合(MISC)、组织(ORG)、人名(PER)。conllpp数据集是conll数据集的修复版本。
语种:English
"训练集/验证集/测试集"数量: 14041/3250/3453
实体类别数量:4
论文:https://aclanthology.org/D19-1519.pdf
下载地址:https://tianchi.aliyun.com/dataset/144414
Github: https://github.com/ZihanWangKi/CrossWeigh
CrossNER命名实体识别数据集
简介:CrossNER数据集是面向多个不同领域(文学、政治、音乐、科学、人工智能)的英文命名实体识别数据集,主要作为低资源NER的练兵场。
语种:English
论文:https://ojs.aaai.org/index.php/AAAI/article/view/17587/17394
下载地址:https://tianchi.aliyun.com/dataset/144418
Github: https://github.com/zliucr/CrossNER
BioCreative V CDR task corpus
简介:The BioCreative V CDR task corpus is manually annotated for chemicals, diseases and chemical-induced disease (CID) relations. It contains the titles and abstracts of 1500 PubMed articles and is split into equally sized train, validation and test sets.
语种:English
"训练集/验证集/测试集"数量:4560/4581/4797
实体类别数量:2
论文:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4860626/
下载地址:https://biocreative.bioinformatics.udel.edu/tasks/biocreative-v/track-3-cdr/
NCBI disease corpus
简介:The NCBI disease corpus is fully annotated at the mention and concept level to serve as a research resource for the biomedical natural language processing community.
语种:English
"训练集/验证集/测试集"数量:5424/923/940
实体类别数量:1
论文:https://pubmed.ncbi.nlm.nih.gov/24393765/
下载地址: https://www.ncbi.nlm.nih.gov/CBBresearch/Dogan/DISEASE/
MIT-Movie命名实体识别数据集
简介:The MIT Movie Corpus is a semantically tagged training and test corpus in BIO format in the movie domain.
语种:English, Chinese
"训练集/验证集/测试集"数量:6816/1000/1953
实体类别数量: 12
论文:https://groups.csail.mit.edu/sls/publications/2013/Liu_ICASSP-2013.pdf
下载地址:https://tianchi.aliyun.com/dataset/145106
MIT-Restaurant命名实体识别数据集
简介:MIT Restaurant Corpus 是餐厅领域中 BIO 格式的实体识别语料库。
语种:English, Chinese
"训练集/验证集/测试集"数量:6900/760/1521
实体类别数量: 9
论文:https://groups.csail.mit.edu/sls/publications/2013/Liu_ICASSP-2013.pdf
下载地址:https://tianchi.aliyun.com/dataset/145105
ACE 2004 Multilingual Training Corpus
简介:This corpus represents the complete set of English, Arabic, and Chinese training data for the 2004 Automatic Content Extraction (ACE) technology evaluation created by LDC with support from the ACE Program and additional assistance from the DARPA TIDES (Translingual Information Detection, Extraction and Summarization) Program. This data was previously distributed as an e-corpus (LDC2004E17) to participants in the 2004 ACE evaluation.
语种:English,Arabic, and Chinese
论文:http://www.lrec-conf.org/proceedings/lrec2004/pdf/5.pdf
下载地址:https://catalog.ldc.upenn.edu/LDC2005T09
ACE 2005 Multilingual Training Corpus
简介:ACE 2005 Multilingual Training Corpus was developed by the Linguistic Data Consortium (LDC) and contains approximately 1,800 files of mixed genre text in English, Arabic, and Chinese annotated for entities, relations, and events. This represents the complete set of training data in those languages for the 2005 Automatic Content Extraction (ACE) technology evaluation. The genres include newswire, broadcast news, broadcast conversation, weblog, discussion forums, and conversational telephone speech. The data was annotated by LDC with support from the ACE Program and additional assistance from LDC.
语种:English,Arabic, and Chinese
下载地址:https://catalog.ldc.upenn.edu/LDC2006T06
KBP2017命名实体识别数据集
简介:The Entity Discovery and Linking (EDL) track aims to extract entity mentions from a source collection of textual documents in multiple languages, and link them to a reference knowledge base; an EDL system is also required to cluster mentions for those entities that don't have corresponding KB entries.
语种:English
实体类别数量: 5
论文:https://tac.nist.gov/publications/2017/additional.papers/TAC2017.KBP_Entity_Discovery_and_Linking_overview.proceedings.pdf
下载地址:https://catalog.ldc.upenn.edu/LDC2019T19
任务官网:https://tac.nist.gov/2017/KBP/
JNLPBA生物命名体识别数据集
简介:The BioNLP / JNLPBA Shared Task 2004 involves the identification and classification of technical terms referring to concepts of interest to biologists in the domain of molecular biology. The task was organized by GENIA Project based on the annotations of the GENIA Term corpus (version 3.02).
语种:English
"训练集/验证集/测试集"数量: 2000/-/404
实体类别数量: 5
论文:https://dl.acm.org/doi/10.5555/1567594.1567610
下载地址:https://tianchi.aliyun.com/dataset/144943
Few-NERD
简介:Few-NERD是一个大规模,多粒度的人工标注命名实体识别(Named Entity Recognition, NER)数据集,包含了8个大类,66个小类,18万余个句子,49余万个实体。本数据集包括3个任务,分别为标准监督NER(Few-NERD (SUP)),跨大类Few-shot NER(Few-NERD (INTRA))和不跨大类的Few-shot NER (Few-NERD (INTER))。Few-NERD由清华大学和阿里巴巴的研究者构建而成。
语种:English
"训练集/验证集/测试集"数量:131767/18824/37548
实体类别数量: 8 / 66
论文:https://aclanthology.org/2021.acl-long.248.pdf
下载地址:https://tianchi.aliyun.com/dataset/102048
Github: https://github.com/thunlp/Few-NERD
Financial NER Dataset
简介:The dataset is generated using CoNll2003 data and financial documents obtained from U.S. Security and Exchange Commission (SEC) filings.
语种:English
"训练集/验证集/测试集"数量: (Document level) 5/-/3
实体类别数量: 4
论文:https://aclanthology.org/U15-1010/
下载地址:https://tianchi.aliyun.com/dataset/145092
Broad Twitter Corpus (BTC)
简介:The Broad Twitter Corpus is a named entity-annotated dataset of tweets, collected in order to capture temporal, spatial and social diversity. Its annotations have high agreement and quality, and it has about 12000 entity annotations, of types Person, Location and Organization.
语种:English
"训练集/验证集/测试集"数量:6338/1001/2000
实体类别数量:3
论文:https://aclanthology.org/C16-1111.pdf
下载地址:https://tianchi.aliyun.com/dataset/145001
Github: https://github.com/GateNLP/broad_twitter_corpus
Temporal Twitter Corpus (TTC)
简介:It includes 12,000 tweets annotated for the named entity recognition task. The tweets are uniformly distributed over the years 2014-2019, with 2,000 tweets from each year. The goal is to have a temporally diverse corpus to account for data drift over time when building NER models.
语种:English
"训练集/验证集/测试集"数量: 10000/500/1500
实体类别数量: 3
论文:https://aclanthology.org/2020.acl-main.680.pdf
下载地址:https://tianchi.aliyun.com/dataset/144438
GitHub:https://github.com/shrutirij/temporal-twitter-corpus
Tweebank-NER
简介:Social media data such as Twitter messages (“tweets”) pose a particular challenge to NLP systems because of their short, noisy, and colloquial nature. The Tweebank-NER is an English NER corpus based on Tweebank V2 (TB2).
语种:English
"训练集/验证集/测试集"数量: 1,639/710/1,201
实体类别数量:4
论文:https://aclanthology.org/2022.lrec-1.780.pdf
下载地址:https://tianchi.aliyun.com/dataset/145049
Github:https://github.com/mit-ccc/TweebankNLP
TweetNER7
简介:TweetNER7 is a NER dataset on Twitter with 7 entity labels annotated over 11,382 tweets from September 2019 to August 2021.
语种:English
实体类别数量: 7
论文:https://aclanthology.org/2022.aacl-main.25.pdf
下载地址:https://tianchi.aliyun.com/dataset/145052
HuggingFace: https://huggingface.co/datasets/tner/tweetner7/tree/main/dataset
三、多模态NER数据集
接下来我们介绍常用多模态NER的数据:
Multimodal Twitter-15 NER Dataset
简介:来自社交媒体领域的多模态NER数据集,内容来自推文及其图片。
语种:English
"训练集/验证集/测试集"数量: 4000/1000/3257
实体类别数量:4
论文:https://ojs.aaai.org/index.php/AAAI/article/view/11962/11821
下载地址:https://tianchi.aliyun.com/dataset/145058
GitHub:https://github.com/jinlanfu/NERmultimodal
Multimodal Twitter-17 NER Dataset
简介:与上面类似,来自社交媒体领域的多模态NER数据集,内容来自推文及其图片。多模态NER的论文通常会在这两个数据集上进行实验。
语种:English
"训练集/验证集/测试集"数量: 4000/1000/3257
实体类别数量:4
论文:https://aclanthology.org/2020.acl-main.306.pdf
下载地址:https://github.com/jefferyYu/UMT
GitHub:https://github.com/jefferyYu/UMT
Multimodal SNAP NER Dataset
简介:SNAP的多模态NER数据,实体类型分别是人名、地名、组织名和misc。
语种:English
实体类别数量:4
论文:https://aclanthology.org/P18-1185.pdf
下载地址:https://github.com/jefferyYu/UMT
GitHub:https://github.com/jefferyYu/UMT
WikiDiverse Dataset
简介:是一个多模态实体识别和实体链接数据集。这一数据集是基于多个角度的考虑:首先,综合参考现有的实体链接数据集、分析图文匹配程度、实体消歧难度等信息,采用WikiNews的“图片-标题”对作为原始数据,将Wikipedia作为对应的知识图谱。其次,我们采集了体育、政治、娱乐、灾难、科技、犯罪、经济、教育、健康、天气主题的图文对,并进行了质量低下、色情、暴恐信息的清洗,对图片类型进行了归一化(因为部分图片为gif等格式),从而保证数据的高覆盖性和质量。最后,引入了众包标注平台进行数据标注,在此过程中设计了详细的标注规范,特别地,我们关注人物、组织、地点、国家、事件、作品(包含图书、画作等)、其他等多个实体类型。
语种:English
"训练集/验证集/测试集"数量: 6312/755/757
论文:https://aclanthology.org/2022.acl-long.328.pdf
下载地址:https://tianchi.aliyun.com/dataset/145103
GitHub:https://github.com/wangxw5/wikidiverse
四、 多语言NER数据集
接下来我们介绍常用的多语种NER的数据:
MultiCoNER Dataset
简介:MultiCoNER 是用于命名实体识别的大型多语言数据集(11 种语言)。它旨在代表 NER 中的一些当代挑战,包括低上下文场景(短文本和无大小写文本)、句法复杂的实体(如电影片名)和长尾实体分布。
语种: Bangla、 Chinese、Dutch、English、Farsi、German、Hindi、Korean、Russian、Spanish、Turkish.
实体类别数量:6
论文:https://aclanthology.org/2022.coling-1.334/
下载地址:https://tianchi.aliyun.com/dataset/145100
任务官网:https://multiconer.github.io/multiconer_1/
命名实体识别数据集汇总列表
Language |
Dataset |
Size |
#Types |
Description |
Paper |
Download |
Chinese |
msra |
46364/-/4365 |
3 |
Levow |
damo/msra_ner |
|
Chinese |
resume |
3821/463/477 |
9 |
Zhang & Yang |
damo/resume_ner |
|
Chinese |
|
1350/269/270 |
4 |
Peng & Dredze |
damo/weibo_ner |
|
Chinese |
ontonotes-v4-zh |
15724/4301/4346 |
- |
ldc/ontonotes-v4 |
||
Chinese |
cluener2020 |
10748/1343/1345 |
10 |
Xu et al., 2020 |
github/cluener2020 |
|
Chinese |
people_dairy1998 |
3 |
github/ChineseNLPCorpus |
|||
Chinese |
people_dairy2014 |
3 |
baidu-pan passwrod:1fa3 |
|||
Chinese |
cmeee |
15000/5000/3000 |
CMeEE dataset in CBLUE benchmark |
Zhang et al., 2022 |
github/cblue |
|
Chinese |
yidu-s4k |
- |
openkg/yidu-s4k |
|||
Chinese |
ecommerce |
Jie et al., 2019 |
github/ner_incomplete_annotation/ecommerce |
|||
Chinese |
dlner |
Xu, et al.,2017 |
github/dlner |
|||
Dutch |
conll2002-nl |
15796/2895/5196 |
4 |
Tjong Kim Sang, 2002 |
||
English |
wnut2016 |
2394/1000/3850 |
Noisy User-generated Text |
Strauss et al., 2016 |
damo/wnut16 |
|
English |
wnut2017 |
3394/1009/1287 |
Derczynski et al., 2017 |
damo/wnut17 |
||
English |
conll2003-en |
14041/3250/3453 |
4 |
Tjong Kim Sang & De Meulder, 2003 |
||
English |
conllpp |
14041/3250/3453 |
4 |
corrected version of the conll03-en NER dataset |
Wang et al., 2019 |
damo/conllpp_ner |
English |
ontonotes-v5-en |
59924/8528/8262(TBD) |
Pradhan et al., 2013 |
ldc/ontonotes-v5 |
||
English |
ai |
100/350/431 |
Liu et al., 2020 |
damo/cross_ner |
||
English |
literature |
100/400/416 |
Liu et al., 2020 |
damo/cross_ner |
||
English |
music |
100/541/465 |
Liu et al., 2020 |
damo/cross_ner |
||
English |
politics |
200/541/651 |
Liu et al., 2020 |
damo/cross_ner |
||
English |
science |
200/450/543 |
Liu et al., 2020 |
damo/cross_ner |
||
English |
bc5cdr |
4560/4581/4797 |
Li et al., 2016 |
|||
English |
ncbi |
5424/923/940 |
Doğan et al., 2014 |
|||
English |
mit-movie |
6816/1000/1953(TBD) |
Liu et al., 2013 |
mit/movie |
||
English |
mit-restaurant |
6900/760/1521 |
Liu et al., 2013 |
mit/restaurant |
||
English |
ace2004-en |
7 |
nested ner |
Doddington et al., 2005 |
ldc/ace04 |
|
English |
ace2005-en |
7 |
nested ner |
- |
ldc/ace05 |
|
English |
kbp2017 |
nested ner |
- |
- |
||
English |
genia |
nested ner |
Ohta et al., 2002 |
|||
English |
few-nerd |
131767/18824/37548 |
8 / 66 |
a few-shot ner dataset |
Ding et al., 2021 |
|
English |
wikigold |
Balasuriya et al.,2009 |
||||
English |
bionlp2014 |
Collier & Kim, 2004 |
||||
English |
fin |
Alvarado et al., 2015 |
||||
English |
btc |
6338/1001/2000 |
3 |
Derczynski et al., 2016 |
||
English |
ttc |
Rijhwani & Preot¸iuc-Pietro |
github/ttc |
|||
English |
tweebank |
Jiang et al.,2022 |
github/tweebank |
|||
English |
tweetner7 |
Ushio, et al., 2022 |
huggingface/tweetner7 |
|||
German |
conll2003-de |
12152/2866/3005 |
4 |
Tjong Kim Sang & De Meulder, 2003 |
||
Spanish |
conll2002-es |
8302/1919/1517 |
4 |
Tjong Kim Sang, 2002 |
||
English |
twitter2015 |
multi-modal |
Zhang et al., 2018 |
|||
English |
snap |
multi-modal |
Lu et al., 2018 |
github/UMT |
||
English |
twitter2017 |
multi-modal |
Yu et al., 2020 |
github/UMT |
||
English |
wiki-diverse |
constructed from wiki-diverse (a multi-modal entity typing dataset) |
Wang et al., 2022 |
github/wikidiverse |
||
11 langs |
multiconer2022 |
- |
6 |
dataset of SemEval 2022 Task 11 (English, Spanish, Dutch, Russian, Turkish, Korean, Farsi, German, Chinese, Hindi, and Bangla) |
Malmasi et al., 2022 |
aws/multiconer |
282 langs |
wikiann |
- |
silver-standard data |
Pan et al, 2017 |
github/wikiann |
|
9 langs |
wikiner |
- |
silver-standard data |
Nothman et al, 2013 |
||
9 langs |
wikineural |
- |
silver-standard data |
Tedeschi et al, 2021 |
||
10 langs |
multinerd |
- |
silver-standard data |
Tedeschi & Navigli. 2022 |
致谢
本列表由达摩院NLP团队和天池数据科学团队长期维护,相关数据可以通过序列理解统一框架AdaSeq进行模型训练。
https://github.com/modelscope/AdaSeq/blob/master/README_zh.md
【AdaSeq基础】30+NER数据汇总,涉及多行业、多模态命名实体识别数据集收集相关推荐
- 零基础入门--中文命名实体识别(BiLSTM+CRF模型,含代码)
https://github.com/mali19064/LSTM-CRF-pytorch-faster 中文分词 说到命名实体抽取,先要了解一下基于字标注的中文分词. 比如一句话 "我爱北 ...
- (转)零基础入门--中文命名实体识别
转自 https://blog.csdn.net/buppt/article/details/81180361 中文分词 说到命名实体抽取,先要了解一下基于字标注的中文分词. 比如一句话 " ...
- 命名实体识别(NER)知识汇总
介绍 命名实体识别(Named Entity Recognition,NER)是NLP领域中一项基础的信息抽取任务,NER 是关系抽取.知识图谱.问答系统等其他诸多NLP任务的基础.NER从给定的非结 ...
- NER命名实体识别,基于数据是字典的形式怎么识别
什么是命名实体识别: 命名实体识别(Named Entity Recognition,简称NER),又称作"专名识别,就是从文本中提取出具有特定意义的实体,主要包括人名,地名,专有名字等等. ...
- NLP命名实体识别NER数据准备及模型训练实例
NLP命名实体识别NER数据准备及模型训练实例 目录 NLP命名实体识别NER数据准备及模型训练实例 方案一
- 一文详解深度学习在命名实体识别(NER)中的应用
近几年来,基于神经网络的深度学习方法在计算机视觉.语音识别等领域取得了巨大成功,另外在自然语言处理领域也取得了不少进展.在NLP的关键性基础任务-命名实体识别(Named Entity Recogni ...
- NLP入门(八)使用CRF++实现命名实体识别(NER)
CRF与NER简介 CRF,英文全称为conditional random field, 中文名为条件随机场,是给定一组输入随机变量条件下另一组输出随机变量的条件概率分布模型,其特点是假设输出随机 ...
- 命名实体识别(NER)发展简史
近几年来,基于神经网络的深度学习方法在计算机视觉.语音识别等领域取得了巨大成功,另外在自然语言处理领域也取得了不少进展.在NLP的关键性基础任务-命名实体识别(Named Entity Recogni ...
- 命名实体识别 NER 论文综述:那些年,我们一起追过的却仍未知道的花名 (一)...
点击上方,选择星标或置顶,每天给你送干货! 阅读大概需要24分钟 跟随小博主,每天进步一丢丢 作者: 龚俊民(昵称: 除夕) 学校: 新南威尔士大学 方向: 自然语言处理和可解释学习 知乎: http ...
最新文章
- 测试TI高速MOS驱动芯片 TPS28225 伴随着MOS半桥
- C 把两个bitmap文件合并成一个bitmap文件
- 0603贴片电阻阻值对照表_怎样读贴片电阻阻值
- Python学习笔记:线程和进程(合),分布式进程
- ASP.NET MVC 后台传值前端乱码解决方案 富文本Ueditor编辑
- sql中count(1)、count(*)和count(字段名)的区别
- python 二维码_Python提取支付宝和微信支付二维码
- 酷派COOL 20 Pro影像大升级:搭载5000万AI三摄 主攻夜景
- QT每日一练day23:鼠标进入与离开事件
- 20-21-2网络管理quiz4
- 打包外星人_《疯狂外星人》中外星人带上金箍就是大圣,放下金箍就是至尊宝!...
- 《深入浅出数据分析》读书心得与笔记
- c mysql学生管理系统_C++ 简单的学生信息管理系统
- finereport自带的模板目录
- 提高非参数检验功效的潜在方法
- 英文电子专业词汇(新手必备)
- Unity网格变形插件的简单使用:以curve sculpt layered自由变换修改器为例
- NFT交易平台2.0来了,源代码,智能合约整套
- 【python】18行代码带你采集国外网小姐姐绝美图片
- 小程序 js把本地或取得临时的视频或者图片路径封装成file文件流
热门文章
- HDU 6386 Age of Moyu DFS+BFS
- PowerBuilder 窗口透明SetLayeredWindowAttributes详解
- 程序员面试金典(第 6 版)(简单篇)
- python 点到围栏距离_Python一行代码处理地理围栏
- Android studio使用svn创建分支及合并分支----终极图解
- 13、Java——“21点”扑克游戏系统(变量+循环)
- 文字表格信息抽取模型介绍——实体抽取方法:NER模型(上)
- Javascript实现数组排列组合
- 瑞士监管机构FINMA对1亿美元的ICO进行调查
- Sklearn(scikit-learn)