将一句话进行实体标注，如以下格式

John   lives in New   York
B-PER  O     O  B-LOC I-LOC

我们的数据分为两个，sentence.txt和labels.txt：

#sentences.txt
John lives in New York
Where is John ?

#labels.txt
B-PER O O B-LOC I-LOC
O O B-PER O

我们假设运行build_vocab.py在/data中创建词汇表，将生成两个文件：

#words.txt
John
lives
in
...```python
#tags.txt
B-PER
B-LOC

加载文本文件

在NLP应用中，使用数字来代替词语。
假设我们的词汇库是{'is':1, 'John':2, 'Where':3, '.':4, '?':5}，则 “Where is John ?”将被表示为[3,1,2,5]。
读取words.txt词汇表，并给每个单词对应上数字。
word.txt中包含了两个特别的tokens，一个是UNK来表示词汇表中没有的词，另一个是PAD在句子补全中使用。

vocab = {}
with open(words_path) as f:for i, l in enumerate(f.read().splitlines()):vocab[l] = i

同样的道理来处理tags.txt。

接下来读取文本，并将文本转换为数字：

train_sentences = []
train_labels = []with open(train_sentences_file) as f:for sentence in f.read().splitlines():#replace each token by its index if it is in vocab#else use index of UNKs = [vocab[token] if token in self.vocab else vocab['UNK']for token in sentence.split(' ')]train_sentences.append(s)with open(train_labels_file) as f:for sentence in f.read().splitlines():#replace each label by its indexl = [tag_map[label] for label in sentence.split(' ')]train_labels.append(l)

batch 预处理

每个句子可能是不等长的，需要补充PAD。
Let’s say we have a batch of sentences batch_sentences that is a Python list of lists, with its corresponding batch_tags which has a tag for each token in batch_sentences

1.首先计算每个batch中最长的语句，不够长的语句使用PAD填充
2.然后使用(num_sentences,batch_max_len)来初始化batch，由于Embedding layer需要输入long type，所以转换为LongTensor

#compute length of longest sentence in batch
batch_max_len = max([len(s) for s in batch_sentences])#prepare a numpy array with the data, initializing the data with 'PAD'
#and all labels with -1; initializing labels to -1 differentiates tokens
#with tags from 'PAD' tokens
batch_data = vocab['PAD']*np.ones((len(batch_sentences), batch_max_len))
batch_labels = -1*np.ones((len(batch_sentences), batch_max_len))#copy the data to the numpy array
for j in range(len(batch_sentences)):cur_len = len(batch_sentences[j])batch_data[j][:cur_len] = batch_sentences[j]batch_labels[j][:cur_len] = batch_tags[j]#since all data are indices, we convert them to torch LongTensors
batch_data, batch_labels = torch.LongTensor(batch_data), torch.LongTensor(batch_labels)#convert Tensors to Variables
batch_data, batch_labels = Variable(batch_data), Variable(batch_labels)

递归神经网络

import torch.nn as nn
import torch.nn.functional as Fclass Net(nn.Module):def __init__(self, params):super(Net, self).__init__()#maps each token to an embedding_dim vectorself.embedding = nn.Embedding(params.vocab_size, params.embedding_dim)#the LSTM takens embedded sentenceself.lstm = nn.LSTM(params.embedding_dim, params.lstm_hidden_dim, batch_first=True)#fc layer transforms the output to give the final output layerself.fc = nn.Linear(params.lstm_hidden_dim, params.number_of_tags)
def forward(self, s):#apply the embedding layer that maps each token to its embeddings = self.embedding(s)   # dim: batch_size x batch_max_len x embedding_dim#run the LSTM along the sentences of length batch_max_lens, _ = self.lstm(s)     # dim: batch_size x batch_max_len x lstm_hidden_dim#reshape the Variable so that each row contains one tokens = s.view(-1, s.shape[2])  # dim: batch_size*batch_max_len x lstm_hidden_dim#apply the fully connected layer and obtain the output for each tokens = self.fc(s)          # dim: batch_size*batch_max_len x num_tagsreturn F.log_softmax(s, dim=1)   # dim: batch_size*batch_max_len x num_tags

自定义Loss Function

主要是去除PAD的影响。

def loss_fn(outputs, labels):#reshape labels to give a flat vector of length batch_size*seq_lenlabels = labels.view(-1)  #mask out 'PAD' tokensmask = (labels >= 0).float()#the number of tokens is the sum of elements in masknum_tokens = int(torch.sum(mask).data[0])#pick the values corresponding to labels and multiply by maskoutputs = outputs[range(outputs.shape[0]), labels]*mask#cross entropy loss for all non 'PAD' tokensreturn -torch.sum(outputs)/num_tokens

参考：
https://cs230.stanford.edu/blog/namedentity/#goals-of-this-tutorial
https://github.com/cs230-stanford/cs230-code-examples

pytorch ner相关推荐

BilSTM 实体识别_NLP-入门实体命名识别（NER）+Bilstm-CRF模型原理Pytorch代码详解——最全攻略
最近在系统地接触学习NER,但是发现这方面的小帖子还比较零散.所以我把学习的记录放出来给大家作参考,其中汇聚了很多其他博主的知识,在本文中也放出了他们的原链.希望能够以这篇文章为载体,帮助其他跟我一样 ...
BilSTM 实体识别_NLP入门实体命名识别（NER）+BilstmCRF模型原理Pytorch代码详解——最全攻略...
来自 | 知乎作者 | seven链接 | https://zhuanlan.zhihu.com/p/79552594编辑 | 机器学习算法与自然语言处理公众号本文仅作学术分享,如有侵权,请联系 ...
pyTorch api
应用 pytorch FC_regression pytorch FC_classification pytorch RNN_regression pytorch LSTM_regression py ...
pytorch实现BiLSTM+CRF用于NER(命名实体识别)
pytorch实现BiLSTM+CRF用于NER(命名实体识别) 在写这篇博客之前,我看了网上关于pytorch,BiLstm+CRF的实现,都是一个版本(对pytorch教程的翻译), 翻译得一点质 ...
ner pytorch project code
https://github.com/shawroad/NLP_pytorch_project
命名实体识别NER遗留问题----模型构建
深度学习模型预测实质:训练保存的模型里面参数整个只有一套参数不仅保存了训练数据全部的正确信息,而且同字多义的情况下通过其同行的词来判断,虽然参数都是一套但是因为输入的值不同导致计算的结果不同导致 ...
我爱自然语言处理bert ner chinese
BERT相关论文.文章和代码资源汇总 4条回复 BERT最近太火,蹭个热点,整理一下相关的资源,包括Paper, 代码和文章解读. 1.Google官方: 1) BERT: Pre-training ...
PyTorch 高级实战教程：基于 BI-LSTM CRF 实现命名实体识别和中文分词
20210607 https://blog.csdn.net/u011828281/article/details/81171066 前言:译者实测 PyTorch 代码非常简洁易懂,只需要将中文分词 ...
ELMo解读（论文 + PyTorch源码）
ELMo的概念也是很早就出了,应该是18年初的事情了.但我仍然是后知后觉,居然还是等BERT出来很久之后,才知道有这么个东西.这两天才仔细看了下论文和源码,在这里做一些记录,如果有不详实的地方,欢迎指 ...

pytorch ner

加载文本文件

batch 预处理

递归神经网络

自定义Loss Function

pytorch ner相关推荐

最新文章

热门文章