TokenEmbedder—自定义Embedder

class GloVeEmbedding(TokenEmbedder):

function

与Embedding用法相同，代码是完全照着Embedding写的
word vector将word分成单个char，char vector=sum(vector)/counter(char)：词向量和/字符出现次数
原Embedding实现中，权重是随机的，对于GloVe中有的词，才替换成GloVe中的词向量，对于没有的词，还是原来的随机的
GloVeEmbedding能够对没有的词，将char vector和作为word vector，避免使用随机词向量，使预训练文件中未出现的词向量尽可能合理
实测：使用的BiDAF框架，设置num_epochs=1未必初始效果较好，但是train_start_acc会好一些，只算是测试一下自定义Embedding

code

@TokenEmbedder.register("glove_embedding")  # 与json中tokens里的type相对应
class GloVeEmbedding(TokenEmbedder):# GloVeEmbedding与Embedding完全相同，但是其中调用了读取预训练文件的函数，该函数需要重写# 修改__read_embeddings_from_text_file，共改动3处
def _read_embeddings_from_text_file(file_uri: str, embedding_dim: int, vocab: Vocabulary, namespace: str = "tokens"
) -> torch.FloatTensor:"""Read pre-trained word vectors from an eventually compressed text file, possibly containedinside an archive with multiple files. The text file is assumed to be utf-8 encoded withspace-separated fields: [word] [dim 1] [dim 2] ...Lines that contain more numerical tokens than `embedding_dim` raise a warning and are skipped.The remainder of the docstring is identical to `_read_pretrained_embeddings_file`."""tokens_to_keep = set(vocab.get_index_to_token_vocabulary(namespace).values())vocab_size = vocab.get_vocab_size(namespace)char_embeddings = {}  # 1.添加char embeddingembeddings = {}# First we read the embeddings from the file, only keeping vectors for the words we need.logger.info("Reading pretrained embeddings from file")with EmbeddingsTextFile(file_uri) as embeddings_file:for line in Tqdm.tqdm(embeddings_file):token = line.split(" ", 1)[0]if token in tokens_to_keep:fields = line.rstrip().split(" ")if len(fields) - 1 != embedding_dim:# Sometimes there are funny unicode parsing problems that lead to different# fields lengths (e.g., a word with a unicode space character that splits# into more than one column).  We skip those lines.  Note that if you have# some kind of long header, this could result in all of your lines getting# skipped.  It's hard to check for that here; you just have to look in the# embedding_misses_file and at the model summary to make sure things look# like they are supposed to.logger.warning("Found line with wrong number of dimensions (expected: %d; actual: %d): %s",embedding_dim,len(fields) - 1,line,)continuevector = numpy.asarray(fields[1:], dtype="float32")# 2.对token中每个字母进行统计，字符出现在词中，则(向量累加，计数)for char in list(token):if char in char_embeddings:char_embeddings[char] = (char_embeddings[char][0] + vector, char_embeddings[char][1] + 1)else:char_embeddings[char] = (vector, 1)embeddings[token] = vectorif not embeddings:raise ConfigurationError("No embeddings of correct dimension found; you probably ""misspecified your embedding_dim parameter, or didn't ""pre-populate your Vocabulary")# char vector：向量和/出现次数char_embeddings = {char: char_embeddings[char][0] / char_embeddings[char][1] for char in char_embeddings}chars = set(char_embeddings.keys())all_embeddings = numpy.asarray(list(embeddings.values()))embeddings_mean = float(numpy.mean(all_embeddings))embeddings_std = float(numpy.std(all_embeddings))# Now we initialize the weight matrix for an embedding layer, starting with random vectors,# then filling in the word vectors we just read.logger.info("Initializing pre-trained embedding layer")embedding_matrix = torch.FloatTensor(vocab_size, embedding_dim).normal_(embeddings_mean, embeddings_std)num_tokens_found = 0index_to_token = vocab.get_index_to_token_vocabulary(namespace)for i in range(vocab_size):token = index_to_token[i]# If we don't have a pre-trained vector for this word, we'll just leave this row alone,# so the word has a random initialization.if token in embeddings:embedding_matrix[i] = torch.FloatTensor(embeddings[token])num_tokens_found += 1elif len(set(token) - chars) == 0:  # 3.只要字符可以组成该token，字符向量和作为词向量，也算预训练文件中包括该词embedding_matrix[i] = torch.FloatTensor([char_embeddings[char] for char in list(token)]).sum(dim=-2)num_tokens_found += 1else:logger.debug("Token %s was not found in the embedding file. Initialising randomly.", token)logger.info("Pretrained embeddings were found for %d out of %d tokens", num_tokens_found, vocab_size)return embedding_matrix

.json

{"token_embedders": {"tokens": {"type": "glove_embedding","trainable": false,"embedding_dim": 300,"pretrained_file": "path/glove.6B.300d.txt"}}
}

command line

allennlp train -s allennlp_model/model -f --include-package allennlp_model.embedder.glove_embedder allennlp_model/run_glove_embedder.json

–include-package path1.path2.py_file_name

My Github

AllenNLP—笔记—TokenEmbedder相关推荐

2021-05-31 elem笔记
Elmo个人笔记 elmo论文和代码模型架构: elmo论文和代码 elmo:论文地址:https://arxiv.org/abs/1802.05365 代码地址:https://pan.baidu ...
深入了解Allennlp细
前言本次将要介绍的是Allennlp框架,这是一个基于Pytorch,面向深度学习中的自然语言处理领域的框架,提供了众多的新兴算法和预训练模型,只需要简单的几行代码就可以完成很棒的功能. 本次教程, ...
【读书笔记】知易行难，多实践
前言: 其实,我不喜欢看书,只是喜欢找答案,想通过专业的解答来解决我生活的困惑.所以,我听了很多书,也看了很多书,但看完书,没有很多的实践,导致我并不很深入在很多时候. 分享读书笔记: <高效1 ...
【运维学习笔记】生命不息，搞事开始。。。
001生命不息,搞事不止!!! 这段时间和hexesdesu搞了很多事情! 之前是机械硬盘和固态硬盘的测速,我就在那默默的看着他一个硬盘一个机械测来测去. 坐在他后面,每天都能看到这位萌萌的小男孩,各 ...
SSAN 关系抽取论文笔记
20210621 https://zhuanlan.zhihu.com/p/353183322 [KG笔记]八.文档级(Document Level)关系抽取任务共指id嵌入一样但是实体嵌入的时候 ...
pandas以前笔记
# -*- coding: utf-8 -*- """ Created on Sat Jul 21 20:06:20 2018@author: heimi "& ...
PyTorch 学习笔记（六）：PyTorch hook 和关于 PyTorch backward 过程的理解 call
您的位置首页 PyTorch 学习笔记系列 PyTorch 学习笔记(六):PyTorch hook 和关于 PyTorch backward 过程的理解发布: 2017年8月4日 7,195阅读 ...
容器云原生DevOps学习笔记——第三期：从零搭建CI/CD系统标准化交付流程
暑期实习期间,所在的技术中台-效能研发团队规划设计并结合公司开源协同实现符合DevOps理念的研发工具平台,实现研发过程自动化.标准化: 实习期间对DevOps的理解一直懵懵懂懂,最近观看了阿里专家带 ...
容器云原生DevOps学习笔记——第二期：如何快速高质量的应用容器化迁移
暑期实习期间,所在的技术中台-效能研发团队规划设计并结合公司开源协同实现符合DevOps理念的研发工具平台,实现研发过程自动化.标准化: 实习期间对DevOps的理解一直懵懵懂懂,最近观看了阿里专家带 ...
王道考研计算机网络笔记第六章：应用层
本文基于2019 王道考研计算机网络: 2019 王道考研计算机网络个人笔记总结第一章:王道考研计算机网络笔记第一章:概述&计算机网络体系结构第二章:王道考研计算机网络笔记第 ...

AllenNLP—笔记—TokenEmbedder

TokenEmbedder—自定义Embedder

class GloVeEmbedding(TokenEmbedder):

function

code

.json

command line

My Github

AllenNLP—笔记—TokenEmbedder相关推荐

最新文章

热门文章