Embedding 编码方法

一、作用

Embedding 是一种单词编码，用低维向量实现了编码，这种编码通过神经网络训练优化，能表达单词之间的相关性。

在是用独热码one_hot编码时，我们会发现单词的编码十分稀疏，以至于训练的效率不是很高。采用embedding的方法可以很好的优化这个个问题。

举个栗子:

二、函数介绍

Embedding（词汇表大小，编码维度）
送入embedding层的数据的维度要求是 [送入样本数，循环核时间展开步数]

即需要将训练的数据reshape为以上形式！.

另外Embedding一般只作为第一层

三、实例介绍

示例一：一个字母预测下一个字母

import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Dense, SimpleRNN, Embedding
import matplotlib.pyplot as plt
import osinput_word = "abcde"
w_to_id = {'a': 0, 'b': 1, 'c': 2, 'd': 3, 'e': 4}  # 单词映射到数值id的词典x_train = [w_to_id['a'], w_to_id['b'], w_to_id['c'], w_to_id['d'], w_to_id['e']]
y_train = [w_to_id['b'], w_to_id['c'], w_to_id['d'], w_to_id['e'], w_to_id['a']]np.random.seed(7)
np.random.shuffle(x_train)
np.random.seed(7)
np.random.shuffle(y_train)
tf.random.set_seed(7)# 使x_train符合Embedding输入要求：[送入样本数， 循环核时间展开步数] ，
# 此处整个数据集送入所以送入，送入样本数为len(x_train)；输入1个字母出结果，循环核时间展开步数为1。
x_train = np.reshape(x_train, (len(x_train), 1))
y_train = np.array(y_train)model = tf.keras.Sequential([Embedding(5, 2),SimpleRNN(3),Dense(5, activation='softmax')
])model.compile(optimizer=tf.keras.optimizers.Adam(0.01),loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False),metrics=['sparse_categorical_accuracy'])checkpoint_save_path = "./checkpoint/run_embedding_1pre1.ckpt"if os.path.exists(checkpoint_save_path + '.index'):print('-------------load the model-----------------')model.load_weights(checkpoint_save_path)cp_callback = tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_save_path,save_weights_only=True,save_best_only=True,monitor='loss')  # 由于fit没有给出测试集，不计算测试集准确率，根据loss，保存最优模型history = model.fit(x_train, y_train, batch_size=32, epochs=100, callbacks=[cp_callback])model.summary()# print(model.trainable_variables)
file = open('./weights.txt', 'w')  # 参数提取
for v in model.trainable_variables:file.write(str(v.name) + '\n')file.write(str(v.shape) + '\n')file.write(str(v.numpy()) + '\n')
file.close()###############################################    show   ################################################ 显示训练集和验证集的acc和loss曲线
acc = history.history['sparse_categorical_accuracy']
loss = history.history['loss']plt.subplot(1, 2, 1)
plt.plot(acc, label='Training Accuracy')
plt.title('Training Accuracy')
plt.legend()plt.subplot(1, 2, 2)
plt.plot(loss, label='Training Loss')
plt.title('Training Loss')
plt.legend()
plt.show()############### predict #############preNum = int(input("input the number of test alphabet:"))
for i in range(preNum):alphabet1 = input("input test alphabet:")alphabet = [w_to_id[alphabet1]]# 使alphabet符合Embedding输入要求：[送入样本数， 循环核时间展开步数]。# 此处验证效果送入了1个样本，送入样本数为1；输入1个字母出结果，循环核时间展开步数为1。alphabet = np.reshape(alphabet, (1, 1))result = model.predict(alphabet)pred = tf.argmax(result, axis=1)pred = int(pred)tf.print(alphabet1 + '->' + input_word[pred])

执行结果：

示例二：四个字母预测下一个字母

代码：

import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Dense, SimpleRNN, Embedding
import matplotlib.pyplot as plt
import osinput_word = "abcdefghijklmnopqrstuvwxyz"
w_to_id = {'a': 0, 'b': 1, 'c': 2, 'd': 3, 'e': 4,'f': 5, 'g': 6, 'h': 7, 'i': 8, 'j': 9,'k': 10, 'l': 11, 'm': 12, 'n': 13, 'o': 14,'p': 15, 'q': 16, 'r': 17, 's': 18, 't': 19,'u': 20, 'v': 21, 'w': 22, 'x': 23, 'y': 24, 'z': 25}  # 单词映射到数值id的词典training_set_scaled = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,11, 12, 13, 14, 15, 16, 17, 18, 19, 20,21, 22, 23, 24, 25]x_train = []
y_train = []for i in range(4, 26):x_train.append(training_set_scaled[i - 4:i])y_train.append(training_set_scaled[i])np.random.seed(7)
np.random.shuffle(x_train)
np.random.seed(7)
np.random.shuffle(y_train)
tf.random.set_seed(7)# 使x_train符合Embedding输入要求：[送入样本数， 循环核时间展开步数] ，
# 此处整个数据集送入所以送入，送入样本数为len(x_train)；输入4个字母出结果，循环核时间展开步数为4。
x_train = np.reshape(x_train, (len(x_train), 4))
y_train = np.array(y_train)model = tf.keras.Sequential([Embedding(26, 2),SimpleRNN(10),Dense(26, activation='softmax')
])model.compile(optimizer=tf.keras.optimizers.Adam(0.01),loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False),metrics=['sparse_categorical_accuracy'])checkpoint_save_path = "./checkpoint/rnn_embedding_4pre1.ckpt"if os.path.exists(checkpoint_save_path + '.index'):print('-------------load the model-----------------')model.load_weights(checkpoint_save_path)cp_callback = tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_save_path,save_weights_only=True,save_best_only=True,monitor='loss')  # 由于fit没有给出测试集，不计算测试集准确率，根据loss，保存最优模型history = model.fit(x_train, y_train, batch_size=32, epochs=100, callbacks=[cp_callback])model.summary()file = open('./weights.txt', 'w')  # 参数提取
for v in model.trainable_variables:file.write(str(v.name) + '\n')file.write(str(v.shape) + '\n')file.write(str(v.numpy()) + '\n')
file.close()###############################################    show   ################################################ 显示训练集和验证集的acc和loss曲线
acc = history.history['sparse_categorical_accuracy']
loss = history.history['loss']plt.subplot(1, 2, 1)
plt.plot(acc, label='Training Accuracy')
plt.title('Training Accuracy')
plt.legend()plt.subplot(1, 2, 2)
plt.plot(loss, label='Training Loss')
plt.title('Training Loss')
plt.legend()
plt.show()################# predict ##################preNum = int(input("input the number of test alphabet:"))
for i in range(preNum):alphabet1 = input("input test alphabet:")alphabet = [w_to_id[a] for a in alphabet1]# 使alphabet符合Embedding输入要求：[送入样本数， 时间展开步数]。# 此处验证效果送入了1个样本，送入样本数为1；输入4个字母出结果，循环核时间展开步数为4。alphabet = np.reshape(alphabet, (1, 4))result = model.predict([alphabet])pred = tf.argmax(result, axis=1)pred = int(pred)tf.print(alphabet1 + '->' + input_word[pred])

注意这里的输入数据的维度发生了改变

执行结果：

链接：

https://www.icourse163.org/learn/PKU-1002536002?tid=1452937471#/learn/content?type=detail&id=1233970430&cid=1253438622&replay=true

Embedding 编码方法相关推荐

循环神经网络中的LSTM和GRU
循环神经网络:就是借助循环核实现的时间特征提取,再把提取到的信息送入全连接网络,实现连续数据的预测. 循环核:循环核具有记忆力,通过不同时刻的参数共享,实现了对时间序列的信息提取. ht:每个时刻的状 ...
Positional Encodings in ViTs 近期各视觉Transformer中的位置编码方法总结及代码解析 1
Positional Encodings in ViTs 近期各视觉Transformer中的位置编码方法总结及代码解析最近CV领域的Vision Transformer将在NLP领域的Transo ...
面试之类别数据处理（one-hot、embedding）
场景描述类别型特征(Categorical Feature)是指反映(事物)类别的数据,是离散数据,其数值个数(分类属性)有限(但可能很多),比如性别(男.女).血型(A.B.AB.O)等只在有限选 ...
（2018 -NIPS）SimplE embedding for link prediction in knowledge
(2018 -NIPS)SimplE embedding for link prediction in knowledge 本文为阅读论文过程中的个人总结加上翻译内容构成. 摘要介绍知识图谱,知识图 ...
终于有人把Embedding讲明白了
导读:如果要总结深度学习大获成功的原因,那至少有两样东西必须入选:一样当然是很"深"的神经网络模型,这也是深度学习的"深度"的由来,另一样就是Embedding ...
Embedding的理解
Embedding 嵌入,我们可以将其理解为一种降维行为.可以将高维数据映射到低维空间来解决稀疏输入数据的问题. 它主要有以下三个目的: 在 embedding 空间中查找最近邻,这可以很好的用 ...
【NLP】词袋模型（bag of words model）和词嵌入模型（word embedding model）
本文作为入门级教程,介绍了词袋模型(bag of words model)和词向量模型(word embedding model)的基本概念. 目录 1 词袋模型和编码方法 1.1 文本向量化 1.2 ...
每日一书丨终于有人把Embedding讲明白了
导读:如果要总结深度学习大获成功的原因,那至少有两样东西必须入选:一样当然是很"深"的神经网络模型,这也是深度学习的"深度"的由来,另一样就是Embedding ...
Pytorch的默认初始化分布 nn.Embedding.weight初始化分布
一.nn.Embedding.weight初始化分布 nn.Embedding.weight随机初始化方式是标准正态分布 ,即均值$\mu=0$,方差$\sigma=1$的正态分布. 论据1--查看 ...

Embedding 编码方法

Embedding 编码方法相关推荐

最新文章

热门文章