用数字表示文本

机器学习模型将向量(数字数组)作为输入。在处理文本时,我们必须先想出一种策略,将字符串转换为数字(或将文本“向量化”),然后再其馈入模型。在本部分中,我们将探究实现这一目标的三种策略。

独热编码

作为第一个想法,我们可以对词汇表中的每个单词进行“独热”编码。考虑这样一句话:“The cat sat on the mat”。这句话中的词汇(或唯一单词)是(cat、mat、on、sat、the)。为了表示每个单词,我们将创建一个长度等于词汇量的零向量,然后在与该单词对应的索引中放置一个 1。下图显示了这种方法。

为了创建一个包含句子编码的向量,我们可以将每个单词的独热向量连接起来。

要点:这种方法效率低下。一个独热编码向量十分稀疏(这意味着大多数索引为零)。假设我们的词汇表中有 10,000 个单词。为了对每个单词进行独热编码,我们将创建一个其中 99.99% 的元素都为零的向量。

用一个唯一的数字编码每个单词

我们可以尝试的第二种方法是使用唯一的数字来编码每个单词。继续上面的示例,我们可以将 1 分配给“cat”,将 2 分配给“mat”,依此类推。然后,我们可以将句子“The cat sat on the mat”编码为一个密集向量,例如 [5, 1, 4, 3, 5, 2]。这种方法是高效的。现在,我们有了一个密集向量(所有元素均已满),而不是稀疏向量。

但是,这种方法有两个缺点:

整数编码是任意的(它不会捕获单词之间的任何关系)。

对于要解释的模型而言,整数编码颇具挑战。例如,线性分类器针对每个特征学习一个权重。由于任何两个单词的相似性与其编码的相似性之间都没有关系,因此这种特征权重组合没有意义。

单词嵌入向量

单词嵌入向量为我们提供了一种使用高效、密集表示的方法,其中相似的单词具有相似的编码。重要的是,我们不必手动指定此编码。嵌入向量是浮点值的密集向量(向量的长度是您指定的参数)。它们是可以训练的参数(模型在训练过程中学习的权重,与模型学习密集层权重的方法相同),无需手动为嵌入向量指定值。8 维的单词嵌入向量(对于小型数据集)比较常见,而在处理大型数据集时最多可达 1024 维。维度更高的嵌入向量可以捕获单词之间的细粒度关系,但需要更多的数据来学习。

上面是一个单词嵌入向量的示意图。每个单词都表示为浮点值的 4 维向量。还可以将嵌入向量视为“查找表”。学习完这些权重后,我们可以通过在表中查找对应的密集向量来编码每个单词。


下面来看看代码

import io
import os
import re
import shutil
import string
import tensorflow as tffrom datetime import datetime
from tensorflow.keras import Model, Sequential
from tensorflow.keras.layers import Activation, Dense, Embedding, GlobalAveragePooling1D
from tensorflow.keras.layers.experimental.preprocessing import TextVectorizationprint(tf.__version__)
"""
输出:2.5.0-dev20201226
"""

下载IMDB数据集

url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"dataset = tf.keras.utils.get_file("aclImdb_v1.tar.gz", url,untar=True, cache_dir='.',cache_subdir='')dataset_dir = os.path.join(os.path.dirname(dataset), 'aclImdb')
os.listdir(dataset_dir)
"""
输出:
Downloading data from https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
84131840/84125825 [==============================] - 138s 2us/step
['imdb.vocab', 'imdbEr.txt', 'README', 'test', 'train']
"""

train文件夹中有pos与neg两个关于电影评论的文件夹,其中数据分别被标记为positive与negative,你可以使用这两个文件夹中的数据去训练一个二元分类模型

train_dir = os.path.join(dataset_dir, 'train')
os.listdir(train_dir)
"""
输出:
['labeledBow.feat','neg','pos','unsup','unsupBow.feat','urls_neg.txt','urls_pos.txt','urls_unsup.txt']
"""

在创建数据集之前应该先移除多余的文件夹,例如unsup

remove_dir = os.path.join(train_dir, 'unsup')
shutil.rmtree(remove_dir)

下一步,用tf.keras.preprocessing.text_dataset_from_directory函数创建一个tf.data.Dataset。用train文件夹中数据创建train与validation数据集,validation所占比例为20%(即validation_split为0.2)

batch_size = 1024
seed = 123
train_ds = tf.keras.preprocessing.text_dataset_from_directory('aclImdb/train', batch_size=batch_size, validation_split=0.2, subset='training', seed=seed)
val_ds = tf.keras.preprocessing.text_dataset_from_directory('aclImdb/train', batch_size=batch_size, validation_split=0.2, subset='validation', seed=seed)
"""
输出:
Found 25000 files belonging to 2 classes.
Using 20000 files for training.
Found 25000 files belonging to 2 classes.
Using 5000 files for validation.
"""

检查数据集中的评论数据以及对应的标签

for text_batch, label_batch in train_ds.take(1):for i in range(5):print(label_batch[i].numpy(), text_batch.numpy()[i])
"""
输出:
0 b"Oh My God! Please, for the love of all that is holy, Do Not Watch This Movie! It it 82 minutes of my life I will never get back. Sure, I could have stopped watching half way through. But I thought it might get better. It Didn't. Anyone who actually enjoyed this movie is one seriously sick and twisted individual. No wonder us Australians/New Zealanders have a terrible reputation when it comes to making movies. Everything about this movie is horrible, from the acting to the editing. I don't even normally write reviews on here, but in this case I'll make an exception. I only wish someone had of warned me before I hired this catastrophe"
1 b'This movie is SOOOO funny!!! The acting is WONDERFUL, the Ramones are sexy, the jokes are subtle, and the plot is just what every high schooler dreams of doing to his/her school. I absolutely loved the soundtrack as well as the carefully placed cynicism. If you like monty python, You will love this film. This movie is a tad bit "grease"esk (without all the annoying songs). The songs that are sung are likable; you might even find yourself singing these songs once the movie is through. This musical ranks number two in musicals to me (second next to the blues brothers). But please, do not think of it as a musical per say; seeing as how the songs are so likable, it is hard to tell a carefully choreographed scene is taking place. I think of this movie as more of a comedy with undertones of romance. You will be reminded of what it was like to be a rebellious teenager; needless to say, you will be reminiscing of your old high school days after seeing this film. Highly recommended for both the family (since it is a very youthful but also for adults since there are many jokes that are funnier with age and experience.'
0 b"Alex D. Linz replaces Macaulay Culkin as the central figure in the third movie in the Home Alone empire. Four industrial spies acquire a missile guidance system computer chip and smuggle it through an airport inside a remote controlled toy car. Because of baggage confusion, grouchy Mrs. Hess (Marian Seldes) gets the car. She gives it to her neighbor, Alex (Linz), just before the spies turn up. The spies rent a house in order to burglarize each house in the neighborhood until they locate the car. Home alone with the chicken pox, Alex calls 911 each time he spots a theft in progress, but the spies always manage to elude the police while Alex is accused of making prank calls. The spies finally turn their attentions toward Alex, unaware that he has rigged devices to cleverly booby-trap his entire house. Home Alone 3 wasn't horrible, but probably shouldn't have been made, you can't just replace Macauley Culkin, Joe Pesci, or Daniel Stern. Home Alone 3 had some funny parts, but I don't like when characters are changed in a movie series, view at own risk."
0 b"There's a good movie lurking here, but this isn't it. The basic idea is good: to explore the moral issues that would face a group of young survivors of the apocalypse. But the logic is so muddled that it's impossible to get involved.<br /><br />For example, our four heroes are (understandably) paranoid about catching the mysterious airborne contagion that's wiped out virtually all of mankind. Yet they wear surgical masks some times, not others. Some times they're fanatical about wiping down with bleach any area touched by an infected person. Other times, they seem completely unconcerned.<br /><br />Worse, after apparently surviving some weeks or months in this new kill-or-be-killed world, these people constantly behave like total newbs. They don't bother accumulating proper equipment, or food. They're forever running out of fuel in the middle of nowhere. They don't take elementary precautions when meeting strangers. And after wading through the rotting corpses of the entire human race, they're as squeamish as sheltered debutantes. You have to constantly wonder how they could have survived this long... and even if they did, why anyone would want to make a movie about them.<br /><br />So when these dweebs stop to agonize over the moral dimensions of their actions, it's impossible to take their soul-searching seriously. Their actions would first have to make some kind of minimal sense.<br /><br />On top of all this, we must contend with the dubious acting abilities of Chris Pine. His portrayal of an arrogant young James T Kirk might have seemed shrewd, when viewed in isolation. But in Carriers he plays on exactly that same note: arrogant and boneheaded. It's impossible not to suspect that this constitutes his entire dramatic range.<br /><br />On the positive side, the film *looks* excellent. It's got an over-sharp, saturated look that really suits the southwestern US locale. But that can't save the truly feeble writing nor the paper-thin (and annoying) characters. Even if you're a fan of the end-of-the-world genre, you should save yourself the agony of watching Carriers."
0 b'I saw this movie at an actual movie theater (probably the $2.00 one) with my cousin and uncle. We were around 11 and 12, I guess, and really into scary movies. I remember being so excited to see it because my cool uncle let us pick the movie (and we probably never got to do that again!) and sooo disappointed afterwards!! Just boring and not scary. The only redeeming thing I can remember was Corky Pigeon from Silver Spoons, and that wasn\'t all that great, just someone I recognized. I\'ve seen bad movies before and this one has always stuck out in my mind as the worst. This was from what I can recall, one of the most boring, non-scary, waste of our collective $6, and a waste of film. I have read some of the reviews that say it is worth a watch and I say, "Too each his own", but I wouldn\'t even bother. Not even so bad it\'s good.'
"""

创建一个高性能的数据集(dataset)

这是加载数据时应该使用的两种重要方法,以确保I/O不会阻塞

  • .cache():将数据从磁盘加载后保留在内存中。这将确保数据集在训练模型时不会成为瓶颈。如果数据集太大,无法放入内存,也可以使用此方法创建一个性能良好的磁盘缓存,它比许多小文件读取效率更高。
  • .prefetch():使数据预处理与模型的训练交替进行
AUTOTUNE = tf.data.AUTOTUNEtrain_ds = train_ds.cache().prefetch(buffer_size=AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)

使用嵌入层(Embedding层)

Embedding层可以理解成一个从整数索引(代表特定词汇)映射到密集向量(该单词对应的embeddings)的一个查找表。你可以通过试验确定最佳嵌入维度,就和你确定Dense层的最佳神经元个数那样做。

# 输入1000个单词,每个单词用5个维度的向量表示
embedding_layer = tf.keras.layers.Embedding(1000, 5)

当你创建Embedding层时,Embedding层的权重(weights)将会和其他层(layer)一样被随机初始化。在训练过程中,权重会逐渐通过反向传播来进行调整。训练过后,embeddings层将会粗略的编码词汇之间的相似性(这个是针对你所训练模型的特定问题的)。

如果将整数传递给嵌入层,则结果将用嵌入表中的向量替换每个整数。

result = embedding_layer(tf.constant([1,2,3]))
result.numpy()
"""
输出:
array([[-0.01827962,  0.033703  ,  0.02065292,  0.00335936, -0.00998179],[ 0.00618695, -0.02138543, -0.01288087,  0.03814398, -0.02176479],[-0.02900024,  0.03794893, -0.03229412,  0.04951945,  0.03212232]],dtype=float32)
"""

对于文本或序列问题,嵌入向量层采用整数组成的 2D 张量,其形状为 (samples, sequence_length),其中每个条目都是一个整数序列。它可以嵌入可变长度的序列。您可以在形状为 (32, 10)(32 个长度为 10 的序列组成的批次)或 (64, 15)(64 个长度为 15 的序列组成的批次)的批次上方嵌入向量层。

返回的张量比输入多一个轴,嵌入向量沿新的最后一个轴对齐。向其传递 (2, 3) 输入批次,输出为 (2, 3, N)

result = embedding_layer(tf.constant([[0,1,2],[3,4,5]]))
result.shape
"""
输出:TensorShape([2, 3, 5])
"""

当给定一个序列批次作为输入时,嵌入向量层将返回形状为 (samples, sequence_length, embedding_dimensionality) 的 3D 浮点张量。


代码:可在微信公众号【明天依旧可好】中回复:05

注: 本文参考了官网并对其进行了删减以及部分注释与修改

TensorFlow2简单入门-单词嵌入向量相关推荐

  1. TensorFlow2简单入门-三维张量

    作者: 明天依旧可好 数据|代码: 在微信公众号「明天依旧可好」中回复:02 三维张量的一个典型应用是表示序列信号,它的格式是 X=[b,sequence_len,feature_len]X = [b ...

  2. TensorFlow2简单入门 - 池化层

    文章目录 结构图 结构图

  3. TensorFlow2简单入门-加载及预处理文本

    博主: 明天依旧可好 代码: 微信公众号「明天依旧可好」内回复 04 思维导图完整版: 回复 tf2思维导图 import tensorflow as tf import tensorflow_dat ...

  4. TensorFlow2简单入门-图像加载及预处理

    下载数据 import tensorflow as tfimport pathlib data_root_orig = tf.keras.utils.get_file(origin='https:// ...

  5. TensorFlow2简单入门-四维张量

    作者: 明天依旧可好 数据|代码:在微信公众号「明天依旧可好」中回复:03 四维张量在卷积神经网络(CNN)中广泛应用,一般用于保存特征图(Feature maps)数据,格式一般定义为 [b,h,w ...

  6. 单词嵌入_神秘的文本分类:单词嵌入简介

    单词嵌入 Natural language processing (NLP) is an old science that started in the 1950s. The Georgetown I ...

  7. 独家 | 图解BiDAF中的单词嵌入、字符嵌入和上下文嵌入(附链接)

    作者:Meraldo Antonio 翻译:张玲 校对:吴金笛 本文约5200字,建议阅读15分钟. 本文重点讲解机器问答任务中常见机器学习模型BiDAF是如何利用单词.字符和上下文3种嵌入机制将单词 ...

  8. 单词嵌入_单词嵌入与单词袋:推荐系统的奇怪案例

    单词嵌入 词嵌入始终是最佳选择吗? (Are word embeddings always the best choice?) If you can challenge a well-accepted ...

  9. Nature Communications:使用连接组的嵌入向量表征映射大脑结构与功能之间的高阶关系

    连接组(Connectomics)用于表征脑网络中的节点以及节点之间成对的连接.节点的功能角色是通过它们与网络其余部分的直接或间接连接来定义的.但是,不能在单个节点上直接表示节点在脑网络中的语义关系( ...

最新文章

  1. 一周焦点 | 李彦宏:如果谷歌回来,有信心再赢一次;GitHub深度学习开源项目Top200...
  2. Spring的Bean生命周期,11 张高清流程图及代码,深度解析
  3. java new 面试_java面试30问
  4. Jetty 的工作原理以及与 Tomcat 的比较
  5. cad转dxf格式文件太大_想知道DWG、DWT、DWS和DXF是什么吗?从了解4种CAD图形格式开始吧...
  6. Shopify:管理一个顶级域名绑定shopify网店
  7. 为什么在C语言中,用scanf输入字符串时,不需加
  8. Spring-3.2.4 + Quartz-2.2.0集成实例
  9. 阿里云轻量应用服务器/腾讯云轻量应用服务器如何安装宝塔面板?
  10. 自己动手写操作系统-经典书籍
  11. 兄弟,学点AI吗?2知识的确定性系统
  12. rake matlab,基于MATLAB的Rake接收机仿真及性能分析
  13. java代码区出现红色,绿色怎么办
  14. 电离层对高分辨率星载SAR成像的影响1——电离层的相关定义
  15. PS:修复图片模糊(字体)
  16. Linux命令行上程序执行的那一刹那!
  17. QAndroidJniObject::callStaticObjectMethod参数含义
  18. 2021-5月14日-今日收获
  19. 中国最缺大学的重点城市
  20. 吐故“钠”新,看钠离子电池如何引导行业新风向

热门文章

  1. 得到INSERT和UPDATE中使用的值
  2. foxmail使用技巧
  3. lwip之数据收发流程_2
  4. PAT甲级1087 All Roads Lead to Rome (30分):[C++题解]dijkstra求单源最短路综合、最短路条数、保存路径
  5. csrf攻击防御 php,Yii2.0防御csrf攻击方法
  6. vsftpd Problem with 425 Security: Bad IP connecting 解决
  7. mysql-5.7.18-winx64 安装 net start mysql 发生系统错误2
  8. mysql主从复制自增_关于mysql主从复制自增长列
  9. sqlserver 指数_大盘指数大涨,牛市是否提前来了?
  10. android中px单位,android中像素单位dp、px、pt、sp的比较