by Cole Murray

通过科尔·默里(Cole Murray)

在Tensorflow中使用深度学习构建图像标题生成器 (Building an image caption generator with Deep Learning in Tensorflow)

In my last tutorial, you learned how to create a facial recognition pipeline in Tensorflow with convolutional neural networks. In this tutorial, you’ll learn how a convolutional neural network (CNN) and Long Short Term Memory (LSTM) can be combined to create an image caption generator and generate captions for your own images.

在我的上一教程中，您学习了如何使用卷积神经网络在Tensorflow中创建面部识别管道。在本教程中，您将学习如何将卷积神经网络 (CNN)和长期短期记忆 (LSTM)组合在一起以创建图像标题生成器并为自己的图像生成标题。

总览 (Overview)

Introduction to Image Captioning Model Architecture
图像字幕模型架构简介
Captions as a Search Problem
字幕作为搜索问题
Creating Captions in Tensorflow
在Tensorflow中创建字幕

先决条件 (Prerequisites)

Basic understanding of Convolutional Neural Networks
卷积神经网络的基本理解
Basic understanding of LSTM
对LSTM的基本了解
Basic understanding of Tensorflow
对Tensorflow的基本了解

图像字幕模型架构简介 (Introduction to image captioning model architecture)

结合CNN和LSTM (Combining a CNN and LSTM)

In 2014, researchers from Google released a paper, Show And Tell: A Neural Image Caption Generator. At the time, this architecture was state-of-the-art on the MSCOCO dataset. It utilized a CNN + LSTM to take an image as input and output a caption.

2014年，来自Google的研究人员发表了一篇论文，《展示与讲述：神经图像字幕生成器》。当时，该体系结构是MSCOCO数据集上的最新技术。它利用CNN + LSTM拍摄图像作为输入并输出字幕。

使用CNN进行图像嵌入 (Using a CNN for image embedding)

A convolutional neural network can be used to create a dense feature vector. This dense vector, also called an embedding, can be used as feature input into other algorithms or networks.

卷积神经网络可用于创建密集特征向量。此密集向量也称为嵌入，可以用作其他算法或网络的特征输入。

For an image caption model, this embedding becomes a dense representation of the image and will be used as the initial state of the LSTM.

对于图像标题模型，此嵌入将成为图像的密集表示，并将用作LSTM的初始状态。

LSTM (LSTM)

An LSTM is a recurrent neural network architecture that is commonly used in problems with temporal dependences. It succeeds in being able to capture information about previous states to better inform the current prediction through its memory cell state.

LSTM是一种递归神经网络体系结构，通常用于与时间相关的问题中。它成功地能够捕获有关先前状态的信息，以便通过其存储单元状态更好地告知当前预测。

An LSTM consists of three main components: a forget gate, input gate, and output gate. Each of these gates is responsible for altering updates to the cell’s memory state.

LSTM由三个主要组件组成：忘记门，输入门和输出门。这些门中的每一个都负责更改对单元内存状态的更新。

For a deeper understanding of LSTM’s, visit Chris Olah’s post.

要更深入地了解LSTM，请访问Chris Olah的帖子。

以图像作为初始状态的预测 (Prediction with image as initial state)

In a sentence language model, an LSTM is predicting the next word in a sentence. Similarly, in a character language model, an LSTM is trying to predict the next character, given the context of previously seen characters.

在句子语言模型中，LSTM预测句子中的下一个单词。同样，在字符语言模型中，鉴于先前看到的字符的上下文，LSTM试图预测下一个字符。

In an image caption model, you will create an embedding of the image. This embedding will then be fed as initial state into an LSTM. This becomes the first previous state to the language model, influencing the next predicted words.

在图像标题模型中，您将创建图像的嵌入。然后将此嵌入作为初始状态输入LSTM。这成为语言模型的第一个先前状态，影响下一个预测单词。

At each time-step, the LSTM considers the previous cell state and outputs a prediction for the most probable next value in the sequence. This process is repeated until the end token is sampled, signaling the end of the caption.

在每个时间步，LSTM都会考虑前一个单元状态并输出对该序列中最可能出现的下一个值的预测。重复此过程，直到对结束令牌进行采样为止，以信号说明字幕的结束。

字幕作为搜索问题 (Captions as a search problem)

Generating a caption can be viewed as a graph search problem. Here, the nodes are words. The edges are the probability of moving from one node to another. Finding the optimal path involves maximizing the total probability of a sentence.

生成标题可以看作是图形搜索问题。在这里，节点是单词。边缘是从一个节点移动到另一个节点的概率。寻找最佳路径涉及最大化句子的总概率。

Sampling and choosing the most probable next value is a greedy approach to generating a caption. It is computationally efficient, but can lead to a sub-optimal result.

采样并选择最可能的下一个值是一种生成字幕的贪婪方法。它的计算效率很高，但可能导致次优结果。

Given all possible words, it would not be computationally/space efficient to calculate all possible sentences and determine the optimal sentence. This rules out using a search algorithm such as Depth First Search or Breadth First Search to find the optimal path.

给定所有可能的单词，计算所有可能的句子并确定最佳句子将在计算/空间上无效。这排除了使用诸如深度优先搜索或宽度优先搜索之类的搜索算法来找到最佳路径的可能性。

光束搜索 (Beam Search)

Beam search is a breadth-first search algorithm that explores the most promising nodes. It generates all possible next paths, keeping only the top N best candidates at each iteration.

波束搜索是一种广度优先的搜索算法，可探索最有希望的节点。它生成所有可能的下一条路径，每次迭代仅保留前N个最佳候选。

As the number of nodes to expand from is fixed, this algorithm is space-efficient and allows more potential candidates than a best-first search.

由于要扩展的节点数量是固定的，因此该算法节省空间，并且比最佳优先搜索具有更多的潜在候选对象。

评论 (Review)

Up to this point, you’ve learned about creating a model architecture to generate a sentence, given an image. This is done by utilizing a CNN to create a dense embedding and feeding this as initial state to an LSTM. Additionally, you’ve learned how to generate better sentences with beam search.

到目前为止，您已经了解了有关创建模型体系结构以生成给定图像的句子的知识。这是通过利用CNN创建密集的嵌入并将其作为初始状态输入LSTM来完成的。此外，您还学习了如何通过波束搜索生成更好的句子。

In the next section, you’ll learn to generate captions from a pre-trained model in Tensorflow.

在下一节中，您将学习从Tensorflow中的预训练模型生成字幕。

在Tensorflow中创建字幕 (Creating captions in Tensorflow)

# Project Structure

├── Dockerfile├── bin│ └── download_model.py├── etc│ ├── show-and-tell-2M.zip│ ├── show-and-tell.pb│ └── word_counts.txt├── imgs│ └── trading_floor.jpg├── medium_show_and_tell_caption_generator│ ├── __init__.py│ ├── caption_generator.py│ ├── inference.py│ ├── model.py│ └── vocabulary.py└── requirements.txt

环境设定 (Environment setup)

Here, you’ll use Docker to install Tensorflow.

在这里，您将使用Docker安装Tensorflow 。

Docker is a container platform that simplifies deployment. It solves the problem of installing software dependencies onto different server environments. If you are new to Docker, you can read more here. To install Docker, run:

Docker是一个简化部署的容器平台。它解决了将软件依赖项安装到不同服务器环境中的问题。如果您不熟悉Docker，可以在此处内容。要安装Docker，请运行：

curl https://get.docker.com | sh

After installing Docker, you’ll create two files. A requirements.txt for the Python dependencies and a Dockerfile to create your Docker environment.

安装Docker之后，您将创建两个文件。用于Python的依赖关系和Dockerfile requirements.txt创建您泊坞环境。

To build this image, run:

要构建此映像，请运行：

$ docker build -t colemurray/medium-show-and-tell-caption-generator -f Dockerfile .

# On MBP, ~ 3mins# Image can be pulled from dockerhub below

If you would like to avoid building from source, the image can be pulled from dockerhub using:

如果您想避免从源代码构建，可以使用以下命令从dockerhub中提取图像：

docker pull colemurray/medium-show-and-tell-caption-generator # Recommended

下载模型 (Download the model)

Below, you’ll download the model graph and pre-trained weights. These weights are from a training session on the MSCOCO dataset for 2MM iterations.

在下面，您将下载模型图和预先训练的权重。这些权重来自针对2MM迭代的MSCOCO数据集上的培训课程。

To download, run:

要下载，请运行：

docker run -e PYTHONPATH=$PYTHONPATH:/opt/app -v $PWD:/opt/app \-it colemurray/medium-show-and-tell-caption-generator \python3 /opt/app/bin/download_model.py \--model-dir /opt/app/etc

Next, create a model class. This class is responsible for loading the graph, creating image embeddings, and running an inference step on the model.

接下来，创建一个模型类。此类负责加载图形，创建图像嵌入以及在模型上运行推理步骤。

下载词汇表 (Download the vocabulary)

When training an LSTM, it is standard practice to tokenize the input. For a sentence model, this means mapping each unique word to a unique numeric id. This allows the model to utilize a softmax classifier for prediction.

训练LSTM时，标准做法是标记输入。对于句子模型，这意味着将每个唯一的单词映射到唯一的数字ID。这允许模型利用softmax分类器进行预测。

Below, you’ll download the vocabulary used for the pre-trained model and create a class to load it into memory. Here, the line number represents the numeric id of the token.

在下面，您将下载用于预训练模型的词汇表，并创建一个类以将其加载到内存中。在此，行号表示令牌的数字ID。

# File structure# token num_of_occurrances

# on 213612# of 202290# the 196219# in 182598

curl -o etc/word_counts.txt https://raw.githubusercontent.com/ColeMurray/medium-show-and-tell-caption-generator/master/etc/word_counts.txt

To store this vocabulary in memory, you’ll create a class responsible for mapping words to ids.

要将此词汇表存储在内存中，您将创建一个类，负责将单词映射到id。

创建字幕生成器 (Creating a caption generator)

To generate captions, first you’ll create a caption generator. This caption generator utilizes beam search to improve the quality of sentences generated.

要生成字幕，请首先创建字幕生成器。该字幕生成器利用波束搜索来提高生成的句子的质量。

At each iteration, the generator passes the previous state of the LSTM (initial state is the image embedding) and previous sequence to generate the next softmax vector.

在每次迭代时，生成器传递LSTM的先前状态(初始状态是图像嵌入)和先前序列以生成下一个softmax向量。

The top N most probable candidates are kept and utilized in the next inference step. This process continues until either the max sentence length is reached or all sentences have generated the end-of-sentence token.

在接下来的推理步骤中，将保留并利用前N个最可能的候选对象。这个过程一直持续到达到最大句子长度或所有句子都生成了句子结束标记。

Next, you’ll load the show and tell model and use it with the above caption generator to create candidate sentences. These sentences will be printed along with their log probability.

接下来，您将加载显示并告诉模型，并将其与上述字幕生成器一起使用以创建候选句子。这些句子及其对数概率将被打印出来。

结果 (Results)

To generate captions, you’ll need to pass in one or more images to the script.

要生成字幕，您需要将一个或多个图像传递给脚本。

docker run -v $PWD:/opt/app \-e PYTHONPATH=$PYTHONPATH:/opt/app \-it colemurray/medium-show-and-tell-caption-generator  \python3 /opt/app/medium_show_and_tell_caption_generator/inference.py \--model_path /opt/app/etc/show-and-tell.pb \--input_files /opt/app/imgs/trading_floor.jpg \--vocab_file /opt/app/etc/word_counts.txt

You should see output:

您应该看到输出：

Captions for image trading_floor.jpg: 0) a group of people sitting at tables in a room . (p=0.000306) 1) a group of people sitting around a table with laptops . (p=0.000140) 2) a group of people sitting at a table with laptops . (p=0.000069)

结论 (Conclusion)

In this tutorial, you learned:

在本教程中，您学习了：

how a convolutional neural network and LSTM can be combined to generate captions to an image
如何将卷积神经网络和LSTM结合起来以生成图像标题
how to utilize the beam search algorithm to consider multiple captions and select the most probable sentence.
如何利用波束搜索算法考虑多个字幕并选择最可能的句子。

Complete code here.

在此处完成代码。

Next Steps:

后续步骤 ：

Try with your own images
尝试使用自己的图像
Read the Show and Tell paper

阅读表演并讲述论文
Create an API to serve captions
创建一个提供字幕的API

呼吁采取行动： (Call to Action:)

If you enjoyed this tutorial, follow and recommend!

如果您喜欢本教程，请遵循并推荐！

Interested in learning more about Deep Learning / Machine Learning? Check out my other tutorials:

有兴趣了解有关深度学习/机器学习的更多信息吗？查看我的其他教程：

- Building a Facial Recognition Pipeline with Deep Learning in Tensorflow

- 在Tensorflow中使用深度学习构建面部识别管道

- Deep Learning CNN’s in Tensorflow with GPUs

-使用GPU在Tensorflow中进行深度学习CNN

- Deep Learning with Keras on Google Compute Engine

-在Google Compute Engine上使用Keras进行深度学习

- Recommendation Systems with Apache Spark on Google Compute Engine

-Google Compute Engine上具有Apache Spark的推荐系统

Other places you can find me:

您可以找到我的其他地方：

- Twitter: https://twitter.com/_ColeMurray

-Twitter： https ： //twitter.com/_ColeMurray

翻译自: https://www.freecodecamp.org/news/building-an-image-caption-generator-with-deep-learning-in-tensorflow-a142722e9b1f/

在Tensorflow中使用深度学习构建图像标题生成器相关推荐

如何在TensorFlow中通过深度学习构建年龄和性别的多任务预测器
by Cole Murray 通过科尔·默里(Cole Murray) In my last tutorial, you learned about how to combine a convolut ...
在TensorFlow中使用深度学习GANs处理图像
设计师和摄影师用内容自动填补来补充图像中不想要的或缺失的部分.与之相似的技术还有图像完善和修复.实现内容自动填补,图像完善和修复的方法有很多.本文介绍的是 Raymond Yeh 和 Chen Che ...
TensorFlow实现基于深度学习的图像补全
第一步:将图像理解为一个概率分布的样本你是怎样补全缺失信息的呢? 但是怎样着手统计呢?这些都是图像啊. 那么我们怎样补全图像? 第二步:快速生成假图像在未知概率分布情况下,学习生成新样本 [ML- ...
在Jupyter Notebook中调用ML模型服务图像标题生成器
说明:写本文的目的主要是验证如何在Jupyter Notebook中通过API调用机器学习模型服务. 1.环境说明 CentOS7(部署在VMware Workstation Pro中的虚拟机) 需要 ...
在浏览器中进行深度学习：TensorFlow.js (四）用基本模型对MNIST数据进行识别
2019独角兽企业重金招聘Python工程师标准>>> 在了解了TensorflowJS的一些基本模型的后,大家会问,这究竟有什么用呢?我们就用深度学习中被广泛使用的MINST数据集 ...
在浏览器中进行深度学习：TensorFlow.js (十二）异常检测算法
2019独角兽企业重金招聘Python工程师标准>>> 异常检测是机器学习领域常见的应用场景,例如金融领域里的信用卡欺诈,企业安全领域里的非法入侵,IT运维里预测设备的维护时间点等. ...
使用TensorFlow.js在浏览器中进行深度学习入门
目录设置TensorFlow.js 创建训练数据检查点定义神经网络模型训练AI 测试结果终点线内存使用注意事项下一步是什么?狗和披萨? 下载TensorFlowJS示例-6.1 MB T ...
深度学习在图像智能审核中的应用（nsfw篇）
深度学习在图像智能审核中的应用(nsfw篇) NSFW(not suitable for work)图片,顾名思义,就是一些不太适合在工作中浏览的图片,本文介绍的工作是通过深度卷积神经网络来对此类图片 ...
python opencv 录制视频_如何使用OpenCV、Python和深度学习在图像和视频中实现面部识别?...
Face ID 的兴起带动了一波面部识别技术热潮.本文将介绍如何使用 OpenCV.Python 和深度学习在图像和视频中实现面部识别,以基于深度识别的面部嵌入,实时执行且达到高准确度. 以下内容由 ...

在Tensorflow中使用深度学习构建图像标题生成器