While the most articles about deep learning are focusing at the modeling part, there are also few about how to deploy such models to production. Some of them say “production”, but they often simply use the un-optimized model and embed it into a Flask web server. In this post, I will explain why using this approach is not scaling well and wasting resources.

尽管有关深度学习的大多数文章都集中在建模部分，但关于如何将此类模型部署到生产环境的文章也很少。他们中的一些人说“生产”，但他们通常只是使用未优化的模型并将其嵌入Flask Web服务器中。在这篇文章中，我将解释为什么使用这种方法不能很好地扩展并浪费资源。

“生产”方式 (The “production” approach)

If you search for how to deploy TensorFlow, Keras or Pytorch models into production there are a lot of good tutorials, but sometimes you come across very simple simple examples, claiming production ready. These examples often use the raw keras model, a Flask web server and containerize it into a docker container. These examples use Python to serve predictions. The code for these “production” Flask webservers look like this:

如果您搜索如何将TensorFlow，Keras或Pytorch模型部署到生产环境中，则有很多不错的教程，但是有时您会遇到非常简单的示例，声称可以进行生产。这些示例通常使用原始keras模型，Flask Web服务器并将其容器化到docker容器中。这些示例使用Python进行预测。这些“生产” Flask Web服务器的代码如下所示：

from flask import Flask, jsonify, requestfrom tensorflow import kerasapp = Flask(__name__)model = keras.models.load_model("model.h5")@app.route("/", methods=["POST"])def index():    data = request.json    prediction = model.predict(preprocess(data))       return jsonify({"prediction": str(prediction)})

Furthermore, they often show how to containerize the Flask server and bundle it with your model into docker. These approaches also claim that they can easily scale by increasing the number of docker instances.

此外，他们经常展示如何容器化Flask服务器并将其与模型捆绑到docker中。这些方法还声称，它们可以通过增加docker实例数量来轻松扩展。

Now let us recap what happens here and why it is not “production” grade.

现在让我们回顾一下这里发生的事情以及为什么它不是“生产”等级。

没有优化模型 (Not optimizing models)

First usually the model is used as it is, which means the Keras model from the example was simply exported by model.save(). The model includes all the parameters and gradients, which are necessary to train the model, but not required for inference. Also, the model is neither pruned nor quantized. As an effect using not optimized models have a higher latency, need more compute and are larger in terms of file size.

首先通常是按原样使用模型，这意味着示例中的Keras模型只是通过model.save()导出的。该模型包括所有参数和梯度，这些参数和梯度是训练模型所必需的，但不是推理所必需的。而且，该模型既不修剪也不量化。结果，使用未优化的模型会导致较高的延迟，需要更多的计算并且文件大小也会更大。

Example with B5 Efficientnet:

B5 Efficientnet的示例：

h5 keras model: 454 MByte
h5 keras模型：454 MByte
Optimized tensorflow model (no quantization): 222 MByte
优化的张量流模型(无量化)：222 MByte

使用Flask和Python API (Using Flask and the Python API)

The next problem is that plain Python and Flask is used to load the model and serve predictions. Here are a lot of problems.

下一个问题是使用普通的Python和Flask加载模型并提供预测。这里有很多问题。

First let’s look at the worst thing you can do: loading the model for each request. In the code example from above, the model is used when the script is called, but in other tutorials they moved this part into the predict function. What that does is loading the model every single time you make a prediction. Please do not do that.

首先让我们看一下您可以做的最坏的事情：为每个请求加载模型。在上面的代码示例中，在调用脚本时使用了模型，但是在其他教程中，他们将这一部分移至了预测函数中。这样做是每次您进行预测时都加载模型。请不要那样做。

That being said, let’s look at Flask. Flask includes a powerful and easy-to-use web server for development. On the official website, you can read the following:

话虽如此，让我们看一下Flask。 Flask包括一个功能强大且易于使用的Web服务器，用于开发。在官方网站上，您可以阅读以下内容：

While lightweight and easy to use, Flask’s built-in server is not suitable for production as it doesn’t scale well.

Flask轻巧易用，但内置的服务器扩展性不好，因此不适合生产 。

That said, you can use Flask as a WSGI app in e.g. Google App Engine. However, many tutorials are not using Google App Engine or NGIX, they just use it as it is and put it into a docker container. But even when they use NGIX or any other web servers, they usually turn off multi threading completely.

也就是说，您可以在例如Google App Engine中将Flask用作WSGI应用程序。但是，许多教程并未使用Google App Engine或NGIX，而是直接使用它并将其放入docker容器中。但是，即使他们使用NGIX或任何其他Web服务器，也通常会完全关闭多线程。

Let’s look a bit deeper into the problem here. If you use a TensorFlow, it handles compute resources (CPU, GPU) for you. If you load a model and call predict, TensorFlow uses the compute resources to make these predictions. While this happens, the resource is in-use aka locked. When your web server only serves one single request at the time, you are fine, as the model was loaded in this thread and predict is called from this thread. But once you allow more than one requests at the time, your web server stops working, because you can simply not access a TensorFlow model from different threads. That being said, in this setup you can not process more than one request at once. Doesn’t really sound like scalable, right?

让我们在这里更深入地研究问题。如果您使用TensorFlow，它将为您处理计算资源(CPU，GPU)。如果您加载模型并调用预测，TensorFlow将使用计算资源进行这些预测。发生这种情况时，资源在使用中也被锁定。当您的Web服务器当时仅服务一个请求时，就可以了，因为模型已加载到该线程中，并且从该线程中调用了predict。但是一旦您一次允许多个请求，您的Web服务器就会停止工作，因为您根本无法从其他线程访问TensorFlow模型。话虽这么说，在这种设置中您不能一次处理多个请求。听起来真的不是可扩展的，对吗？

Example:

例：

Flask development web server: 1 simultaneous request
Flask开发Web服务器：1个同时请求
TensorFlowX Model server: parallelism configurable
TensorFlowX模型服务器：可配置并行性

使用docker扩展“低负载”实例 (Scaling “low-load” instances with docker)

Ok, the web server does not scale, but what about scaling the number of web servers? In a lot of examples this approach is the solution to the scaling problem of single instances. There is not much to say about it, it works sure. But scaling this way wastes money, resources and energy. It’s like having a truck and putting in one single parcel and once there are more parcels, you get another truck, instead of using the existing truck smarter.

好的，Web服务器无法扩展，但是如何扩展Web服务器的数量呢？在许多示例中，此方法是解决单个实例的缩放问题的方法。没什么可说的，它可以正常工作。但是以这种方式扩展会浪费金钱，资源和能量。这就像拥有一辆卡车并放入一个包裹，一旦有更多的包裹，您将获得另一辆卡车，而不是更智能地使用现有的卡车。

Example latency:

延迟示例：

Flask Serving like shown above: ~2s per image
上图所示的烧瓶投放：每张图片约2秒
Tensorflow model server (no batching, no GPU): ~250ms per image
Tensorflow模型服务器(无批处理，无GPU)：每个图像约250ms
Tensorflow model server (no batching, GPU): ~100ms per image
Tensorflow模型服务器(无批处理，GPU)：每个图像约100毫秒

不使用GPU / TPU (Not using GPUs/TPUs)

GPUs made deep learning possible as they can do operations massively in parallel. When using docker containers to deploy deep learning models to production, the most examples do NOT utilize GPUs, they don’t even use GPU instances. The prediction time for each request is magnitudes slower on CPU machines, so latency is a big problem. Even with powerful CPU instances you will not achieve comparable results to the small GPU instances.

GPU使深度学习成为可能，因为它们可以并行进行大规模操作。当使用Docker容器将深度学习模型部署到生产环境时，大多数示例不使用GPU，甚至不使用GPU实例。在CPU机器上，每个请求的预测时间要慢得多，因此延迟是一个大问题。即使使用功能强大的CPU实例，您也无法获得与小型GPU实例相当的结果。

Just a side note: In general it is possible to use GPUs in docker, if the host has the correct driver installed. Docker is completely fine for scaling up instances, but scale up the correct instances.

附带说明：通常，如果主机安装了正确的驱动程序，则可以在docker中使用GPU。 Docker可以很好地扩展实例，但是可以扩展正确的实例。

Example costs:

费用示例：

2 CPU instances (16 Core, 32GByte, a1.4xlarge): 0,816 $/h
2个CPU实例(16核，32GB，a1.4xlarge)：0,816 $ / h
1 GPU instance (32G RAM, 4 Cores, Tesla M60, g3s.xlarge): 0,75 $/h
1个GPU实例(32G RAM，4核，Tesla M60，g3s.xlarge)：0,75 $ / h

已经解决了 (It’s already solved)

As you can see, loading trained model and putting it into Flask docker containers is not an elegant solution. If you want deep learning in production, start from the model, then think about servers and finally about scaling instances.

如您所见，加载经过训练的模型并将其放入Flask docker容器中并不是一个很好的解决方案。如果要在生产中进行深度学习，请从模型开始，然后考虑服务器，最后考虑扩展实例。

优化模型 (Optimize the model)

Unfortunately optimizing a model for inference is not that straight forward as it should be. However, it can easily reduce inference time by multiples, so it’s worth it without doubts. The first step is freezing the weights and removing all the trainings overhead. This can be achieved with TensorFlow directly but requires you to convert your model into either an estimator or into a Tensorflow graph (SavedModel format), if you came from a Keras model. TensorFlow itself has a tutorial for this. To further optimize, the next step is to apply model pruning and quantization, where insignificant weights are removed and model size is reduced.

不幸的是，为推理优化模型并不是应该的。但是，它可以轻松地将推理时间减少几倍，因此毫无疑问是值得的。第一步是冻结重量并消除所有训练开销。这可以直接用TensorFlow来实现，但是如果您来自Keras模型，则需要将模型转换为估算器或Tensorflow图(SavedModel格式)。 TensorFlow本身对此有一个教程。为了进一步优化，下一步是应用模型修剪和量化，删除不重要的权重并减小模型大小。

使用模型服务器 (Use model servers)

When you have an optimized model, you can look at different model servers, meant for deep learning models in production. For TensorFlow and Keras TensorFlowX offers the tensorflow model server. There are also others like TensorRT, Clipper, MLFlow, DeepDetect.

拥有优化的模型后，您可以查看不同的模型服务器，这些服务器用于生产中的深度学习模型。对于TensorFlow和Keras， TensorFlowX提供了tensorflow模型服务器。还有其他一些像TensorRT，Clipper，MLFlow，DeepDetect。

TensorFlow model server offers several features. Serving multiple models at the same time, while reducing the overhead to a minimum. It allows you to version your models, without downtime when deploying a new version, while still being able to use the old version. It also has an optional REST API endpoint additionally to the gRPC API. The throughput is magnitudes higher than using a Flask API, as it is written in C++ and uses multi-threading. Additionally, you can even enable batching, where the server batches multiple single predictions into a batch for very high load settings. And finally, you can put it into a docker container and scale even more.

TensorFlow模型服务器提供了多种功能。同时为多个模型提供服务，同时将开销降至最低。它允许您对模型进行版本控制，而在部署新版本时不会停机，同时仍可以使用旧版本。除了gRPC API外，它还具有可选的REST API端点。与使用Flask API相比，吞吐量要高出许多，因为它是用C ++编写的并且使用多线程。另外，您甚至可以启用批处理，其中服务器将多个单个预测批处理为非常高的负载设置的批处理。最后，您可以将其放入docker容器并进一步扩展。

Hint: tensorflow_model_server is available on every AWS-EC2 Deep learning AMI image, with TensorFlow 2 it’s called tensorflow2_model_server.

提示：在每个AWS-EC2深度学习AMI映像上都可以使用tensorflow_model_server，对于TensorFlow 2，它称为tensorflow2_model_server。

使用GPU实例 (Use GPU instances)

And lastly, I would recommend using GPUs or TPUs for inference environments. The latency and throughput are much higher with such accelerators, while saving energy and money. Note that it is only being utilized if your software stack can utilize the power of GPUs (optimized model + model server). In AWS you can look into Elastic Inference or just use a GPU instance with Tesla M60 (g3s.xlarge).

最后，我建议在推理环境中使用GPU或TPU。使用此类加速器时，延迟和吞吐量要高得多，同时可以节省能源和金钱。请注意，只有在您的软件堆栈可以利用GPU(优化的模型+模型服务器)的功能时，才可以使用它。在AWS中，您可以研究Elastic Inference或仅将GPU实例与Tesla M60(g3s.xlarge)一起使用。

Originally posted on digital-thnking.de

最初发布在digital-thnking.de

翻译自: https://towardsdatascience.com/how-to-not-deploy-keras-tensorflow-models-4fa60b487682

http://www.taodudu.cc/news/show-994808.html

对食材的敬畏之心极致产品_这些数据科学产品组合将给您带来敬畏和启发（2020年中的版本）
向量积判断优劣弧_判断经验论文优劣的10条诫命
sql如何处理null值_如何正确处理SQL中的NULL值
数据可视化信息可视化_动机可视化
快速数据库框架_快速学习新的数据科学概念的框架
停止使用p = 0.05
成像数据更好的展示_为什么更多的数据并不总是更好
vue domo网站_DOMO与Tableau-逐轮
每个人都应该使用的Python 3中被忽略的3个功能
数据探查_数据科学家，开始使用探查器
从ncbi下载数据_如何从NCBI下载所有细菌组件
线性插值插值_揭秘插值搜索
如果您不将Docker用于数据科学项目，那么您将生活在1985年
docker部署flask_使用Docker，GCP Cloud Run和Flask部署Scikit-Learn NLP模型
问卷假设检验 t检验_真实问题的假设检验
大数据技术学习之旅_为什么聚焦是您数据科学之旅的关键
无监督学习 k-means_无监督学习-第4部分
深度学习算法原理_用于对象检测的深度学习算法的基本原理
软件本地化 pdf_软件本地化与标准翻译
数据库不停机导数据方案_如何计算数据停机成本
python初学者_面向初学者的20种重要的Python技巧
贝叶斯网络建模
数据科学家数据分析师_使您的分析师和数据科学家在数据处理方面保持一致
python db2查询_如何将DB2查询转换为python脚本
爱因斯坦提出的逻辑性问题_提出正确问题的重要性
餐厅数据分析报告_如何使用数据科学选择理想的餐厅设计场所
熊猫直播使用什么sdk_没什么可花的-但是16项基本操作才能让您开始使用熊猫
关系型数据库的核心单元是_核中的数据关系
小程序国际化_在国际化您的应用程序时忘记的一件事
robo 3t连接_使用robo 3t studio 3t连接到地图集

如何不部署Keras / TensorFlow模型相关推荐

使用Keras/TensorFlow模型构建属于你的Chatbot API
很多人认为,构建自己的聊天机器人(或助手)并不复杂.各种聊天机器人平台正在使用分类模型来识别用户意图.显然,在现有平台上构建聊天机器人时,会得到强有力的提升.为什么不自己使用类似的模型,构建一个独属于 ...
keras/tensorflow 模型保存后重新加载准确率为0 model.save and load giving different result
我在用别人的代码跑程序的时候遇到了这个问题: keras 模型保存后重新加载准确率为0 GitHub上有个issue:model.save and load giving different resu ...
设备 esp32_「ESP 教程」ESP32 如何运行 TensorFlow 模型
人工智能之父,艾伦·图灵很早就曾预测"有一天,人们会带着电脑在公园散步,并告诉对方,今天早上我的计算机讲了个很有趣的事." 人类一直试图让机器具有智能,也就是人工智能(Artifi ...
在英特尔硬件上部署深度学习模型的无代码方法关于OpenVINO深度学习工作台的三部分系列第二部
作者 Taylor, Mary, 翻译李翊玮关于 OpenVINO™ 深度学习工作台的三部分系列文章关于该系列了解如何转换.微调和打包推理就绪的 TensorFlow 模型,该模型针对英特尔 ...
在英特尔硬件上部署深度学习模型的无代码方法 OpenVINO 深度学习工作台的三部分系列 - CPU AI 第二部
作者 Taylor, Mary, 翻译李翊玮关于该系列了解如何转换.微调和打包推理就绪的 TensorFlow 模型,该模型针对英特尔®硬件进行了优化,仅使用 Web 浏览器.每一步都在云中使 ...
使用tensorflow serving部署keras模型（tensorflow 2.0.0）
点击上方"AI搞事情"关注我们内容转载自知乎:https://zhuanlan.zhihu.com/p/96917543 Justin ho 〉 Tensorflow 2.0.0 ...
打成jar包_keras, tensorflow模型部署通过jar包部署到spark环境攻略
这是个我想干很久的事情了.之前研究tensorflow on spark, DL4j 都没有成功.所以这里首先讲一下我做这件事情的流程.模型的部署,首先你得有一个模型.这里假设你有了一个keras模型 ...
keras保存模型_onnx+tensorrt部署keras模型
由于项目需要,最近捣鼓了一波如何让用tensorrt部署训练好的模型来达到更快的推理速度,期间花费了大量的时间在知乎和各种网页上去搜索别人的方案,但始终没有找到我想要的条理相对清晰的记录贴(也许只是我 ...
cloud 部署_使用Google Cloud AI平台开发，训练和部署TensorFlow模型
cloud 部署实用指南 (A Practical Guide) The TensorFlow ecosystem has become very popular for developing ap ...

如何不部署Keras / TensorFlow模型