最近遇到了模型性能方面的问题，调研中关于JIT（just in time）即时编译一些知识点进行介绍：

概述

XLA（加速线性代数）是用于优化TensorFlow计算的线性代数的域特定编译器。代码位置在tensorflow/compiler.

在XLA技术之前，TensorFlow中计算图的执行是由runtime(运行时)代码驱动的：runtime负责加载计算图定义、创建计算图、计算图分区、计算图优化、分配设备、管理节点间的依赖并调度节点kernel的执行；计算图是数据部分，runtime是代码部分。在XLA出现之后，我们有了另一个选择，计算图现在可以直接被编译成目标平台的可执行代码，可以直接执行，不需要runtime代码的参与了。

XLA 利用 JIT 编译技术分析用户在运行时创建的 TensorFlow 图表，根据实际运行时维度和类型将其专门化，将多个运算融合在一起并为它们生成高效的本机代码——适用于 CPU、GPU 之类的设备和自定义加速器（例如，Google 的 TPU）。

目前XLA是实验性的。大多数使用情况在性能（加快速度或减少内存使用）方面都没有改进。

代码示例

代码来自tenorflow源码下的tensorflow\examples\tutorials\mnist\mnist_softmax_xla.py

这份代码原理和前面几篇博客类似，相通的知识点就不特别说了。

开启JIT编译

在会话级别打开JIT方法如下：

方式一，通过Session设置：

# Config to turn on JIT compilation
config = tf.ConfigProto()
config.graph_options.optimizer_options.global_jit_level = tf.OptimizerOptions.ON_1sess = tf.Session(config=config)

方式二，通过tf.contrib.compiler.jit.experimental_jit_scope()：

jit_scope = tf.contrib.compiler.jit.experimental_jit_scopex = tf.placeholder(np.float32)
with jit_scope():y = tf.add(x, x)  # The "add" will be compiled with XLA.

方式三，通过设置device：

with tf.device("/job:localhost/replica:0/task:0/device:XLA_GPU:0"):output = tf.add(input1, input2)

记录元数据和timeline文件

元数据用于记录运行过程的时间和内存消耗。把这些信息导出来，可以保存为timeline文件，用chrome浏览器查看。

run_metadata = tf.RunMetadata()sess = tf.Session(config=config)sess.run(train_step,feed_dict={x: batch_xs,y_: batch_ys},options=tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE),run_metadata=run_metadata)trace = timeline.Timeline(step_stats=run_metadata.step_stats)

前面写过博客把元数据写到tensorboard事件日志里了。

这里写到磁盘上timeline文件里。这个文件是jason格式的，可以使用chrome可视化。在chrome浏览器打开"chrome://tracing"，把文件拖到页面上打开，可以看到运行的时间。这个和android用于分析性能的界面类似。

完整代码

我增加了把计算图结构写入tensorboard文件的代码。其它基本未变。

"""Simple MNIST classifier example with JIT XLA and timelines."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_functionimport argparse
import sysimport tensorflow as tffrom tensorflow.examples.tutorials.mnist import input_data
from tensorflow.python.client import timelineFLAGS = Nonedef main(_):# Import datamnist = input_data.read_data_sets(FLAGS.data_dir)# Create the modelx = tf.placeholder(tf.float32, [None, 784])w = tf.Variable(tf.zeros([784, 10]))b = tf.Variable(tf.zeros([10]))y = tf.matmul(x, w) + b# Define loss and optimizery_ = tf.placeholder(tf.int64, [None])# The raw formulation of cross-entropy,##   tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(tf.nn.softmax(y)),#                                 reduction_indices=[1]))## can be numerically unstable.## So here we use tf.losses.sparse_softmax_cross_entropy on the raw# logit outputs of 'y', and then average across the batch.cross_entropy = tf.losses.sparse_softmax_cross_entropy(labels=y_, logits=y)train_step = tf.train.GradientDescentOptimizer(0.5).minimize(cross_entropy)config = tf.ConfigProto()jit_level = 0if FLAGS.xla:# Turns on XLA JIT compilation.jit_level = tf.OptimizerOptions.ON_1config.graph_options.optimizer_options.global_jit_level = jit_levelrun_metadata = tf.RunMetadata()sess = tf.Session(config=config)tf.global_variables_initializer().run(session=sess)writer = tf.summary.FileWriter( FLAGS.log_dir + '/train', sess.graph )writer.close()# Traintrain_loops = 1000for i in range(train_loops):batch_xs, batch_ys = mnist.train.next_batch(100)# Create a timeline for the last loop and export to json to view with# chrome://tracing/.if i == train_loops - 1:sess.run(train_step,feed_dict={x: batch_xs,y_: batch_ys},options=tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE),run_metadata=run_metadata)trace = timeline.Timeline(step_stats=run_metadata.step_stats)with open('timeline.ctf.json', 'w') as trace_file:trace_file.write(trace.generate_chrome_trace_format())else:sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})# Test trained modelcorrect_prediction = tf.equal(tf.argmax(y, 1), y_)accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))print(sess.run(accuracy,feed_dict={x: mnist.test.images,y_: mnist.test.labels}))sess.close()if __name__ == '__main__':parser = argparse.ArgumentParser()parser.add_argument('--data_dir',type=str,default='./data',help='Directory for storing input data')parser.add_argument('--xla', type=bool, default=True, help='Turn xla via JIT on')parser.add_argument('--log_dir',type=str,default='./logs',help='Directory to put the log data.')FLAGS, unparsed = parser.parse_known_args()tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)

XLA编译器用于JIT加速相关推荐

执行引擎、解释器、编译器、JIT编译器的恩怨情仇
学习宋红康老师的JVM课程已经有一段时间了,学习过程中发现,有些内容遗忘得很快,虽然自己也用印象笔记记录了,但是如果没有输出,知识仍然不能完全地消化.因此,决定在JVM专栏中记录和总结学过的内容,欢迎 ...
halcon中可用于GPU加速的算子
说明:halcon18.05以上版本可用于GPU加速的算子如下: crop_rectangle1,deviation_image,mean_image,points_harris,gray_open ...
JVM原理（二）执行引擎篇（JVM程序执行流程、JIT编译器、JIT编译器优化）
一.JVM程序执行流程上一章我们介绍过程序执行通常分为解释执行和编译执行,而Java两种方式都采用了,下面是Java编译成字节码.动态编译和解释为机器码的过程分析: 编译器和解释器的协调工作流程: ...
gpu处理信号_GPU显卡不仅用来打游戏那么简单，它还可以用于通用加速计算
如今,显卡不仅在工作站.个人PC中变得非常重要,而且在数据中心也处于举足轻重的地位.CPU负责通用计算.GPU负责加速计算已经成为绝大数数据中心一种常态.用于加速计算的GPU专用处理器,它将计算密集型 ...
若使用numba.cuda.jit加速pytorch训练代码会怎样
也许没有察觉在使用pytorch训练数据的时候cuda 显卡总是发挥不到最大性能这就是你的cpu程序拖住了你的显卡怎么办目前我能想到的最好方法就是使用numba.cuda.jit这样你也不用 ...
python numba jit加速使用方法
Python 凭什么打败 Java、C/C++，成为机器学习的唯一语言？
点击上方"CSDN",选择"置顶公众号" 关键时刻,第一时间送达! 是什么让数据科学如此喜爱Python?是语言本身,还是生态系统,或是相关的开发过程? 在许多 ...
使用TensorFlow XLA辅助实现BERT预训练加速
XLA 简介 XLA 是 TensorFlow 图表的编译器,只需更改极少的源代码,便可加速您的 TensorFlow ML 模型.这篇文章将介绍 XLA,并说明如何在您自己的代码中试用 XLA. 在 ...
神器！微软发布 Python 的 JIT 编译器：Pyjion！
出品 | OSC开源社区 ‍用于 Python 3.10 及以上版本的嵌入式 JIT 编译器 Pyjion 已发布 1.0 版本. Pyjion 拥有以下特性: 配置文件引导的 JIT 编译器原生 ...

XLA编译器用于JIT加速

概述

代码示例

开启JIT编译

记录元数据和timeline文件

完整代码

XLA编译器用于JIT加速相关推荐

最新文章

热门文章