原文地址:http://www.reedbeta.com/blog/gpu-profiling-101/

GPU Profiling 101

October 12, 2011 · Coding, GPU, Graphics · Comments

In all my graphics demos, even the smallest ones, you’ll typically find a readout like this in one corner of the screen:

Draw time: 0.71 msShadow clear: 0.03 msShadow objects: 0.11 msMain clear: 0.10 msMain objects: 0.47 ms
GPU frame time: 16.70 ms
CPU frame time: 16.61 ms

Even for simple apps that hardly do any complex rendering, I still like to include a real-time performance measurement system. I choose to implement this early on in development because it’s my belief that although it can be too early to optimize, it’s never too early to profile.

Even at the very beginning of development, I want to know that the performance numbers I’m seeing are reasonable. I don’t need to worry about the details—I’m not going to worry that 0.47 ms is too long to spend drawing a 2700-triangle, untextured mesh—but I do want to know the numbers are at about the right order of magnitude. To put it another way, if I were spending 4.7 ms drawing a 2700-triangle, untextured mesh, then I’d be wondering what was going on! More than likely, it would be because I was doing something dumb that was forcing the driver or GPU into a slow mode of execution. This kind of bug can be hard to spot because the rendered frame is still visually correct, so you can’t see it, unless it causes you to drop from 60 to 30 Hz or something like that. But if you’re measuring your performance and you at least sanity-check your numbers from time to time, you’ll notice something’s wrong.

The other reason to implement GPU profiling early is that when it comes time to do your performance optimizations for real, you’ll have more trust in your profiling system. You’ll have ironed out most of the bugs, and you’ll have a feel for how much noise there is in the data and what sorts of artifacts there may be.

So now that you know why you should do GPU profiling, how do you actually do it? In this article, I’ll show you how to set it up in Direct3D 11. Similar commands should be available in any modern graphics API, although the details may vary.

First, let’s talk about how not to do GPU profiling. Do not measure performance in terms of framerate! This is something many beginners do, but it’s misleading at best, since framerate isn’t a linear measure of performance. People will make statements like “turning on XYZ dropped my framerate by 10 fps”, but this is meaningless, since a drop from 100 to 90 fps is a very different thing than a drop from 30 to 20 fps. You could clarify by reporting the fps you started from, but why bother? Just report performance results using units of real time. Milliseconds are fine, although you can go one step further and express everything in fractions of your frame budget. For instance, I’m currently targeting 60 Hz, so I have 16.67 ms to render a frame. Instead of saying that my objects rendered in 0.47 ms, I could make Idyll report that they rendered in 2.8% of a frame.

Another caveat that should be obvious: you can’t measure GPU performance by timing on the CPU. For instance, calling QueryPerformanceCounter before and after a draw call won’t measure how long the GPU took to draw your objects, only how long the graphics driver took to queue up that call in the various data structures it has under the hood. That might be useful information to have in general, but it’s not GPU profiling.

Placing Queries

The tools we’ll use to get profiling data out of the GPU are ID3D11Query objects. As described in the docs, these can be created with ID3D11Device::CreateQuery, and come in various flavors. There are two we’ll want: D3D11_QUERY_TIMESTAMP and D3D11_QUERY_TIMESTAMP_DISJOINT. You’ll need one timestamp query for each stage of your renderer that you want to profile separately (in my case, one each for shadow clear, shadow objects, main clear, and main objects), plus two more to timestamp the beginning and ending of the whole frame. You’ll also need one “timestamp disjoint” query; I’ll return to that one later.

Executing a timestamp query causes the GPU to read an internal clock and store the current value of that clock somewhere for later retrieval. This is done by calling ID3D11DeviceContext::End and passing the query object you created. (There’s no call to Begin for timestamp queries, because they don’t operate like a stopwatch—they don’t measure the elapsed time between two points; they just take a clock reading at a single point.) You’ll want to place one of these calls at the beginning of the frame, and one after each chunk of rendering work you want to profile. You’ll also need one at the very end, after your call to IDXGISwapChain::Present. Each of these is a separate ID3D11Query object, because all of these timestamps need to be collected and stored separately as the GPU chews its way through the frame.

The “timestamp disjoint” query should enclose the whole frame. This one does require paired calls to ID3D11DeviceContext::Begin (just before your first timestamp) and ID3D11DeviceContext::End (just after your last one). What does the disjoint query do? Its most important role for our purposes is to tell us the clock frequency. All the timestamps we’re going to measure are expressed as counts of ticks of some internal GPU clock…but it’s the disjoint query that tells how to convert those ticks into real time. It also has a secondary purpose: if the something went wrong with the clock, such as the counter overflowing and wrapping back to zero or the clock frequency changing during the middle of the frame (as can happen on laptops due to CPU throttling), it’ll also tell us that. However, in those cases we just have to throw our data out, as we don’t have enough information to recover accurate timings. In practice, this rarely happens, and even when it does happen it only leads to one frame of missed data, so it’s not a big deal.

To recap, here’s what a render routine might look like after all these queries have been added:

void Render (ID3D11DeviceContext * pContext)
{// Begin disjoint query, and timestamp the beginning of the framepContext->Begin(pQueryDisjoint);pContext->End(pQueryBeginFrame);// Draw shadow mapClearShadowMap();pContext->End(pQueryShadowClear);DrawStuffInShadowMap();pContext->End(pQueryShadowObjects);// Draw main viewClearMainView();pContext->End(pQueryMainClear);DrawStuffInMainView();pContext->End(pQueryMainObjects);// Display frame on-screen and finish up queriespSwapChain->Present(1, 0);pContext->End(pQueryEndFrame);pContext->End(pQueryDisjoint);
}

Retrieving The Data

So far we’ve set up a number of query objects that tell the GPU when and how to collect the timing data we want. Once the GPU actually renders the frame and collects the data, we need to retrieve it so we can analyze and display it.

The idea is simple enough. We just use the ID3D11DeviceContext::GetData method to see if a particular query has finished yet, and return its result if it has. As the docs say, GetData returns S_FALSE if the query is still in flight on the GPU, and S_OK when it’s done. So, we just need to write a loop that waits (an idle wait, please, not a busy wait!) until GetData returns something other than S_FALSE. Then our query data will be waiting for us!

The timestamp queries all send back a UINT64 containing—you guessed it—the timestamp value, measured in some sort of clock tick. The disjoint query returns a D3D11_QUERY_DATA_TIMESTAMP_DISJOINT struct, which contains the clock frequency and, as mentioned before, the rarely-encountered “disjoint” flag that means you have to throw out the last frame’s data because something weird happened with the clock.

So all we have to do now is subtract adjacent timestamp values to get the deltas, convert those into milliseconds and off we go! The code for this might look like:

void CollectTimestamps (ID3D11DeviceContext * pContext)
{// Wait for data to be availablewhile (pContext->GetData(pQueryDisjoint, NULL, 0, 0) == S_FALSE){Sleep(1);       // Wait a bit, but give other threads a chance to run}// Check whether timestamps were disjoint during the last frameD3D10_QUERY_DATA_TIMESTAMP_DISJOINT tsDisjoint;pContext->GetData(pQueryDisjoint, &tsDisjoint, sizeof(tsDisjoint), 0);if (tsDisjoint.Disjoint){return;}// Get all the timestampsUINT64 tsBeginFrame, tsShadowClear, // ... etc.pContext->GetData(pQueryBeginFrame, &tsBeginFrame, sizeof(UINT64), 0);pContext->GetData(pQueryShadowClear, &tsShadowClear, sizeof(UINT64), 0);// ... etc.// Convert to real timefloat msShadowClear = float(tsShadowClear - tsBeginFrame) /float(tsDisjoint.Frequency) * 1000.0f;// ... etc.
}

Now you have the timings in milliseconds and can do with them as you please! Display them on screen, average them over time, log them to a file, or whatever else you can think of.

Double-Buffered Queries

Back toward the beginning of this article, I said that you would need one timestamp query for each rendering stage and one disjoint query. Actually, I lied. I wanted to keep things simple while explaining the basics, but the truth is that to do a proper GPU profiler, you’ll need to double-buffer your queries, meaning that you need two copies of everything.

This is because the CPU and GPU run simultaneously, not sequentially—or at least they should, for efficiency! While the GPU is rendering frame n, the CPU is forging ahead and working on frame n + 1. When the GPU finishes frame n, it puts it up on screen and starts on frame n + 1; then the CPU gets the go-ahead to start on frame n + 2. But this means that if we fire off a bunch of timestamp queries in frame n on the CPU, then wait for them to complete before we start the next frame, we’ll be serializing the system. The CPU will have to wait for the GPU to finish the frame it just dispatched, and then the GPU will idle waiting for the CPU to finish the next frame. This is a Bad Thing.

Double-buffering allows us to get around this by keeping separate queries for alternate frames. On frame n we’ll fire off one set of queries, then we’ll wait for the other set of queries to complete—these were the queries for frame n − 1. Then we’ll re-use the queries we just collected and fire off a new set for frame n + 1, then finally wait for the queries from frame n. This way we always give the GPU a whole frame to finish its work, and avoid dropping into the slow, serialized mode of execution.

By the way, this is a good example of what I mentioned at the top of this article about sanity-checking your performance numbers! You might have noticed that I’m measuring the total frame time on the CPU as well as on the GPU—on the CPU, it’s just QueryPerformanceCounter as normal. When the CPU and GPU are running simultaneously, as they should be, the CPU and GPU frame time will be equal (plus or minus some measurement noise). But if you didn’t double-buffer your queries, or did something else to make them drop into the slow, serialized execution mode, you’d see the CPU frame time shoot up to substantially larger than the GPU frame time, since it’s doing its own work and then waiting for the GPU to finish all its work.

At this point, you should have all the information you’ll need to go and write a GPU profiler of your own! But in case some of the details aren’t clear, I’ve posted my GPU profiling code from my home engine, which you’re free to include in your own project; it’s under the BSD license. In that code, I’ve defined an enum called GTS (GPU Time-Stamp) to label all the timestamped points in the frame; you should modify it to contain the set of blocks you care about. It handles all the double-buffering for you, and will also average all the deltas over a half-second, for smoother display. Now, go forth and measure how long the GPU is taking to render your scenes!

GPU Profiling 101相关推荐

  1. GPU Profiling

    提起优化,第一件事情要做的就是Profiling,因为没有经过Profiling谁都不知道瓶颈是什么,正确的Profiling是对游戏整体性能的全面认识.关于CPU的Profiling已经非常成熟,各 ...

  2. U3D性能分析 Profiling

    Ports that the Unity profiler uses: Unity分析器使用的端口如下: MulticastPort : 54998 组播端口:54998ListenPorts : 5 ...

  3. Android性能优化典范第二季

    原文链接:http://hukai.me/android-performance-patterns-season-2/ 1)Battery Drain and Networking 对于手机程序,网络 ...

  4. Android 系统性能优化(14)---Android性能优化典范 - 第2季

    1)Battery Drain and Networking 对于手机程序,网络操作相对来说是比较耗电的行为.优化网络操作能够显著节约电量的消耗.在性能优化第1季里面有提到过,手机硬件的各个模块的耗电 ...

  5. Android 性能优化探究

    使用ViewStub动态载入布局.避免一些不常常的视图长期握住引用: ViewStub的一些特点: 1. ViewStub仅仅能Inflate一次,之后ViewStub对象被置空:某个被ViewStu ...

  6. Android 2017面试题整理

    似乎自去年下半年以来,大家跳槽的少了,还有有些公司裁员了,前几年火热的移动端.前端岗位也越来越少,回归理性.现在各大公司对移动Android/ios的需求基本要求都是三年以上相关经验,有过大型互联网项 ...

  7. Android性能优化典范 - 第2季

    Google前几天刚发布了Android性能优化典范第2季的课程,一共20个短视频,包括的内容大致有:电量优化,网络优化,Wear上如何做优化,使用对象池来提高效率,LRU Cache,Bitmap的 ...

  8. Android性能优化典范(二)

    转自: http://www.csdn.net/article/2015-04-29/2824583-android-performance-patterns-season-2/1 摘要:Google ...

  9. Android性能优化典范第五季

    原文链接:http://hukai.me/android-performance-patterns-season-5/ 前言 本季内容大致有:多线程并发的性能问题,介绍了AsyncTask,Handl ...

最新文章

  1. 安装debian总结以及编译linux内核
  2. 四条使用Spring BeanUtils的总结,避免各种诡异的属性拷贝问题!
  3. python叫什么语言-Python代码是什么语言
  4. 几个基于jvm 的微服务框架
  5. (译)理解 LSTM 网络 (Understanding LSTM Networks by colah)
  6. [译]解密 Uber 数据科学团队路径选择算法的优化之路
  7. 华科一篇 20 年前硕士论文,「神预言」深圳赛格大厦晃动?导师回应
  8. 关于null,undefined,空字符串的判断
  9. 如何在Mac上删除其他存储
  10. txt文本阅读器 html,TxT小说阅读器(TxTReader)
  11. Activiti6新特性
  12. 软件工程项目学生管理系统
  13. c语言最小公倍数最简单求法,c语言最小公倍数与最大公约数的求法集锦
  14. 级数ex展开_关于e和ex级数型展开式的规律分析_数学专业.doc
  15. 用php实现本地文件的上传
  16. FreeMarker生成复杂word(包含图片,表格)
  17. python后端工程师岗位职责_【PYTHON后端开发工程师岗位职责_PYTHON后端开发工程师职责/工作内容】-猎聘岗位职责频道...
  18. 完美解决 unbuntu中vim编辑完成后 按ESC毫无反映
  19. 北京科技大学 数值计算方法实验代码
  20. 微信小程序背景图片background无法在手机端显示问题解决方案

热门文章

  1. 容灾技术及建设经验介绍
  2. Git实战技巧-多人协作开发出现代码冲突,如何合并代码
  3. Python: PS 图像调整--颜色梯度
  4. Cookie | Cookie的理论基础、Cookie中常用的方法
  5. 华为p20Pro投屏电脑 手机投屏电脑教程
  6. 【Rosalind】Rabbits and Recurrence Relations
  7. 如何提高项目估算精准度?关键看5大影响因子
  8. Camera成像原理(raw图如何产生的)
  9. FileInputStream 与 BufferedInputStream的比较
  10. 什么是mvc,mvc有什么用?