文章目录

  • 摘要
  • 1 Introduction
  • 2 Related Work
  • 3 背景
    • 3.1 Baseline GPU Architecture

摘要

  • scalable coherence 已被 studied in CMPs,

    • GPU new challenges.
  • conventional directory protocols adds unnecessary coherence traffic overhead to existing GPU apps.
  • these protocols increase the verification complexity of the GPU memory system.
  • Recent research,Library Cache Coherence (LCC) [34, 54], explored
    • the use of time-based approaches
    • in CMP coherence protocols.

  • a time-based coherence framework for GPUs,

    • Temporal Coherence (TC),
    • exploits globally synchronized counters in single-chip to develop a streamlined GPU coherence protocol.
  • Synchronized counters enable all coherence transitions,
    • such as invalidation of cache blocks
    • to happen synchronously,
    • eliminating all coherence traffic and protocol races.
  • present an implementation of TC, called TC-Weak,
    • eliminates LCC’s trade-off between stalling stores and increasing L1 miss rates
    • to improve performance and reduce interconnect traffic.

  • providing coherent L1 caches, TC-Weak improves the performance of GPU applications with inter-workgroup communication by 85% over disabling the non-coherent L1 caches in the baseline GPU.
  • We also find that write-through protocols outperform a writeback protocol on a GPU
    • the latter suffers from increased traffic
    • due to unnecessary refills of write-once data.

1 Introduction

  • abstracting away the SIMD hardware and

    • providing the illusion of independent scalar threads executing in parallel.
  • Traditionally limited to regular parallelism,
    • recent studies[21, 41] show highly irregular algorithms can attain significant speedups on a GPU.
  • multi-level cache hierarchy in recent GPUs [6, 44] frees the burden of software managed caches
  • increases the GPU’s attractiveness as a platform for accelerating applications with irregular memory access patterns [22, 40].

  • GPUs lack cache一致性 and require disabling private caches if an application requires memory operations to be visible across all cores [6, 44, 45].
  • CMPs employ hardware cache coherence [17, 30, 32, 50] to enforce strict memory consistency models.
  • These consistency models form the basis
    of memory models for high-level languages [10, 35] and provide the synchronization primitives employed by multithreaded CPU app
  • Coherence greatly simplifies supporting well-defined consistency and memory models
    for high-level languages on GPUs.
  • It helps enable a unified address space in heterogeneous architectures with
    single-chip CPU-GPU [11, 26].
  • This paper focuses on coherence in the GPU cores;
    • CPU-GPU cache coherence as future work.

  • Disabling L1 caches provides coherence at the cost of app performance.
  • Figure 1(a) shows the potential improvement
    • contain interworkgroup communication and require coherent L1 caches
  • Compared to disabling L1 caches,
    • an ideally coherent GPU ,
    • where coherence traffic does not incur any latency or traffic costs, improves performance of these applications by 88%

  • GPUs present three main challenges for coherence.
  • Figure 1(b) depicts the first of these challenges
    • comparing the interconnect traffic of
    • the baseline non-coherent GPU system (NO-COH) to
    • writeback MESI,
    • inclusive write-through GPU-VI
    • non-inclusive write-through GPU-VIni (described in Section 4).
  • These protocols introduce unnecessary coherence traffic overheads for GPU app
    • containing data that does not require coherence.

  • on a GPU, CPU-like worst case sizing [18] would require an impractical amount of storage for tracking thousands of in-flight coherence requests.
  • existing coherence protocols introduce complexity in the form of transient states and additional message classes.
  • They require additional virtual networks [58] on GPU interconnects to ensure forward progress, increase power consumption.
  • tracking a large number of sharers [28, 64] is not a problem for current GPU
    • only tens of cores.

  • using a time-based coherence framework

    • minimizing overheads of GPU coherence
    • no introducing design complexity.
  • Traditional coherence protocols rely
    • explicit message
    • inform others
    • when an address needs be invalidated.
  • describe a time-based coherence framework, TC,
    • uses synchronized counters to
      self-invalidate cache blocks
    • maintain coherence invariants without explicit messages
  • Existing hardware implements counters synchronized across components [23, Sec-
    tion 17.12.1] to provide efficient timer services.
  • Leveraging these counters allows TC to
    • eliminate coherence traffic,
    • lower area overheads,
    • reduce protocol complexity for GPU coherence.
  • TC requires prediction of cache block lifetimes for self-invalidation.

  • [34, 54]proposed time-based hardware coherence protocol, LCC,

    • implements SC on CMPs by stalling
      writes to cache blocks until they have been self-invalidated by all sharers.
  • describe one implementation of the TC
    framework, TC-Strong,similar to LCC.
  • Section 8.3: TC-Strong poorly on a GPU.
  • second :TC-Weak, uses a novel timestamp-based memory fence to eliminate stalling of writes.
  • TC-Weak uses timestamps to drive all consistency operations.
  • It implements RC [19], enabling full support of C++ and Java memory models [58] on GPUs.

  • Figure 2 :high-level operation of TC-Strong and TC-Weak.
  • C2 、C3, addresses A and B cached in private L1
  • TC-Strong,C1’s write to A stalls completion
    • until C2 self-invalidates
    • its locally cached copy of A.
  • C1’s write to B stalls completion
    • until C3 self-invalidates
    • its copy of B.
  • TC-Weak, C1’s writes to A and B do not stall
    • waiting for other copies to be self-invalidated.
  • the fence operation ensures that all previously written addresses have been self-invalidated in other local caches.
  • This ensures that all previous writes from this core will be globally visible after the fence completes.

  • challenges of introducing existing coherence protocols to GPUs. introduce two optimizations to a VI protocol [30] to make it more suitable for GPU.
  • provides detailed complexity and performance evaluations of inclusive and non-inclusive directory protocols on a GPU.
  • describes Temporal Coherence,
    • a GPU coherence framework for exploiting synchronous counters in single-chip systems to eliminate coherence traffic and protocol races.
  • proposes the TC-Weak coherence protocol which employs timestamp based memory fences to implement Release Consistency [19] on a GPU.
  • proposes a simple lifetime predictor for TC-Weak that performs well across a range of GPU applications.

  • TC-Weak with a simple lifetime predictor improves performance apps with inter-workgroup communication by 85%

    • over the baseline non-coherent GPU.
  • performs as well as the VI protocols and 23% faster than MESI across all benchmarks.
  • for apps with intra-workgroup communication, it reduces the traffic overheads of MESI, GPU-VI and GPU-VIni by 56%,23% and 22%, reducing interconnect energy usage by40%, 12% and 12%.
  • Compared to TC-Strong, TC-Weak
    performs 28% faster with 26% lower interconnect traffic across all applications.

  • 2 discusses related work,
  • 3 reviews GPU architectures and cache coherence,
  • 4 describes the directory protocols
  • 5 describes the challenges of GPU coherence.
  • 6 details the implementations of TC-Strong and TC-Weak,
  • 7 and 8 present our methodology and results
  • 9 concludes.

2 Related Work

  • timestamps explored in software coherence [42, 63]
  • Nandy [43] first consider for hardware coherence.
  • (LCC) [34, 54] :time-based hardware coherence proposal
    • stores timestamps in directory
    • delays stores to unexpired blocks
    • to enforce sc on CMP.
  • TC-Strong similar LCC
    • both enforce write atomicity
    • by stalling writes
    • at the shared last level cache.
  • Unlike LCC, TC-Strong supports multiple outstanding writes from a core and implements a rc model.
  • TC-Strong includes optimizations to eliminate stalls due to private writes and L2 evictions.
  • the stalling of writes in TC-Strong
    causes poor on GPU.
  • propose TC-Weak and a novel time-based memory fence to eliminate all write-stalling, improve performance, and reduce interconnect traffic compared to TC-Strong.
  • unlike for CPU apps [34, 54],
  • the fixed timestamp prediction
  • proposed by LCC is not suited for GPU
    applications.
  • We propose a simple yet effective lifetime predictor that can accommodate a range of GPU applications.
  • Lastly, present a full description of our proposed protocol, including state transition tables that describe the
    implementation in detail.

3 背景

  • the memory system and cache hierarchy of the baseline non-coherent GPU ,

    • similar to NVIDIA’s Fermi [44],
    • we evaluate in this paper.
  • Cache coherence is also briefly discussed.

3.1 Baseline GPU Architecture

  • Figure 3 :the organization of baseline non-coherent GPU.
  • An OpenCL[29]or CUDA[46] application begins execution on a CPU
    • and launches compute kernels onto a GPU.
  • Each kernel launches a hierarchy of threads (an NDRange of work groups of wavefronts of work items/scalar threads) onto a GPU.
  • Each workgroup assigned to a multi-threaded GPU core.
  • Scalar threads are managed as a SIMD execution group
    • consisting of 32 threads
    • called a warp (NVIDIA terminology)
    • or wavefront (AMD terminology).

Cache Coherence for GPU Architectures相关推荐

  1. 为什么程序员需要关心顺序一致性(Sequential Consistency)而不是Cache一致性(Cache Coherence)

    本文所讨论的计算机模型是Shared Memory Multiprocessor,即我们现在常见的共享内存的多核CPU.本文适合的对象是想用C++或者Java进行多线程编程的程序员.本文主要包括对Se ...

  2. A Primer on Memory Consistency and Cache Coherence

    前言 现在许多的计算机系统和多核芯片都在硬件层面支持共享内存(shared memory). 在共享内存系统中,每一个处理器核都可以对一个单独的内存地址空间进行读写.对共享内存系统来说,Memory ...

  3. 介绍内存一致性(Memory Consistency)和缓存一致性(Cache Coherence)

    为了追求PPA(高性能.低功耗和低成本),许多现代计算机系统和多核(处理器)芯片都支持共享硬件内存.在存在共享内存的存储器系统中,每个处理器都可以读写某个共享地址空间. 在支持共享内存之前最重要的是保 ...

  4. How to caching Global data in on-chip (level 1) cache in Morden GPU

    1.Fermi arch 因为在CC 2.x(Compute Capability NVIDIA 计算能力)时,L1 Data Cache 还是可用的,我们可以缓存 local 和 global 的数 ...

  5. cache coherence

    1.CPU没有管脚直接连到内存.相反,CPU和一级缓存(L1 Cache)通讯,而一级缓存才能和内存通讯.大约二十年前,一级缓存可以直接和内存传输数据.如今,更多级别的缓存加入到设计中,一级缓存已经不 ...

  6. 《浅谈Cache Memory》 学习-第四章

    Cache的层次结构 我第一次接触存储器瓶颈这个话题是在上世纪九十年代,距今已接近二十年.至今这个问题非但没有缓和的趋势,却愈演愈烈,进一步发展为Memory Wall.在这些问题没有得到解决之前,片 ...

  7. 【并行计算-CUDA开发】GPU 的硬体架构

    GPU 的硬体架构   这里我们会简单介绍,NVIDIA 目前支援CUDA 的GPU,其在执行CUDA 程式的部份(基本上就是其shader 单元)的架构.这里的资料是综合NVIDIA 所公布的资讯, ...

  8. 【并行计算-CUDA开发】GPU 的硬体架构 关于存储体的介绍比较好bank conflicts

    GPU 的硬体架构   这里我们会简单介绍,NVIDIA 目前支援CUDA 的GPU,其在执行CUDA 程式的部份(基本上就是其shader 单元)的架构.这里的资料是综合NVIDIA 所公布的资讯, ...

  9. 深入理解GPU硬件架构及运行机制

    目录 一.导言 1.1 为何要了解GPU? 1.2 内容要点 1.3 带着问题阅读 二.GPU概述 2.1 GPU是什么? 2.2 GPU历史 2.2.1 NV GPU发展史 2.2.2 NV GPU ...

最新文章

  1. Spring Cloud应用开发(六:使用本地存储方式实现分布式配置管理 )
  2. 微信小程序尝鲜一个月现状分析
  3. 视频协议 rtsp 默认弱口令 漏洞
  4. AtomicLong与LongAdder执行效率对比
  5. 在hive中对日期数据进行处理,毫秒级时间转化为yyyy-MM-dd格式
  6. shopping car 3.0
  7. C++中两个数交换不引进中间变量的方法
  8. C++ —— C++程序编译的四个过程
  9. linux串口对调,Linux串口调试详解
  10. python初学小甲鱼_Python零基础入门学习 作者:小甲鱼
  11. html 隐藏广告代码,js漂浮广告原理 js或者CSS带关闭的漂浮广告代码
  12. win10系统 DNS服务器,Win10系统DNS服务器无响应
  13. adb不是内部或外部命令,也不是可运行的程序
  14. Oracle select表要带双引号的原因
  15. C++ Qt 高分屏处理心得
  16. LabVIEW网络服务安全2
  17. html+css制作盾牌飞入效果
  18. 【kimol君的无聊小发明】—用python写论文下载器(图形化界面)
  19. spring-上手spring
  20. ava锁机制Synchronized方法简介

热门文章

  1. VMWare 8 安装 Mac OS 10.7 (Lion)版
  2. (第五章 1)组合——从DList到HashTable
  3. 基于模型参考自适应的永磁同步电机参数辨识模型
  4. 操作记录表怎么设计_博实乐韶关碧桂园外国语学校?:STEAM课程案例《校园花架的设计与种植》...
  5. eureka组件服务集群,feign远程调用,生产者服务集群,ribbon组件(负载均衡),hystrix组件(断路器),zuul(网关路由)
  6. 计算机机房ups供电时间多少,机房ups电源供电时间及要求有哪些?
  7. 【网络安全】网络安全概论(练习题)
  8. 计算机网络知识总结(七)网络安全
  9. python随机种子数_关于随机:rng种子的Python数量
  10. 华为OD机试真题 Python 实现【星际篮球争霸赛】【100%通过率】【2022.11 Q4 新题】