摘要

scalable coherence 已被 studied in CMPs,
- GPU new challenges.
conventional directory protocols adds unnecessary coherence traffic overhead to existing GPU apps.
these protocols increase the verification complexity of the GPU memory system.
Recent research,Library Cache Coherence (LCC) [34, 54], explored
- the use of time-based approaches
- in CMP coherence protocols.

a time-based coherence framework for GPUs,
- Temporal Coherence (TC),
- exploits globally synchronized counters in single-chip to develop a streamlined GPU coherence protocol.
Synchronized counters enable all coherence transitions,
- such as invalidation of cache blocks
- to happen synchronously,
- eliminating all coherence traffic and protocol races.
present an implementation of TC, called TC-Weak,
- eliminates LCC’s trade-off between stalling stores and increasing L1 miss rates
- to improve performance and reduce interconnect traffic.

providing coherent L1 caches, TC-Weak improves the performance of GPU applications with inter-workgroup communication by 85% over disabling the non-coherent L1 caches in the baseline GPU.
We also find that write-through protocols outperform a writeback protocol on a GPU
- the latter suffers from increased traffic
- due to unnecessary refills of write-once data.

1 Introduction

abstracting away the SIMD hardware and
- providing the illusion of independent scalar threads executing in parallel.
Traditionally limited to regular parallelism,
- recent studies[21, 41] show highly irregular algorithms can attain significant speedups on a GPU.
multi-level cache hierarchy in recent GPUs [6, 44] frees the burden of software managed caches
increases the GPU’s attractiveness as a platform for accelerating applications with irregular memory access patterns [22, 40].

GPUs lack cache一致性 and require disabling private caches if an application requires memory operations to be visible across all cores [6, 44, 45].
CMPs employ hardware cache coherence [17, 30, 32, 50] to enforce strict memory consistency models.
These consistency models form the basis
of memory models for high-level languages [10, 35] and provide the synchronization primitives employed by multithreaded CPU app
Coherence greatly simplifies supporting well-defined consistency and memory models
for high-level languages on GPUs.
It helps enable a unified address space in heterogeneous architectures with
single-chip CPU-GPU [11, 26].
This paper focuses on coherence in the GPU cores;
- CPU-GPU cache coherence as future work.

Disabling L1 caches provides coherence at the cost of app performance.
Figure 1(a) shows the potential improvement
- contain interworkgroup communication and require coherent L1 caches
Compared to disabling L1 caches,
- an ideally coherent GPU ,
- where coherence traffic does not incur any latency or traffic costs, improves performance of these applications by 88%

GPUs present three main challenges for coherence.
Figure 1(b) depicts the first of these challenges
- comparing the interconnect traffic of
- the baseline non-coherent GPU system (NO-COH) to
- writeback MESI,
- inclusive write-through GPU-VI
- non-inclusive write-through GPU-VIni (described in Section 4).
These protocols introduce unnecessary coherence traffic overheads for GPU app
- containing data that does not require coherence.

on a GPU, CPU-like worst case sizing [18] would require an impractical amount of storage for tracking thousands of in-flight coherence requests.
existing coherence protocols introduce complexity in the form of transient states and additional message classes.
They require additional virtual networks [58] on GPU interconnects to ensure forward progress, increase power consumption.
tracking a large number of sharers [28, 64] is not a problem for current GPU
- only tens of cores.

using a time-based coherence framework
- minimizing overheads of GPU coherence
- no introducing design complexity.
Traditional coherence protocols rely
- explicit message
- inform others
- when an address needs be invalidated.
describe a time-based coherence framework, TC,
- uses synchronized counters to
  self-invalidate cache blocks
- maintain coherence invariants without explicit messages
Existing hardware implements counters synchronized across components [23, Sec-
tion 17.12.1] to provide efficient timer services.
Leveraging these counters allows TC to
- eliminate coherence traffic,
- lower area overheads,
- reduce protocol complexity for GPU coherence.
TC requires prediction of cache block lifetimes for self-invalidation.

[34, 54]proposed time-based hardware coherence protocol, LCC,
- implements SC on CMPs by stalling
  writes to cache blocks until they have been self-invalidated by all sharers.
describe one implementation of the TC
framework, TC-Strong,similar to LCC.
Section 8.3： TC-Strong poorly on a GPU.
second ：TC-Weak, uses a novel timestamp-based memory fence to eliminate stalling of writes.
TC-Weak uses timestamps to drive all consistency operations.
It implements RC [19], enabling full support of C++ and Java memory models [58] on GPUs.

Figure 2 ：high-level operation of TC-Strong and TC-Weak.
C2 、C3, addresses A and B cached in private L1
TC-Strong,C1’s write to A stalls completion
- until C2 self-invalidates
- its locally cached copy of A.
C1’s write to B stalls completion
- until C3 self-invalidates
- its copy of B.
TC-Weak, C1’s writes to A and B do not stall
- waiting for other copies to be self-invalidated.
the fence operation ensures that all previously written addresses have been self-invalidated in other local caches.
This ensures that all previous writes from this core will be globally visible after the fence completes.

challenges of introducing existing coherence protocols to GPUs. introduce two optimizations to a VI protocol [30] to make it more suitable for GPU.
provides detailed complexity and performance evaluations of inclusive and non-inclusive directory protocols on a GPU.
describes Temporal Coherence,
- a GPU coherence framework for exploiting synchronous counters in single-chip systems to eliminate coherence traffic and protocol races.
proposes the TC-Weak coherence protocol which employs timestamp based memory fences to implement Release Consistency [19] on a GPU.
proposes a simple lifetime predictor for TC-Weak that performs well across a range of GPU applications.

TC-Weak with a simple lifetime predictor improves performance apps with inter-workgroup communication by 85%
- over the baseline non-coherent GPU.
performs as well as the VI protocols and 23% faster than MESI across all benchmarks.
for apps with intra-workgroup communication, it reduces the traffic overheads of MESI, GPU-VI and GPU-VIni by 56%,23% and 22%, reducing interconnect energy usage by40%, 12% and 12%.
Compared to TC-Strong, TC-Weak
performs 28% faster with 26% lower interconnect traffic across all applications.

2 discusses related work,
3 reviews GPU architectures and cache coherence,
4 describes the directory protocols
5 describes the challenges of GPU coherence.
6 details the implementations of TC-Strong and TC-Weak,
7 and 8 present our methodology and results
9 concludes.

2 Related Work

timestamps explored in software coherence [42, 63]
Nandy [43] first consider for hardware coherence.
(LCC) [34, 54] ：time-based hardware coherence proposal
- stores timestamps in directory
- delays stores to unexpired blocks
- to enforce sc on CMP.
TC-Strong similar LCC
- both enforce write atomicity
- by stalling writes
- at the shared last level cache.
Unlike LCC, TC-Strong supports multiple outstanding writes from a core and implements a rc model.
TC-Strong includes optimizations to eliminate stalls due to private writes and L2 evictions.
the stalling of writes in TC-Strong
causes poor on GPU.
propose TC-Weak and a novel time-based memory fence to eliminate all write-stalling, improve performance, and reduce interconnect traffic compared to TC-Strong.
unlike for CPU apps [34, 54],
the fixed timestamp prediction
proposed by LCC is not suited for GPU
applications.
We propose a simple yet effective lifetime predictor that can accommodate a range of GPU applications.
Lastly, present a full description of our proposed protocol, including state transition tables that describe the
implementation in detail.

3 背景

the memory system and cache hierarchy of the baseline non-coherent GPU ,
- similar to NVIDIA’s Fermi [44],
- we evaluate in this paper.
Cache coherence is also briefly discussed.

3.1 Baseline GPU Architecture

Figure 3 ：the organization of baseline non-coherent GPU.
An OpenCL[29]or CUDA[46] application begins execution on a CPU
- and launches compute kernels onto a GPU.
Each kernel launches a hierarchy of threads (an NDRange of work groups of wavefronts of work items/scalar threads) onto a GPU.
Each workgroup assigned to a multi-threaded GPU core.
Scalar threads are managed as a SIMD execution group
- consisting of 32 threads
- called a warp (NVIDIA terminology)
- or wavefront (AMD terminology).

Cache Coherence for GPU Architectures相关推荐

为什么程序员需要关心顺序一致性（Sequential Consistency）而不是Cache一致性（Cache Coherence）
本文所讨论的计算机模型是Shared Memory Multiprocessor,即我们现在常见的共享内存的多核CPU.本文适合的对象是想用C++或者Java进行多线程编程的程序员.本文主要包括对Se ...
A Primer on Memory Consistency and Cache Coherence
前言现在许多的计算机系统和多核芯片都在硬件层面支持共享内存(shared memory). 在共享内存系统中,每一个处理器核都可以对一个单独的内存地址空间进行读写.对共享内存系统来说,Memory ...
介绍内存一致性(Memory Consistency)和缓存一致性(Cache Coherence)
为了追求PPA(高性能.低功耗和低成本),许多现代计算机系统和多核(处理器)芯片都支持共享硬件内存.在存在共享内存的存储器系统中,每个处理器都可以读写某个共享地址空间. 在支持共享内存之前最重要的是保 ...
How to caching Global data in on-chip (level 1) cache in Morden GPU
1.Fermi arch 因为在CC 2.x(Compute Capability NVIDIA 计算能力)时,L1 Data Cache 还是可用的,我们可以缓存 local 和 global 的数 ...
cache coherence
1.CPU没有管脚直接连到内存.相反,CPU和一级缓存(L1 Cache)通讯,而一级缓存才能和内存通讯.大约二十年前,一级缓存可以直接和内存传输数据.如今,更多级别的缓存加入到设计中,一级缓存已经不 ...
《浅谈Cache Memory》学习-第四章
Cache的层次结构我第一次接触存储器瓶颈这个话题是在上世纪九十年代,距今已接近二十年.至今这个问题非但没有缓和的趋势,却愈演愈烈,进一步发展为Memory Wall.在这些问题没有得到解决之前,片 ...
【并行计算-CUDA开发】GPU 的硬体架构
GPU 的硬体架构这里我们会简单介绍,NVIDIA 目前支援CUDA 的GPU,其在执行CUDA 程式的部份(基本上就是其shader 单元)的架构.这里的资料是综合NVIDIA 所公布的资讯, ...
【并行计算-CUDA开发】GPU 的硬体架构关于存储体的介绍比较好bank conflicts
GPU 的硬体架构这里我们会简单介绍,NVIDIA 目前支援CUDA 的GPU,其在执行CUDA 程式的部份(基本上就是其shader 单元)的架构.这里的资料是综合NVIDIA 所公布的资讯, ...
深入理解GPU硬件架构及运行机制
目录一.导言 1.1 为何要了解GPU? 1.2 内容要点 1.3 带着问题阅读二.GPU概述 2.1 GPU是什么? 2.2 GPU历史 2.2.1 NV GPU发展史 2.2.2 NV GPU ...

Cache Coherence for GPU Architectures

文章目录

摘要

1 Introduction

2 Related Work

3 背景

3.1 Baseline GPU Architecture

Cache Coherence for GPU Architectures相关推荐

最新文章

热门文章