Reducing GPU Address Translation Overhead with Virtual Caching
Loading...
Files
Authors
Yoon, Hongil
Lowe-Power, Jason
Sohi, Gurindar S.
Advisors
License
DOI
Type
Technical Report
Journal Title
Journal ISSN
Volume Title
Publisher
Grantor
Abstract
Heterogeneous computing on tightly-integrated CPU-GPU systems is ubiquitous, and to increase programmability, many of these systems support virtual address accesses from GPU hardware. However, there is no free lunch. Supporting virtual memory entails address translations on every memory access, which greatly impacts performance (about 77% performance degradation on average).
To mitigate this overhead, we propose a software-transparent, practical GPU virtual cache hierarchy. We show that a virtual cache hierarchy is an effective GPU address translation bandwidth filter. We make several empirical observations advocating for GPU virtual caches: (1) mirroring CPU-style memory management unit in GPUs is not effective, because GPU workloads show very high Translation Lookaside Buffer (TLB) miss ratio and high miss bandwidth. (2) many requests that miss in TLBs find corresponding valid data in the GPU cache hierarchy. (3) The GPU’s accelerator nature simplifies implementing a deep virtual cache hierarchy (i.e., fewer virtual address synonyms and homonyms).
We evaluate both L1-only virtual cache designs and an entire virtual cache hierarchy (private L1s and a shared L2 caches). We find that virtual caching on GPUs considerably improves performance. Our experimental evaluation shows that the proposed entire GPU virtual cache design significantly reduces the overheads of virtual address translation providing an average speedup of 1.77x over a baseline physically cached system. L1-only virtual cache designs show modest performance benefits (1.35x speedup). By using a whole GPU virtual cache hierarchy, we can obtain additional performance benefits.
Description
Related Material and Data
Citation
TR1842