gpu-systems-tech-report

A Technical Report on GPU systems for a directed study course
https://github.com/sir-nochill/gpu-systems-tech-report
Science Score: 31.0%

This score indicates how likely this project is to be science-related based on various indicators:
✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic links in README
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (0.4%) to scientific vocabulary
Last synced: 10 months ago · JSON representation ·
Repository

A Technical Report on GPU systems for a directed study course
Basic Info

Host: GitHub
Owner: Sir-NoChill
Language: TeX
Default Branch: main
Size: 4.78 MB
Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0
Created over 1 year ago · Last pushed over 1 year ago
Metadata Files

Citation
Citation (citations/ayrton.bib)

@book{kirk_programming_2016,
	title = {Programming Massively Parallel Processors: A Hands-on Approach},
	isbn = {978-0-12-811987-7},
	shorttitle = {Programming Massively Parallel Processors},
	abstract = {Programming Massively Parallel Processors: A Hands-on Approach, Third Edition shows both student and professional alike the basic concepts of parallel programming and {GPU} architecture, exploring, in detail, various techniques for constructing parallel programs. Case studies demonstrate the development process, detailing computational thinking and ending with effective and efficient parallel programs. Topics of performance, floating-point format, parallel patterns, and dynamic parallelism are covered in-depth. For this new edition, the authors have updated their coverage of {CUDA}, including coverage of newer libraries, such as {CuDNN}, moved content that has become less important to appendices, added two new chapters on parallel patterns, and updated case studies to reflect current industry practices. - Teaches computational thinking and problem-solving techniques that facilitate high-performance parallel computing - Utilizes {CUDA} version 7.5, {NVIDIA}'s software development tool created specifically for massively parallel environments - Contains new and updated case studies - Includes coverage of newer libraries, such as {CuDNN} for Deep Learning},
	pagetotal = {574},
	publisher = {Morgan Kaufmann},
	author = {Kirk, David B. and Hwu, Wen-mei W.},
	date = {2016-11-24},
	langid = {english},
	note = {Google-Books-{ID}: {wcS}\_DAAAQBAJ},
	keywords = {Computers / Distributed Systems / General, Computers / Programming / Parallel},
	file = {Kirk and Hwu - 2016 - Programming Massively Parallel Processors A Hands-on Approach.pdf:/home/stormblessed/Zotero/storage/VPCELARG/Kirk and Hwu - 2016 - Programming Massively Parallel Processors A Hands-on Approach.pdf:application/pdf},
}

@inproceedings{guz_threads_2010,
	title = {Threads vs. caches: Modeling the behavior of parallel workloads},
	url = {https://ieeexplore.ieee.org/document/5647747/?arnumber=5647747},
	doi = {10.1109/ICCD.2010.5647747},
	shorttitle = {Threads vs. caches},
	abstract = {A new generation of high-performance engines now combine graphics-oriented parallel processors with a cache architecture. In order to meet this new trend, new highly-parallel workloads are being developed. However, it is often difficult to predict how a given application would perform on a given architecture. This paper provides a new model capturing the behavior of such parallel workloads on different multi-core architectures. Specifically, we provide a simple analytical model, which, for a given application, describes its performance and power as a function of the number of threads it runs in parallel, on a range of architectures. We use our model (backed by simulations) to study both synthetic workloads and real ones from the {PARSEC} suite. Our findings recognize distinctly different behavior patterns for different application families and architectures.},
	eventtitle = {2010 {IEEE} International Conference on Computer Design},
	pages = {274--281},
	booktitle = {2010 {IEEE} International Conference on Computer Design},
	author = {Guz, Zvika and Itzhak, Oved and Keidar, Idit and Kolodny, Avinoam and Mendelson, Avi and Weiser, Uri C.},
	urldate = {2024-10-16},
	date = {2010-10},
	note = {{ISSN}: 1063-6404},
	keywords = {Computer architecture, Mathematical model, Analytical models, Bandwidth, Engines, Benchmark testing, Instruction sets},
	file = {Full Text PDF:/home/stormblessed/Zotero/storage/IWT2G94N/Guz et al. - 2010 - Threads vs. caches Modeling the behavior of parallel workloads.pdf:application/pdf;IEEE Xplore Abstract Record:/home/stormblessed/Zotero/storage/IPEFEH4A/5647747.html:text/html},
}

@article{GPGPUbook,
	title = {General-Purpose Graphics Processor Architecture},
	author = {Aamodt, Tor M and Fung, Wilson Wai Lun and Rogers, Timothy G},
	langid = {english},
	file = {Aamodt et al. - General-Purpose Graphics Processor Architecture.pdf:/home/stormblessed/Zotero/storage/9W3BSJP8/Aamodt et al. - General-Purpose Graphics Processor Architecture.pdf:application/pdf},
}

@inproceedings{pugsley_sandbox_2014,
	title = {Sandbox Prefetching: Safe run-time evaluation of aggressive prefetchers},
	url = {https://ieeexplore.ieee.org/document/6835971/?arnumber=6835971},
	doi = {10.1109/HPCA.2014.6835971},
	shorttitle = {Sandbox Prefetching},
	abstract = {Memory latency is a major factor in limiting {CPU} performance, and prefetching is a well-known method for hiding memory latency. Overly aggressive prefetching can waste scarce resources such as memory bandwidth and cache capacity, limiting or even hurting performance. It is therefore important to employ prefetching mechanisms that use these resources prudently, while still prefetching required data in a timely manner. In this work, we propose a new mechanism to determine at run-time the appropriate prefetching mechanism for the currently executing program, called Sandbox Prefetching. Sandbox Prefetching evaluates simple, aggressive offset prefetchers at run-time by adding the prefetch address to a Bloom filter, rather than actually fetching the data into the cache. Subsequent cache accesses are tested against the contents of the Bloom filter to see if the aggressive prefetcher under evaluation could have accurately prefetched the data, while simultaneously testing for the existence of prefetchable streams. Real prefetches are performed when the accuracy of evaluated prefetchers exceeds a threshold. This method combines the ideas of global pattern confirmation and immediate prefetching action to achieve high performance. Sandbox Prefetching improves performance across the tested workloads by 47.6\% compared to not using any prefetching, and by 18.7\% compared to the Feedback Directed Prefetching technique. Performance is also improved by 1.4\% compared to the Access Map Pattern Matching Prefetcher, while incurring considerably less logic and storage overheads.},
	eventtitle = {2014 {IEEE} 20th International Symposium on High Performance Computer Architecture ({HPCA})},
	pages = {626--637},
	booktitle = {2014 {IEEE} 20th International Symposium on High Performance Computer Architecture ({HPCA})},
	author = {Pugsley, Seth H and Chishti, Zeshan and Wilkerson, Chris and Chuang, Peng-fei and Scott, Robert L and Jaleel, Aamer and Lu, Shih-Lien and Chow, Kingsum and Balasubramonian, Rajeev},
	urldate = {2024-10-29},
	date = {2014-02},
	note = {{ISSN}: 2378-203X},
	keywords = {Bandwidth, Prefetching, Accuracy, Radiation detectors, Monitoring, Pattern matching},
	file = {Full Text PDF:/home/stormblessed/Zotero/storage/M9GPLN8I/Pugsley et al. - 2014 - Sandbox Prefetching Safe run-time evaluation of aggressive prefetchers.pdf:application/pdf;IEEE Xplore Abstract Record:/home/stormblessed/Zotero/storage/8WSEYQAK/6835971.html:text/html},
}

@inproceedings{lee_many-thread_2010,
	title = {Many-Thread Aware Prefetching Mechanisms for {GPGPU} Applications},
	url = {https://ieeexplore.ieee.org/document/5695538/?arnumber=5695538},
	doi = {10.1109/MICRO.2010.44},
	abstract = {We consider the problem of how to improve memory latency tolerance in massively multithreaded {GPGPUs} when the thread-level parallelism of an application is not sufficient to hide memory latency. One solution used in conventional {CPU} systems is prefetching, both in hardware and software. However, we show that straightforwardly applying such mechanisms to {GPGPU} systems does not deliver the expected performance benefits and can in fact hurt performance when not used judiciously. This paper proposes new hardware and software prefetching mechanisms tailored to {GPGPU} systems, which we refer to as many-thread aware prefetching ({MT}-prefetching) mechanisms. Our software {MT}-prefetching mechanism, called inter-thread prefetching, exploits the existence of common memory access behavior among fine-grained threads. For hardware {MT}-prefetching, we describe a scalable prefetcher training algorithm along with a hardware-based inter-thread prefetching mechanism. In some cases, blindly applying prefetching degrades performance. To reduce such negative effects, we propose an adaptive prefetch throttling scheme, which permits automatic {GPGPU} application- and hardware-specific adjustment. We show that adaptation reduces the negative effects of prefetching and can even improve performance. Overall, compared to the state-of-the-art software and hardware prefetching, our {MT}-prefetching improves performance on average by 16\%(software pref.)/15\% (hardware pref.) on our benchmarks.},
	eventtitle = {2010 43rd Annual {IEEE}/{ACM} International Symposium on Microarchitecture},
	pages = {213--224},
	booktitle = {2010 43rd Annual {IEEE}/{ACM} International Symposium on Microarchitecture},
	author = {Lee, Jaekyu and Lakshminarayana, Nagesh B. and Kim, Hyesoon and Vuduc, Richard},
	urldate = {2024-10-31},
	date = {2010-12},
	note = {{ISSN}: 2379-3155},
	keywords = {Hardware, Training, prefetching, Prefetching, Merging, {GPGPU}, {IP} networks, prefetch throttling},
	file = {Full Text PDF:/home/stormblessed/Zotero/storage/B2LGDIV2/Lee et al. - 2010 - Many-Thread Aware Prefetching Mechanisms for GPGPU Applications.pdf:application/pdf;IEEE Xplore Abstract Record:/home/stormblessed/Zotero/storage/3YVW6YED/5695538.html:text/html},
}

@article{sinclair_cs_nodate,
	title = {{CS} 758: Advanced Topics in Computer Architecture},
	author = {Sinclair, Professor Matthew D},
	langid = {english},
	file = {PDF:/home/stormblessed/Zotero/storage/R6LIL759/Sinclair - CS 758 Advanced Topics in Computer Architecture.pdf:application/pdf},
}

@online{noauthor_prefetching_nodate,
	title = {Prefetching in a Texture Cache Architecture},
	url = {https://graphics.stanford.edu/papers/texture_prefetch/},
	urldate = {2024-11-04},
	file = {Prefetching in a Texture Cache Architecture:/home/stormblessed/Zotero/storage/TJWQCIPS/texture_prefetch.html:text/html},
}

@inproceedings{igehy_prefetching_1998,
	location = {Lisbon Portugal},
	title = {Prefetching in a texture cache architecture},
	isbn = {978-1-58113-097-3},
	url = {https://dl.acm.org/doi/10.1145/285305.285321},
	doi = {10.1145/285305.285321},
	abstract = {Texture mapping has become so ubiquitous in real-time graphics hardware that many systems are able to perform filtered texturing without any penalty in fill rate. The computation rates available in hardware have been outpacing the memory access rates, and texture systems are becoming constrained by memory bandwidth and latency. Caching in conjunction with prefetching can be used to alleviate this problem.},
	eventtitle = {Euro98: 1998 Eurographics/{SIGGRAPH} on Graphics Hardware},
	pages = {133},
	booktitle = {Proceedings of the {ACM} {SIGGRAPH}/{EUROGRAPHICS} workshop on Graphics hardware},
	publisher = {{ACM}},
	author = {Igehy, Homan and Eldridge, Matthew and Proudfoot, Kekoa},
	urldate = {2024-11-04},
	date = {1998-08},
	langid = {english},
	file = {PDF:/home/stormblessed/Zotero/storage/QGDPXRYS/Igehy et al. - 1998 - Prefetching in a texture cache architecture.pdf:application/pdf},
}

@inproceedings{singh_cache_2013,
	title = {Cache coherence for {GPU} architectures},
	url = {https://ieeexplore.ieee.org/document/6522351/?arnumber=6522351},
	doi = {10.1109/HPCA.2013.6522351},
	abstract = {While scalable coherence has been extensively studied in the context of general purpose chip multiprocessors ({CMPs}), {GPU} architectures present a new set of challenges. Introducing conventional directory protocols adds unnecessary coherence traffic overhead to existing {GPU} applications. Moreover, these protocols increase the verification complexity of the {GPU} memory system. Recent research, Library Cache Coherence ({LCC}) [34, 54], explored the use of time-based approaches in {CMP} coherence protocols. This paper describes a time-based coherence framework for {GPUs}, called Temporal Coherence ({TC}), that exploits globally synchronized counters in single-chip systems to develop a streamlined {GPU} coherence protocol. Synchronized counters enable all coherence transitions, such as invalidation of cache blocks, to happen synchronously, eliminating all coherence traffic and protocol races. We present an implementation of {TC}, called {TC}-Weak, which eliminates {LCC}'s trade-off between stalling stores and increasing L1 miss rates to improve performance and reduce interconnect traffic. By providing coherent L1 caches, {TC}-Weak improves the performance of {GPU} applications with inter-workgroup communication by 85\% over disabling the non-coherent L1 caches in the baseline {GPU}. We also find that write-through protocols outperform a writeback protocol on a {GPU} as the latter suffers from increased traffic due to unnecessary refills of write-once data.},
	eventtitle = {2013 {IEEE} 19th International Symposium on High Performance Computer Architecture ({HPCA})},
	pages = {578--590},
	booktitle = {2013 {IEEE} 19th International Symposium on High Performance Computer Architecture ({HPCA})},
	author = {Singh, Inderpreet and Shriraman, Arrvindh and Fung, Wilson W. L. and O'Connor, Mike and Aamodt, Tor M.},
	urldate = {2024-12-05},
	date = {2013-02},
	note = {{ISSN}: 1530-0897},
	keywords = {Graphics processing units, Synchronization, Coherence, Instruction sets, Complexity theory, Protocols, Transient analysis},
	file = {PDF:/home/stormblessed/Zotero/storage/8I9UTB7D/Singh et al. - 2013 - Cache coherence for GPU architectures.pdf:application/pdf},
}

@misc{nie_time_2023,
	title = {A Time Series is Worth 64 Words: Long-term Forecasting with Transformers},
	url = {http://arxiv.org/abs/2211.14730},
	doi = {10.48550/arXiv.2211.14730},
	shorttitle = {A Time Series is Worth 64 Words},
	abstract = {We propose an efﬁcient design of Transformer-based models for multivariate time series forecasting and self-supervised representation learning. It is based on two key components: (i) segmentation of time series into subseries-level patches which are served as input tokens to Transformer; (ii) channel-independence where each channel contains a single univariate time series that shares the same embedding and Transformer weights across all the series. Patching design naturally has three-fold beneﬁt: local semantic information is retained in the embedding; computation and memory usage of the attention maps are quadratically reduced given the same look-back window; and the model can attend longer history. Our channel-independent patch time series Transformer ({PatchTST}) can improve the long-term forecasting accuracy signiﬁcantly when compared with that of {SOTA} Transformer-based models. We also apply our model to self-supervised pretraining tasks and attain excellent ﬁne-tuning performance, which outperforms supervised training on large datasets. Transferring of masked pre-trained representation on one dataset to others also produces {SOTA} forecasting accuracy.},
	number = {{arXiv}:2211.14730},
	publisher = {{arXiv}},
	author = {Nie, Yuqi and Nguyen, Nam H. and Sinthong, Phanwadee and Kalagnanam, Jayant},
	urldate = {2024-12-09},
	date = {2023-03-05},
	langid = {english},
	eprinttype = {arxiv},
	eprint = {2211.14730 [cs]},
	keywords = {Computer Science - Machine Learning, Computer Science - Artificial Intelligence},
	file = {PDF:/home/stormblessed/Zotero/storage/2UJPIV3Z/Nie et al. - 2023 - A Time Series is Worth 64 Words Long-term Forecasting with Transformers.pdf:application/pdf},
}

@book{noauthor_computer_nodate,
	title = {Computer Architecture},
	file = {PDF:/home/stormblessed/Zotero/storage/RQRY8JDP/Computer Architecture.pdf:application/pdf},
}

@book{noauthor_microprocessor_nodate,
	title = {Microprocessor Architecture},
	file = {PDF:/home/stormblessed/Zotero/storage/84IFRKUR/Microprocessor Architecture.pdf:application/pdf},
}

@article{noauthor_tn-ed-03_2017,
	title = {{TN}-{ED}-03: {GDDR}6: The Next-Generation Graphics {DRAM}},
	date = {2017},
	langid = {english},
	file = {PDF:/home/stormblessed/Zotero/storage/HL4ILCDX/2017 - TN-ED-03 GDDR6 The Next-Generation Graphics DRAM.pdf:application/pdf},
}

@online{amd:occupancy,
	title = {Occupancy explained},
	url = {https://gpuopen.com/learn/occupancy-explained/},
	abstract = {In this blog post we will try to demystify what exactly occupancy is, which factors limit occupancy, and how to use tools to identify occupancy-limited workloads.},
	titleaddon = {{AMD} {GPUOpen}},
	urldate = {2024-12-14},
	langid = {british},
	file = {Snapshot:/home/stormblessed/Zotero/storage/4WARIVB4/occupancy-explained.html:text/html},
}

@online{amd:register-pressure,
	title = {Register pressure in {AMD} {CDNA}2™ {GPUs} - amd-lab-notes},
	url = {https://gpuopen.com/learn/amd-lab-notes/amd-lab-notes-register-pressure-readme/},
	abstract = {Register pressure of {GPU} kernels has a tremendous impact on performance. This post provides a practical demo on applying recommendations.},
	titleaddon = {{AMD} {GPUOpen}},
	urldate = {2024-12-14},
	langid = {british},
	file = {Snapshot:/home/stormblessed/Zotero/storage/JRLPWI9Q/amd-lab-notes-register-pressure-readme.html:text/html},
}

@online{noauthor_gpu_nodate,
	title = {{GPU} architecture hardware specifications — {ROCm} Documentation},
	url = {https://rocm.docs.amd.com/en/docs-6.0.2/reference/gpu-arch/gpu-arch-spec-overview.html},
	urldate = {2024-12-14},
	file = {GPU architecture hardware specifications — ROCm Documentation:/home/stormblessed/Zotero/storage/X8DZ8F6A/gpu-arch-spec-overview.html:text/html},
}

@online{nvforum:l1-global-caching,
	title = {How does cuda global memory's L1 caching work - {CUDA} / {CUDA} Programming and Performance},
	url = {https://forums.developer.nvidia.com/t/how-does-cuda-global-memorys-l1-caching-work/299470},
	abstract = {In the programming guide, it says:   Data that is not read-only for the entire lifetime of the kernel cannot be cached in the unified  L1/texture cache for devices of compute capability 5.0. For devices of compute capability 5.2, it is,  by default, not cached in the unified L1/texture cache, but caching may be enabled using the following  mechanisms:  ▶ Perform the read using inline assembly with the appropriate modifier as described in the {PTX}  reference manual;  ▶ Compile with the -Xptxas -dl...},
	titleaddon = {{NVIDIA} Developer Forums},
	urldate = {2024-12-14},
	date = {2024-07-11},
	langid = {english},
	note = {Section: Accelerated Computing},
	file = {Snapshot:/home/stormblessed/Zotero/storage/T2ESAACN/299470.html:text/html},
}

@incollection{jacob_chapter_2008,
	location = {San Francisco},
	title = {{CHAPTER} 15 - Memory System Design Analysis},
	isbn = {978-0-12-379751-3},
	url = {https://www.sciencedirect.com/science/article/pii/B9780123797513500175},
	pages = {541--597},
	booktitle = {Memory Systems},
	publisher = {Morgan Kaufmann},
	author = {Jacob, Bruce and Ng, Spencer W. and Wang, David T.},
	editor = {Jacob, Bruce and Ng, Spencer W. and Wang, David T.},
	urldate = {2024-12-14},
	date = {2008-01-01},
	doi = {10.1016/B978-012379751-3.50017-5},
	file = {ScienceDirect Snapshot:/home/stormblessed/Zotero/storage/SQUH4AJU/B9780123797513500175.html:text/html},
}

@article{nvidia:programming-guide,
	title = {{CUDA} C++ Programming Guide},
	langid = {english},
	file = {PDF:/home/stormblessed/Zotero/storage/VHXRGV8R/CUDA C++ Programming Guide.pdf:application/pdf},
}

@article{nvidia:register-pressure,
	title = {Software-Directed Techniques for Improved {GPU} Register File Utilization},
	volume = {15},
	issn = {1544-3566, 1544-3973},
	url = {https://dl.acm.org/doi/10.1145/3243905},
	doi = {10.1145/3243905},
	abstract = {Throughput architectures such as {GPUs} require substantial hardware resources to hold the state of a massive number of simultaneously executing threads. While {GPU} register files are already enormous, reaching capacities of 256KB per streaming multiprocessor ({SM}), we find that nearly half of real-world applications we examined are register-bound and would benefit from a larger register file to enable more concurrent threads. This article seeks to increase the thread occupancy and improve performance of these register-bound applications by making more efficient use of the existing register file capacity. Our first technique eagerly deallocates register resources during execution. We show that releasing register resources based on value liveness as proposed in prior states of the art leads to unreliable performance and undue design complexity. To address these deficiencies, our article presents a novel compiler-driven approach that identifies and exploits last use of a register name (instead of the value contained within) to eagerly release register resources. Furthermore, while previous works have leveraged “scalar” and “narrow” operand properties of a program for various optimizations, their impact on thread occupancy has been relatively unexplored. Our article evaluates the effectiveness of these techniques in improving thread occupancy and demonstrates that while any one approach may fail to free very many registers, together they synergistically free enough registers to launch additional parallel work. An in-depth evaluation on a large suite of applications shows that just our early register technique outperforms previous work on dynamic register allocation, and together these approaches, on average, provide 12\% performance speedup (23\% higher thread occupancy) on register bound applications not already saturating other {GPU} resources.},
	pages = {1--23},
	number = {3},
	journaltitle = {{ACM} Transactions on Architecture and Code Optimization},
	shortjournal = {{ACM} Trans. Archit. Code Optim.},
	author = {Voitsechov, Dani and Zulfiqar, Arslan and Stephenson, Mark and Gebhart, Mark and Keckler, Stephen W.},
	urldate = {2024-12-14},
	date = {2018-09-30},
	langid = {english},
	file = {Full Text:/home/stormblessed/Zotero/storage/UEGE7P7J/Voitsechov et al. - 2018 - Software-Directed Techniques for Improved GPU Register File Utilization.pdf:application/pdf},
}

@online{noauthor_cuda_2020,
	title = {{CUDA} Refresher: The {CUDA} Programming Model},
	url = {https://developer.nvidia.com/blog/cuda-refresher-cuda-programming-model/},
	shorttitle = {{CUDA} Refresher},
	abstract = {This is the fourth post in the {CUDA} Refresher series, which has the goal of refreshing key concepts in {CUDA}, tools, and optimization for beginning or intermediate developers.},
	titleaddon = {{NVIDIA} Technical Blog},
	urldate = {2024-12-14},
	date = {2020-06-26},
	langid = {american},
	file = {Snapshot:/home/stormblessed/Zotero/storage/K2QZ2N6W/cuda-refresher-cuda-programming-model.html:text/html},
}

@online{noauthor_sdkdocumentationgcn_architecture_whitepaperpdf_nodate,
	title = {{SDK}/documentation/{GCN}\_Architecture\_whitepaper.pdf at master · {AMD}-{FirePro}/{SDK}},
	url = {https://github.com/AMD-FirePro/SDK/blob/master/documentation/GCN_Architecture_whitepaper.pdf},
	abstract = {{SDK}. Contribute to {AMD}-{FirePro}/{SDK} development by creating an account on {GitHub}.},
	titleaddon = {{GitHub}},
	urldate = {2024-12-14},
	langid = {english},
	file = {Snapshot:/home/stormblessed/Zotero/storage/3SUD7YJG/GCN_Architecture_whitepaper.html:text/html},
}

@online{noauthor_how_2024-1,
	title = {How does cuda global memory's L1 caching work - {CUDA} / {CUDA} Programming and Performance},
	url = {https://forums.developer.nvidia.com/t/how-does-cuda-global-memorys-l1-caching-work/299470},
	abstract = {In the programming guide, it says:   Data that is not read-only for the entire lifetime of the kernel cannot be cached in the unified  L1/texture cache for devices of compute capability 5.0. For devices of compute capability 5.2, it is,  by default, not cached in the unified L1/texture cache, but caching may be enabled using the following  mechanisms:  ▶ Perform the read using inline assembly with the appropriate modifier as described in the {PTX}  reference manual;  ▶ Compile with the -Xptxas -dl...},
	titleaddon = {{NVIDIA} Developer Forums},
	urldate = {2024-12-14},
	date = {2024-07-11},
	langid = {english},
	note = {Section: Accelerated Computing},
	file = {Snapshot:/home/stormblessed/Zotero/storage/ZFGI8ZN2/299470.html:text/html},
}

@online{noauthor_ptx_nodate,
	title = {{PTX} {ISA} 8.5},
	url = {https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#memory-consistency-model},
	urldate = {2024-12-14},
	file = {PTX ISA 8.5:/home/stormblessed/Zotero/storage/XTGYWU7C/index.html:text/html},
}

@patent{minkin_cache_2012,
	title = {Cache miss processing using a defer/replay mechanism},
	url = {https://patents.google.com/patent/US8266383B1/en},
	holder = {Nvidia Corp},
	type = {patentus},
	number = {8266383B1},
	author = {Minkin, Alexander L. and Heinrich, Steven J. and Selvanesan, Rajeshwaran and {McCarver}, Charles and Carlton, Stewart Glenn and Siu, Ming Y. and Tang, Yan Yan and Stoll, Robert J.},
	urldate = {2024-12-14},
	date = {2012-09-11},
	keywords = {cache, client, data, request, requests},
	file = {Full Text PDF:/home/stormblessed/Zotero/storage/TMV4C9B5/Minkin et al. - 2012 - Cache miss processing using a deferreplay mechanism.pdf:application/pdf},
}

@inproceedings{singh:temporal-coherence,
	title = {Cache coherence for {GPU} architectures},
	url = {https://ieeexplore.ieee.org/document/6522351},
	doi = {10.1109/HPCA.2013.6522351},
	abstract = {While scalable coherence has been extensively studied in the context of general purpose chip multiprocessors ({CMPs}), {GPU} architectures present a new set of challenges. Introducing conventional directory protocols adds unnecessary coherence traffic overhead to existing {GPU} applications. Moreover, these protocols increase the verification complexity of the {GPU} memory system. Recent research, Library Cache Coherence ({LCC}) [34, 54], explored the use of time-based approaches in {CMP} coherence protocols. This paper describes a time-based coherence framework for {GPUs}, called Temporal Coherence ({TC}), that exploits globally synchronized counters in single-chip systems to develop a streamlined {GPU} coherence protocol. Synchronized counters enable all coherence transitions, such as invalidation of cache blocks, to happen synchronously, eliminating all coherence traffic and protocol races. We present an implementation of {TC}, called {TC}-Weak, which eliminates {LCC}'s trade-off between stalling stores and increasing L1 miss rates to improve performance and reduce interconnect traffic. By providing coherent L1 caches, {TC}-Weak improves the performance of {GPU} applications with inter-workgroup communication by 85\% over disabling the non-coherent L1 caches in the baseline {GPU}. We also find that write-through protocols outperform a writeback protocol on a {GPU} as the latter suffers from increased traffic due to unnecessary refills of write-once data.},
	eventtitle = {2013 {IEEE} 19th International Symposium on High Performance Computer Architecture ({HPCA})},
	pages = {578--590},
	booktitle = {2013 {IEEE} 19th International Symposium on High Performance Computer Architecture ({HPCA})},
	author = {Singh, Inderpreet and Shriraman, Arrvindh and Fung, Wilson W. L. and O'Connor, Mike and Aamodt, Tor M.},
	urldate = {2024-12-14},
	date = {2013-02},
	note = {{ISSN}: 1530-0897},
	keywords = {Coherence, Complexity theory, Graphics processing units, Instruction sets, Protocols, Synchronization, Transient analysis},
	file = {IEEE Xplore Abstract Record:/home/stormblessed/Zotero/storage/SDXWH3DG/6522351.html:text/html},
}
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science