cuda-notes

Personal notes on CUDA programming

https://github.com/leokruglikov/cuda-notes

Last synced: 10 months ago · JSON representation ·

Repository

Personal notes on CUDA programming

Basic Info

Host: GitHub
Owner: leokruglikov
Language: TeX
Default Branch: main
Size: 29.1 MB

Statistics

Stars: 55
Watchers: 2
Forks: 0
Open Issues: 5
Releases: 0

Created about 4 years ago · Last pushed over 3 years ago

Metadata Files

Readme Citation

Introduction to CUDA

Personal notes

Disclaimers

This is a pre-alpha version
I did my best to put all the possible references. I need to, however, mention that everything was taken from books and online open sources.
I am not a computer science professional. I am not even a computer science student. Thus there may be some major or minor inaccuracies.

Comments

I want to share with you my personal notes on this topic.

extract_pdf

At the moment, there are lot's of things to be completed and added, as initially, they were written for my personal use. I then decided, that adding some images and references would be the cause to share them publicly. The LaTeX document isn't clean enough, regarding the references and paragraph intending.

The author will do its best to add new chapters and section, as well as modify the new lacking features, meantioned above. The pdf document and the source LaTeX code - cuda_recap.tex.

ToDo's

[ ] Check spelling/grammar.
[ ] Add parallel algorithm - parallel scan.
[x] Add thread filtering (__all() & __any())
[ ] Constant memory, texture memory & peer access
[ ] Cuda gdb

Structure

The document is written in LaTeX. The contents are divided into different modules, which are included into the cuda_recap.tex. For compilation, LaTeX, together with all the necessary packages must be installed. The compilation is done the usual LaTeX way, together with the flag for the minted package pdflatex --shell-escpe cuda_recap.tex.

Owner

Name: Leo Kruglikov
Login: leokruglikov
Kind: user
Location: Lausanne
Company: @EPFLXplore

Repositories: 3
Profile: https://github.com/leokruglikov

Citation (citation.bib)

@book{tuomanen2018hands,
  title={Hands-On GPU Programming with Python and CUDA: Explore high-performance parallel computing with CUDA},
  author={Tuomanen, Brian},
  year={2018},
  publisher={Packt Publishing Ltd}
}
@misc{blog_2020, 
    title={Cooperative groups: Flexible cuda thread programming}, 
    url={https://developer.nvidia.com/blog/cooperative-groups/}, 
    journal={NVIDIA Technical Blog}, year={2020}, month={Aug}
}
@misc{center, 
    title={Cuda C++ Programming Guide}, 
    url={https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#programming-model},
journal={NVIDIA Documentation Center}} 

@misc{MemoryAlignment,
    title={Fang's Notebook. Nichijou.co},
    url={https://nichijou.co/cudaRandom-memAlign/},
    month={May}, year={2022}
}

@misc{atomics, 
    title={What are atomic operations for newbies?}, 
    url={https://stackoverflow.com/questions/52196678/what-are-atomic-operations-for-newbies}, 
    journal={Stack Overflow}, author={Brijendar BakchodiaBrijendar Bakchodia1 and AmadanAmadan}, 
    year={2018}, 
    month={Apr}
} 

 @misc{habr_car_vs_bus, title={Paraljeljnoje programmirovanije s cuda. chastj 1: Vvedenije}, url={https://habr.com/ru/company/epam_systems/blog/245503/}, journal={Habr}, publisher={Habr}, year={2015}, month={Apr}}

 @misc{memory_model, 
   title={Cuda\ -\ memory model},
   url={https://medium.com/analytics-vidhya/cuda-memory-model-823f02cef0bf#:~:text=Pageable\%20memory,-The\%20memory\%20allocated\&amp;text=The\%20data\%20at\%20this\%20memory,allocation\%20and\%20transfer\%20is\%20slow}, 
  journal={Medium}, 
  publisher={Analytics Vidhya}, 
  author={Ponnuraj, Raj Prasanna}, 
  year={2020}, 
  month={Oct}
 }

@misc{async_memcpy,
  title={How to Overlap Data Transfers in CUDA C/C++},
  url={https://developer.nvidia.com/blog/how-overlap-data-transfers-cuda-cc/},
  journal={Nvidia Developer blog},
  year={2013},
  month={Dec},
  author={Mark Harris}
}
@misc{chinese_sync, 
  title={Cuda \_\_shfl\_xor \_\_shfl \_\_shfl\_up() \_\_shfl\_down()}, 
  url={https://blog.csdn.net/jqw11/article/details/103071556}, 
  journal={cuda \_\_shfl\_xor \_\_shfl \_\_shfl\_up() \_\_shfl\_down()\_8BitCat-CSDN\_cuda}
}
@misc{conference_gpu,
    title={Kepler's SHUFFLE (SHFL): Tips and Tricks | GTC 2013},
    url={http://on-demand.gputechconf.com/gtc/2013/presentations/S3174-Kepler-Shuffle-Tips-Tricks.pdf}
}


@misc{cuda_events, 
  title={Introduction to stream and event}, 
  url={https://www.programmerall.com/article/4005578780/}, 
  journal={CUDA ---- Stream and Event - Programmer All}
} 

@misc{cuda_performance_metrics,
    title={How to Implement Performance Metrics in CUDA C/C++ | NVIDIA ...},
    url={https://developer.nvidia.com/blog/how-implement-performance-metrics-cuda-cc}
}

@misc{streams_best_practices,
   title={CUDA Streams: Best Practices and Common Pitfalls},
   url={https://on-demand.gputechconf.com/gtc/2014/presentations/S4158-cuda-streams-best-practices-common-pitfalls.pdf}
}


@misc{ptx_nvidia, 
  title={Parallel thread execution ISA version 7.7}, 
  url={https://docs.nvidia.com/cuda/parallel-thread-execution/index.html}, 
  journal={NVIDIA Documentation Center}
}

@misc{ptx_wiki, 
 title={Parallel thread execution}, 
 url={https://en.wikipedia.org/wiki/Parallel_Thread_Execution}, 
 journal={Wikipedia}, 
 publisher={Wikimedia Foundation}, 
 year={2022}, 
 month={May}
} 

 @misc{nsight_cern,
 url={https://indico.cern.ch/event/962112/contributions/4110591/attachments/2159863/3643851/CERN_Nsight_Compute.pdf}
} 

@misc{armour_warp_nodate,
	address = {Oxford e-Research centre},
	type = {Lecture},
	title = {Warp shuffles, reduction and scan operations},
	url = {https://people.maths.ox.ac.uk/gilesm/cuda/2019/lecture_04.pdf},
	abstract = {n this fourth lecture we will learn about warp shuffle instructions, reduction and scan operations.
You will learn about:
• Different types of warp shuffle instructions and why they are useful.
• How warp shuffles can be used to construct different memory access patterns.
• The reduction algorithm and implementation on a GPU.
• The scan algorithm and implementation on a GPU},
	language = {English},
	urldate = {2022-11-10},
	author = {Armour, Was},
}

@misc{giles_warp_nodate,
	address = {Oxford University Mathematical Institute},
	type = {Lecture},
	title = {Warp shuffles, reduction and scan operations},
	url = {https://people.maths.ox.ac.uk/gilesm/cuda/lecs/lec4.pdf},
	language = {English},
	urldate = {2022-11-10},
	author = {Giles, Mike},
}

@misc{talonmies_cudamemcpytosymbol_2013,
	type = {Answer},
	title = {{cudaMemcpyToSymbol} vs. {cudaMemcpy} why is it still around ({cudaMemcpyToSymbol})},
	url = {https://stackoverflow.com/questions/15984913/cudamemcpytosymbol-vs-cudamemcpy-why-is-it-still-around-cudamemcpytosymbol},
	journal = {stackoverflow.com},
	author = {talonmies},
	month = may,
	year = {2013},
}


@misc{noauthor_docsgl_nodate,
	title = {docs.gl},
	url = {https://docs.gl/},
	urldate = {2022-11-11},
	file = {docs.gl:/home/leo/Zotero/storage/KIHKSL57/docs.gl.html:text/html},
}

@misc{noauthor_steps3d_nodate,
	title = {{steps3D}\ -\ {Tutorials}\ -\ Vzaimodejstvije\ {CUDA}\ i\ {OpenGL}},
	url = {http://steps3d.narod.ru/tutorials/cuda-opengl-tutorial.html},
	urldate = {2022-11-11},
	file = {steps3D\ -\ Tutorials\ -\ Vzaimodejstvije\ CUDA\ и\ OpenGL:/home/leo/Zotero/storage/833WUXG9/cuda-opengl-tutorial.html:text/html},
}

@book{boreskov__nodate,
  edition = {Издательство Московского университета},
  series = {Moskovskij gosudarstvennij universitet imeni M. V. Lomonosova},
  title = {{Paralleljnije Vichislenija na GPU Arhitektura i programmnaja modelj CUDA}},
	isbn = {978-5-19-011058-6},
	language = {Russian},
	author = {Boreskov and Kharlamov and Markovskij and Miljnjcev and Sakharnih and Frolov},
}

@misc{reeves_ams_nodate,
	type = {Course description},
	title = {{AMS} 148: {GPU} {Programming} {For} {Scientific} {Computation} {\textbar} {AMS148}, {Spring} 18, {Section} 01},
	url = {https://ams148-spring18-01.courses.soe.ucsc.edu/},
	abstract = {This is a first course in parallel programming with GPUs in CUDA C and C++. This course covers introductory parallelism, basic hardware, parallel communication patterns, and primitive algorithms. The students will apply these topics to problems in scientific computing, image/signal processing, and linear algebra.  At the end of the course, students will complete a final project in a topic of their choosing.},
	urldate = {2022-11-13},
	author = {Reeves, Steven},
	file = {AMS 148\: GPU Programming For Scientific Computation | AMS148, Spring 18, Section 01:/home/leo/Zotero/storage/5R9APMF6/ams148-spring18-01.courses.soe.ucsc.edu.html:text/html},
}


@misc{noauthor_prefix_2022,
	title = {Prefix sum},
	copyright = {Creative Commons Attribution-ShareAlike License},
	url = {https://en.wikipedia.org/w/index.php?title=Prefix_sum\&oldid=1120386053},
	abstract = {In computer science, the prefix sum, cumulative sum, inclusive scan, or simply scan of a sequence of numbers x0, x1, x2, ... is a second sequence of numbers y0, y1, y2, ..., the sums of prefixes (running totals) of the input sequence:

y0 = x0
y1 = x0 + x1
y2 = x0 + x1+ x2
...For instance, the prefix sums of the natural numbers are the triangular numbers:

Prefix sums are trivial to compute in sequential models of computation, by using the formula yi = yi − 1 + xi to compute each output value in sequence order. However, despite their ease of computation, prefix sums are a useful primitive in certain algorithms such as counting sort,
and they form the basis of the scan higher-order function in functional programming languages. Prefix sums have also been much studied in parallel algorithms, both as a test problem to be solved and as a useful primitive to be used as a subroutine in other parallel algorithms.Abstractly, a prefix sum requires only a binary associative operator ⊕, making it useful for many applications from calculating well-separated pair decompositions of points to string processing.Mathematically, the operation of taking prefix sums can be generalized from finite to infinite sequences; in that context, a prefix sum is known as a partial sum of a series. Prefix summation or partial summation form linear operators on the vector spaces of finite or infinite sequences; their inverses are finite difference operators.},
	language = {en},
	urldate = {2022-11-13},
	journal = {Wikipedia},
	month = nov,
	year = {2022},
	note = {Page Version ID: 1120386053},
	file = {Snapshot:/home/leo/Zotero/storage/8AHRTZXN/Prefix_sum.html:text/html},
}

@misc{harris_parallel_2007,
	title = {Parallel {Prefix} {Sum} ({Scan}) with {CUDA}},
	copyright = {2007 NVIDIA Corporation. All rights reserved.},
	url = {https://www.eecs.umich.edu/courses/eecs570/hw/parprefix.pdf},
	abstract = {Parallel prefix sum, also known as parallel Scan, is a useful building block for many
parallel algorithms including sorting and building data structures. In this document
we introduce Scan and describe step-by-step how it can be implemented efficiently
in NVIDIA CUDA. We start with a basic naïve algorithm and proceed through
more advanced techniques to obtain best performance. We then explain how to
scan arrays of arbitrary size that cannot be processed with a single block of threads.},
	language = {English},
	urldate = {2022-10-14},
	author = {Harris, Mark},
	month = apr,
	year = {2007},
}

@misc{luebke_blelloch_nodate,
	title = {Blelloch {Scan} - {Intro} to {Parallel} {Programming}},
	url = {https://www.youtube.com/watch?v=mmYv3Haj6uc},
	language = {En},
	urldate = {2022-11-18},
	publisher = {Udacity},
	author = {Luebke, David and Owens, John},
}

@misc{harris_chapter_nodate,
	title = {Chapter 39. {Parallel} {Prefix} {Sum} ({Scan}) with {CUDA}},
	url = {https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda},
	language = {en-US},
	urldate = {2022-11-23},
	journal = {NVIDIA Developer},
	author = {Harris, Mark},
	file = {Snapshot:/home/leo/Zotero/storage/X372HD24/chapter-39-parallel-prefix-sum-scan-cuda.html:text/html},
}

@book{osnovi_raboti,
  title = {Osnovi paboti s tehnologijej CUDA},
  author = {Boreskov and Harlamov},
  address = {Moscow},
  year = {2010},
  publisher = {DMK PRSS Издательство},
  isbn = {978-5-94074-578-5}
}

@book{tehnologija_cuda,
  title = {CUDA by Example, An introduction to general-purpose GPU programming},
  address = {Moscow},
  author = {Jason Sanders and Edward Kandrot},
  language = {Russian},
  isbn = {978-5-94074-504-4},
  year = {2011},
  publisher = {DMK Press}
}

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science