vector-sum-cuda
Comparing performance of sequential vs CUDA-based vector element sum.
Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (8.4%) to scientific vocabulary
Keywords
Repository
Comparing performance of sequential vs CUDA-based vector element sum.
Basic Info
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 1
Topics
Metadata Files
README.md
Comparing performance of sequential vs CUDA-based vector element sum.
We take a floating-point vector x, with number of elements ranging from
1E+6 to 1E+9, and sum them up using CUDA (Σx). We attempt each element
count with various approaches, running each approach 5 times to get a good time
measure. Sum here represents any reduce() operation that processes several
values to a single value. I thank the guidance from Prof. Kishore Kothapalli and
Prof. Dip Sankar Banerjee.
Adjusting launch config for Memcpy approach
In this experiment (memcpy-adust-launch), we compare various launch configs for CUDA-based vector element sum, using the memcpy approach. We attempt different element counts with various CUDA launch configs.
This sum uses memcpy to transfer partial results to CPU, where the final sum
is calculated. If the result can be used within GPU itself, it might be faster
to calculate complete sum [in-place] instead of transferring to CPU. Results
indicate that a grid_limit of 1024 and a block_size of 128/256 is
suitable for float datatype, and a grid_limit of 1024 and a
block_size of 256 is suitable for double datatype. Thus, using a
grid_limit of 1024 and a block_size of 256 could be a decent choice.
Interestingly, the sequential sum suffers from precision issue when using
the float datatype, while the CUDA based sum does not.
Adjust per-thread duty for Memcpy approach
In this experiment (memcpy-adjust-duty), we compare various per-thread duty numbers for CUDA based vector element sum (memcpy). Here, we attempt each element count with various CUDA launch configs, and per-thread-duties. Rest of the experimental setup is similar to the [memcpy-adjust-launch] experiment. Results indicate no significant difference between memcpy launch approach, and this one.
Adjusting launch config for Inplace approach
In this experiment (inplace-adjust-launch), we compare various launch configs for CUDA based vector element sum (in-place) (in-place). We attempt different element counts with various CUDA launch configs. This is an in-place sum, meaning the single sum values is calculated entirely by the GPU. This is done using 2 kernel calls.
A number of possible optimizations including multiple reads per loop
iteration, loop unrolled reduce, atomic adds, and multiple kernels
provided no benefit (see branches). A simple one read per loop iteration
and standard reduce loop (minimizing warp divergence) is both shorter
and works best. For float, a grid_limit of 1024 and a
block_size of 128 is a decent choice. For double, a grid_limit of
1024 and a block_size of 256 is a decent choice. Interestingly, the
sequential sum suffers from precision issue when using the float
datatype, while the CUDA based sum does not (just like with
memcpy sum).
Comparison of Memcpy and Inplace approach
In this experiment (memcpy-vs-inplace), we compare the performance of memcpy vs in-place based CUDA based vector element sum. It appears both approaches have similar performance.
Comparison with Sequential approach
In this experiment (compare-sequential, main), we compare the performance
between finding sum(x) using a single thread (sequential) and CUDA
(not power-of-2 and power-of-2 reduce). Here x is a 32-bit integer vector.
We attempt the approaches on a number of vector sizes. Note that time taken to
copy data back and forth from the GPU is not measured, and the sequential
approach does not make use of SIMD instructions.
While it might seem that CUDA approach would be a clear winner, the results indicate it is dependent upon the workload. Results indicate that from 10^5 elements, CUDA approach performs better than sequential. Both CUDA approaches (not power-of-2/power-of-2 reduce) seem to have similar performance. All outputs are saved in a gist. Some charts are also included below, generated from sheets.
References
- CUDA by Example :: Jason Sanders, Edward Kandrot
- Managed memory vs cudaHostAlloc - TK1
- How to enable C++17 code generation in VS2019 CUDA project
- "More than one operator + matches these operands" error
- How to import VSCode keybindings into Visual Studio?
- Explicit conversion constructors (C++ only)
- Configure X11 Forwarding with PuTTY and Xming
- code-server setup and configuration
- Installing snap on CentOS
- Git pulling a branch from another repository?
Owner
- Name: puzzlef
- Login: puzzlef
- Kind: organization
- Website: https://puzzlef.github.io/
- Repositories: 10
- Profile: https://github.com/puzzlef
A summary of experiments.
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: Sahu
given-names: Subhajit
orcid: https://orcid.org/0000-0001-5140-6578
title: "puzzlef/sum-sequential-vs-cuda: Performance of sequential vs CUDA-based vector element sum"
version: 1.0.0
doi: 10.5281/zenodo.7258353
date-released: 2022-10-27
GitHub Events
Total
Last Year
Issues and Pull Requests
Last synced: 9 months ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0

