vector-sum-cuda

Comparing performance of sequential vs CUDA-based vector element sum.

https://github.com/puzzlef/vector-sum-cuda

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: zenodo.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (8.4%) to scientific vocabulary

Keywords

cuda element experiment gpu sum vector

Last synced: 6 months ago · JSON representation ·

Repository

Comparing performance of sequential vs CUDA-based vector element sum.

Basic Info

Host: GitHub
Owner: puzzlef
License: mit
Language: C++
Default Branch: main
Homepage:
Size: 595 KB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 1

Topics

cuda element experiment gpu sum vector

Created over 3 years ago · Last pushed 11 months ago

Metadata Files

Readme License Citation

README.md

Comparing performance of sequential vs CUDA-based vector element sum.

We take a floating-point vector x, with number of elements ranging from 1E+6 to 1E+9, and sum them up using CUDA (Σx). We attempt each element count with various approaches, running each approach 5 times to get a good time measure. Sum here represents any reduce() operation that processes several values to a single value. I thank the guidance from Prof. Kishore Kothapalli and Prof. Dip Sankar Banerjee.

Adjusting launch config for Memcpy approach

In this experiment (memcpy-adust-launch), we compare various launch configs for CUDA-based vector element sum, using the memcpy approach. We attempt different element counts with various CUDA launch configs.

This sum uses memcpy to transfer partial results to CPU, where the final sum is calculated. If the result can be used within GPU itself, it might be faster to calculate complete sum [in-place] instead of transferring to CPU. Results indicate that a grid_limit of 1024 and a block_size of 128/256 is suitable for float datatype, and a grid_limit of 1024 and a block_size of 256 is suitable for double datatype. Thus, using a grid_limit of 1024 and a block_size of 256 could be a decent choice. Interestingly, the sequential sum suffers from precision issue when using the float datatype, while the CUDA based sum does not.

Adjust per-thread duty for Memcpy approach

In this experiment (memcpy-adjust-duty), we compare various per-thread duty numbers for CUDA based vector element sum (memcpy). Here, we attempt each element count with various CUDA launch configs, and per-thread-duties. Rest of the experimental setup is similar to the [memcpy-adjust-launch] experiment. Results indicate no significant difference between memcpy launch approach, and this one.

Adjusting launch config for Inplace approach

In this experiment (inplace-adjust-launch), we compare various launch configs for CUDA based vector element sum (in-place) (in-place). We attempt different element counts with various CUDA launch configs. This is an in-place sum, meaning the single sum values is calculated entirely by the GPU. This is done using 2 kernel calls.

A number of possible optimizations including multiple reads per loop iteration, loop unrolled reduce, atomic adds, and multiple kernels provided no benefit (see branches). A simple one read per loop iteration and standard reduce loop (minimizing warp divergence) is both shorter and works best. For float, a grid_limit of 1024 and a block_size of 128 is a decent choice. For double, a grid_limit of 1024 and a block_size of 256 is a decent choice. Interestingly, the sequential sum suffers from precision issue when using the float datatype, while the CUDA based sum does not (just like with memcpy sum).

Comparison of Memcpy and Inplace approach

In this experiment (memcpy-vs-inplace), we compare the performance of memcpy vs in-place based CUDA based vector element sum. It appears both approaches have similar performance.

Comparison with Sequential approach

In this experiment (compare-sequential, main), we compare the performance between finding sum(x) using a single thread (sequential) and CUDA (not power-of-2 and power-of-2 reduce). Here x is a 32-bit integer vector. We attempt the approaches on a number of vector sizes. Note that time taken to copy data back and forth from the GPU is not measured, and the sequential approach does not make use of SIMD instructions.

While it might seem that CUDA approach would be a clear winner, the results indicate it is dependent upon the workload. Results indicate that from 10^5 elements, CUDA approach performs better than sequential. Both CUDA approaches (not power-of-2/power-of-2 reduce) seem to have similar performance. All outputs are saved in a gist. Some charts are also included below, generated from sheets.

References

Owner

Name: puzzlef
Login: puzzlef
Kind: organization

Website: https://puzzlef.github.io/
Repositories: 10
Profile: https://github.com/puzzlef

A summary of experiments.

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: Sahu
    given-names: Subhajit
    orcid: https://orcid.org/0000-0001-5140-6578
title: "puzzlef/sum-sequential-vs-cuda: Performance of sequential vs CUDA-based vector element sum"
version: 1.0.0
doi: 10.5281/zenodo.7258353
date-released: 2022-10-27

GitHub Events

Total

Last Year

Issues and Pull Requests

Last synced: 9 months ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science