vector-sum-cuda

Comparing performance of sequential vs CUDA-based vector element sum.

https://github.com/puzzlef/vector-sum-cuda

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (8.4%) to scientific vocabulary

Keywords

cuda element experiment gpu sum vector
Last synced: 6 months ago · JSON representation ·

Repository

Comparing performance of sequential vs CUDA-based vector element sum.

Basic Info
  • Host: GitHub
  • Owner: puzzlef
  • License: mit
  • Language: C++
  • Default Branch: main
  • Homepage:
  • Size: 595 KB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 1
Topics
cuda element experiment gpu sum vector
Created over 3 years ago · Last pushed 11 months ago
Metadata Files
Readme License Citation

README.md

Comparing performance of sequential vs CUDA-based vector element sum.

We take a floating-point vector x, with number of elements ranging from 1E+6 to 1E+9, and sum them up using CUDA (Σx). We attempt each element count with various approaches, running each approach 5 times to get a good time measure. Sum here represents any reduce() operation that processes several values to a single value. I thank the guidance from Prof. Kishore Kothapalli and Prof. Dip Sankar Banerjee.


Adjusting launch config for Memcpy approach

In this experiment (memcpy-adust-launch), we compare various launch configs for CUDA-based vector element sum, using the memcpy approach. We attempt different element counts with various CUDA launch configs.

This sum uses memcpy to transfer partial results to CPU, where the final sum is calculated. If the result can be used within GPU itself, it might be faster to calculate complete sum [in-place] instead of transferring to CPU. Results indicate that a grid_limit of 1024 and a block_size of 128/256 is suitable for float datatype, and a grid_limit of 1024 and a block_size of 256 is suitable for double datatype. Thus, using a grid_limit of 1024 and a block_size of 256 could be a decent choice. Interestingly, the sequential sum suffers from precision issue when using the float datatype, while the CUDA based sum does not.


Adjust per-thread duty for Memcpy approach

In this experiment (memcpy-adjust-duty), we compare various per-thread duty numbers for CUDA based vector element sum (memcpy). Here, we attempt each element count with various CUDA launch configs, and per-thread-duties. Rest of the experimental setup is similar to the [memcpy-adjust-launch] experiment. Results indicate no significant difference between memcpy launch approach, and this one.


Adjusting launch config for Inplace approach

In this experiment (inplace-adjust-launch), we compare various launch configs for CUDA based vector element sum (in-place) (in-place). We attempt different element counts with various CUDA launch configs. This is an in-place sum, meaning the single sum values is calculated entirely by the GPU. This is done using 2 kernel calls.

A number of possible optimizations including multiple reads per loop iteration, loop unrolled reduce, atomic adds, and multiple kernels provided no benefit (see branches). A simple one read per loop iteration and standard reduce loop (minimizing warp divergence) is both shorter and works best. For float, a grid_limit of 1024 and a block_size of 128 is a decent choice. For double, a grid_limit of 1024 and a block_size of 256 is a decent choice. Interestingly, the sequential sum suffers from precision issue when using the float datatype, while the CUDA based sum does not (just like with memcpy sum).


Comparison of Memcpy and Inplace approach

In this experiment (memcpy-vs-inplace), we compare the performance of memcpy vs in-place based CUDA based vector element sum. It appears both approaches have similar performance.


Comparison with Sequential approach

In this experiment (compare-sequential, main), we compare the performance between finding sum(x) using a single thread (sequential) and CUDA (not power-of-2 and power-of-2 reduce). Here x is a 32-bit integer vector. We attempt the approaches on a number of vector sizes. Note that time taken to copy data back and forth from the GPU is not measured, and the sequential approach does not make use of SIMD instructions.

While it might seem that CUDA approach would be a clear winner, the results indicate it is dependent upon the workload. Results indicate that from 10^5 elements, CUDA approach performs better than sequential. Both CUDA approaches (not power-of-2/power-of-2 reduce) seem to have similar performance. All outputs are saved in a gist. Some charts are also included below, generated from sheets.



References




ORG DOI

Owner

  • Name: puzzlef
  • Login: puzzlef
  • Kind: organization

A summary of experiments.

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: Sahu
    given-names: Subhajit
    orcid: https://orcid.org/0000-0001-5140-6578
title: "puzzlef/sum-sequential-vs-cuda: Performance of sequential vs CUDA-based vector element sum"
version: 1.0.0
doi: 10.5281/zenodo.7258353
date-released: 2022-10-27

GitHub Events

Total
Last Year

Issues and Pull Requests

Last synced: 9 months ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels