Recent Releases of vector-sum-cuda
vector-sum-cuda - Performance of sequential vs CUDA-based vector element sum
Performance of sequential vs CUDA-based vector element sum.
This experiment was for comparing the performance between:
1. Find sum(x) using a single thread (sequential).
2. Find sum(x) accelerated using CUDA (not power-of-2 reduce).
3. Find sum(x) accelerated using CUDA (power-of-2 reduce).
Here x is a 32-bit integer vector. Both approaches were attempted on a number of vector sizes, running each approach 5 times per size to get a good time measure. Note that time taken to copy data back and forth from the GPU is not measured, and the sequential approach does not make use of SIMD instructions. While it might seem that CUDA approach would be a clear winner, the results indicate it is dependent upon the workload. Results indicate that from 10^5 elements, CUDA approach performs better than sequential. Both CUDA approaches (not power-of-2/power-of-2 reduce) seem to have similar performance.
All outputs are saved in a gist and a small part of the output is listed here. Some charts are also included below, generated from sheets. This experiment was done with guidance from Prof. Kishore Kothapalli and Prof. Dip Sankar Banerjee.
```bash $ nvcc -std=c++17 -Xcompiler -O3 main.cu $ ./a.out
[00000.002 ms; 1e+03 elems.] [502942114] sumSeq
[00001.128 ms; 1e+03 elems.] [502942114] sumCuda
[00000.018 ms; 1e+03 elems.] [502942114] sumCudaPow2
...
```
References
- CUDA by Example :: Jason Sanders, Edward Kandrot
- Managed memory vs cudaHostAlloc - TK1
- How to enable C++17 code generation in VS2019 CUDA project
- "More than one operator + matches these operands" error
- How to import VSCode keybindings into Visual Studio?
- Explicit conversion constructors (C++ only)
- Configure X11 Forwarding with PuTTY and Xming
- code-server setup and configuration
- Installing snap on CentOS
- C++
Published by wolfram77 over 3 years ago

