Recent Releases of aluminum
aluminum - v1.4.2
This is a minor update primarily adding additional debugging support to Aluminum and support for newer versions of CUDA and ROCm.
- Add additional sanity checks for MPI initialization.
- Add various checks to ensure arguments are sane.
- Add support for Caliper annotations to Aluminum APIs (build with
ALUMINUM_ENABLE_CALIPER=Yes). - Add an option to disable all background streams for non-blocking operations (build with
ALUMINUM_DISABLE_BACKGROUND_STREAMS=Tes). - Various compilation fixes.
- Support for building with CUDA 12.
- Support for building with ROCm 6.
- C++
Published by ndryden over 2 years ago
aluminum - v1.4.0
This release addresses various issues and adds a new MultiSendRecv operation.
- The default internal stream pool size has changed to 1. This is to mitigate issues on ROCm platforms, but no performance impact was observed on other platforms.
- Fix a compilation error when building on CUDA 12 platforms.
- On ROCm platforms only: zero-size RCCL
Send,Recv, andSendrecvmessages are skipped. This is to work around apparent hangs in RCCL with such messages and will be removed once the issue is fixed upstream. - Fix a memory copy issue in the host-transfer
Alltoallv. - Updated to cxxopts 3.
- Added a compile-time traits API for describing what operations, types, etc. are supported by each backend.
- Added the
MultiSendRecvoperation, which supports an arbitrary sequence of sends and receives among ranks as a single operation. - Various internal reorganizations for the test and benchmark code.
- C++
Published by ndryden almost 3 years ago
aluminum - v1.2.2
This is primarily a bugfix release.
- Fixed an issue in progress engine binding that could lead to hangs. (See #182.)
- Traces include the stream of an operation.
- Tuning parameters are now configured via CMake rather than by manually editing
tunuing_params.hpp.
- C++
Published by ndryden about 3 years ago
aluminum - v1.2.0
This release adds better support for low-precision data.
- Support fp16 (IEEE half-precision) in all backends when support is available.
- Support bfloat16 in all backends when support is available.
- The NCCL/RCCL backend now supports averaging as a reduction operator (
avg). - Aluminum now requires at least CUDA 11 / ROCm 5 when GPU support is requested.
- C++
Published by ndryden over 3 years ago
aluminum - v1.1.0
The highlight of this release is that Aluminum now has a logo.
There were some other, slightly less interesting, changes, too. Notably full support for multi-threaded communication in Aluminum. There are also significant improvements to support on HIP/ROCm platforms and extensive internal cleanups.
- Aluminum has a logo now!
- Support the
AL_MPI_SERIALIZEDcompile-time flag, which will run blocking MPI calls on the progress engine for situations where all calls need to come from the same thread. - Support
AL_THREAD_MULTIPLEfor support in Aluminum for safe multi-threaded communication. - Significant improvements in benchmarking/testing infrastructure.
- Removed support for custom MPI allreduce algorithms. Aluminum now uses the native MPI implementations.
- Added an
al_infobinary to provide basic info on Aluminum. - Better progress engine binding on HIP/ROCm systems.
- The host-transfer backend uses stream memory operations on HIP/ROCm systems when available.
- Aluminum no longer relies on hipify for HIP/ROCm systems.
- Aluminum's CMake exports components to identify backend support at build time.
- Significant internal code reorganizations/cleanup for CUDA stuff and the progress engine.
- Various bugfixes and other minor improvements.
- C++
Published by ndryden over 3 years ago
aluminum - v1.0.0
Aluminum is now officially stable.
Changes since v0.7.0:
* Aluminum communicators have been refactored and now always operate like objects (as opposed to handles). Communicators all have a stream interface.
* Added Barrier operation to all backends.
* Added support for vector collectives in the host-transfer backend.
* Fix bug in the NCCL Reduce_scatterv operation (#110).
* Various other code cleanups and bug fixes.
- C++
Published by ndryden about 5 years ago
aluminum - v0.7.0
The testing and benchmarking infrastructure has been entirely rewritten to be significantly more comprehensive and cleaner. There are also now scripts for nicely plotting benchmark results.
Numerous bugfixes and similar improvements:
* Aluminum no longer attempts to use bitwise reductions for long double.
* Fixed bug in the host-transfer Allreduce on one processor.
* Fix in-place bugs in the NCCL Gather, Gatherv, Scatter, and Scatterv, operations.
* Fix MPI type for long int.
* The throw_al_exception macro works outside of the Al namespace.
* Added a check for version mismatches in the version of HWLOC Aluminum was compiled with versus the one that is used at runtime.
* All internal Aluminum headers are now included with the aluminum/ prefix to avoid conflicts with other projects.
- C++
Published by ndryden over 5 years ago
aluminum - v0.6.0
New features:
* Support for Send, Recv, and SendRecv in the NCCL backend.
* Add initial support for Gather, Scatter, and Alltoall to the NCCL backend.
* Initial support for vector collectives in the NCCL and MPI backends: Allgatherv, Alltoallv, Gatherv, Scatterv, and Reduce_scatterv.
* Added new benchmarks for all supported operations.
* Improved performance and correctness of the spin-wait kernel used in the host-transfer backend.
* Improved progress engine binding logic. Related environment variables have been removed. Failing to bind no longer throws an exception.
Other changes: * Various code cleanups and enhancements. * The pairwise-exchange/ring allreduce algorithm has been removed from the MPI backend. * Internal CUB memory pool is used for temporary GPU memory allocations.
- C++
Published by ndryden over 5 years ago
aluminum - v0.2
New features/changes:
* Host-transfer implementations of standard collectives in the MPI-CUDA backend: AllGather, AllToAll, Broadcast, Gather, Reduce, ReduceScatter, and Scatter.
* Progress engine is now aware of separate compute streams. This enables better scheduling of non-interfering operations.
* Experimental RMA Put/Get operations.
* Improved Aluminum algorithm specification.
* Non-blocking point-to-point operations.
* Improved testing and benchmarks.
* Bugfixes and performance improvements.
- C++
Published by ndryden over 7 years ago
aluminum - Initial Release
Aluminum provides a generic interface to high-performance communication libraries, with a focus on allreduce algorithms. Blocking and non-blocking algorithms and GPU-aware algorithms are supported. Aluminum also contains custom implementations of select algorithms to optimize for certain situations.
Features: - Blocking and non-blocking algorithms - GPU-aware algorithms - Implementations/interfaces: - MPI: MPI and custom algorithms implemented on top of MPI - NCCL: Interface to Nvidia's NCCL 2 library - MPI-CUDA: Custom GPU-aware algorithms
- C++
Published by bvanessen over 7 years ago