Recent Releases of aluminum

aluminum - v1.4.2

This is a minor update primarily adding additional debugging support to Aluminum and support for newer versions of CUDA and ROCm.

  • Add additional sanity checks for MPI initialization.
  • Add various checks to ensure arguments are sane.
  • Add support for Caliper annotations to Aluminum APIs (build with ALUMINUM_ENABLE_CALIPER=Yes).
  • Add an option to disable all background streams for non-blocking operations (build with ALUMINUM_DISABLE_BACKGROUND_STREAMS=Tes).
  • Various compilation fixes.
  • Support for building with CUDA 12.
  • Support for building with ROCm 6.

- C++
Published by ndryden over 2 years ago

aluminum - v1.4.1

This is a bugfix release addressing a compilation issue with libc++ (see #209).

- C++
Published by ndryden almost 3 years ago

aluminum - v1.4.0

This release addresses various issues and adds a new MultiSendRecv operation.

  • The default internal stream pool size has changed to 1. This is to mitigate issues on ROCm platforms, but no performance impact was observed on other platforms.
  • Fix a compilation error when building on CUDA 12 platforms.
  • On ROCm platforms only: zero-size RCCL Send, Recv, and Sendrecv messages are skipped. This is to work around apparent hangs in RCCL with such messages and will be removed once the issue is fixed upstream.
  • Fix a memory copy issue in the host-transfer Alltoallv.
  • Updated to cxxopts 3.
  • Added a compile-time traits API for describing what operations, types, etc. are supported by each backend.
  • Added the MultiSendRecv operation, which supports an arbitrary sequence of sends and receives among ranks as a single operation.
  • Various internal reorganizations for the test and benchmark code.

- C++
Published by ndryden almost 3 years ago

aluminum - v1.3.1

This is a minor release that mainly fixes some linking issues on ROCm platforms.

  • Fix RCCL includes and linking.
  • Various improvements to the benchmarking and testing infrastructure.
  • Improved documentation.

- C++
Published by ndryden about 3 years ago

aluminum - v1.3.0

This adds in-place SendRecv support to Aluminum.

- C++
Published by ndryden about 3 years ago

aluminum - v1.2.3

This is a bugfix release adding threads linkage to the CMake export.

- C++
Published by ndryden about 3 years ago

aluminum - v1.2.2

This is primarily a bugfix release.

  • Fixed an issue in progress engine binding that could lead to hangs. (See #182.)
  • Traces include the stream of an operation.
  • Tuning parameters are now configured via CMake rather than by manually editing tunuing_params.hpp.

- C++
Published by ndryden about 3 years ago

aluminum - v1.2.1

This is a minor bugfix release.

  • Fixed builds of the MPI-CUDA tests and MPI-CUDA RMA library.
  • Use locks to protect tracing when built with AL_THREAD_MULTIPLE.
  • Match benchmarking script type support to test's support.

- C++
Published by ndryden over 3 years ago

aluminum - v1.2.0

This release adds better support for low-precision data.

  • Support fp16 (IEEE half-precision) in all backends when support is available.
  • Support bfloat16 in all backends when support is available.
  • The NCCL/RCCL backend now supports averaging as a reduction operator (avg).
  • Aluminum now requires at least CUDA 11 / ROCm 5 when GPU support is requested.

- C++
Published by ndryden over 3 years ago

aluminum - v1.1.0

The highlight of this release is that Aluminum now has a logo.

There were some other, slightly less interesting, changes, too. Notably full support for multi-threaded communication in Aluminum. There are also significant improvements to support on HIP/ROCm platforms and extensive internal cleanups.

  • Aluminum has a logo now!
  • Support the AL_MPI_SERIALIZED compile-time flag, which will run blocking MPI calls on the progress engine for situations where all calls need to come from the same thread.
  • Support AL_THREAD_MULTIPLE for support in Aluminum for safe multi-threaded communication.
  • Significant improvements in benchmarking/testing infrastructure.
  • Removed support for custom MPI allreduce algorithms. Aluminum now uses the native MPI implementations.
  • Added an al_info binary to provide basic info on Aluminum.
  • Better progress engine binding on HIP/ROCm systems.
  • The host-transfer backend uses stream memory operations on HIP/ROCm systems when available.
  • Aluminum no longer relies on hipify for HIP/ROCm systems.
  • Aluminum's CMake exports components to identify backend support at build time.
  • Significant internal code reorganizations/cleanup for CUDA stuff and the progress engine.
  • Various bugfixes and other minor improvements.

- C++
Published by ndryden over 3 years ago

aluminum - v1.0.0

Aluminum is now officially stable.

Changes since v0.7.0: * Aluminum communicators have been refactored and now always operate like objects (as opposed to handles). Communicators all have a stream interface. * Added Barrier operation to all backends. * Added support for vector collectives in the host-transfer backend. * Fix bug in the NCCL Reduce_scatterv operation (#110). * Various other code cleanups and bug fixes.

- C++
Published by ndryden about 5 years ago

aluminum - v0.7.0

The testing and benchmarking infrastructure has been entirely rewritten to be significantly more comprehensive and cleaner. There are also now scripts for nicely plotting benchmark results.

Numerous bugfixes and similar improvements: * Aluminum no longer attempts to use bitwise reductions for long double. * Fixed bug in the host-transfer Allreduce on one processor. * Fix in-place bugs in the NCCL Gather, Gatherv, Scatter, and Scatterv, operations. * Fix MPI type for long int. * The throw_al_exception macro works outside of the Al namespace. * Added a check for version mismatches in the version of HWLOC Aluminum was compiled with versus the one that is used at runtime. * All internal Aluminum headers are now included with the aluminum/ prefix to avoid conflicts with other projects.

- C++
Published by ndryden over 5 years ago

aluminum - v0.6.0

New features: * Support for Send, Recv, and SendRecv in the NCCL backend. * Add initial support for Gather, Scatter, and Alltoall to the NCCL backend. * Initial support for vector collectives in the NCCL and MPI backends: Allgatherv, Alltoallv, Gatherv, Scatterv, and Reduce_scatterv. * Added new benchmarks for all supported operations. * Improved performance and correctness of the spin-wait kernel used in the host-transfer backend. * Improved progress engine binding logic. Related environment variables have been removed. Failing to bind no longer throws an exception.

Other changes: * Various code cleanups and enhancements. * The pairwise-exchange/ring allreduce algorithm has been removed from the MPI backend. * Internal CUB memory pool is used for temporary GPU memory allocations.

- C++
Published by ndryden over 5 years ago

aluminum - v0.5.0

  • Support for the entire Aluminum API in the MPI backend.
  • The MPI-CUDA backend has been renamed to the HostTransfer backend. (Except for RMA operations.)
  • Internal cleanups.

- C++
Published by ndryden almost 6 years ago

aluminum - v0.4.0

  • Bugfix for edge case that could cause hangs when not making progress.
  • Support for AMD GPUs using HIP/ROCm/RCCL.

- C++
Published by ndryden almost 6 years ago

aluminum - v0.3.3

Fixes a build issue with certain GPU backend configurations.

- C++
Published by ndryden over 6 years ago

aluminum - v0.3.2

  • Bugfixes related to ordering in MPI-CUDA.
  • Removed vector reduce-scatter.
  • Additional benchmarks and tests.

- C++
Published by ndryden over 6 years ago

aluminum - v0.2.1-1

Similar functionality to 0.2.1. Subsequent releases will break backwards compatibility.

- C++
Published by ndryden almost 7 years ago

aluminum - v0.2.1

Fixed the internal version number to properly reflect the release number. This is required for checking API compatibility.

- C++
Published by bvanessen over 7 years ago

aluminum - v0.2

New features/changes: * Host-transfer implementations of standard collectives in the MPI-CUDA backend: AllGather, AllToAll, Broadcast, Gather, Reduce, ReduceScatter, and Scatter. * Progress engine is now aware of separate compute streams. This enables better scheduling of non-interfering operations. * Experimental RMA Put/Get operations. * Improved Aluminum algorithm specification. * Non-blocking point-to-point operations. * Improved testing and benchmarks. * Bugfixes and performance improvements.

- C++
Published by ndryden over 7 years ago

aluminum - Initial Release

Aluminum provides a generic interface to high-performance communication libraries, with a focus on allreduce algorithms. Blocking and non-blocking algorithms and GPU-aware algorithms are supported. Aluminum also contains custom implementations of select algorithms to optimize for certain situations.

Features: - Blocking and non-blocking algorithms - GPU-aware algorithms - Implementations/interfaces: - MPI: MPI and custom algorithms implemented on top of MPI - NCCL: Interface to Nvidia's NCCL 2 library - MPI-CUDA: Custom GPU-aware algorithms

- C++
Published by bvanessen over 7 years ago