Recent Releases of cccl
cccl - CCCL Python Libraries v0.1.3.2.0.dev128 (pre-release)
These are the changes in the cuda.cccl libraries introduced in the pre-release 0.1.3.2.0dev128 dated August 14th, 2025.
cuda.cccl is in "experimental" status, meaning that its API and feature set can change quite rapidly.
Major API improvements
Single-call APIs in cuda.cccl.parallel
Previously, performing operation like reduce_into required 4 API invocations to
(1) create a reducer object, (2) compute the amount of temporary storage required for the reduction,
(3) allocate the required amount of temporary memory, and (4) perform the reduction.
In this version, cuda.cccl.parallel introduces simpler, single-call APIs. For example, reduction looks like:
```python
New API - single function call with automatic temp storage
parallel.reduceinto(dinput, doutput, addop, numitems, hinit) ```
If you wish to have more control over temporary memory allocation,
the previous API still exists (and always will). It has been renamed from reduce_into to make_reduce_into:
```python
Object API
reducer = parallel.makereduceinto(dinput, doutput, addop, hinit) tempstoragesize = reducer(None, dinput, doutput, numitems, hinit) tempstorage = cp.empty(tempstoragesize, dtype=np.uint8) reducer(tempstorage, dinput, doutput, numitems, hinit) ```
New algorithms
Device-wide histogram
The histogram_even
function provides Python exposure of the corresponding CUB C++ API DeviceHistogram::HistogramEven.
StripedtoBlock exchange
cuda.cccl.cooperative adds a block.exchange
providing Python exposure of the corresponding CUB C++ API BlockExchange.
Currently, only the StripedToBlock exchange pattern is supported.
Infrastructure improvements
CuPy dependency replaced with cuda.core
Use of CuPy within the library has been replaced with the lighter weight cuda.core
package. This means that installing cuda.cccl won't install CuPy as a dependency.
Support for CUDA 13 drivers
cuda.cccl can be used with CUDA 13 compatible drivers. However, the CUDA 13 toolkit (runtime and libraries) is not
yet supported, meaning you still need the CUDA 12 toolkit. Full support for CUDA 13 toolkit is planned for the next
pre-release.
- C++
Published by shwina 5 months ago
cccl - v3.0.2
What's Changed
🔄 Other Changes
- [Version] Update branch/3.0.x to v3.0.2 by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/5348
- Backport to 3.0: Add a macro to disable PDL (#5316) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5330
- [Backport branch/3.0x] Add gitlab devcontainers (#5325) by @wmaxey in https://github.com/NVIDIA/cccl/pull/5352
Full Changelog: https://github.com/NVIDIA/cccl/compare/v3.0.1...v3.0.2
- C++
Published by github-actions[bot] 5 months ago
cccl - v3.0.1
What's Changed
🔄 Other Changes
- [Version] Update branch/3.0.x to v3.0.1 by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/5256
- [Backport branch/3.0.x] Disable assertions for QNX, they do not provide the machinery with their libc by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/5258
- [BACKPORT 3.0] Make sure that nested
tupleandpairhave the expected size (#5246) by @miscco in https://github.com/NVIDIA/cccl/pull/5265 - [BACKPORT] Add missed specializations of the new aligned vector types to cub (#5264) by @miscco in https://github.com/NVIDIA/cccl/pull/5271
- [BACKPORT 3.0] Backport diagnostic suppression machinery by @miscco in https://github.com/NVIDIA/cccl/pull/5281
Full Changelog: https://github.com/NVIDIA/cccl/compare/v3.0.0...v3.0.1
- C++
Published by github-actions[bot] 6 months ago
cccl - v3.0.0
CCCL 3.0 Release
The 3.0 release of the CUDA Core Compute Libraries (CCCL) marks our first major version since unifying the Thrust, CUB, and libcudacxx libraries under a single repository. This release reflects over a year of work focused on cleanup, consolidation, and modernizing the codebase to support future growth.
While this release includes a number of breaking changes, many involve the consolidation of APIs—particularly in the thrust:: and cub:: namespaces—as well as cleanup of internal details that were never intended for public use. In many cases, redundant functionality from thrust:: or cub:: has been replaced with equivalent or improved abstractions from the cuda:: or cuda::std:: namespaces. Impact should be minimal for most users. For full details and recommended migration steps, please consult the CCCL 2.x to 3.0 Migration Guide.
Key Changes in CCCL 3.0
Requirements
- C++17 or newer is now required (support for C++11 and C++14 has been dropped #3255)
- CUDA Toolkit 12.0+ is now required (support for CTK 11.0+ has been dropped). For details on version compatibility, see the README.
- Compilers:
- Dropped support for
Header Directory Changes in CUDA Toolkit 13.0
CCCL 3.0 will be included with an upcoming CUDA Toolkit 13.0 release. In this release, the bundled CCCL headers have moved to new top-level directories under ${CTK_ROOT}/include/cccl/.
| Before CUDA 13.0 | After CUDA 13.0 |
| :---- | :---- |
| ${CTK_ROOT}/include/cuda/ | ${CTK_ROOT}/include/cccl/cuda/ |
| ${CTK_ROOT}/include/cub/ | ${CTK_ROOT}/include/cccl/cub/ |
| ${CTK_ROOT}/include/thrust/ | ${CTK_ROOT}/include/cccl/thrust/ |
These changes only affect the on-disk location of CCCL headers within the CUDA Toolkit installation.
What you need to know
- ❌ Do NOT write
#include <cccl/...>— this will break. - If using CCCL headers only in files compiled with nvcc
- âś… No action needed. This is the default for most users.
- âś… No action needed. This is the default for most users.
- If using CCCL headers in files compiled exclusively with a host compiler (e.g., GCC, Clang, MSVC):
- Using CMake and linking
CCCL::CCCL - âś… No action needed. (This is the recommended path. See example)
- Other build systems
- ⚠️ Add
${CTK_ROOT}/include/ccclto your compiler’s include search path (e.g., with-I)
- Using CMake and linking
These changes prevent issues when mixing CCCL headers bundled with the CUDA Toolkit and those from external package managers. For more detail, see the CCCL 2.x to 3.0 Migration Guide.
Major API Changes
Hundreds of macros, internal types, and implementation details were removed or relocated to internal namespaces. This significantly reduces surface area and eliminates long-standing technical debt, improving both compile times and maintainability.
Removed Macros
Over 50 legacy macros have been removed in favor of modern C++ alternatives:
CUB_{MIN,MAX}: usecuda::std::{min,max}instead #3821THRUST_NODISCARD: use[[nodiscard]]instead #3746THRUST_INLINE_CONSTANT: use `inline constexpr` instead #3746- See CCCL 2.x to 3.0 Migration Guide for complete list
Removed Functions and Classes
thrust::optional: usecuda::std::optionalinstead #4172thrust::tuple: usecuda::std::tupleinstead #2395thrust::pair: usecuda::std::pairinstead #2395thrust::numeric_limits: usecuda::std::numeric_limitsinstead #3366cub::BFE: use `cuda::bitfield_inser`t andcuda::bitfield_extractinstead #4031cub::ConstantInputIterator: usethrust::constant_iteratorinstead #3831cub::CountingInputIterator: usethrust::counting_iteratorinstead #3831cub::GridBarrier: use cooperative groups instead #3745cub::DeviceSpmv: use cuSPARSE instead #3320cub::Mutex: usecuda::std::mutexinstead #3251- See CCCL 2.x to 3.0 Migration Guide for complete list
New Features
C++
cuda::
cuda::std::numeric_limitsnow supports__float128#4059cuda::std::optional<T&>implementation (P2988) #3631cuda::std::numbersheader for mathematical constants #3355NVFP8/6/4extended floating-point types support in<cuda/std/cmath>#3843cuda::overflow_castfor safe numeric conversions #4151cuda::ilog2andcuda::ilog10integer logarithms #4100cuda::round_upandcuda::round_downutilities #3234
cub::
- `cub::DeviceSegmentedReduce` now supports large number of segments #3746
- `cub::DeviceCopy::Batched` now supports large number of buffers #4129
- `cub::DeviceMemcpy::Batched` now supports large number of buffers #4065
thrust::
- New `thrust::offset_iterator` iterator #4073
- Temporary storage allocations in parallel algorithms now respect `par_nosync` #4204
Python
CUDA Python Core Libraries are now available on PyPI through the cuda-cccl package.
pip install cuda-cccl
cuda.cccl.cooperative
- Block-level sorting now supports multi-dimensional thread blocks #4035, #4028
- Block-level data movement now supports multi-dimensional thread blocks #3161
- New block-level inclusive sum algorithm #3921
cuda.cccl.parallel
- New device-level segmented-reduce algorithm #3906
- New device-level unique-by-key algorithm #3947
- New device-level merge-sort algorithm #3763
<!-- Release notes generated using configuration in .github/release.yml at v3.0.0 -->
What's Changed
🚀 Thrust / CUB
- Drop cub::Mutex by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3251
- Remove legacy macros from CUB util_arch.cuh by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3257
- Remove thrust::[unary|binary]_traits by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3260
- Drop thrust not1 and not2 by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3264
- Deprecate GridBarrier and GridBarrierLifetime by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3258
- Drop thrust::[unary|binary]_function by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3274
- Enable thrust::identity test for non-MSVC by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3281
- Enable PDL in triple chevron launch by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3282
- Drop Thrust legacy arch macros by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3298
- Drop Thrust's compiler_fence.h by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3300
- Drop CUB's util_compiler.cuh by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3302
- Drop Thrust's deprecated compiler macros by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3301
- Drop CUBRUNTIMEENABLED and THRUSTHASCUDART by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3305
- Require C++17 for compiling Thrust and CUB by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3255
- Deprecate Thrust's cpp_compatibility.h macros by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3299
- Deprecate cub::IterateThreadStore by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3337
- Drop CUB's BinaryFlip operator by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3332
- Deprecate cub::Swap by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3333
- Drop CUB APIs with a debug_synchronous parameter by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3330
- Drop CUB's util_compiler.cuh for real by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3340
- Drop cub::ValueCache by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3346
- Drop CDPv1 by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3344
- Use cuda::std::addressof in Thrust by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3363
- Drop deprecated aliases in Thrust functional by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3272
- Drop cub::DivideAndRoundUp by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3347
- Use cuda::std::min/max in Thrust by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3364
- Cleanup CUB util_arch by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2773
- Deprecate thrust::null_type by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3367
- Deprecate thrust::async by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3324
- Review CUB
util.ptxfor CCCL 2.x by @fbusato in https://github.com/NVIDIA/cccl/pull/3342 - Deprecate thrust::numeric_limits by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3366
- Deprecate thrust::optional by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3307
- Redefine and deprecate thrust::remove_cvref by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3394
- Replace and deprecate thrust::cuda_cub::terminate by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3421
- Deprecate
cub::{min, max}and replace internal uses with those from libcu++ by @miscco in https://github.com/NVIDIA/cccl/pull/3419 - Moves agents to
detail::<algorithm_name>namespace by @elstehle in https://github.com/NVIDIA/cccl/pull/3435 - Default transform_iterator's copy ctor by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3395
- Refactor allocator handling of contiguous_storage by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3050
- Drop thrust::detail::integer_traits by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3391
- Deprecate a few CUB macros by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3456
- Deprecate thrust universal iterator categories by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3461
- Drop thrust universal iterator categories by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3474
- Moves CUB kernel entry points to a detail namespace by @elstehle in https://github.com/NVIDIA/cccl/pull/3468
- Deprecate block/warp algo specializations by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3455
- Drop thrust numeric_traits by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3476
- Deprecate and replace thrust::cuda_cub iterators by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3422
- Deprecate thrust macros from type_deduction.h by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3501
- Deprecate thrust event, future and more by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3457
- Drop thrust::null_type by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3508
- Deprecates tuning policy hubs by @elstehle in https://github.com/NVIDIA/cccl/pull/3514
- Deprecate macros from cuda/detail/core/util.h by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3504
- Deprecate CUB iterators existing in Thrust by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3304
- Deprecate thrust logical meta functions by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3538
- Fixes value type of
thrust::tabulate_output_iteratorby @elstehle in https://github.com/NVIDIA/cccl/pull/3573 - Internalize cuda/detail/core/* by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3505
- Remove CUB
DeviceSpMVby @fbusato in https://github.com/NVIDIA/cccl/pull/3549 - Remove
LEGACY_PTX_ARCHby @fbusato in https://github.com/NVIDIA/cccl/pull/3551 - Removes deprecated
Agent*alias templates in the public namespace by @elstehle in https://github.com/NVIDIA/cccl/pull/3717 - Move
ForceInclusiveparameter ofDispatchScanbefore policy by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3739 - Drop Thrust's cpp_compatibility.h by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3746
- Drop thrust::identity by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3747
- Drop deprecated entities from CUB util_type by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3743
- Drop cub::GridBarrier by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3745
- Move Dispatcher policy hub parameters to the back by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3740
- Drop small deprecated entites by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3748
- Error when users specialize BaseTraits but not numeric_limits by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3836
- Drop deprecated iterators from Thrust cuda utils by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3905
- Drop CUB thread operators by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3918
- Minimize usage of cub::Traits by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3863
- Drop/internalize some macros by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3936
- Drop public access to RegBoundScaling/MemBoundScaling by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3934
- Drop deprecated features from CUB util_ptx.cuh by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3935
- Fix definition of universalhostpinnedmemoryresource by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3988
- Assert offset type in
DispatchScan[ByKey]to be unsigned and at least 4 bytes by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3992 - Drop deprecated CUB macros by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3821
- Drop deprecated warp/block algo specializations by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/4007
- Drop remaining 2.8-deprecated entities by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/4009
- Use cuda::std::array in histogram APIs by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3973
- Test tuple of iterator reference assignment by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1964
- Rework counting_iterator difference by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3861
- [thrust, docs] Use the variadic overload of
make_zip_iteratorin thezip_iteratordocs by @brycelelbach in https://github.com/NVIDIA/cccl/pull/4111 ### 📚 Libcudacxx - ptx: Add addptxinstruction.py by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3190
- Fix assert definition for NVHPC due to constexpr issues by @miscco in https://github.com/NVIDIA/cccl/pull/3418
ceil_divreturn common type and optmize by @fbusato in https://github.com/NVIDIA/cccl/pull/3229- attempt to work around msvc bug exposed by type_list.h by @ericniebler in https://github.com/NVIDIA/cccl/pull/3487
- Ensure that pointer_traits work nicely with proxy iterators by @miscco in https://github.com/NVIDIA/cccl/pull/3519
- Define isfloatingpointv in terms of isfloating_point by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3923
- Rework our
mdspanimplementation by @miscco in https://github.com/NVIDIA/cccl/pull/3343 - Implement more of cmath by @miscco in https://github.com/NVIDIA/cccl/pull/3963 ### 📝 Documentation
- Improve docs of std headers by @miscco in https://github.com/NVIDIA/cccl/pull/3416 ### 🔄 Other Changes
- Expands support for more offset types in segmented benchmark by @elstehle in https://github.com/NVIDIA/cccl/pull/3231
- Add escape hatches to the cmake configuration of the header tests so that we can tests deprecated compilers / dialects by @miscco in https://github.com/NVIDIA/cccl/pull/3253
- [Version] Update main to v2.9.0 by @github-actions in https://github.com/NVIDIA/cccl/pull/3247
- Architecture and OS identification macros by @fbusato in https://github.com/NVIDIA/cccl/pull/3237
- [Version] Update main to v3.0.0 by @github-actions in https://github.com/NVIDIA/cccl/pull/3265
- CCCL Internal macro documentation by @fbusato in https://github.com/NVIDIA/cccl/pull/3238
- Require at least gcc7 by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3268
- Drop ICC from CI by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3277
- [STF] Corruption of the capture list of an extended lambda with a parallel_for construct on a host execution place by @caugonnet in https://github.com/NVIDIA/cccl/pull/3270
- Disambiguate line continuations and macro continuations in
by @wmaxey in https://github.com/NVIDIA/cccl/pull/3244 - Drop VS 2017 from CI by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3287
- Drop ICC support in code by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3279
- Make CUB NVRTC commandline arguments come from a cmake template by @wmaxey in https://github.com/NVIDIA/cccl/pull/3292
- Add components to the bug report template by @caugonnet in https://github.com/NVIDIA/cccl/pull/3295
- Use process isolation instead of default hyper-v for Windows. by @wmaxey in https://github.com/NVIDIA/cccl/pull/3294
- [pre-commit.ci] pre-commit autoupdate by @pre-commit-ci in https://github.com/NVIDIA/cccl/pull/3248
- Drop CTK 11.x from CI by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3275
- Update repo_man and packman versions by @shwina in https://github.com/NVIDIA/cccl/pull/3293
- Adds support for large number of items to
DevicePartition::Ifwith theThreeWayPartitionoverload by @elstehle in https://github.com/NVIDIA/cccl/pull/2506 - Refactor scan tunings by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3262
- Implement
views::emptyby @miscco in https://github.com/NVIDIA/cccl/pull/3254 - Refactor
limitsandclimitsby @davebayer in https://github.com/NVIDIA/cccl/pull/3221 - cuda.parallel: Add documentation for the current iterators along with examples and tests by @NaderAlAwar in https://github.com/NVIDIA/cccl/pull/3311
- Drop clang<14 from CI, update devcontainers. by @alliepiper in https://github.com/NVIDIA/cccl/pull/3309
- [STF] Cleanup task dependencies object constructors by @caugonnet in https://github.com/NVIDIA/cccl/pull/3291
- Disable test with a gcc-14 regression by @miscco in https://github.com/NVIDIA/cccl/pull/3297
- Remove dropped function objects from docs by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3319
- Document
NV_TARGETmacros by @fbusato in https://github.com/NVIDIA/cccl/pull/3313 - [STF] Define ctx.pick_stream() which was missing for the unified context by @caugonnet in https://github.com/NVIDIA/cccl/pull/3326
- Clarify CUB transform output can overlap input by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3323
- Limits the number of different offset types for
DeviceMergeSortby @elstehle in https://github.com/NVIDIA/cccl/pull/3328 - Drop thrust::void_t by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3362
- Fix all_of documentation for empty ranges by @upsj in https://github.com/NVIDIA/cccl/pull/3358
- [STF] Do not keep track of dangling events in a CUDA graph backend by @caugonnet in https://github.com/NVIDIA/cccl/pull/3327
- Extract scan kernels into NVRTC-compilable header by @shwina in https://github.com/NVIDIA/cccl/pull/3334
- Implement
cuda::std::numeric_limitsfor__halfand__nv_bfloat16by @davebayer in https://github.com/NVIDIA/cccl/pull/3361 - Deprecate cub::DeviceSpmv by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3320
- Improves
DeviceSegmentedSorttest run time for large number of items and segments by @elstehle in https://github.com/NVIDIA/cccl/pull/3246 - Compile basic infra test with C++17 by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3377
- Adds support for large number of items and large number of segments to
DeviceSegmentedSortby @elstehle in https://github.com/NVIDIA/cccl/pull/3308 - Exit with error when RAPIDS CI fails. by @alliepiper in https://github.com/NVIDIA/cccl/pull/3385
- cuda.parallel: Support structured types as algorithm inputs by @shwina in https://github.com/NVIDIA/cccl/pull/3218
- Fix broken
_CCCL_BUILTIN_ASSUMEmacro by @fbusato in https://github.com/NVIDIA/cccl/pull/3314 - Replace
typedefwithusingin libcu++ by @davebayer in https://github.com/NVIDIA/cccl/pull/3368 - Upgrade to Catch2 3.8 by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3310
- refactor
<cuda/std/cstdint>by @davebayer in https://github.com/NVIDIA/cccl/pull/3325 - Update CODEOWNERS by @jrhemstad in https://github.com/NVIDIA/cccl/pull/3331
- Fix sign-compare warning by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3408
- Implement more cmath functions to be usable on host and device by @miscco in https://github.com/NVIDIA/cccl/pull/3382
- Extend CUB reduce benchmarks by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3401
- Update upload-pages-artifact to v3 by @shwina in https://github.com/NVIDIA/cccl/pull/3423
std::linalgaccessors andtransposed_layoutby @fbusato in https://github.com/NVIDIA/cccl/pull/2962- Add round up/down to multiple by @fbusato in https://github.com/NVIDIA/cccl/pull/3234
- [FEA]: Introduce Python module with CCCL headers by @rwgk in https://github.com/NVIDIA/cccl/pull/3201
- cuda.parallel: Add optional stream argument to reduce_into() by @NaderAlAwar in https://github.com/NVIDIA/cccl/pull/3348
- Fix Deploy CCCL pages workflow by @rwgk in https://github.com/NVIDIA/cccl/pull/3434
- [CUDAX] Fix CI issues in the nightly testing by @pciolkosz in https://github.com/NVIDIA/cccl/pull/3443
- Remove deprecated
cub::minandthrust::remove_cvrefby @miscco in https://github.com/NVIDIA/cccl/pull/3450 - Fix typo in builtin by @miscco in https://github.com/NVIDIA/cccl/pull/3451
- Uses unsigned offset types in thrust's scan algorithms by @elstehle in https://github.com/NVIDIA/cccl/pull/3436
- Turn C++ dialect warning into error by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3453
- Uses unsigned offset types in thrust's sort algorithm calling into
DispatchMergeSortby @elstehle in https://github.com/NVIDIA/cccl/pull/3437 - Add
cuda::is_floating_pointsupporting half and bfloat by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3379 - Drop C++11 and C++14 support for all of cccl by @miscco in https://github.com/NVIDIA/cccl/pull/3417
- [CUDAX] Fix block and grid dimension order in <<<>>> in one of the hierarchy tests by @pciolkosz in https://github.com/NVIDIA/cccl/pull/3465
- Add
--extended-lambdato the list of removed clangd flags by @fbusato in https://github.com/NVIDIA/cccl/pull/3432 - add
_CCCL_HAS_NVFP8macro by @fbusato in https://github.com/NVIDIA/cccl/pull/3429 - Add
_CCCL_BUILTIN_PREFETCHby @fbusato in https://github.com/NVIDIA/cccl/pull/3433 - Ensure that headers in
<cuda/*>can be build with a C++ only compiler by @miscco in https://github.com/NVIDIA/cccl/pull/3472 - Specialize _isextendedfloatingpoint for FP8 types by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3470
- Refactor CUB's util_debug by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3345
- Specialize
cuda::std::numeric_limitsfor FP8 types by @davebayer in https://github.com/NVIDIA/cccl/pull/3478 - Fix typo in limits by @miscco in https://github.com/NVIDIA/cccl/pull/3491
- Add dynamic CUB dispatch for scan to support c.parallel by @shwina in https://github.com/NVIDIA/cccl/pull/3398
- Use a raw string literal for nvrtc source by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3486
- Add
popcount,clz,ctzbuiltin intrinsics by @fbusato in https://github.com/NVIDIA/cccl/pull/3489 - [STF] Fix paths in the STF unittest infrastructure by @caugonnet in https://github.com/NVIDIA/cccl/pull/3396
- Increase test coverage now that we dropped half of our configs by @miscco in https://github.com/NVIDIA/cccl/pull/3500
- Fix issue with conversion between
mdspan<T>andmdspan<const T>by @miscco in https://github.com/NVIDIA/cccl/pull/3469 - Extract merge sort kernels to NVRTC compilable header by @NaderAlAwar in https://github.com/NVIDIA/cccl/pull/3438
- [STF] Generate statistics about the DOT output by @caugonnet in https://github.com/NVIDIA/cccl/pull/3509
- [CUDAX] Align some naming and add missing docs by @pciolkosz in https://github.com/NVIDIA/cccl/pull/3497
- [CUDAX] Rename
hierarchy_dimensions_fragmenttohierarchy_dimensionsand remove the old alias by @pciolkosz in https://github.com/NVIDIA/cccl/pull/3496 - cuda.parallel: invoke pytest directly rather than via
python -m pytestby @shwina in https://github.com/NVIDIA/cccl/pull/3523 - add a
__call_result_talias template, implement__is_callable_vwith it by @ericniebler in https://github.com/NVIDIA/cccl/pull/3527 - cudastf (examples): Fix compiler errors when enabling examples for CUDA STF by @janciesko in https://github.com/NVIDIA/cccl/pull/3516
- A few improvements for internal macro documentation by @fbusato in https://github.com/NVIDIA/cccl/pull/3554
- Replace pipes.quote with shlex.quote in lit config by @wmaxey in https://github.com/NVIDIA/cccl/pull/3547
- Tune cub::DeviceTransform for Blackwell by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3543
- Refactor injecting benchmark policy_hub by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3561
- Try to always include the definition of barriernativehandle when needed by @miscco in https://github.com/NVIDIA/cccl/pull/3556
- Fix transform iterator for non-copy-constructible types by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3542
- Sync ptx helpers with libcudaptx by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3564
- Update ptx_isa.h to include 8.7 by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3563
- add missing visibility annotations to ustdex types that have data members by @ericniebler in https://github.com/NVIDIA/cccl/pull/3571
- [STF] Document dot sections by @caugonnet in https://github.com/NVIDIA/cccl/pull/3506
- Remove nvks runners from testing pool. by @alliepiper in https://github.com/NVIDIA/cccl/pull/3580
- Try and get rapids green by @miscco in https://github.com/NVIDIA/cccl/pull/3503
- Add
__int128and__float128detection macros by @fbusato in https://github.com/NVIDIA/cccl/pull/3413 - Remove all code paths and policies for SM37 and below by @fbusato in https://github.com/NVIDIA/cccl/pull/3466
- PTX: Update generated files with Blackwell instructions by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3568
- Update CI matrix to use NVKS nodes. by @alliepiper in https://github.com/NVIDIA/cccl/pull/3572
- Deprecate and replace
CUB_IS_INT128_ENABLEDby @fbusato in https://github.com/NVIDIA/cccl/pull/3427 - Adds support for large num items to
DeviceMergeby @elstehle in https://github.com/NVIDIA/cccl/pull/3530 - Support FP16 traits on CTK 12.0 by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3535
- Suppress execution checks for vocabulary types by @miscco in https://github.com/NVIDIA/cccl/pull/3578
- [nv/target] Add sm_120 macros. by @wmaxey in https://github.com/NVIDIA/cccl/pull/3550
- PTX: Remove internal instructions by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3583
- Add dynamic CUB dispatch for merge_sort by @NaderAlAwar in https://github.com/NVIDIA/cccl/pull/3525
- PTX: Update existing instructions by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3584
- PTX: Add clusterlaunchcontrol by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3589
- PTX: Add st.bulk by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3604
- PTX: Add multimem instructions by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3603
- PTX: Add cp.async.mbarrier.arrive{.noinc} by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3602
- PTX: Add tcgen05 instructions by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3607
- Use a differrent implementation for
tuple_of_iterator_referencesto tuple conversion by @miscco in https://github.com/NVIDIA/cccl/pull/3609 - work around erroneous "undefined in device code" error in
basic_anyby @ericniebler in https://github.com/NVIDIA/cccl/pull/3614 - Deprecate
AgentSegmentFixupPolicyby @fbusato in https://github.com/NVIDIA/cccl/pull/3593 - Fix deadlocks by enabling eager module loading in libcudacxx tests. by @wmaxey in https://github.com/NVIDIA/cccl/pull/3585
- Add b200 tunings for histogram by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3616
- make
uninitialized[_async]_buffer's range accessors const-correct by @ericniebler in https://github.com/NVIDIA/cccl/pull/3615 - Fix typo in index.rst by @cliffburdick in https://github.com/NVIDIA/cccl/pull/3620
- Disable X86-64 detection macro for Arm64 emulation on MSVC by @fbusato in https://github.com/NVIDIA/cccl/pull/3540
- Deprecate ABI v2 and v3 in libcudacxx by @wmaxey in https://github.com/NVIDIA/cccl/pull/3575
- Add b200 policies for reduce by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3612
- Add b200 tunings for reduce.by_key by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3610
- Remove CUDA 11.x support by @fbusato in https://github.com/NVIDIA/cccl/pull/3596
- PTX: fix cp.async.bulk.tensor and mbarrier.arrive by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3628
- Add b200 tunings for radix_sort.keys by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3611
- Try and make nvrtc on windows pass by @miscco in https://github.com/NVIDIA/cccl/pull/3623
- Sync PTX refactorings from libcudaptx by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3632
- Bump CI to use CTK 12.8, add sm100 build. by @alliepiper in https://github.com/NVIDIA/cccl/pull/3544
- PTX: add bfind, exit and trap by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3627
- Adds benchmarks for
cub::DeviceMergeby @elstehle in https://github.com/NVIDIA/cccl/pull/3529 - remove AgentSegmentFixupPolicy by @fbusato in https://github.com/NVIDIA/cccl/pull/3639
__builtin_isfiniteis only available above nvrtc 12.2 by @miscco in https://github.com/NVIDIA/cccl/pull/3644- Turn
TEST_[HALF|BF]_Tinto function-style macros and fix some tests by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3608 - [STF] frozenlogicaldata::getaccessmode() by @caugonnet in https://github.com/NVIDIA/cccl/pull/3646
- Internalize
triple_chevronby @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3648 - This improves the detection logic for
__cccl_ptx_isafor clang-cuda by @miscco in https://github.com/NVIDIA/cccl/pull/3647 - Try to fix backport workflow by @leofang in https://github.com/NVIDIA/cccl/pull/3634
- Revert #3623 by @leofang in https://github.com/NVIDIA/cccl/pull/3654
- Deprecate cub::FpLimits in favor of cuda::std::numeric_limits by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3635
- Fix transformiterator
and drop result ofadaptablefunction by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3652 - Transition build system of cudacccl and cudaparallel to scikit-build-core by @oleksandr-pavlyk in https://github.com/NVIDIA/cccl/pull/3597
- Replaces bool template parameters on
Dispatch*class templates to useenum classby @elstehle in https://github.com/NVIDIA/cccl/pull/3643 - Add b200 policies for device.select.if,flagged,unique by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3545
- Add b200 tunings for radix_sort.pairs by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3626
- Fix the vectorized loading of BlockLoad by @ChristinaZ in https://github.com/NVIDIA/cccl/pull/3517
- PTX: mbarrier.{test,try}_wait: Fix return value by @ahendriksen in https://github.com/NVIDIA/cccl/pull/3670
- Add b200 policies for cub.select.uniquebykey by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3557
- Update RAPIDS CI build to 25.04. by @alliepiper in https://github.com/NVIDIA/cccl/pull/3539
- Fix issues with nvrtc compilation by @miscco in https://github.com/NVIDIA/cccl/pull/3666
- Function-like macros for FP6/BF16 macros by @fbusato in https://github.com/NVIDIA/cccl/pull/3588
- Remove
cub::ArrayWrapperby @fbusato in https://github.com/NVIDIA/cccl/pull/3677 - Internalize cub::PolicyWrapper by @fbusato in https://github.com/NVIDIA/cccl/pull/3681
- Modernize MSVC 2005/nvcc workaround by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3606
- Deprecate
cub::AliasTemporariesby @fbusato in https://github.com/NVIDIA/cccl/pull/3679 - [CUB] Remove pre-c++17 conditions and code by @fbusato in https://github.com/NVIDIA/cccl/pull/3684
- Internalize cub::KernelConfig by @fbusato in https://github.com/NVIDIA/cccl/pull/3683
- remove MSVC 2017 paths by @fbusato in https://github.com/NVIDIA/cccl/pull/3553
- [Thrust] Remove pre-c++17 conditions and code by @fbusato in https://github.com/NVIDIA/cccl/pull/3687
- Remove cugraph-ops from RAPIDS 25.04 builds. by @bdice in https://github.com/NVIDIA/cccl/pull/3675
- Refactor radix_sort tuning by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3657
- Make thrust iterators work with NVRTC by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3676
- Deprecate and replace thrust::identity by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3649
- Replace CUB iterators by Thrust ones by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3480
- Drop Thrust's global workaround by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3692
- replace Int2Type in CUB library by @fbusato in https://github.com/NVIDIA/cccl/pull/3641
- Add b200 policies for cub.device.runlengthencode.encode,non_trivialruns by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3546
- Deprecate cub::Trait::CATEGORY|PRIMITIVE|NULL_TYPE by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3689
- Fix sccache reporting in CI summaries. by @alliepiper in https://github.com/NVIDIA/cccl/pull/3621
- Make THRUSTDEVICESYSTEM and THRUSTCPPDIALECT independent of THRUSTHOSTSYSTEM by @adams381 in https://github.com/NVIDIA/cccl/pull/3659
- Deprecate
cub::RegBoundScalingandcub::MemBoundScalingby @fbusato in https://github.com/NVIDIA/cccl/pull/3685 - Fix devcontainers'
initializeCommandby @trxcllnt in https://github.com/NVIDIA/cccl/pull/3533 - [cuda.cooperative] Add missing overloads to block.reduce and block.sum by @brycelelbach in https://github.com/NVIDIA/cccl/pull/2691
- clean up the cudax
__launch_transformcode and document its purpose and design by @ericniebler in https://github.com/NVIDIA/cccl/pull/3526 - Add b200 policies for partition.three_way by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3708
- Fix multiple CI arches in matrix by @alliepiper in https://github.com/NVIDIA/cccl/pull/3702
- Minor cleanups following bool-to-enum template parameter PR by @elstehle in https://github.com/NVIDIA/cccl/pull/3716
- Remove V2 and V3 ABI support from libcudacxx. by @wmaxey in https://github.com/NVIDIA/cccl/pull/3662
- Add b200 tunings for scan.exclusive.by_key by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3560
- assorted bug fixes for the std::execution implementation in cudax by @ericniebler in https://github.com/NVIDIA/cccl/pull/3721
- Minor fix for a regressing tuning in reduce.by_key by @gonidelis in https://github.com/NVIDIA/cccl/pull/3723
- Fix SM100 histogram tunings by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3691
- Move
zip_iteratorto internally usecuda::std::tupleby @miscco in https://github.com/NVIDIA/cccl/pull/3725 - Remove reduce tunings with no benefit by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3724
- fix ::cuda::discard_memory by @fbusato in https://github.com/NVIDIA/cccl/pull/3733
- Add b200 policies for cub.device.partition.flagged,if by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3617
- Add b200 tunings for scan.exclusive.sum by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3559
- Fix cub trait deprecations by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3742
- Nightly fixes by @alliepiper in https://github.com/NVIDIA/cccl/pull/3720
- Clarify scan benchmarks by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3709
- Drop thrust::future|event|async::* by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3730
- Replace raw arm64/x86_64 macros by @fbusato in https://github.com/NVIDIA/cccl/pull/3732
- Add Merge Sort implementation for c.parallel by @NaderAlAwar in https://github.com/NVIDIA/cccl/pull/3636
- Extracted Segmented Reduce kernels into NVRTC compilable header by @oleksandr-pavlyk in https://github.com/NVIDIA/cccl/pull/3727
- Remove unsupported CPU architecture paths (32-bit) by @fbusato in https://github.com/NVIDIA/cccl/pull/3752
- [Automation] Add release workflow for tagging and testing new RCs by @wmaxey in https://github.com/NVIDIA/cccl/pull/3009
- fix cuda std namespace by @fbusato in https://github.com/NVIDIA/cccl/pull/3751
- Remove cuda/init.py in
cuda-parallelpackage by @shwina in https://github.com/NVIDIA/cccl/pull/3750 - Simplify
cuda::std::{min,max}by @miscco in https://github.com/NVIDIA/cccl/pull/3758 - Add dynamic CUB dispatch for SegmentedReduce by @oleksandr-pavlyk in https://github.com/NVIDIA/cccl/pull/3753
- [STF] Implement kernel chains in the graph backend without child graphs by @caugonnet in https://github.com/NVIDIA/cccl/pull/3707
- Add Scan implementation for c.parallel by @shwina in https://github.com/NVIDIA/cccl/pull/3462
- cuda.parallel: Minor perf improvements by @shwina in https://github.com/NVIDIA/cccl/pull/3718
- refactor
<cuda/std/cstdlib>by @davebayer in https://github.com/NVIDIA/cccl/pull/3339 - Fix python editable builds by @oleksandr-pavlyk in https://github.com/NVIDIA/cccl/pull/3762
- Reinstate
thrust::optionalby @miscco in https://github.com/NVIDIA/cccl/pull/3759 - Drop unsupported dialects for libcu++ by @miscco in https://github.com/NVIDIA/cccl/pull/3695
- Disable
[[no_unique_address]]for MSVC by @miscco in https://github.com/NVIDIA/cccl/pull/3757 - cuda.coop: Generalize war_introspection utility for any # of arguments by @shwina in https://github.com/NVIDIA/cccl/pull/3769
- Avoid issues with nvcc compilation in c++ mode by @miscco in https://github.com/NVIDIA/cccl/pull/3770
- Refactor
cuda/cmathfunctions documentation by @fbusato in https://github.com/NVIDIA/cccl/pull/3773 - [STF] Factorize large event lists in CUDA graphs by @caugonnet in https://github.com/NVIDIA/cccl/pull/3756
- Replace pre-c++17 traits with modern ones in CUB by @fbusato in https://github.com/NVIDIA/cccl/pull/3774
- Drop cugraph-gnn from rapids CI by @miscco in https://github.com/NVIDIA/cccl/pull/3771
- [STF] Ensure dot_section::guard is actually movable by @caugonnet in https://github.com/NVIDIA/cccl/pull/3778
- Guard PDL by availability by @miscco in https://github.com/NVIDIA/cccl/pull/3779
- [STF] virtual to_string() method for STF contexts by @caugonnet in https://github.com/NVIDIA/cccl/pull/3781
- [STF] Enable freeze on logical tokens by @caugonnet in https://github.com/NVIDIA/cccl/pull/3782
- Refactors
DeviceMemcpy'svectorized_copytests by @elstehle in https://github.com/NVIDIA/cccl/pull/3777 - More h100 usage. by @alliepiper in https://github.com/NVIDIA/cccl/pull/3776
- Add Python wrappers for c.parallel scan API by @shwina in https://github.com/NVIDIA/cccl/pull/3592
- Replace
_CCCL_IF_CONSTEXPRby @fbusato in https://github.com/NVIDIA/cccl/pull/3775 - Remove
_CCCL_CONSTEXPR_CXX14/17by @fbusato in https://github.com/NVIDIA/cccl/pull/3793 - Bump -std from 14 to 17 in `./ci/(build|test)_cub.sh examples. by @tpn in https://github.com/NVIDIA/cccl/pull/3792
- [CUDAX] Add host launch API allowing stream ordered host execution by @pciolkosz in https://github.com/NVIDIA/cccl/pull/3555
- Moves
DeviceMemcpy'sBitPackedCountertests to Catch2 by @elstehle in https://github.com/NVIDIA/cccl/pull/3794 - Refactor
<cuda/std/cstring>by @davebayer in https://github.com/NVIDIA/cccl/pull/3484 - fix NoopExecutor by @fbusato in https://github.com/NVIDIA/cccl/pull/3811
- Unifies workload generation for
DeviceMergebenchmarks by @elstehle in https://github.com/NVIDIA/cccl/pull/3645 - Optimize and clean
countl,countr,popcount,has_single_bitby @fbusato in https://github.com/NVIDIA/cccl/pull/3414 - fix
-Werror=unused-resultby @fbusato in https://github.com/NVIDIA/cccl/pull/3810 - Enable
cuda::std::ssizefor C++17 by @miscco in https://github.com/NVIDIA/cccl/pull/3813 - fix
_LIBCUDACXX_HAS_NO_INT128with NVRTC by @fbusato in https://github.com/NVIDIA/cccl/pull/3802 - Move radix sort kernels to separate NVRTC compilable header by @NaderAlAwar in https://github.com/NVIDIA/cccl/pull/3803
- Fix
popcparentheses warning by @fbusato in https://github.com/NVIDIA/cccl/pull/3820 - Add arch_traits for sm100 to cudax. by @alliepiper in https://github.com/NVIDIA/cccl/pull/3818
- Remove unused function parameter by @ericniebler in https://github.com/NVIDIA/cccl/pull/3828
- CI summary fix by @alliepiper in https://github.com/NVIDIA/cccl/pull/3826
- Refactor Thrust allocator example by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3830
- [STF] Improved cache mechanism for executable CUDA graphs by @caugonnet in https://github.com/NVIDIA/cccl/pull/3768
- Drop deprecated CUB iterators by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3831
- Use libcu++ limits/trait in tests/benchmarks by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3822
- Move uniquebykey kernels to NVRTC compilable header by @NaderAlAwar in https://github.com/NVIDIA/cccl/pull/3815
- Specialize
numeric_limitsfor CUDA 12.8 FP types by @davebayer in https://github.com/NVIDIA/cccl/pull/3832 - Refactor thrust::zip_iterator by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3834
- Refactor Thrust iterators 2/4 by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3840
- Refactor Thrust iterators 3/4 by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3842
- Refactor Thrust iterators 4/4 by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3833
- Increase libcudacxx test timeout by @alliepiper in https://github.com/NVIDIA/cccl/pull/3850
- Use lower case variable name to avoid macro collosions by @miscco in https://github.com/NVIDIA/cccl/pull/3856
- Fix incorrect availability of
variantin docs by @miscco in https://github.com/NVIDIA/cccl/pull/3859 - Add cuda_cccl to the list of Python packages for which test suite is run by @oleksandr-pavlyk in https://github.com/NVIDIA/cccl/pull/3846
- Refactor Thrust iterators 1/4 by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3839
- Rewrites
DeviceMemcpy::Batchedtests to use device-side data generation and Catch2 by @elstehle in https://github.com/NVIDIA/cccl/pull/3849 - Refactor CUB transfrom by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3825
- Add Python wrappers for c.parallel merge_sort API by @NaderAlAwar in https://github.com/NVIDIA/cccl/pull/3763
- Add c parallel segmented reduce api by @oleksandr-pavlyk in https://github.com/NVIDIA/cccl/pull/3838
- [libcudacxx] Stable abstraction for Blackwell work-stealing (PTX try_cancel) by @gonzalobg in https://github.com/NVIDIA/cccl/pull/3671
- Consider specializations of
std::iterator_traitsby @miscco in https://github.com/NVIDIA/cccl/pull/3837 - Update supported C++ dialects in README by @davebayer in https://github.com/NVIDIA/cccl/pull/3879
- Refactor
assume_alignedimplementation by @fbusato in https://github.com/NVIDIA/cccl/pull/3765 - Refactor and make NVRTC compile
<cub/util_device>by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3880 - Cache the result of
merge_sort()by @shwina in https://github.com/NVIDIA/cccl/pull/3881 - do not try to use clang-19's support for c++26 pack indexing by @ericniebler in https://github.com/NVIDIA/cccl/pull/3888
- Add support for single item per thread calls to blockscan.exclusivescan by @tpn in https://github.com/NVIDIA/cccl/pull/3829
- Document
cuda::maximum,cuda::minimumby @fbusato in https://github.com/NVIDIA/cccl/pull/3883 - Refactor Thrust iterator_traits by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3892
- Update Blackwell PTX instruction availability tables by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3894
- Fix CCCL C headers to be compileable by C compiler by @oleksandr-pavlyk in https://github.com/NVIDIA/cccl/pull/3885
- Move transform kernels to NVRTC compilable header by @shwina in https://github.com/NVIDIA/cccl/pull/3875
- PTX
shfl_syncby @fbusato in https://github.com/NVIDIA/cccl/pull/3241 - Add a warning that we cannot tune transform by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3896
- Extend tuning guide by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3904
- Drop join_iterator by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3891
- Revert Thrust findifnot implementation to please nvc++ by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3901
- [CUB/docs] Add missing closing braces to
BlockReducekernel examples in CUB docs. by @brycelelbach in https://github.com/NVIDIA/cccl/pull/3916 - [STF] Executable CUDA graphs caching policies by @caugonnet in https://github.com/NVIDIA/cccl/pull/3868
- Refactor Thrust iterator internals by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3893
- Revert Thrust mismatch implementation by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3899
- Replace usage of CUB_MIN|MAX in reduce by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3927
- Move to cuda::std::iterator_traits in CUB by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3924
- Add C++ test for single-item-per-thread BlockScan Sum routines. by @tpn in https://github.com/NVIDIA/cccl/pull/3889
- Rename threadsinblock -> threadsperblock to be consistent with CUB. by @tpn in https://github.com/NVIDIA/cccl/pull/3919
- Implement cuda.coopertive.blockscan.inclusivesum(). by @tpn in https://github.com/NVIDIA/cccl/pull/3921
- Replace CUB macros in more places by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3930
- [PTX] Add shl, shr, bmsk, prmt by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3939
- Add testreduceapi.py::testreducestructtypeminmax by @oleksandr-pavlyk in https://github.com/NVIDIA/cccl/pull/3938
- Add
cuda::std::aligned_accessorby @fbusato in https://github.com/NVIDIA/cccl/pull/3731 - [STF] Thread safe graph_ctx by @caugonnet in https://github.com/NVIDIA/cccl/pull/3925
- Replace CUB macros in tunings and benchmarks by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3931
- Deprecate and replace some Thrust iterator traits by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3928
- Optimize
bit_floor,bit_ceil,bit_widthby @fbusato in https://github.com/NVIDIA/cccl/pull/3296 - Allow RAPIDS workflow to run on an arbitrary branch. by @alliepiper in https://github.com/NVIDIA/cccl/pull/3945
- Initial CUDA C++ Execution Model documentation by @gonzalobg in https://github.com/NVIDIA/cccl/pull/3873
- [STF] Remove unmaintained CUDASTF_DEBUG option by @caugonnet in https://github.com/NVIDIA/cccl/pull/3944
- Revert "Initial CUDA C++ Execution Model documentation (#3873)" by @alliepiper in https://github.com/NVIDIA/cccl/pull/3950
- Implement
ranges::ref_viewby @miscco in https://github.com/NVIDIA/cccl/pull/3316 - Expose CCCL branch controls on Actions UI for RAPIDS workflow. by @alliepiper in https://github.com/NVIDIA/cccl/pull/3948
- Drop unused
TEST_COMPILER_CUDACC_BELOW_11_3macro by @miscco in https://github.com/NVIDIA/cccl/pull/3946 - Allow NVRTC to compile more of CUB by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3951
- Use
_CCCL_REQUIRES_EXPRin test code by @miscco in https://github.com/NVIDIA/cccl/pull/3954 - Improve
<cuda/std/bit>documentation by @fbusato in https://github.com/NVIDIA/cccl/pull/3959 - [STF] Support generation of multiple CUDA graphs from separate threads by @caugonnet in https://github.com/NVIDIA/cccl/pull/3943
- Add segmented_reduce python api by @oleksandr-pavlyk in https://github.com/NVIDIA/cccl/pull/3906
- Implement
__cccl_is_integertrait by @davebayer in https://github.com/NVIDIA/cccl/pull/3962 - Implement
cudax::async_bufferby @miscco in https://github.com/NVIDIA/cccl/pull/3460 - Add dynamic CUB dispatch for uniquebykey by @NaderAlAwar in https://github.com/NVIDIA/cccl/pull/3816
- Fix typo in
_LIBCUDACXX_HAS_NVFP16macro by @davebayer in https://github.com/NVIDIA/cccl/pull/3965 - Drop obsolete thrust tuple algorithms by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3966
- Extend CUB policy and tuning documentation by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3933
- Fix thrust::rawreferencecast for tupleofiterator_references and simplify thrust::generate by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3970
- [PTX] Add
st,ldinstructions by @fbusato in https://github.com/NVIDIA/cccl/pull/3974 - [cuda.cooperative] Support multidimensional thread blocks in block load/store and improve load/store docs by @brycelelbach in https://github.com/NVIDIA/cccl/pull/3161
- Disable automatic header inclusion for clangd by @miscco in https://github.com/NVIDIA/cccl/pull/3365
- Deprecate and replace
THRUST_STATIC_ASSERTby @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3971 - Avoid int overflow during multipl
- C++
Published by github-actions[bot] 6 months ago
cccl - v2.8.5
What's Changed
- Avoid plain
assertin device code by @miscco in https://github.com/NVIDIA/cccl/pull/4707 - Do not use open-coded
INFINITYfor tests that also test extended floating points by @miscco in https://github.com/NVIDIA/cccl/pull/4744 - [Version] Update branch/2.8.x to v2.8.5 by @github-actions in https://github.com/NVIDIA/cccl/pull/4755
- [Backport branch/2.8.x] Update Blackwell PTX instruction availability tables by @github-actions in https://github.com/NVIDIA/cccl/pull/3900
Full Changelog: https://github.com/NVIDIA/cccl/compare/v2.8.4...v2.8.5
- C++
Published by github-actions[bot] 7 months ago
cccl - v2.8.4
What's Changed
- [BACKPORT] Do not use pack indexing with clang-19 by @miscco in https://github.com/NVIDIA/cccl/pull/4447
- [Backport branch/2.8.x] Always bypass automatic atomic storage checks to prevent potential compiler issues by @github-actions in https://github.com/NVIDIA/cccl/pull/4616
- [Version] Update branch/2.8.x to v2.8.4 by @github-actions in https://github.com/NVIDIA/cccl/pull/4655
Full Changelog: https://github.com/NVIDIA/cccl/compare/v2.8.3...v2.8.4
- C++
Published by github-actions[bot] 8 months ago
cccl - v2.8.3
What's Changed
- [BACKPORT: 2.8] Set NOCMAKEFINDROOTPATH for cudax. (#4162) by @miscco in https://github.com/NVIDIA/cccl/pull/4216
- [BACKPORT 2.8] Fix the cuda python setup by @miscco in https://github.com/NVIDIA/cccl/pull/4218
- Backport PR #4221 to branch/2.8.x — Remove python/cuda_cooperative/setup.py by @rwgk in https://github.com/NVIDIA/cccl/pull/4235
- [Backport branch/2.8.x] Remove invalid single
#in builtin.h by @github-actions in https://github.com/NVIDIA/cccl/pull/4326 - [BACKPORT 2.8] Allow rapids to avoid unrolling some loops in sort (#4253) by @miscco in https://github.com/NVIDIA/cccl/pull/4387
- [Backport branch/2.8.x] Fix uninitialized read in local atomic code path. by @github-actions in https://github.com/NVIDIA/cccl/pull/4424
- [Version] Update branch/2.8.x to v2.8.3 by @github-actions in https://github.com/NVIDIA/cccl/pull/4423
Full Changelog: https://github.com/NVIDIA/cccl/compare/v2.8.2...v2.8.3
- C++
Published by github-actions[bot] 9 months ago
cccl - v2.8.2
What's Changed
- [Version] Update branch/2.8.x to v2.8.2 by @github-actions in https://github.com/NVIDIA/cccl/pull/4079
- Ignore
Wmaybe-uninitializedin dispatch_reduce.cuh. by @bdice in https://github.com/NVIDIA/cccl/pull/4054 - backport: fix numeric_limits digits for nvfp8/6/4 (#4070) by @miscco in https://github.com/NVIDIA/cccl/pull/4130
- [BACKPORT]: Avoid compiler issue with MSVC and
spanconstructor by @miscco in https://github.com/NVIDIA/cccl/pull/4127
Full Changelog: https://github.com/NVIDIA/cccl/compare/v2.8.1...v2.8.2
- C++
Published by github-actions[bot] 9 months ago
cccl - v2.8.1
What's Changed
- Backport to 2.8: NVHPC fixes by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/4021
- [Backport 2.8.x] [cuda::ptx] Fix .cta_group::2 definition (#4038) by @wmaxey in https://github.com/NVIDIA/cccl/pull/4044
- [Version] Update branch/2.8.x to v2.8.1 by @github-actions in https://github.com/NVIDIA/cccl/pull/4049
Full Changelog: https://github.com/NVIDIA/cccl/compare/v2.8.0...v2.8.1
- C++
Published by github-actions[bot] 10 months ago
cccl - CCCL 2.8.0
What's Changed
- Adds benchmarks for
DeviceSelect::Uniqueby @elstehle in https://github.com/NVIDIA/cccl/pull/2359 - CUB - Enable DPX Reduction by @fbusato in https://github.com/NVIDIA/cccl/pull/2286
- [CUDAX] add a small c++17 implementation of
std::execution(aka P2300) by @ericniebler in https://github.com/NVIDIA/cccl/pull/2301 - Add thurst::transforminclusivescan with init value by @gonidelis in https://github.com/NVIDIA/cccl/pull/2326
- Widen histogram agent constructor to more types by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2380
- Use a constant for the amount of static SMEM by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2374
- Add
cub::DeviceTransformby @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2086 - Update toolkit to CTK 12.6 by @miscco in https://github.com/NVIDIA/cccl/pull/2348
- implement
make_integer_sequencein terms of intrinsics whenever possible by @ericniebler in https://github.com/NVIDIA/cccl/pull/2384 - Implement
cuda::mr::cuda_async_memory_resourceby @miscco in https://github.com/NVIDIA/cccl/pull/1637 - Drop implementation of
thrust::pairandthrust::tupleby @miscco in https://github.com/NVIDIA/cccl/pull/2395 - Pull out
_LIBCUDACXX_UNREACHABLEinto its own file by @miscco in https://github.com/NVIDIA/cccl/pull/2399 - Share common compiler flags in new CCCL-level targets. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2386
- conditionally include
<crt/host_defines.h>from__cccl/execution_space.hheader by @ericniebler in https://github.com/NVIDIA/cccl/pull/2406 - add some simple utilities for manipulating lists of types by @ericniebler in https://github.com/NVIDIA/cccl/pull/2370
- Drop thrusts diagnostic suppression warnings by @miscco in https://github.com/NVIDIA/cccl/pull/2392
- [PoC]: Implement
cuda::experimental::uninitialized_async_bufferby @miscco in https://github.com/NVIDIA/cccl/pull/1854 - Fix thrust package to work with newer FindOpenMP.cmake. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2421
- Introduce
cccl_configure_targetcmake function. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2388 - Fix sccache errors in RAPIDS builds by @trxcllnt in https://github.com/NVIDIA/cccl/pull/2417
- Replace
CUDA C++ Core LibrarieswithCUDA Core Compute Libraries(only in README.md). by @rwgk in https://github.com/NVIDIA/cccl/pull/2424 - Minor cleanup with
cuda/atomicby @miscco in https://github.com/NVIDIA/cccl/pull/2418 uninitialized_buffer::get_resourcereturns a ref to anany_resourcethat can be copied by @ericniebler in https://github.com/NVIDIA/cccl/pull/2431- Refactor
cuda::ceil_divto take two different types by @miscco in https://github.com/NVIDIA/cccl/pull/2376 - Reduce PR testing matrix. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2436
- Implement
cudax::shared_resourceby @miscco in https://github.com/NVIDIA/cccl/pull/2398 - Increase the libcu++ timeout by @miscco in https://github.com/NVIDIA/cccl/pull/2435
- Move c/include/cccl/.h files to c/include/cccl/c/.h by @rwgk in https://github.com/NVIDIA/cccl/pull/2428
- Make
any_resourceemplacable by @miscco in https://github.com/NVIDIA/cccl/pull/2425 - Fix issues with
__host__and__device__definitions by @miscco in https://github.com/NVIDIA/cccl/pull/2413 - Make
bit_castplay nice with extended floating point types by @miscco in https://github.com/NVIDIA/cccl/pull/2434 - Do not include our own string.h file by @miscco in https://github.com/NVIDIA/cccl/pull/2444
- Move nightly time by @bdice in https://github.com/NVIDIA/cccl/pull/2437
- Remove a ton of lines in thrust tests by @gonidelis in https://github.com/NVIDIA/cccl/pull/2356
- [CUDAX] Add placeholder green context type and logical device that can hold both a green ctx and a device by @pciolkosz in https://github.com/NVIDIA/cccl/pull/2446
- Fix typo in CCCLBuildCompilerTargets.cmake by @alliepiper in https://github.com/NVIDIA/cccl/pull/2453
- Drop superflous compile definition from thrust tests by @miscco in https://github.com/NVIDIA/cccl/pull/2450
- Consolidate packages and install rules by @alliepiper in https://github.com/NVIDIA/cccl/pull/2456
- Prune CUB's ChainedPolicy by CUDAARCHLIST by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2154
- fixes merge conflict for policy pruning by @elstehle in https://github.com/NVIDIA/cccl/pull/2466
- Add CCCLENABLEWERROR flag. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2463
- Add CUB tests for segmented sort/radix sort with 64-bit num. items and segments by @fbusato in https://github.com/NVIDIA/cccl/pull/2254
- Propagate compiler flags down to libcu++ LIT tests by @Artem-B in https://github.com/NVIDIA/cccl/pull/2420
- Drop remaining uses of
_LIBCUDACXX_COMPILER_*by @miscco in https://github.com/NVIDIA/cccl/pull/2467 - Avoid C++17 extension in c++11 tests by @miscco in https://github.com/NVIDIA/cccl/pull/2469
- Add span to example and templated block size by @Kh4ster in https://github.com/NVIDIA/cccl/pull/2470
- Drop Objective C++ support by @miscco in https://github.com/NVIDIA/cccl/pull/2468
- removes superfluous template keyword in call to Dereference by @andrewcorrigan in https://github.com/NVIDIA/cccl/pull/2482
- Improve build times in several heavyweight libcudacxx tests. by @wmaxey in https://github.com/NVIDIA/cccl/pull/2478
- Drop
__availabilityheader by @miscco in https://github.com/NVIDIA/cccl/pull/2484 - Replace a few more instances of
CUDA C++ Core Librarieswith CUDA Core Compute Libraries`. by @rwgk in https://github.com/NVIDIA/cccl/pull/2447 - Fix
common_typespecialization for extended floating point types by @miscco in https://github.com/NVIDIA/cccl/pull/2483 - Implement some CUDA API calls for
async_memory_poolby @miscco in https://github.com/NVIDIA/cccl/pull/2455 - Move cudax example project to CCCL project examples. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2462
- Disable system header for narrowing conversion check by @miscco in https://github.com/NVIDIA/cccl/pull/2465
- Require resources to always provide at least one execution space property by @miscco in https://github.com/NVIDIA/cccl/pull/2489
- Rework builtin handling by @miscco in https://github.com/NVIDIA/cccl/pull/2461
- Disable execution checks for
std::equalby @miscco in https://github.com/NVIDIA/cccl/pull/2491 - replace
_CCCL_ALWAYS_INLINEwith_CCCL_FORCEINLINEby @ericniebler in https://github.com/NVIDIA/cccl/pull/2439 - Drop 2 relative includes that snuck in by @miscco in https://github.com/NVIDIA/cccl/pull/2492
- re-express the
cudax::__tupl::__applymember to make nvc++ happy by @ericniebler in https://github.com/NVIDIA/cccl/pull/2493 - Drop badly named
_One_ofconcept by @miscco in https://github.com/NVIDIA/cccl/pull/2490 - Unify assert handling in cccl by @miscco in https://github.com/NVIDIA/cccl/pull/2382
- Reduce scope of Thrust linkage in cudax. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2496
- Centralize CPM logic. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2495
- Fix typo in presets. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2497
- Refactor away per-project TOPLEVEL flags. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2498
- [FEA]: Validate cuda.parallel type matching in build and execution by @rwgk in https://github.com/NVIDIA/cccl/pull/2429
- avoid gcc optimizer bug by not force inlining part of
thrust::transformby @ericniebler in https://github.com/NVIDIA/cccl/pull/2509 - Cleanup and modularize
<cuda/std/barrier>by @miscco in https://github.com/NVIDIA/cccl/pull/2443 - Consolidate header testing infra. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2460
- Add ForEachN from CUB to cccl/c. by @wmaxey in https://github.com/NVIDIA/cccl/pull/2378
- Adds support for large number of items in
DeviceSelectandDevicePartitionby @elstehle in https://github.com/NVIDIA/cccl/pull/2400 - Adds support for large number of items to
DeviceScan::*ByKeyfamily of algorithms by @elstehle in https://github.com/NVIDIA/cccl/pull/2477 - Integrate c/parallel with CCCL build system and CI. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2514
- Create a command list utility for nvrtc/jitlink steps. by @wmaxey in https://github.com/NVIDIA/cccl/pull/2511
- Fix the example project which the documentation refers too by @caugonnet in https://github.com/NVIDIA/cccl/pull/2531
- Enable tests/headertests for c/parallel in all-dev presets. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2566
- Rename cudax test targets to match CCCL conventions. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2568
- Update project list in issue template by @alliepiper in https://github.com/NVIDIA/cccl/pull/2532
- Disable compiler extensions on CCCL targets. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2559
- Fixes
cub::DeviceMemcpy::Batchedto be able to copy fromconstsource pointers by @elstehle in https://github.com/NVIDIA/cccl/pull/2573 - Fix documentation error in ci/build_common.sh for -arch by @caugonnet in https://github.com/NVIDIA/cccl/pull/2574
- gcc-14 gained the ability to mangle
noexceptexpressions by @ericniebler in https://github.com/NVIDIA/cccl/pull/2565 - Miscellaneous simple fixes by @rwgk in https://github.com/NVIDIA/cccl/pull/2575
- Avoid including
yvals.hwhen the compiler is not MSVC. by @wmaxey in https://github.com/NVIDIA/cccl/pull/2545 - Fix popc.h when architecture is not x86 on MSVC. by @wmaxey in https://github.com/NVIDIA/cccl/pull/2524
- test for exceptions support on msvc with the
_CPPUNWINDmacro by @ericniebler in https://github.com/NVIDIA/cccl/pull/2576 - fix the forwarding of the receiver in the
just_fromalgorithm by @ericniebler in https://github.com/NVIDIA/cccl/pull/2569 - Block type pack indexing on NVCC by @wmaxey in https://github.com/NVIDIA/cccl/pull/2563
- Cleanup the semaphore headers by @miscco in https://github.com/NVIDIA/cccl/pull/2441
- Add
_CCCL_GRID_CONSTANTmacro by @fbusato in https://github.com/NVIDIA/cccl/pull/2530 - Add
_CCCL_RESTRICTmacro by @fbusato in https://github.com/NVIDIA/cccl/pull/2529 - Try to use the same redefinition of
__assert_failas pytorch has by @miscco in https://github.com/NVIDIA/cccl/pull/2577 - Fix miscellaneous bugs in cub/iterator documentation. by @rwgk in https://github.com/NVIDIA/cccl/pull/2580
- Expose parts of
<cuda/std/memory>by @fbusato in https://github.com/NVIDIA/cccl/pull/2502 - add a config macro for testing support for inline variables by @ericniebler in https://github.com/NVIDIA/cccl/pull/2581
- add dialect macros
_CCCL_NO_RTTIand_CCCL_NO_TYPEIDby @ericniebler in https://github.com/NVIDIA/cccl/pull/2578 - fix misspelling in the
_CCCL_NO_VARIABLE_TEMPLATESmacro by @ericniebler in https://github.com/NVIDIA/cccl/pull/2584 - Add
atomic_refsupport for 8 and 16b types. by @wmaxey in https://github.com/NVIDIA/cccl/pull/2255 - add
_LIBCUDACXX_REQUIRES_EXPRto the concepts emulation macros by @ericniebler in https://github.com/NVIDIA/cccl/pull/2564 - Ensure CuPy arrays can be used with
cuda.paralleltoo by @leofang in https://github.com/NVIDIA/cccl/pull/2335 - assert that
cuda::std::declvalisnoexceptby @ericniebler in https://github.com/NVIDIA/cccl/pull/2588 - Revert accidental force push to main. by @wmaxey in https://github.com/NVIDIA/cccl/pull/2596
- add
__is_callable_vvariable template when possible by @ericniebler in https://github.com/NVIDIA/cccl/pull/2598 - Cleanup threading support by @miscco in https://github.com/NVIDIA/cccl/pull/2507
- CCCLTOPLEVELPROJECT always needs to be defined by @robertmaynard in https://github.com/NVIDIA/cccl/pull/2597
- Strip prefix paths from cudax documentation by @caugonnet in https://github.com/NVIDIA/cccl/pull/2603
- examples/cudax/CMakeLists.txt should not be executable by @caugonnet in https://github.com/NVIDIA/cccl/pull/2594
- [CUDAX] Peer access control on asyncmemorypool and asyncmemoryresource by @pciolkosz in https://github.com/NVIDIA/cccl/pull/2587
- Introduce
_CCCL_PRAGMAto CCCL by @davebayer in https://github.com/NVIDIA/cccl/pull/2610 - Only enable CUDA language when needed. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2612
- Modularize latch by @miscco in https://github.com/NVIDIA/cccl/pull/2508
- Unify kernel dispatch paths for device reduce between CUB and c.parallel. by @griwes in https://github.com/NVIDIA/cccl/pull/2591
- Integrate CUDASTF -> CudaX by @caugonnet in https://github.com/NVIDIA/cccl/pull/2572
- [STF] The cmake example for stf was not updated when moving to main branch by @caugonnet in https://github.com/NVIDIA/cccl/pull/2618
- Rework
head_flagsso that we do not rely on the tuple being unevaluated by @miscco in https://github.com/NVIDIA/cccl/pull/2619 - [CUDAX] size_bytes in buffer types by @pciolkosz in https://github.com/NVIDIA/cccl/pull/2621
- fix portability bug in libcu++'s implementation of
char_traitsby @ericniebler in https://github.com/NVIDIA/cccl/pull/2623 - [cccl/c] Unify some build boilerplate by @wmaxey in https://github.com/NVIDIA/cccl/pull/2625
- devcontainer: replace
VAULT_HOSTwithAWS_ROLE_ARNby @jjacobelli in https://github.com/NVIDIA/cccl/pull/2604 - Add checks to unique_id by @andralex in https://github.com/NVIDIA/cccl/pull/2622
- Add
cuda::get_device_addressby @miscco in https://github.com/NVIDIA/cccl/pull/2611 - Do not pass integral constants to ptx by @miscco in https://github.com/NVIDIA/cccl/pull/2229
- Add nvhpc devcontainer to CI by @miscco in https://github.com/NVIDIA/cccl/pull/1488
- Use a default initialization for CUDA graph mem alloc nodes by @caugonnet in https://github.com/NVIDIA/cccl/pull/2632
- [CUDAX] Add getname to deviceref by @pciolkosz in https://github.com/NVIDIA/cccl/pull/2631
- Add 12.5 devcontainer needed for nvhpc by @miscco in https://github.com/NVIDIA/cccl/pull/2634
- a substitute for
std::type_infowhen the compiler doesn't support RTTI by @ericniebler in https://github.com/NVIDIA/cccl/pull/2582 - Check for missing
inlineon functions in public headers. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2570 - fix linker errors about multiply defined symbols in STF by @ericniebler in https://github.com/NVIDIA/cccl/pull/2641
- Add installation presets and update README with install steps by @alliepiper in https://github.com/NVIDIA/cccl/pull/2643
- Fix annotated_ptr test failures. by @wmaxey in https://github.com/NVIDIA/cccl/pull/2607
- Issue a deprecation warning when compiling with ICC by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2076
- Include all python libs in inspect_changes. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2648
- Add reusable workflow for updating version in branch with a PR by @wmaxey in https://github.com/NVIDIA/cccl/pull/2589
- define
_CCCL_NO_RTTIin device code; RTTI isn't available there by @ericniebler in https://github.com/NVIDIA/cccl/pull/2639 - Migrate C2H library to top-level library by @alliepiper in https://github.com/NVIDIA/cccl/pull/2629
- [CUDAX] Add canpeeraccessto API to deviceref and check both ways access in get_peers by @pciolkosz in https://github.com/NVIDIA/cccl/pull/2642
- Use
_CCCL_ASSERTfor stf by @miscco in https://github.com/NVIDIA/cccl/pull/2645 - un-templatize CUDASTF's
callback_completion_kernelper @robertmaynard by @ericniebler in https://github.com/NVIDIA/cccl/pull/2656 - Implement C++20
<source_location>by @miscco in https://github.com/NVIDIA/cccl/pull/2628 - Disable
[[no_unique_address]]for clang and mdspan by @miscco in https://github.com/NVIDIA/cccl/pull/2646 - [STF] Adapt timingwithfences test to be more reliable by @caugonnet in https://github.com/NVIDIA/cccl/pull/2658
- Add prefetching kernel as new fallback for
cub::DeviceTransformby @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2396 - Drop
cub::DeviceTransformfallback tocub::DeviceForby @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2660 - Ignore more files when detecting CI changes. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2654
- Add
thrust::universal_host_pinned_vectorby @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2653 - add new type-list algorithms
copy_if,remove_if,find_if, anduniqueby @ericniebler in https://github.com/NVIDIA/cccl/pull/2644 - abide by CCCL config macro naming conventions for
_CCCL_PRETTY_FUNCTIONand_CCCL_NO_BUILTIN_STRLENby @ericniebler in https://github.com/NVIDIA/cccl/pull/2640 - [STF] Fix how we define multi-dimensional shapes in the documentation by @caugonnet in https://github.com/NVIDIA/cccl/pull/2662
- Automate creating a CCCL release from RC tags. by @wmaxey in https://github.com/NVIDIA/cccl/pull/2657
- Enable span to work with contiguous std containers in C++17 by @miscco in https://github.com/NVIDIA/cccl/pull/2613
- [Version] Update main to v2.8.0 by @github-actions in https://github.com/NVIDIA/cccl/pull/2670
- promote the cudax
__async/config.cuhto be the config for all of cudax by @ericniebler in https://github.com/NVIDIA/cccl/pull/2638 - avoid using nvcc's
__type_pack_elementbefore 12.2 by @ericniebler in https://github.com/NVIDIA/cccl/pull/2673 - Update ninja_summary.py to support ninja log v6. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2663
- Rename new CUB headers to follow conventions. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2675
- consistent use of
_CUDAXfunction attributes in the cudax__async/directory by @ericniebler in https://github.com/NVIDIA/cccl/pull/2676 - [CUDAX] Add forwarding reference to functor accepting launch by @pciolkosz in https://github.com/NVIDIA/cccl/pull/2677
- [CUDAX] Add initial bits of copybytes and fillbytes by @pciolkosz in https://github.com/NVIDIA/cccl/pull/2608
- suppress msvc warning "qualifier applied to function type" in
is_functionby @ericniebler in https://github.com/NVIDIA/cccl/pull/2683 - Disable ublkcp CUB transform kernel for NVHPC by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2664
- Deprecate
thrust::cuda_cub::identityby @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2688 - Remove an unused variable by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2690
- Setup cudax examples. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2697
- portability fixes for
_CCCL_BUILTIN_PRETTY_FUNCTIONand_CCCL_TYPEIDby @ericniebler in https://github.com/NVIDIA/cccl/pull/2695 - address portability issues found while using the typelist/typeset utities by @ericniebler in https://github.com/NVIDIA/cccl/pull/2694
- Make tests technically correct by initializing the barrier by @miscco in https://github.com/NVIDIA/cccl/pull/2701
- Fix invalid memory reads in testdevicebatch_copy. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2698
- revert config macros
_CCCL_CUDACC_BELOW_XX_Xto their original semantics by @ericniebler in https://github.com/NVIDIA/cccl/pull/2700 - This cleanes up our function objects a bit by @miscco in https://github.com/NVIDIA/cccl/pull/2702
- Drop handling of 32bit Windows by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2689
- Guard inclusion of
cuda_runtime_apiby using a cuda compiler by @miscco in https://github.com/NVIDIA/cccl/pull/2704 - Fix race condition in blockreduceraking. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2699
- Honor CCCLENABLEWERROR for CUDA targets. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2705
- Fix nvbench helper compilation for clang-18 by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2707
- Default ctor of deviceptr and normaliterator by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2708
- Add
cuda::minimumandcuda::maximumby @Jacobfaib in https://github.com/NVIDIA/cccl/pull/2681 - Various fixes to
cub::DeviceTransformby @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2709 - Make
thrust::transformusecub::DeviceTransformby @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2389 - Ensure that we only use the inline variable trait when it is actually available by @miscco in https://github.com/NVIDIA/cccl/pull/2712
- [CUDAX] Rename memory resource and memory pool from async to device by @pciolkosz in https://github.com/NVIDIA/cccl/pull/2710
- triple_chevron fix by @fbusato in https://github.com/NVIDIA/cccl/pull/2720
- Improve
uninitialized_{async_}bufferAPI by @miscco in https://github.com/NVIDIA/cccl/pull/2713 - Fix merge conflict from renaming of asyncmemoryresource by @miscco in https://github.com/NVIDIA/cccl/pull/2728
- [STF] Improve DOT graph outputs by @caugonnet in https://github.com/NVIDIA/cccl/pull/2703
- Implement
_CCCL_SUPPRESS_DEPRECATED_[PUSH|POP]for ICC and NVHPC by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2730 - Clean up CUB thread operators by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2716
- Deprecate/replace more of Thrust functional by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2105
- Alias
cuda::std::identityto__identityby @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2733 - Do not read uninitialized memory for OOB elements. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2739
- Add option to conditionally build CUDASTF by @miscco in https://github.com/NVIDIA/cccl/pull/2731
- fix
cuda::std::bit_width()return type by @fbusato in https://github.com/NVIDIA/cccl/pull/2745 - [STF] Option to disable kernel generation in CUDASTF by @caugonnet in https://github.com/NVIDIA/cccl/pull/2723
- fix
static_extent()return type by @fbusato in https://github.com/NVIDIA/cccl/pull/2751 - make the empty parens after level constructors optional by @ericniebler in https://github.com/NVIDIA/cccl/pull/2750
- cudax: rename ustdex's
__querymember function toqueryby @ericniebler in https://github.com/NVIDIA/cccl/pull/2757 - Implement execution policies by @miscco in https://github.com/NVIDIA/cccl/pull/2715
- Document some transform iterator corner cases by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2740
- Shorten the git commit message in the ci scripts by @miscco in https://github.com/NVIDIA/cccl/pull/2760
- Separate CUDA and C++ code in C2H by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2734
- Make
get_streamwork with queries by @miscco in https://github.com/NVIDIA/cccl/pull/2761 - Allow
thrust::identityto forward value category by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2732 - Proclaim Thrust/CUB/libcu++ functor address stability by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2719
- give
declvalan implementation that compiles 2x faster by @ericniebler in https://github.com/NVIDIA/cccl/pull/2758 - [CUDAX] Add modernized simpleP2P sample by @pciolkosz in https://github.com/NVIDIA/cccl/pull/2696
s/get_delegatee_scheduler/get_delegation_scheduler/by @ericniebler in https://github.com/NVIDIA/cccl/pull/2766- remove duplicated
__apply_cvtype trait by @ericniebler in https://github.com/NVIDIA/cccl/pull/2754 - merge metaprogramming libs from libcudac++ and µstdex by @ericniebler in https://github.com/NVIDIA/cccl/pull/2767
- Doc fix scan by @karthikeyann in https://github.com/NVIDIA/cccl/pull/2769
- Remove obsolete ways to set iterator category in CUB by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2759
- Run
thrust::transformbenchmarks with more elements by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2764 - Increase libcu++ timeout by @miscco in https://github.com/NVIDIA/cccl/pull/2774
- [STF] Rename the redux access mode into relaxed by @caugonnet in https://github.com/NVIDIA/cccl/pull/2776
- Enable type trait aliases in all standard modes by @miscco in https://github.com/NVIDIA/cccl/pull/2763
- Optimize, Cleanup, and Expose CUB Thread-Level Reduction by @fbusato in https://github.com/NVIDIA/cccl/pull/2756
- Disable execution checks for tuple by @miscco in https://github.com/NVIDIA/cccl/pull/2780
- Avoid benchmarking first-time setup in Thrust algorithms by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2782
- Improve listing benchmarks and text by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2778
- Fix thrust partition docs typo by @gonidelis in https://github.com/NVIDIA/cccl/pull/2791
- Drop unused sanitizer hook by @miscco in https://github.com/NVIDIA/cccl/pull/2793
- use
_CCCL_HAS_FEATUREinstead of plain__has_featureeverywhere by @davebayer in https://github.com/NVIDIA/cccl/pull/2794 - Avoid
make_zip_iterator(make_tuple(...))by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2796 - implement
_CCCL_HAS_INCLUDEby @davebayer in https://github.com/NVIDIA/cccl/pull/2786 - add
__cpp_lib_mdspanfeature-test macro by @fbusato in https://github.com/NVIDIA/cccl/pull/2787 - Remove redundant cmake from example. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2804
- change
__as_type_listso it doesn't cause the instantiation of its argument by @ericniebler in https://github.com/NVIDIA/cccl/pull/2803 - [CUDAX] Enable passing hierarchy levels directly into make_config by @pciolkosz in https://github.com/NVIDIA/cccl/pull/2755
- Fix cudacc/cluster detection macro in launch path of libcudacxx tests by @wmaxey in https://github.com/NVIDIA/cccl/pull/2811
- [STF] Replace CUDASTFCODEGENERATION by !CUDASTFDISABLECODE_GENERATION by @caugonnet in https://github.com/NVIDIA/cccl/pull/2797
- Reduce P0 benchmark variations for mergesortpairs by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2798
- Replace macros by lambdas in cub::DeviceTransform by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2817
- Add
nvrtc_sm_top_level::add_link_list()and use in c/parallel/src/reduce.cu by @rwgk in https://github.com/NVIDIA/cccl/pull/2781 - give
completion_signaturesa fast lookup cache by @ericniebler in https://github.com/NVIDIA/cccl/pull/2812 - implement new compiler checks for NVHPC by @davebayer in https://github.com/NVIDIA/cccl/pull/2816
- Unify [CCCL|CUB|THRUST]ENABLEBENCHMARKS by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2827
- Remove traces of metal from CCCL by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2828
- Move our CUDACC version checks towards the new version check design by @miscco in https://github.com/NVIDIA/cccl/pull/2826
- Extend CUB benchmarking documentation by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2831
- Remove all warm-up runs from Thrust benchmarks by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2838
- Utility scripts for benchmark database by @gevtushenko in https://github.com/NVIDIA/cccl/pull/2847
- [CUDAX] Add missing sm_61 traits by @pciolkosz in https://github.com/NVIDIA/cccl/pull/2848
- Move
_CCCL_COMPILER_ICCto the new macro by @miscco in https://github.com/NVIDIA/cccl/pull/2849 - Fix wrong include in Thrust benchmark by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2854
- Add missing include by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2855
- Move
_CCCL_COMPILER_GCCto the new macro by @davebayer in https://github.com/NVIDIA/cccl/pull/2850 - Add benchmarking and tuning presets by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2856
- Fix race condition in block-RLD test harness. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2706
- Add MatX build to CCCL CI by @alliepiper in https://github.com/NVIDIA/cccl/pull/2682
- Fix DeviceSegmentedSort NVTX range name by @davidwendt in https://github.com/NVIDIA/cccl/pull/2857
- Make discovery mechanism for
cuda/_includedirectory compatible withpip install --editableby @rwgk in https://github.com/NVIDIA/cccl/pull/2846 - add missing
DOXYGEN_*predefined macros when building the cudax docs by @ericniebler in https://github.com/NVIDIA/cccl/pull/2858 - correct the names of
shared_resource's async allocate/deallocate members by @ericniebler in https://github.com/NVIDIA/cccl/pull/2880 - [Docs/PTX] Add device tensor map init example by @ahendriksen in https://github.com/NVIDIA/cccl/pull/1983
- Fix rst typos in benchmarking.html by @gonidelis in https://github.com/NVIDIA/cccl/pull/2868
- Include use of NVHPC in CUB/Thrust magic namespace by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2771
- backport
to_underlyingby @davebayer in https://github.com/NVIDIA/cccl/pull/2853 - move
_CCCL_COMPILER_CLANGto the new macro by @davebayer in https://github.com/NVIDIA/cccl/pull/2859 - Automate release branch creation by @wmaxey in https://github.com/NVIDIA/cccl/pull/2685
- Add
thrust_create_targetDISPATCHoption. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2844 for_each_in_extentby @fbusato in https://github.com/NVIDIA/cccl/pull/2518- Fix old gcc version check by @davebayer in https://github.com/NVIDIA/cccl/pull/2904
- Move implementation of
_LIBCUDACXX_TEMPLATEto CCCL by @miscco in https://github.com/NVIDIA/cccl/pull/2832 - Try to work around issue with NVHPC in conjunction with older CTK versions by @miscco in https://github.com/NVIDIA/cccl/pull/2889
- Refactor nvbench helper less_t by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2905
- add "
interface" to_CCCL_PUSH_MACROSby @ericniebler in https://github.com/NVIDIA/cccl/pull/2919 - Replace inconsistent Doxygen macros with
_CCCL_DOXYGEN_INVOKEDby @ericniebler in https://github.com/NVIDIA/cccl/pull/2921 - implement C++26
std::span::atby @davebayer in https://github.com/NVIDIA/cccl/pull/2924 - move msvc compiler macros to new version by @davebayer in https://github.com/NVIDIA/cccl/pull/2885
- Reorganize PTX tests to match generator by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2930
- Reorganize PTX docs to match generator by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2929
- Improve build instructions for libcu++ by @miscco in https://github.com/NVIDIA/cccl/pull/2881
- Reorganize PTX headers to match generator by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2925
- implement C++26
std::span's constructor fromstd::initializer_listby @davebayer in https://github.com/NVIDIA/cccl/pull/2923 - Add tuple protocol to
cuda::std::complexfrom C++26 by @davebayer in https://github.com/NVIDIA/cccl/pull/2882 - Add missing qualifier for cuda namespace by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2940
- Try to fix a clang warning: by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2941
- minor consistency improvements in concepts macros by @ericniebler in https://github.com/NVIDIA/cccl/pull/2928
- Drop some of the mdspan fold implementation by @miscco in https://github.com/NVIDIA/cccl/pull/2949
- [STF] Implement CUDASTFDOTTIMING for the ctx.cuda_kernel construct by @caugonnet in https://github.com/NVIDIA/cccl/pull/2950
- Avoid potential null dereference in
annotated_ptrby @miscco in https://github.com/NVIDIA/cccl/pull/2951 - make compiler version comparison utility generic by @davebayer in https://github.com/NVIDIA/cccl/pull/2952
- Add SM100 descriptor to target by @miscco in https://github.com/NVIDIA/cccl/pull/2954
- Regenerate
cuda::ptxheaders/docs and run format by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2937 - Regenerate
cuda::ptxtest by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2953 - Do not include extended floating point headers if they are not needed by @miscco in https://github.com/NVIDIA/cccl/pull/2956
- [CUDAX] Add copybytes and fillbytes overloads for mdspan by @pciolkosz in https://github.com/NVIDIA/cccl/pull/2932
- add a
_CCCL_NO_CONCEPTSconfig macro by @ericniebler in https://github.com/NVIDIA/cccl/pull/2945 - remove definition of macro (
_LIBCUDACXX_NO_RTTI) that is no longer used by @ericniebler in https://github.com/NVIDIA/cccl/pull/2957 - Avoid symbol clashes with libc++ by @miscco in https://github.com/NVIDIA/cccl/pull/2955
- Add more CUB transform benchmarks by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2906
- Start reworking our math functions by @miscco in https://github.com/NVIDIA/cccl/pull/2749
- Drop memory resources in libcu++ by @miscco in https://github.com/NVIDIA/cccl/pull/2860
std::dimsby @fbusato in https://github.com/NVIDIA/cccl/pull/2961- Fix merge conflict from moving resources up a namespace by @miscco in https://github.com/NVIDIA/cccl/pull/2965
- [CUDAX] Add a way to combine thread hierarchies by @pciolkosz in https://github.com/NVIDIA/cccl/pull/2746
- Require approval to run CI on draft PRs by @bdice in https://github.com/NVIDIA/cccl/pull/2969
- fix thread-reduce performance regression by @fbusato in https://github.com/NVIDIA/cccl/pull/2944
- add a
__type_switchutility and use it the ptx generator by @ericniebler in https://github.com/NVIDIA/cccl/pull/2946 - replace use of old
_CONCEPT_FRAGMENTmacro in cudax by @ericniebler in https://github.com/NVIDIA/cccl/pull/2973 - remove vestigal uses of the old
DOXYGEN_SHOULD_SKIP_THISmacro by @ericniebler in https://github.com/NVIDIA/cccl/pull/2978 - Fix proclaimcopyablearguments for lambdas by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2833
- Forward declare half types in
cuda::ptxby @ahendriksen in https://github.com/NVIDIA/cccl/pull/2981 - Fix tuning benchmark for
cub::DeviceTransformby @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2970 - fix old gcc version check by @davebayer in https://github.com/NVIDIA/cccl/pull/2989
- Fix a typo in thrust/binary_search.h (#2980) by @hzhangxyz in https://github.com/NVIDIA/cccl/pull/2992
- Enable assertions for CCCL users in CMake Debug builds by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2986
- Fix CMake warning for FindPythonInterp by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2982
- Further clarify host compiler support by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2991
- Drop CCCLELSEIFCONSTEXPR by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2966
- implement C++26
std::ignoreby @davebayer in https://github.com/NVIDIA/cccl/pull/2922 - make the upper limit on TMP loop unrolling configurable by @ericniebler in https://github.com/NVIDIA/cccl/pull/2971
- Update docs with recent features by @davebayer in https://github.com/NVIDIA/cccl/pull/2994
- Restore thrust single config options. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2977
- Document tuning DB comparison scripts by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2968
- Build CUB and Thrust tests with assertions by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2987
- Issue a deprecation warning when compiling with Visual Studio 2017 by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2990
- Guard forward declarations of extended FP types by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2998
- [STF] Create dot sections to possibly collapse nodes when displaying large DOT graphs by @caugonnet in https://github.com/NVIDIA/cccl/pull/2988
- Remove redundant pre c++11 checks by @davebayer in https://github.com/NVIDIA/cccl/pull/2999
- Avoid checking unsigned values for negativity by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2997
- Rename thrust example
version.cutoprint_version.cuby @j3soon in https://github.com/NVIDIA/cccl/pull/3002 - don't bother sync-ing a stream with itself by @ericniebler in https://github.com/NVIDIA/cccl/pull/3007
- Backport
is_scoped_enumby @davebayer in https://github.com/NVIDIA/cccl/pull/3003 - Put
monostatein<utility>by @davebayer in https://github.com/NVIDIA/cccl/pull/3000 - backport std integer comparison functions to C++11 by @davebayer in https://github.com/NVIDIA/cccl/pull/2805
- backport
forward_likeby @davebayer in https://github.com/NVIDIA/cccl/pull/2995 - Document how to profile benchmarks by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3015
- Update Thrust examples ReadMe by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3004
- Deprecate public CUB/Thrust deprecation macros by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3010
- Fix libcudacxx example by @j3soon in https://github.com/NVIDIA/cccl/pull/3013
- Refactor BlockLoad test by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3005
- Fix NVBench profile flags in docs by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3016
- Update RAPIDS to 25.02. by @bdice in https://github.com/NVIDIA/cccl/pull/2967
- Tweak tuning database plot and comparison scripts by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2883
- Allow passing debug flags to NVRTC in libcudacxx tests by @wmaxey in https://github.com/NVIDIA/cccl/pull/3020
- Add missing template parameter to BlockRadixRank example. by @esoha-nvidia in https://github.com/NVIDIA/cccl/pull/2736
- Fix value range overflows in tests by @Artem-B in https://github.com/NVIDIA/cccl/pull/3022
- Avoid relative includesthat have slipped in by @miscco in https://github.com/NVIDIA/cccl/pull/3042
- Fix word count example in Thrust by @caugonnet in https://github.com/NVIDIA/cccl/pull/3014
- revise
<cuda/std/version>by @davebayer in https://github.com/NVIDIA/cccl/pull/3043 - Replace thrust::swap by cuda::std::swap by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2985
- add a converting constructor to
cudax::stream_reffromcuda::stream_refby @ericniebler in https://github.com/NVIDIA/cccl/pull/3052 - [CUDAX] Remove launch overloads taking dimensions and make everything except makehierarchy return kernelconfig by @pciolkosz in https://github.com/NVIDIA/cccl/pull/2979
- move sender support library to
__async/sender/by @ericniebler in https://github.com/NVIDIA/cccl/pull/3056 - [cuda.cooperative] Add block.load and block.store. by @brycelelbach in https://github.com/NVIDIA/cccl/pull/2693
- Backport
unreachableby @davebayer in https://github.com/NVIDIA/cccl/pull/3018 - Define the destructor of
kernel_argby @miscco in https://github.com/NVIDIA/cccl/pull/3060 - Add missing
__syncthreads()to test by @miscco in https://github.com/NVIDIA/cccl/pull/3061 - Add assertions in the mdspan accessors that we are not out of bounds by @miscco in https://github.com/NVIDIA/cccl/pull/3055
- Do not use cudaGetErrorString on GPU. by @Artem-B in https://github.com/NVIDIA/cccl/pull/3059
- Reduce number of per-PR CI jobs. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2931
- Rework CUDA compiler checks by @davebayer in https://github.com/NVIDIA/cccl/pull/3057
- implement C++23
invoke_rby @davebayer in https://github.com/NVIDIA/cccl/pull/3041 - Consider NVTARGETSMINTEGERLIST for ChainedPolicy pruning by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2772
- Add environment to encapsulate information needed for
cudax::vectorby @miscco in https://github.com/NVIDIA/cccl/pull/2775 - We should not call
cudaGetErrorStringon device by @miscco in https://github.com/NVIDIA/cccl/pull/3062 - Introduce cuda.cooperative overloads not requiring temporary storage by @gevtushenko in https://github.com/NVIDIA/cccl/pull/2528
basic_any: a utility for defining type-erasing wrappers in terms of an interface description by @ericniebler in https://github.com/NVIDIA/cccl/pull/2633- Fix Thrust/CUB tests by adding empty base opt-ins to iterator classes by @wmaxey in https://github.com/NVIDIA/cccl/pull/3066
- Don't use exact comparison for FP values. by @Artem-B in https://github.com/NVIDIA/cccl/pull/2742
- Use consistent spelling for aliasing select benchmarks by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3073
- Improve handling of language level features by @miscco in https://github.com/NVIDIA/cccl/pull/3069
- Only tune streaming DeviceSelect versions for 64-bit offsets by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3072
- Disable nvrtc workaround by @miscco in https://github.com/NVIDIA/cccl/pull/1116
- fix assorted problems in cudax memory resource equality fns by @ericniebler in https://github.com/NVIDIA/cccl/pull/3079
- Support fancy iterators in cuda.parallel by @rwgk in https://github.com/NVIDIA/cccl/pull/2788
- fix feature test for operator<=> by @ericniebler in https://github.com/NVIDIA/cccl/pull/3075
- Mark test as potentially passing by @miscco in https://github.com/NVIDIA/cccl/pull/3078
- Avoid padding warning with MSVC by @miscco in https://github.com/NVIDIA/cccl/pull/3077
- Improve CUB tuning documentation by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3058
- Optimise tuning compile-time by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3074
- Use consistent spelling for
CounterTin histogram benchmarks by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3089 - [Improvement] Don't require specifying output type when constructing TransformIterator (cuda.parallel) by @shwina in https://github.com/NVIDIA/cccl/pull/3083
- simplify the definition of the
basic_anyclass template by @ericniebler in https://github.com/NVIDIA/cccl/pull/3085 - Use only signed offset types in CUB benchmarks by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3087
- Improve readability of DispatchSelectIf parameterization by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3092
- [cudax] Simplify implementation of device attributes by @davebayer in https://github.com/NVIDIA/cccl/pull/3084
- suppress
-Werror=empty-bodyinchar_traitsimplementation by @ericniebler in https://github.com/NVIDIA/cccl/pull/3098 - help older clang and gcc to disambiguate
basic_any<__ireference<I>>andbasic_any<I&>bases by @ericniebler in https://github.com/NVIDIA/cccl/pull/3102 - [PERF] cuda.parallel: Cache intermediate results to improve performance of
cudax.reduce_intoby @shwina in https://github.com/NVIDIA/cccl/pull/3001 - [Improvement] cuda.parallel: Don't require value_type when constructing iterators by @shwina in https://github.com/NVIDIA/cccl/pull/3105
- Fix zip and permutation iterator EBO on MSVC by @wmaxey in https://github.com/NVIDIA/cccl/pull/3106
- Avoid signed unsigned warnings in
annotated_ptrtest by @miscco in https://github.com/NVIDIA/cccl/pull/3076 - Changes
DispatchScan[ByKey]documentation to advise using unsigned offset types by @elstehle in https://github.com/NVIDIA/cccl/pull/3111 - [STF] reduce access mode by @caugonnet in https://github.com/NVIDIA/cccl/pull/2830
- add support for comparing type-erased wrappers to non-type-erased objects by @ericniebler in https://github.com/NVIDIA/cccl/pull/3100
- backport
byteby @davebayer in https://github.com/NVIDIA/cccl/pull/3091 - Add bound checks for each dimension of
mdspanby @fbusato in https://github.com/NVIDIA/cccl/pull/3065 - Move some CUB tunings to dedicated headers by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3096
- [CUDAX] Add combine API to kernel_config and allow adding default configuration to kernel functors by @pciolkosz in https://github.com/NVIDIA/cccl/pull/3082
- Extend tuning guide by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3117
- Densen sm90 policy by @gonidelis in https://github.com/NVIDIA/cccl/pull/3121
- Fix a typo in the documentation of cub::DeviceReduce::Reduce by @caugonnet in https://github.com/NVIDIA/cccl/pull/3123
- Cleanup select if tuning by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3120
- Modularize
<cuda/std/cstddef>by @davebayer in https://github.com/NVIDIA/cccl/pull/3119 - Use programmatic dependent launch in CUB merge sort by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3114
- Refactor selecting default tuning for select_if by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3124
- Refactor SM90 radix_sort tuning by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3125
- [STF] Improved sparse CG example and rename scalar to scalar_view by @caugonnet in https://github.com/NVIDIA/cccl/pull/3112
- [CUDAX] Fix the other copy of vector_add after migration to use configs in launch by @pciolkosz in https://github.com/NVIDIA/cccl/pull/3129
- Refactor cub histogram tuning by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3128
- Refactor RLE tuning by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3127
- Make PDL available with CTK 12.0 by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3136
- Refactor reducebykey tuning by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3137
- Refactor scan tunings by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3138
- Fix analyze.py bug by @gonidelis in https://github.com/NVIDIA/cccl/pull/3067
- Refactor scanbykey tuning by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3139
- Refactor threewayparition tuning by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3140
- Clarify passing ValueT to scanbykey tuning by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3143
- Move remaining CUB policy hubs to tuning headers by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3141
- [Internal Cleanup] pre-commit ruff (excluding docs/tools, libcudacxx/test) by @rwgk in https://github.com/NVIDIA/cccl/pull/3110
- Add Python codeowners by @jrhemstad in https://github.com/NVIDIA/cccl/pull/3150
- make
basic_anycompile for device by stubbing out the virtual tables by @ericniebler in https://github.com/NVIDIA/cccl/pull/3109 - Refactoring unique by key by @gonidelis in https://github.com/NVIDIA/cccl/pull/3145
- Add missing header in bench scan exclusive base header by @gonidelis in https://github.com/NVIDIA/cccl/pull/3157
- Use synchronize_optional for device-to-device copy in thrust::copy() by @davidwendt in https://github.com/NVIDIA/cccl/pull/3149
- [Internal Cleanup] pre-commit ruff libcudacxx/tests by @rwgk in https://github.com/NVIDIA/cccl/pull/3152
- Clarify unknown tuning axis are ignored by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3156
- address portability issue in
basic_anywith older nvcc versions by @ericniebler in https://github.com/NVIDIA/cccl/pull/3160 - Add limited H100 testing for CUB by @jrhemstad in https://github.com/NVIDIA/cccl/pull/3151
- Unify policy hub handling and update documentation by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3142
- make the
_CCCL_REQUIRES_EXPRmacro more robust by @ericniebler in https://github.com/NVIDIA/cccl/pull/3164 - [Refactor] cuda.parallel: Simplify TransformIterator implementation and refactor iterators to derive from a common base by @shwina in https://github.com/NVIDIA/cccl/pull/3118
- the streams created by
cudax::streamshould not synchronize with the null stream by @ericniebler in https://github.com/NVIDIA/cccl/pull/3167 - [STF] Implement CUDASTFDOTTIMING for the host_launch construct by @caugonnet in https://github.com/NVIDIA/cccl/pull/3170
- Add support for sm101 and sm101a to NV_TARGET by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3166
- implement C++23
byteswapby @davebayer in https://github.com/NVIDIA/cccl/pull/3093 - Unifies large problem test helper infrastructure by @elstehle in https://github.com/NVIDIA/cccl/pull/3171
- Deprectate C++11 and C++14 for libcu++ by @miscco in https://github.com/NVIDIA/cccl/pull/3173
- Implement
absanddivfromcstdlibby @davebayer in https://github.com/NVIDIA/cccl/pull/3153 - Fix missing radix sort policies by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3174
- Introduces new
DeviceReduce::Arg{Min,Max}interface with two output iterators by @elstehle in https://github.com/NVIDIA/cccl/pull/3148 - Extend tuning documentation by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3179
- Add codespell pre-commit hook, fix typos in CCCL by @bdice in https://github.com/NVIDIA/cccl/pull/3168
- Fix parameter space for TUNE_LOAD in scan benchmark by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3176
- Fix various old compiler version checks by @davebayer in https://github.com/NVIDIA/cccl/pull/3178
- Implement ADL-proof
std::projectedfrom C++26 by @davebayer in https://github.com/NVIDIA/cccl/pull/3175 - Fix pre-commit config for codespell and remaining typos by @shwina in https://github.com/NVIDIA/cccl/pull/3182
- Massive cleanup of our config by @miscco in https://github.com/NVIDIA/cccl/pull/3155
- Fix UB in atomics with automatic storage by @wmaxey in https://github.com/NVIDIA/cccl/pull/2586
- Refactor the source code layout for
cuda.parallelby @shwina in https://github.com/NVIDIA/cccl/pull/3177 - new type-erased memory resources by @ericniebler in https://github.com/NVIDIA/cccl/pull/2824
- rename
_LIBCUDACXX_DECLSPEC_EMPTY_BASESto_CCCL_DECLSPEC_EMPTY_BASESby @ericniebler in https://github.com/NVIDIA/cccl/pull/3186 - Document address stability of
thrust::transformby @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3181 - turn off cuda version check for clangd by @ericniebler in https://github.com/NVIDIA/cccl/pull/3194
- [STF] jacobi example based on parallel_for by @caugonnet in https://github.com/NVIDIA/cccl/pull/3187
- Fixes pre-CTK 11.5 diag suppression issues by @elstehle in https://github.com/NVIDIA/cccl/pull/3189
- Prefer c2h::type_name over c2h::demangle by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3195
- Fix memcpy_async* tests by @ahendriksen in https://github.com/NVIDIA/cccl/pull/3197
- Add type annotations and mypy checks for
cuda.parallelby @shwina in https://github.com/NVIDIA/cccl/pull/3180 - Fix rendering of cuda.parallel docs by @shwina in https://github.com/NVIDIA/cccl/pull/3192
- Enable PDL for DeviceMergeSortBlockSortKernel by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3199
- Adds support for large
num_itemstoDeviceReduce::{ArgMin,ArgMax}by @elstehle in https://github.com/NVIDIA/cccl/pull/2647 - Fixes for Python 3.7 docs environment by @shwina in https://github.com/NVIDIA/cccl/pull/3206
- Adds support for large number of items to
DeviceTransformby @elstehle in https://github.com/NVIDIA/cccl/pull/3172 - cpasyncbulk: Fix test by @ahendriksen in https://github.com/NVIDIA/cccl/pull/3198
- cudax fixes for msvc 14.41 by @ericniebler in https://github.com/NVIDIA/cccl/pull/3200
- avoid instantiating class templates in
is_sameimplementation when possible by @ericniebler in https://github.com/NVIDIA/cccl/pull/3203 - Fix: make launchers a CUB detail; make kernel source functions hidden. by @griwes in https://github.com/NVIDIA/cccl/pull/3209
- help the ranges concepts recognize standard contiguous iterators in c++14/17 by @ericniebler in https://github.com/NVIDIA/cccl/pull/3202
- unify macros and cmake options that control the suppression of deprecation warnings by @ericniebler in https://github.com/NVIDIA/cccl/pull/3220
- Fx thread-reduce performance regression by @fbusato in https://github.com/NVIDIA/cccl/pull/3225
- cuda.parallel: In-memory caching of
cuda.parallelbuild objects by @shwina in https://github.com/NVIDIA/cccl/pull/3216 - clean up the
cuda::std::spanimplementation with minimal c++14 range support by @ericniebler in https://github.com/NVIDIA/cccl/pull/3211 - use generalized concepts portability macros to simplify the
rangeconcept by @ericniebler in https://github.com/NVIDIA/cccl/pull/3217 - Use Ruff to sort imports by @shwina in https://github.com/NVIDIA/cccl/pull/3230
- Fix scan / sm90 perf regression by @gevtushenko in https://github.com/NVIDIA/cccl/pull/3236
- [STF] Logical token by @caugonnet in https://github.com/NVIDIA/cccl/pull/3196
- Fix ReduceByKey tuning by @gevtushenko in https://github.com/NVIDIA/cccl/pull/3240
- Fix RLE tuning by @gevtushenko in https://github.com/NVIDIA/cccl/pull/3239
- cuda.parallel: Forbid non-contiguous arrays as inputs (or outputs) by @shwina in https://github.com/NVIDIA/cccl/pull/3233
- Backport to 2.8: Make CUB NVRTC commandline arguments come from a cmake template (#3292) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3322
- Backport to 2.8: Deprecate GridBarrier and GridBarrierLifetime (#3258) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3288
- Backport to 2.8: Deprecate cub::Swap (#3333) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3350
- Backport to 2.8: Deprecate Thrust's cpp_compatibility.h macros (#3299) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3321
- Backport to 2.8: Deprecate cub::IterateThreadStore (#3337) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3351
- Backport to 2.8: Deprecate thrust::null_type (#3367) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3373
- Backport to 2.8: Review/Deprecate CUB
util.ptxfor CCCL 2.x (#3342) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3389 - Backport to 2.8: Deprecate thrust::optional (#3307) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3393
- Backport to 2.8: Deprecate thrust::numeric_limits (#3366) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3392
- Backport to 2.8: Redefine and deprecate thrust::remove_cvref (#3394) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3420
- Backport to 2.8: Replace and deprecate thrust::cuda_cub::terminate (#3421) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3425
- [BACKPORT]: Deprecate
cub::{min, max}and replace internal uses with those from libcu++ (#3419) by @miscco in https://github.com/NVIDIA/cccl/pull/3447 - Backport to 2.8: Deprecate thrust::async (#3324) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3388
- [BACKPORT]: Moves agents to detail::
namespace by @elstehle in https://github.com/NVIDIA/cccl/pull/3454 - Backport to 2.8: Deprecate a few CUB macros (#3456) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3463
- [BACKPORT]: Fix assert definition for NVHPC due to constexpr issues (#3418) by @miscco in https://github.com/NVIDIA/cccl/pull/3448
- Backport to 2.8: Deprecate cub::DeviceSpmv (#3320) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3374
- Backport to 2.8: some FP8 support by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3479
- Backport to 2.8: Deprecate block/warp algo specializations (#3455) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3481
- Backport to 2.8: Refactor
limitsandclimits(#3221) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3488 - Backport to 2.8: Fix typo in limits (#3491) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3498
- Backport to 2.8: Update upload-pages-artifact to v3 (#3423) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3513
- Backport to 2.8: Implement
cuda::std::numeric_limitsfor__halfand__nv_bfloat16(#3361) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3490 - Backport PRs #3201, #3523, #3547, #3580 to the 2.8.x branch. by @rwgk in https://github.com/NVIDIA/cccl/pull/3536
- [Backport 2.8] work around msvc bug exposed by
__type_indexintype_list.h(#3487) by @wmaxey in https://github.com/NVIDIA/cccl/pull/3537 - [Backport] #3572 to the 2.8.x branch. by @miscco in https://github.com/NVIDIA/cccl/pull/3605
- Backport to 2.8: Specialize
cuda::std::numeric_limitsfor FP8 types (#3478) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3492 - Backport to 2.8: Deprecate thrust universal iterator categories (#3461) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3471
- Backport to 2.8: Deprecate and replace thrust::cuda_cub iterators (#3422) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3510
- Backport to 2.8: Deprecate thrust macros from type_deduction.h (#3501) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3511
- Backport to 2.8: Deprecate macros from cuda/detail/core/util.h (#3504) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3520
- [BACKPORT]:: Try to always include the definition of barriernativehandle when needed (#3556) by @miscco in https://github.com/NVIDIA/cccl/pull/3569
- Backport to 2.8: Deprecates tuning policy hubs by @elstehle in https://github.com/NVIDIA/cccl/pull/3531
- [Backport 2.8] Add extended data type macro identification by @fbusato in https://github.com/NVIDIA/cccl/pull/3586
- Backport to 2.8: Deprecate thrust logical meta functions (#3538) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3567
- Backport to 2.8: Refactor (#3561) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3566
- Backport to 2.8: Tune cub::DeviceTransform for Blackwell (#3543) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3565
- Backport to 2.8: Deprecate and replace
CUB_IS_INT128_ENABLED(#3427) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3629 - Backport to 2.8: Deprecate CUB iterators existing in Thrust (#3304) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3534
- Backport to 2.8: Deprecate thrust event, future and more (#3457) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3512
- Backport to 2.8: PTX support for Blackwell by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3624
- Backport to 2.8: Support FP16 traits on CTK 12.0 (#3535) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3625
- [Backport 2.8] Deprecate
AgentSegmentFixupPolicyby @fbusato in https://github.com/NVIDIA/cccl/pull/3638 - Backport to 2.8: PTX: fix cp.async.bulk.tensor and mbarrier.arrive (#3628) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3630
- Backport to 2.8: Suppress execution checks for vocabulary types (#3578) by @miscco in https://github.com/NVIDIA/cccl/pull/3599
- [BACKPORT]: Try and get rapids green (#3503) by @miscco in https://github.com/NVIDIA/cccl/pull/3598
- Backport to 2.8: Internalize triple_chevron (#3648) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3650
- [BACKPORT]: Ensure that headers in
<cuda/*>can be build with a C++ only compiler (#3472) by @miscco in https://github.com/NVIDIA/cccl/pull/3651 - Backport to 2.8:
__builtin_isfiniteis only available above nvrtc 12.2 by @leofang in https://github.com/NVIDIA/cccl/pull/3653 - [Backport 2.8.x] Backport #3575 deprecating old ABIs in libcudacxx by @wmaxey in https://github.com/NVIDIA/cccl/pull/3660
- [Backport 2.8.x] Backport [nv/target] Add sm_120 macros. (#3550) by @wmaxey in https://github.com/NVIDIA/cccl/pull/3661
- Backport to 2.8: Add b200 policies for device.select.if,flagged,unique (#3545) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3667
- Backport to 2.8: Add b200 tunings for radix_sort.pairs (#3626) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3668
- [Backport branch/2.8.x] PTX: mbarrier.{test,try}_wait: Fix return value by @github-actions in https://github.com/NVIDIA/cccl/pull/3672
- Backport to 2.8: Add b200 tunings for radix_sort.keys (#3611) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3655
- [Backport branch/2.8.x] Fix issues with nvrtc compilation by @github-actions in https://github.com/NVIDIA/cccl/pull/3674
- [Backport branch/2.8.x] Add b200 policies for cub.select.uniquebykey by @github-actions in https://github.com/NVIDIA/cccl/pull/3673
- [Backport branch/2.8.x] Deprecate cub::FpLimits in favor of cuda::std::numeric_limits by @github-actions in https://github.com/NVIDIA/cccl/pull/3658
- Backport to 2.8: Deprecate
cub::AliasTemporaries(#3679) andcub::PolicyWrapper(#3681) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3690 - [Backport branch/2.8.x] Internalize cub::KernelConfig by @github-actions in https://github.com/NVIDIA/cccl/pull/3688
- Backport to 2.8: Fix transform_iterator
(#3652) and Deprecate thrust::identity (#3649) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3693 - Backport to 2.8: Add b200 policies for cub.device.runlengthencode.encode,non_trivialruns (#3546) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3704
- [BACKPORT] Remove cugraph-ops from RAPIDS 25.04 builds. (#3675) by @miscco in https://github.com/NVIDIA/cccl/pull/3696
- Backport to 2.8: Make thrust iterators work with NVRTC (#3676) and replace CUB iterators by Thrust ones (#3480) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3697
- [Backport branch/2.8.x] Deprecate
cub::RegBoundScalingandcub::MemBoundScalingby @github-actions in https://github.com/NVIDIA/cccl/pull/3706 - [backport 2.8] Deprecate and replace
Int2Typeby @fbusato in https://github.com/NVIDIA/cccl/pull/3705 - [Backport branch/2.8.x] Add b200 policies for partition.three_way by @github-actions in https://github.com/NVIDIA/cccl/pull/3710
- Backport to 2.8: Deprecate cub::Trait::CATEGORY|PRIMITIVE|NULL_TYPE (#3689) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3703
- [Backport branch/2.8.x] Add b200 tunings for scan.exclusive.by_key by @github-actions in https://github.com/NVIDIA/cccl/pull/3719
- Backport to 2.8: B200 reduce.by_key tunings by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3726
- Backport to 2.8: B200 tunings for histogram by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3728
- Backport to 2.8: B200 reduce tunings by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3735
- Backport to 2.8: Add b200 policies for cub.device.partition.flagged,if (#3617) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3736
- Backport to 2.8: Add b200 tunings for scan.exclusive.sum (#3559) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3738
- [Backport branch/2.8.x] fix ::cuda::discard_memory by @github-actions in https://github.com/NVIDIA/cccl/pull/3737
- [Backport branch/2.8.x] Fix cub trait deprecations by @github-actions in https://github.com/NVIDIA/cccl/pull/3744
- [Backport branch/2.8.x] [Automation] Add release workflow for tagging and testing new RCs by @github-actions in https://github.com/NVIDIA/cccl/pull/3754
- Suppress deprecatings on logical meta functions by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3795
- Revert back to cub::Traits::CATEGORY|PRIMITIVE by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3866
- [2.8.x] Disable
[[no_unique_address]]for MSVC (#3757) by @miscco in https://github.com/NVIDIA/cccl/pull/3869 - [Backport branch/2.8.x] do not try to use clang-19's support for c++26 pack indexing by @github-actions in https://github.com/NVIDIA/cccl/pull/3903
New Contributors
- @Artem-B made their first contribution in https://github.com/NVIDIA/cccl/pull/2420
- @Kh4ster made their first contribution in https://github.com/NVIDIA/cccl/pull/2470
- @andrewcorrigan made their first contribution in https://github.com/NVIDIA/cccl/pull/2482
- @jjacobelli made their first contribution in https://github.com/NVIDIA/cccl/pull/2604
- @andralex made their first contribution in https://github.com/NVIDIA/cccl/pull/2622
- @Jacobfaib made their first contribution in https://github.com/NVIDIA/cccl/pull/2681
- @karthikeyann made their first contribution in https://github.com/NVIDIA/cccl/pull/2769
- @davidwendt made their first contribution in https://github.com/NVIDIA/cccl/pull/2857
- @hzhangxyz made their first contribution in https://github.com/NVIDIA/cccl/pull/2992
- @j3soon made their first contribution in https://github.com/NVIDIA/cccl/pull/3002
- @esoha-nvidia made their first contribution in https://github.com/NVIDIA/cccl/pull/2736
Full Changelog: https://github.com/NVIDIA/cccl/compare/v2.7.0...v2.8.0
- C++
Published by wmaxey 10 months ago
cccl - CCCL 2.7.0
What’s New
C++
Thrust / CUB
- Inclusive scan now supports initial value https://github.com/NVIDIA/cccl/pull/1940
- Inclusive and exclusive scan now support problem sizes exceeding 2^31 elements https://github.com/NVIDIA/cccl/pull/2171
- New
cub::DeviceMerge::MergeKeysandcub::DeviceMerge::MergePairsalgorithms https://github.com/NVIDIA/cccl/pull/1817 - New
thrust::tabulate_output_iteratorfancy iterator https://github.com/NVIDIA/cccl/pull/2282
Libcudacxx
- Enable Assertions on host and device depending on users choice
- C++26 inplace_vector has been implemented and backported to C++14
- Improved support for extended floating point types
__halfand__nv_bfloat16both for cmath functions and complex cuda::std::tupleis now trivially copyable if the stored types are trivially copyable- Reworked our atomics implementation
- Improved
<cuda/std/bit>conformance - Implemented
<cuda/std/bitset>and backported to C++14 - Implemented and backported C++20
bit_cast. It is available in all standard modes and constexpr with compiler support - Various backports and constexpr improvements (
bool_constant,cuda::std::max) - Moved the experimental memory resources from
<cuda/memory_resource>into<cuda/experimental/memory_resource.cuh>
Python
cuda.cooperative
Best practice of using CCCL to make your CUDA kernels easier to write and faster to execute is now available in Python through the cuda.cooperative module. This module currently supports block- and warp-level algorithms within numba.cuda kernels, offering speed-of-light reductions, prefix sums, radix, and merge sort. You can customize cuda.cooperative algorithms with user-defined data types and operators, implemented directly in Python.
Block and warp-level cooperative algorithms are now available in Python https://github.com/NVIDIA/cccl/pull/1973. Experimental versions of reduce, scan, merge and radix sort are available in numba.cuda kernels.
cuda.parallel
Apart from device-side cooperative algorithms, CCCL 2.7 provides an experimental version of host-side parallel algorithms as part of the cuda.parallel module. This release includes parallel reduction.
What's Changed
- Fix documentation generation for thrust::pair by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1976
- Correct typo in a launch configuration header name by @pciolkosz in https://github.com/NVIDIA/cccl/pull/1972
- Fix thrust::sort for large problem sizes by @gevtushenko in https://github.com/NVIDIA/cccl/pull/1952
- Avoid SIGPIPE when truncating verbose output in CI scripts. by @alliepiper in https://github.com/NVIDIA/cccl/pull/1971
- Clarify compiler support by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1970
- Experimental Python cooperative algorithms by @gevtushenko in https://github.com/NVIDIA/cccl/pull/1973
- [pre-commit.ci] pre-commit autoupdate by @pre-commit-ci in https://github.com/NVIDIA/cccl/pull/1928
- Guard against an overflow in sort tests by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1980
- Remove obsolete Thrust function traits by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1962
- Python: Add version string & wheel build command by @leofang in https://github.com/NVIDIA/cccl/pull/1985
- Add device inclusive scan with init_value by @gonidelis in https://github.com/NVIDIA/cccl/pull/1845
- Fix BWUtil report on early exit by @gonidelis in https://github.com/NVIDIA/cccl/pull/1994
- Use libcu++ void_t everywhere by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1977
- Drop zippedbinaryop by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1988
- Clarify PtxVersion and SmVersion by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2004
- More simplifications for CUB util_device by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1948
- fix some typos in
<cuda/stream_ref>by @ericniebler in https://github.com/NVIDIA/cccl/pull/2003 - Add CI slack notifications. by @alliepiper in https://github.com/NVIDIA/cccl/pull/1961
- Allow nightly workflow to be manually invoked. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2007
- Need to use a different approach to reuse secrets in reusable workflows vs. actions. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2008
- Enable RAPIDS builds for manually dispatched workflows. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2009
- clean up complex.inl by @ZelboK in https://github.com/NVIDIA/cccl/pull/1655
- Add github token to nightly workflow-results action. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2012
- Remove obsolete build system glue from the Thrust/CUB submodule structure. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2016
- Benchmark thrust::copy with non-trivially relocatable type by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1989
- Make bool_constant available in C++11 by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1997
- Spell value initialization where used in thrust vectors by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1990
- Do no redefine
__ELF__macro by @miscco in https://github.com/NVIDIA/cccl/pull/2018 - Port
thrust::merge[_by_key]to CUB by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1817 - Simplify some pointer traits by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2020
- Simplify test data setup by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2023
- Add tests to ensure that we properly propagate common_type for complex types by @miscco in https://github.com/NVIDIA/cccl/pull/2025
- Update Thrust CMake README to use CCCL repo. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2026
- Include container toolkit in manual prereqs by @bryevdv in https://github.com/NVIDIA/cccl/pull/2064
- Avoid ADL issues with
thrust::distanceby @miscco in https://github.com/NVIDIA/cccl/pull/2053 - Simplify thrust::detail::wrapped_function by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2019
- Add a test for Thrust scan with non-commutative op by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2024
- Update memory_resource docs by @miscco in https://github.com/NVIDIA/cccl/pull/1883
- Temporarily switch nightly H100 CI to build-only. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2060
- Do not rely on conversions between float and extended floating point types by @miscco in https://github.com/NVIDIA/cccl/pull/2046
- experimental wrapper types for
cudaEvent_tthat provide a modern C++ interface. by @ericniebler in https://github.com/NVIDIA/cccl/pull/2017 - [CUDAX] Add a dummy device struct for now by @pciolkosz in https://github.com/NVIDIA/cccl/pull/2066
- Allow (somewhat) different input value types for merge by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2075
- Avoid
::result_typefor partial sums in TBB reducebykey by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1998 - Fix formatting by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2090
- Rename and refactor transformiteratorbase by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1987
- Benchmark analysis: Print all top rows when asked for by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2089
- Makes user-provided functors in our examples use
__device__instead ofCUB_RUNTIME_FUNCTIONby @elstehle in https://github.com/NVIDIA/cccl/pull/2088 - Separate
cuda/experimentalwhen sorting includes by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2094 - add support to
cudax::devicefor querying a device's attributes by @ericniebler in https://github.com/NVIDIA/cccl/pull/2084 - [CUDAX] Add experimental owning abstraction for cudaStream_t by @pciolkosz in https://github.com/NVIDIA/cccl/pull/2093
- Do not query NVRTC for cuda runtime header by @miscco in https://github.com/NVIDIA/cccl/pull/2102
- Cleanup CUB block/thread load and exchange by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1946
- Improve binary function objects and replace thrust implementation by @srinivasyadav18 in https://github.com/NVIDIA/cccl/pull/1872
- Replace
_LIBCUDACXX_CPO_ACCESSIBILITYwith_CCCL_GLOBAL_VARIABLEby @miscco in https://github.com/NVIDIA/cccl/pull/1881 - Add script to update RAPIDS version. by @bdice in https://github.com/NVIDIA/cccl/pull/2082
- Update bad links by @bryevdv in https://github.com/NVIDIA/cccl/pull/2080
- Fix line break issues that break doxygen code examples by @miscco in https://github.com/NVIDIA/cccl/pull/2103
- Add internal wrapper for cuda driver APIs by @pciolkosz in https://github.com/NVIDIA/cccl/pull/2070
- Use
common_typefor complexpowby @miscco in https://github.com/NVIDIA/cccl/pull/1800 - [CUDAX] rename
devicetodevice_ref, add immovabledeviceas a place to cache properties by @ericniebler in https://github.com/NVIDIA/cccl/pull/2110 - Use the float flavors of the cmath functions in the extended floating point fallbacks by @miscco in https://github.com/NVIDIA/cccl/pull/2106
- [PoC]: Implement
cuda::experimental::uninitialized_bufferby @miscco in https://github.com/NVIDIA/cccl/pull/1831 - Ensure that we avoid ABI Version conflics by @miscco in https://github.com/NVIDIA/cccl/pull/2137
- Ensure that
cuda_memory_resourceallocates memory on the proper device by @miscco in https://github.com/NVIDIA/cccl/pull/2073 - Clarify compatibility wrt. template specializations by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2138
- Implement a
cudax::get_streamCPO by @miscco in https://github.com/NVIDIA/cccl/pull/2135 - Make
cuda::std::tupletrivially copyable by @miscco in https://github.com/NVIDIA/cccl/pull/2127 - Fix missing copy of docs artifacts by @miscco in https://github.com/NVIDIA/cccl/pull/2162
- Fix g++-14 warning on uninitialized copying by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2157
- Fix flakey heterogeneous tests by @wmaxey in https://github.com/NVIDIA/cccl/pull/2085
- Fix multiple definition of InclusiveScanKernel by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2169
- [CUDAX] Add a global constexpr
cudax::devicesrange for all devices in the system by @ericniebler in https://github.com/NVIDIA/cccl/pull/2100 - fix use of
cudaStream_tas if it were a stream wrapper by @ericniebler in https://github.com/NVIDIA/cccl/pull/2190 - Fix uninitialized_buffer self assignment by @miscco in https://github.com/NVIDIA/cccl/pull/2170
- Fix trivialcopydevicetodevice execution space by @gevtushenko in https://github.com/NVIDIA/cccl/pull/2164
- Clarify libcu++ use by non-CUDA compilers by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1969
- Warn when using C++14 in CUB and Thrust by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2166
- Fix the
clang-formatpath in the devcontainers by @miscco in https://github.com/NVIDIA/cccl/pull/2194 - Mount a temporary build volume for CCCL projects if WSL is detected by @wmaxey in https://github.com/NVIDIA/cccl/pull/2035
- 2118 [CUDAX] Change the RAII device swapper to use driver API and add it in places where it was missing by @pciolkosz in https://github.com/NVIDIA/cccl/pull/2192
- Fix singular vs plural typo in thread scope documentation. by @brycelelbach in https://github.com/NVIDIA/cccl/pull/2198
- [CUDAX] fixing some minor issues with device attribute queries by @ericniebler in https://github.com/NVIDIA/cccl/pull/2183
- Integrate Python docs by @bryevdv in https://github.com/NVIDIA/cccl/pull/2196
- [FEA] Atomics codegen refactor by @wmaxey in https://github.com/NVIDIA/cccl/pull/1993
- [CUDAX] add
__launch_transformto transform arguments tocudax::launchprior to launching the kernel by @ericniebler in https://github.com/NVIDIA/cccl/pull/2202 - Cleanup common testing headers and correct asserts in launch testing by @pciolkosz in https://github.com/NVIDIA/cccl/pull/2204
- [CUDAX] Add an API to get deviceref from stream and add comparison operator to deviceref by @pciolkosz in https://github.com/NVIDIA/cccl/pull/2203
- Update devcontainer docs for WSL by @jrhemstad in https://github.com/NVIDIA/cccl/pull/2200
- add
cudax::distribute<threadsPrBlock>(numElements)by @ericniebler in https://github.com/NVIDIA/cccl/pull/2210 - Rework mdspan concept emulation by @miscco in https://github.com/NVIDIA/cccl/pull/2213
- Un-doc functions taking debug_synchronous by @bryevdv in https://github.com/NVIDIA/cccl/pull/2209
- CUDA
vector_addsample project by @ericniebler in https://github.com/NVIDIA/cccl/pull/2160 - avoid constraint recursion in the
resourceconcept by @ericniebler in https://github.com/NVIDIA/cccl/pull/2215 - fix
cuda_memory_resourcetest for properly aligned memory by @ericniebler in https://github.com/NVIDIA/cccl/pull/2227 - Fix including
<complex>when bad CUDA bfloat/half macros are used. by @wmaxey in https://github.com/NVIDIA/cccl/pull/2226 - Add license & fix
long_descriptioninsetup.pyby @leofang in https://github.com/NVIDIA/cccl/pull/2211 - Extract reduction kernels into NVRTC-compilable header by @gevtushenko in https://github.com/NVIDIA/cccl/pull/2231
- Implement
<cuda/std/bitset>by @griwes in https://github.com/NVIDIA/cccl/pull/1496 - Refactor Thrust placeholder operators by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2233
- Add missing annotations for deprecated debug_sync APIs by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2212
- Test thrust headers for disabled half/bf16 support by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2219
- Make cuda::std::max constexpr in C++11 by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2107
- Fix ForEachCopyN for non-contiguous iterators by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2220
- Configure CUB/Thrust for C++17 by default by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2217
- Allow installing components when downstream by @stephenswat in https://github.com/NVIDIA/cccl/pull/2096
- Rename the memory resources to drop the superfluous prefix
cuda_by @miscco in https://github.com/NVIDIA/cccl/pull/2243 - Fix and simplify
by @wmaxey in https://github.com/NVIDIA/cccl/pull/2197 - Proclaim pair and tuple trivially relocatable by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2010
- Make
cuda::std::minconstexpr in C++11 by @miscco in https://github.com/NVIDIA/cccl/pull/2249 - Add
CCCL_DISABLE_NVTXmacro by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2173 - Workaround GCC 13 issue with empty histogram decoder op by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2252
- Refactor Thrust's logical meta functions by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2260
- Fix use of doxygen \file command by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2259
- Add tests for transform_iterator's reference type by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2221
- Small tuning script output improvements by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2262
- Fix Thrust::vector ctor selection for int,int by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2261
- Adds support for large number of items to
DeviceScanby @elstehle in https://github.com/NVIDIA/cccl/pull/2171 - Use and test radix sort for int128, half and bfloat16 in Thrust by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2168
- Implement C API for device reduction by @gevtushenko in https://github.com/NVIDIA/cccl/pull/2256
- Move cooperative module by @gevtushenko in https://github.com/NVIDIA/cccl/pull/2269
- Move compiler version macros into libcu++ by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2250
- Introduce cuda.parallel module by @gevtushenko in https://github.com/NVIDIA/cccl/pull/2276
- Adds
thrust::tabulate_output_iteratorby @elstehle in https://github.com/NVIDIA/cccl/pull/2282 - Drop macos string that lit cannot parse properly by @miscco in https://github.com/NVIDIA/cccl/pull/2283
- Flatten forwarding headers by @miscco in https://github.com/NVIDIA/cccl/pull/2284
- 2270 static compute capabilities queries by @pciolkosz in https://github.com/NVIDIA/cccl/pull/2271
- Fix read of dangling reference in thrust placeholders by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2290
- Implement
any_resource, an owning wrapper around a memory resource by @ericniebler in https://github.com/NVIDIA/cccl/pull/2266 - Fixes formatting of
tabulate_output_iterator.inlby @elstehle in https://github.com/NVIDIA/cccl/pull/2298 - use
NV_IF_TARGETto conditionally compile CUDAX tests by @ericniebler in https://github.com/NVIDIA/cccl/pull/2297 - Make for_each compatible with NVRTC by @wmaxey in https://github.com/NVIDIA/cccl/pull/2288
- refactor cmake so more cudax samples can be easily added by @ericniebler in https://github.com/NVIDIA/cccl/pull/2296
- Use the
in,out, andinoutparameter decorators fromcudax::launchby @ericniebler in https://github.com/NVIDIA/cccl/pull/2294 - Implement
std::bit_castby @miscco in https://github.com/NVIDIA/cccl/pull/2258 - Cleanup the
<cuda/std/bit>header by @miscco in https://github.com/NVIDIA/cccl/pull/2299 - change
cudax::uninitialized_bufferto own its memory resource withcudax::any_resourceby @ericniebler in https://github.com/NVIDIA/cccl/pull/2293 - Documentation typos by @fbusato in https://github.com/NVIDIA/cccl/pull/2302
- Add thrust::inclusivescan with initvalue support by @gonidelis in https://github.com/NVIDIA/cccl/pull/1940
- Assure placeholder expressions are semi-regular by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2305
- Add documentation for
any_resourceby @miscco in https://github.com/NVIDIA/cccl/pull/2309 - Implement P0843
inplace_vectorby @miscco in https://github.com/NVIDIA/cccl/pull/1936 - Cleanup
__configand unify most visibility macros by @miscco in https://github.com/NVIDIA/cccl/pull/2285 - Add a fast, low memory "limited" mode to CUB testing. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2317
- [CUDAX] Add eventref::isdone() and update event tests by @pciolkosz in https://github.com/NVIDIA/cccl/pull/2304
- Minor cleanup to memory resources by @miscco in https://github.com/NVIDIA/cccl/pull/2308
- Drop ICC from the cudax support matrix by @miscco in https://github.com/NVIDIA/cccl/pull/2330
- Do not hardcode Thrust's host system to cpp. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2332
- [CUDAX] Add computecapability device attribute and handle archtraits for future architectures by @pciolkosz in https://github.com/NVIDIA/cccl/pull/2328
- Disable exec checks on ranges CPOs by @miscco in https://github.com/NVIDIA/cccl/pull/2331
- Enable exceptions by default by @miscco in https://github.com/NVIDIA/cccl/pull/2329
- Make the thrust dispatch mechanisms configurable by @miscco in https://github.com/NVIDIA/cccl/pull/2310
- [CUDAX] give all the cudax headers the
.cuhextension by @ericniebler in https://github.com/NVIDIA/cccl/pull/2340 - Compiler version improvements by @fbusato in https://github.com/NVIDIA/cccl/pull/2316
- Fix hardcoding _THRUSTHOSTSYSTEMNAMESPACE to cpp by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2341
- Improvements to the Cuda Core C library infrastructure by @miscco in https://github.com/NVIDIA/cccl/pull/2336
- Fix bug remaining on thrust::inclusive_scan with init value with CDP by @gonidelis in https://github.com/NVIDIA/cccl/pull/2346
- [CUDAX] make
uninitialized_bufferusable withlaunchby @ericniebler in https://github.com/NVIDIA/cccl/pull/2342 - Test and fix failing nightly libcudacxx + CUB jobs by @miscco in https://github.com/NVIDIA/cccl/pull/1847
- Update Memory Model docs for HMM by @gonzalobg in https://github.com/NVIDIA/cccl/pull/2272
- Harden thrust algorithms against evil iterators that overload
operator,by @miscco in https://github.com/NVIDIA/cccl/pull/2349 - Avoid circular concept definition with memory resources by @miscco in https://github.com/NVIDIA/cccl/pull/2351
- add IWYU
exportpragma on config headers by @ericniebler in https://github.com/NVIDIA/cccl/pull/2352 - Add cuda_parallel to CI. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2338
- [CUDAX] Branch out an experimental version of stream_ref by @pciolkosz in https://github.com/NVIDIA/cccl/pull/2343
- Improve visibility macros for libcu++ by @miscco in https://github.com/NVIDIA/cccl/pull/2337
- Add missing cuKernelGetFunction call to reduce by @pciolkosz in https://github.com/NVIDIA/cccl/pull/2355
- Move
invalid_streamto the proper file by @miscco in https://github.com/NVIDIA/cccl/pull/2360 - fix the cudax
vector_addsample by @ericniebler in https://github.com/NVIDIA/cccl/pull/2372 - Add -Wmissing-field-initializers to cudax by @pciolkosz in https://github.com/NVIDIA/cccl/pull/2373
- Update CCCL version to 2.7.0 by @wmaxey in https://github.com/NVIDIA/cccl/pull/2364
- Backport several fixes into 2.7.x. by @wmaxey in https://github.com/NVIDIA/cccl/pull/2579
- [BACKPORT]: Rework
head_flagsso that we do not rely on the tuple being unevaluated (#2619) by @miscco in https://github.com/NVIDIA/cccl/pull/2620 - [Backport] Fix cluster launch error in branch/2.7.x by @wmaxey in https://github.com/NVIDIA/cccl/pull/2866
- Disable execution checks for tuple (#2780) by @wmaxey in https://github.com/NVIDIA/cccl/pull/2867
- [BACKPORT: Fix Thrust/CUB tests by adding empty base opt-ins to iterator classes (#3066) by @miscco in https://github.com/NVIDIA/cccl/pull/3068
- [Backport] Fix EBO in zip_iterator on MSVC. by @wmaxey in https://github.com/NVIDIA/cccl/pull/3107
New Contributors
- @bryevdv made their first contribution in https://github.com/NVIDIA/cccl/pull/2064
- @stephenswat made their first contribution in https://github.com/NVIDIA/cccl/pull/2096
Full Changelog: https://github.com/NVIDIA/cccl/compare/v2.6.1...v2.7.0
- C++
Published by wmaxey about 1 year ago
cccl - CCCL 2.6.1
This release includes backports for PRs #2332 and #2341. Please see release 2.6.0 for the full list of changes included in the release.
What's Changed
- Backport PR #2332 and #2341 by @wmaxey in https://github.com/NVIDIA/cccl/pull/2368
Full Changelog: https://github.com/NVIDIA/cccl/compare/v2.6.0...v2.6.1
- C++
Published by wmaxey over 1 year ago
cccl - CCCL 2.6.0
What's Changed
- Restrict active histogram channels to channel count by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1796
- Cleanup internal thrust CUDA utils by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1802
- Use variadic interfaces in agent launcher by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1804
- Use
nullptroverNULLby @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1805 - Rework the documentation to be build with sphinx by @miscco in https://github.com/NVIDIA/cccl/pull/1753
- Let Catch2 report cudaError descriptions by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1808
- Check size-querying CUB API invocation in tests by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1809
- Update docs link by @gevtushenko in https://github.com/NVIDIA/cccl/pull/1812
- Add missing inline specifiers by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1813
- Upgrade actions that use node16 to versions that use node20 by @trxcllnt in https://github.com/NVIDIA/cccl/pull/1779
- Document NVTX range behavior during graph capture by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1814
- Clean up AliasTemporaries by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1815
- Drop removed clang-tidy option by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1810
- Exclude docs from cccl infra changes. by @alliepiper in https://github.com/NVIDIA/cccl/pull/1821
- Clean up thrust merge unit tests by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1819
- Fix atomic performance regressions by avoiding use of memcpy with natively supported atomic types. by @wmaxey in https://github.com/NVIDIA/cccl/pull/1801
- Clean up
merge_by_keyandmerge_key_valuetests by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1824 - Restore the old thrust api documentation in rst by @miscco in https://github.com/NVIDIA/cccl/pull/1818
- Drop all internal implementations of exceptions by @miscco in https://github.com/NVIDIA/cccl/pull/1806
- Fix span for non-ranges by @miscco in https://github.com/NVIDIA/cccl/pull/1836
- Cleanup thrust test special types by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1837
- Add inclusive_scan with initial value support (warp/block) by @gonidelis in https://github.com/NVIDIA/cccl/pull/1749
- Fix loading from incorrect URI on 404 page. by @wmaxey in https://github.com/NVIDIA/cccl/pull/1843
- Port CUB temporary storage layout test to Catch2 by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1835
- Port CUB thread operators test to Catch2 by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1834
- Adds ceil_div by @gonzalobg in https://github.com/NVIDIA/cccl/pull/1825
- Split workflow into multiple dispatch groups to avoid skipped jobs. by @alliepiper in https://github.com/NVIDIA/cccl/pull/1797
- Fix broken CUB doc build and add 404 page to Sphinx. by @wmaxey in https://github.com/NVIDIA/cccl/pull/1846
- Port CUB thread sort test to Catch2 by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1838
- Cleanup CUB temporary storage layout test by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1848
- Propogate error when docsbuild fails, add docs build to CI. by @alliepiper in https://github.com/NVIDIA/cccl/pull/1852
- Cleanup CUB util_macro.cuh by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1849
- Provide libcu++ transparent functors in C++11 by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1851
- Roll back upload-pages-artifact to v2. by @alliepiper in https://github.com/NVIDIA/cccl/pull/1861
- Port CUB iterator test to Catch2 by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1822
- Symbol visibility is now invariant in regards to
__cuda_std__definition by @robertmaynard in https://github.com/NVIDIA/cccl/pull/1832 - Add dimensions description functionality to CUDA Experimental library by @pciolkosz in https://github.com/NVIDIA/cccl/pull/1743
- Document Asynchronous Operations by @gonzalobg in https://github.com/NVIDIA/cccl/pull/1781
- Remove cpp11_required.h by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1860
- Add workflow to build RAPIDS from source with local CCCL by @trxcllnt in https://github.com/NVIDIA/cccl/pull/1667
- Refactor CI matrix. by @alliepiper in https://github.com/NVIDIA/cccl/pull/1844
- Adds tests for large number of items in
cub::DeviceScanby @elstehle in https://github.com/NVIDIA/cccl/pull/1830 - Make CUB test launch wrappers functor instances by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1850
- Improve CUB test overview docs by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1867
- Skip devcontainer validation jobs if not needed. by @alliepiper in https://github.com/NVIDIA/cccl/pull/1853
- Improve CUB device-scope documentation by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1862
- Make integer sequence et al. available in C++11 by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1859
- Minimize template instantiations in CUB thread_load by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1857
- Create major version 2.6.0 by @wmaxey in https://github.com/NVIDIA/cccl/pull/1880
- Drop facilities deprecated in CUB 1.x by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1868
- Make thrust::sort use radix sort with more comparators by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1884
- Make cuda::ptx::*_multicast pass on all architectures by @ahendriksen in https://github.com/NVIDIA/cccl/pull/1874
- Replace typedef by alias declarations in CUB by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1885
- Remove legacy benchmarks and other dvs/p4 remnants by @alliepiper in https://github.com/NVIDIA/cccl/pull/1901
- Qualify call to distance in thrust::async_reduce by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1904
- Rename CUB uninitialized_copy by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1913
- Sanitizer fixes by @alliepiper in https://github.com/NVIDIA/cccl/pull/1916
- Use c2h::vectors in all non-example CUB tests by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1914
- Renamed overlooked uninitialized_copy by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1920
- Add assert implementation for device side testing by @pciolkosz in https://github.com/NVIDIA/cccl/pull/1918
- Thrust and CUB: README: Fix copy-paste from libcu++ and links by @pauleonix in https://github.com/NVIDIA/cccl/pull/1878
- Follow-up fixes to CUB iterator test by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1875
- Replace typedef by alias declarations in Thrust by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1915
- Cleanup CUB util_type.cuh by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1863
- Fix include for
in cub/util_type.cuh by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1929 - Fix issues with comments in the concept emulation by @miscco in https://github.com/NVIDIA/cccl/pull/1931
- Deprecate and reduce use of old functional stuff by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1925
- Deprecate more nested aliases in thrust functors by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1932
- Fix various typos in CUB documentation and comments. by @brycelelbach in https://github.com/NVIDIA/cccl/pull/1933
- Add BabelStream flavors as thrust::transform benchmarks by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1921
- Some cleanup in Thrust config headers by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1934
- Update to CUDA 12.5 containers by @jrhemstad in https://github.com/NVIDIA/cccl/pull/1935
- Check that the current version of CMake supports policy 141 before se… by @alliepiper in https://github.com/NVIDIA/cccl/pull/1924
- Fix
memmoveoptimization by @miscco in https://github.com/NVIDIA/cccl/pull/1937 - Fixes
thrust::unique_by_keyexamples by @elstehle in https://github.com/NVIDIA/cccl/pull/1943 - Use only explicit NVTX3 V1 API in CUB by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1751
- Suppress a clang warning on array size computation by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1942
- Add a benchmark for thrust::equal by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1944
- Strip prefix paths to improve doc rendering by @bdice in https://github.com/NVIDIA/cccl/pull/1954
- Modernize Thrust's alignment.h and triplechevronlaunch by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1905
- Restore RAPIDS devcontainer by @bdice in https://github.com/NVIDIA/cccl/pull/1955
- Fix for in-place
DeviceSelect&thrust::remove_ifby @elstehle in https://github.com/NVIDIA/cccl/pull/1782 - Drop Thrust's cstdint.h by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1959
- Use
make_devcontainers.sh --cleanwhen validating. by @alliepiper in https://github.com/NVIDIA/cccl/pull/1963 - Fix missing binarypred in thrust::uniqueby_key by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1957
- cuda::launch and launch configuration object with minimal functionality by @pciolkosz in https://github.com/NVIDIA/cccl/pull/1950
- Backport PR #2046 - Fixing FP16 conversions. by @wmaxey in https://github.com/NVIDIA/cccl/pull/2222
Full Changelog: https://github.com/NVIDIA/cccl/compare/v2.5.0...v2.6.0
- C++
Published by wmaxey over 1 year ago
cccl - CCCL 2.5.0
What's New
This release includes several notable improvements and new features: - CUB device-level algorithms now support NVTX ranges in Nsight Systems. This integration makes it easier to identify and analyze the time spent in CUB algorithms. Please note that profiling with this feature requires at least C++14. - We have added new cub::DeviceSelect::FlaggedIf API, which allows you to select items based on applying a predicate to flags. This addition provides more flexibility and control over item selection.
What's Changed
- Clean up libcu++ docs landing page by @jrhemstad in https://github.com/NVIDIA/cccl/pull/1492
- PTX: Add
cuda::ptx::elect_syncby @ahendriksen in https://github.com/NVIDIA/cccl/pull/1537 - Print a summary of all tests sorted by execution time. by @alliepiper in https://github.com/NVIDIA/cccl/pull/1539
- Fix unused variable warning for
__can_use_complete_txby @wmaxey in https://github.com/NVIDIA/cccl/pull/1547 - Fix usage of naked array with 0 elements in sm90 barrier tests. by @wmaxey in https://github.com/NVIDIA/cccl/pull/1546
- Add support for stream operators for complex by @miscco in https://github.com/NVIDIA/cccl/pull/1538
- Fix
__halffor older architectures by @miscco in https://github.com/NVIDIA/cccl/pull/1543 - Feat 565 remove redundant thrust dialect conditional by @ZelboK in https://github.com/NVIDIA/cccl/pull/566
- fix missing device hint in WarpMergeSort Documentation by @MARD1NO in https://github.com/NVIDIA/cccl/pull/1553
- Minor fixes and additions on cub developer guides by @gonidelis in https://github.com/NVIDIA/cccl/pull/1559
- Consolidate handling of
constexprandif constexprby @miscco in https://github.com/NVIDIA/cccl/pull/1562 - Ensure that
cuda::aligned_size_tis usable in a constexpr context by @miscco in https://github.com/NVIDIA/cccl/pull/1564 - Group CUB docs by @gevtushenko in https://github.com/NVIDIA/cccl/pull/1565
- Update toolkit to 12.4 by @miscco in https://github.com/NVIDIA/cccl/pull/1554
- Work around change in cuTensorMapEncode by @miscco in https://github.com/NVIDIA/cccl/pull/1567
- Remove stdlib arg from .clangd. by @alliepiper in https://github.com/NVIDIA/cccl/pull/1569
- Add the DeviceSelect::FlaggedIf algorithm by @gonidelis in https://github.com/NVIDIA/cccl/pull/1533
- Catch2 segmented sort by @alliepiper in https://github.com/NVIDIA/cccl/pull/1484
- Do not emit diagnostic with extended device lambdas with preserved re… by @Revaj in https://github.com/NVIDIA/cccl/pull/1495
- Use absolute includes for libcu++ by @miscco in https://github.com/NVIDIA/cccl/pull/1560
- [NFC] Modularize
<exception>by @miscco in https://github.com/NVIDIA/cccl/pull/199 - Add test support for launching kernels with cluster size > 1 by @ahendriksen in https://github.com/NVIDIA/cccl/pull/416
- Fix typo in README.md by @bprb in https://github.com/NVIDIA/cccl/pull/1574
- [FEA]: Modularize
<cuda/memory_resource>by @miscco in https://github.com/NVIDIA/cccl/pull/1532 - Cleanup_complex by @miscco in https://github.com/NVIDIA/cccl/pull/1555
- Add missing comma in barrier
__try_waitby @miscco in https://github.com/NVIDIA/cccl/pull/1593 - Segmented sort test fix by @alliepiper in https://github.com/NVIDIA/cccl/pull/1591
- Add pre-commit configuration by @bdice in https://github.com/NVIDIA/cccl/pull/1596
- Preserve
.devcontainer/img/when cleaning. by @alliepiper in https://github.com/NVIDIA/cccl/pull/1604 - Add some documentation for recent additions to libcu++ by @miscco in https://github.com/NVIDIA/cccl/pull/1594
- Ensure
cuda::std::nulloptis visible in device code by @trxcllnt in https://github.com/NVIDIA/cccl/pull/1598 - Fix ordering of
alignasand__shared__by @miscco in https://github.com/NVIDIA/cccl/pull/1601 - Update Thrust CI tests. by @alliepiper in https://github.com/NVIDIA/cccl/pull/1605
- Implement tuple interface for cuda vector types by @miscco in https://github.com/NVIDIA/cccl/pull/1410
- Inspect PR changes to determine if subproject builds are needed. by @alliepiper in https://github.com/NVIDIA/cccl/pull/1572
- Apply clang-format to cub by @bdice in https://github.com/NVIDIA/cccl/pull/1602
- Add missing non-volatile atomic overloads. by @wmaxey in https://github.com/NVIDIA/cccl/pull/1582
- Drop unused libcxx files by @miscco in https://github.com/NVIDIA/cccl/pull/1606
- Apply formatting to libcudacxx by @miscco in https://github.com/NVIDIA/cccl/pull/1610
- Add conda documentation to the README. by @bdice in https://github.com/NVIDIA/cccl/pull/1581
- Allow jobs to be skipped. by @alliepiper in https://github.com/NVIDIA/cccl/pull/1611
- Make libcu++ work with exceptions by @miscco in https://github.com/NVIDIA/cccl/pull/1607
- Implement
cuda::mr::cuda_memory_resourceby @miscco in https://github.com/NVIDIA/cccl/pull/1578 - Implement
cuda::mr::managed_memory_resourceby @miscco in https://github.com/NVIDIA/cccl/pull/1579 - Apply formatting to thrust by @miscco in https://github.com/NVIDIA/cccl/pull/1616
- Update exampledeviceradix_sort.cu by @eriktedhamre in https://github.com/NVIDIA/cccl/pull/1608
- Implement
cuda::mr::pinned_memory_resourceby @miscco in https://github.com/NVIDIA/cccl/pull/1580 - Set the devcontainers to format on save. by @miscco in https://github.com/NVIDIA/cccl/pull/1624
- Enable internal use of
std::allocatorrelated functionality by @miscco in https://github.com/NVIDIA/cccl/pull/1583 - Adds tests for large number of items for
cub::DeviceSelectby @elstehle in https://github.com/NVIDIA/cccl/pull/1612 - Add pre-commit docs to CONTRIBUTING.md. by @bdice in https://github.com/NVIDIA/cccl/pull/1627
- Move visibility attributes to cccl by @miscco in https://github.com/NVIDIA/cccl/pull/1595
- Work around thrust/memory.h circular include by @dkolsen-pgi in https://github.com/NVIDIA/cccl/pull/1634
- Fix mbarrier.init addressing by @ahendriksen in https://github.com/NVIDIA/cccl/pull/1636
- Trim trailing whitespace and normalize newlines. by @bdice in https://github.com/NVIDIA/cccl/pull/1633
- Add a
git-blame-ignore-revsfile by @miscco in https://github.com/NVIDIA/cccl/pull/1629 - Revert "PTX: Add
cuda::ptx::elect_sync(#1537)" by @ahendriksen in https://github.com/NVIDIA/cccl/pull/1638 - Address potential oob in cub when passing in an invalid device counter by @miscco in https://github.com/NVIDIA/cccl/pull/1641
- Allow ninja_summary to fail by @jrhemstad in https://github.com/NVIDIA/cccl/pull/1644
- Mostly flatten the folder structure of libcu++ by @miscco in https://github.com/NVIDIA/cccl/pull/1630
- Make
--cmake-options=""always override others. by @alliepiper in https://github.com/NVIDIA/cccl/pull/1648 - Fix invalid
_CCCL_CUDACCdefinition for clang cuda by @miscco in https://github.com/NVIDIA/cccl/pull/1656 - Add missing #pragma once in some headers by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1668
- Add NVTX ranges for all CUB algorithms by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1657
- Implement LWG-3843 and LWG-3940 by @miscco in https://github.com/NVIDIA/cccl/pull/1621
- Modularize
<memory>by @miscco in https://github.com/NVIDIA/cccl/pull/1639 - Expose
<cuda/std/numeric>to be publicly available by @miscco in https://github.com/NVIDIA/cccl/pull/1671 - Add nsight support for automated debugging by @gonidelis in https://github.com/NVIDIA/cccl/pull/1660
- Format core headers by @miscco in https://github.com/NVIDIA/cccl/pull/1670
- Guard
resource_refand friends behind feature flag by @miscco in https://github.com/NVIDIA/cccl/pull/1675 - Create major version 2.5.0 by @wmaxey in https://github.com/NVIDIA/cccl/pull/1677
- Install CUB headers with .hpp extension by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1687
- Update CMakePresets.json by @alliepiper in https://github.com/NVIDIA/cccl/pull/1686
- Fix deprecated status by @gevtushenko in https://github.com/NVIDIA/cccl/pull/1692
- Test combined internal/user-side use of NVTX by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1690
- CI Overhaul, new nightly workflow by @alliepiper in https://github.com/NVIDIA/cccl/pull/1654
- Fix CMake option handling. by @alliepiper in https://github.com/NVIDIA/cccl/pull/1698
- Fix issues that came up with building cuDF with main by @miscco in https://github.com/NVIDIA/cccl/pull/1643
- Drop new properties until we are certain about the design by @miscco in https://github.com/NVIDIA/cccl/pull/1681
- Remove more uses of
__cuda_std__by @miscco in https://github.com/NVIDIA/cccl/pull/1669 - Fix usage of
result_ofin thrust by @miscco in https://github.com/NVIDIA/cccl/pull/1705 - Fix thrust::optional
::emplace() by @Snektron in https://github.com/NVIDIA/cccl/pull/1707 - Remove old f(void) function signatures by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1708
- Fix code sample in README and docs by @pauleonix in https://github.com/NVIDIA/cccl/pull/1652
- Format libcudacxx/include files without extensions by @bdice in https://github.com/NVIDIA/cccl/pull/1676
- Several improvements to zipiterator/zipfunction by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1710
- Expose thrust's contiguous iterator unwrap helpers by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1717
- Fix flakey heterogeneous tests by @wmaxey in https://github.com/NVIDIA/cccl/pull/1712
- Ensure that we can use
cuda::std::optionalwith types that are not__host__ __device__by @miscco in https://github.com/NVIDIA/cccl/pull/1663 - Fix a typo in barrier docs and update the godbolt link by @PointKernel in https://github.com/NVIDIA/cccl/pull/1718
- Massively improve test times in heterogeneous atomics tests by @wmaxey in https://github.com/NVIDIA/cccl/pull/1719
- Consolidate more common functionality by @miscco in https://github.com/NVIDIA/cccl/pull/1716
- Increase timeout for the libcu++ test runs by @miscco in https://github.com/NVIDIA/cccl/pull/1720
- Fix nightly CI: H100 runners are not in a testing pool. by @alliepiper in https://github.com/NVIDIA/cccl/pull/1723
- Add a new CUDA Next library and a first entry in it with hierarchy_dimensions type template by @pciolkosz in https://github.com/NVIDIA/cccl/pull/1485
- Atomics backend refactor by @wmaxey in https://github.com/NVIDIA/cccl/pull/1631
- Const-qualify
half_t::operator+/*by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1726 - Reenable previously failing histogram test for icc by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1725
- Enable testing for the other half of the heterogeneous managed memory tests on MSVC. by @wmaxey in https://github.com/NVIDIA/cccl/pull/1729
- PTX: mark cpasyncbulk*multicast functions sm90a by @ahendriksen in https://github.com/NVIDIA/cccl/pull/1734
- Improve libcu++ documentation a bit more by @miscco in https://github.com/NVIDIA/cccl/pull/1732
- Make atomic_ref ctor constexpr. again. by @wmaxey in https://github.com/NVIDIA/cccl/pull/1737
- Various and sundry fixes for Thrust's CPP backends. by @alliepiper in https://github.com/NVIDIA/cccl/pull/1722
- Avoid ABI issues due to MSVC EBCO issues by @miscco in https://github.com/NVIDIA/cccl/pull/1739
- Drop unused header from ptx by @miscco in https://github.com/NVIDIA/cccl/pull/1740
- Allow an
overridematrix to reduce CI workload. by @alliepiper in https://github.com/NVIDIA/cccl/pull/1701 - Fix docs generation by @miscco in https://github.com/NVIDIA/cccl/pull/1741
- Add docs instructions on how to utilize CMake Presets by @gonidelis in https://github.com/NVIDIA/cccl/pull/1694
- Ensure that {cr}begin works with types that pull in namespace std via ADL by @miscco in https://github.com/NVIDIA/cccl/pull/1685
- Merge prep jobs for verify-devcontainers CI. by @alliepiper in https://github.com/NVIDIA/cccl/pull/1754
- Fix typo in ci docs. by @alliepiper in https://github.com/NVIDIA/cccl/pull/1756
- Add runtime + sccache info to CI comment by @alliepiper in https://github.com/NVIDIA/cccl/pull/1744
- Add section about SSH signing keys to developer docs. by @alliepiper in https://github.com/NVIDIA/cccl/pull/1755
- Add sm100 support to
for NVCC by @wmaxey in https://github.com/NVIDIA/cccl/pull/1745 - Fixduplicatejob_checks by @alliepiper in https://github.com/NVIDIA/cccl/pull/1759
- Const-qualify histogram pointer input parameters by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1762
- Return demangled name in
c2h::type_nameby @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1773 - Simplify argument forwarding in CUB histogram entry-points by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1776
- Add guard against half support by @miscco in https://github.com/NVIDIA/cccl/pull/1735
- Refactor CUB test launch helpers by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1770
- Replace
cub::ArrayWrapperbycuda::std::arrayand deprecate it by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1764 - Fix missing qualification of
powin two instances by @miscco in https://github.com/NVIDIA/cccl/pull/1784 - Add mechanism to split project tests into parallel jobs. by @alliepiper in https://github.com/NVIDIA/cccl/pull/1696
- Fix
__halfconversion to float in histogram by @miscco in https://github.com/NVIDIA/cccl/pull/1785 - Implement P3029R1: deduction from
integral_constantby @miscco in https://github.com/NVIDIA/cccl/pull/1786 - Revert to showing skipped jobs to WAR GHA bug. by @alliepiper in https://github.com/NVIDIA/cccl/pull/1794
- Port to Catch2 and rework device histogram test by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1695
- Add gcc13, clang17, clang18 to CI by @jrhemstad in https://github.com/NVIDIA/cccl/pull/1757
- Drop more of thrust type traits by @miscco in https://github.com/NVIDIA/cccl/pull/1721
- Show workflow walltime, job max time in CI comment. by @alliepiper in https://github.com/NVIDIA/cccl/pull/1795
- Fix span for non-ranges by @miscco in https://github.com/NVIDIA/cccl/pull/1840
- Drop all internal implementations of exceptions (#1806) by @miscco in https://github.com/NVIDIA/cccl/pull/1839
- Backport atomic regression fix #1801 by @wmaxey in https://github.com/NVIDIA/cccl/pull/1833
- [BACKPORT] Symbol visibility is now invariant in regards to
__cuda_std__definition (#1832) by @miscco in https://github.com/NVIDIA/cccl/pull/1864
New Contributors
- @MARD1NO made their first contribution in https://github.com/NVIDIA/cccl/pull/1553
- @Revaj made their first contribution in https://github.com/NVIDIA/cccl/pull/1495
- @bprb made their first contribution in https://github.com/NVIDIA/cccl/pull/1574
- @eriktedhamre made their first contribution in https://github.com/NVIDIA/cccl/pull/1608
- @Snektron made their first contribution in https://github.com/NVIDIA/cccl/pull/1707
Full Changelog: https://github.com/NVIDIA/cccl/compare/v2.4.0...v2.5.0
- C++
Published by wmaxey over 1 year ago
cccl - v2.4.0
What’s New
We are still hard at work in CCCL on paying down lots of technical debt, improving infrastructure, and various other simplifications as part of the unification of Thrust/CUB/libcu++. In addition to various fixes and documentation improvements, the following notable improvements have been made to Thrust, CUB, and libcudacxx.
Thrust
As part of our kernel consolidation effort, kernels of thrust::uniquebykey, thrust::copy_if, and thrust::partition algorithms are now consolidated in CUB. Kernel consolidation achieves two goals. First, it delivers the latest optimizations of CUB algorithms to Thrust users. Apart from the performance improvements, it introduces support of large problem sizes (64-bit offsets) into Thrust algorithms.
CUB
cub::DeviceSelect::UniqueByKeynow supports equality operator and large problem sizes.- New cub::DeviceFor family of algorithms goes beyond conventional
cub::DeviceFor::ForEach.cub::DeviceFor::ForEachCopycan provide you with additional performance benefits from vectorized memory accesses. - Many CUB algorithms now support CUDA graph capture mode.
libcudacxx
- Added new
cuda::ptxnamespace with wrappers for inline-PTX instructions cuda::std::complexspecializations for CUDA typesbfloatandhalf.
What's Changed
- Implement remaining ranges iterator concepts and modernize array by @miscco in https://github.com/NVIDIA/cccl/pull/627
- Fix C++11 support of recently added tests by @ahendriksen in https://github.com/NVIDIA/cccl/pull/651
- Update CUDA newest to CTK 12.3 by @jrhemstad in https://github.com/NVIDIA/cccl/pull/629
- Add
cuda::ptx::*namespace by @ahendriksen in https://github.com/NVIDIA/cccl/pull/574 - The test seems to pass just fine by @miscco in https://github.com/NVIDIA/cccl/pull/654
- Fixes discard_memory compilation failure for pre-Volta by @elstehle in https://github.com/NVIDIA/cccl/pull/637
- Reduce benchmarking time by @gevtushenko in https://github.com/NVIDIA/cccl/pull/657
- Add CCCL_VERSION and script for updating version by @jrhemstad in https://github.com/NVIDIA/cccl/pull/652
- Fixes compiler error for extended fp type data gen by @elstehle in https://github.com/NVIDIA/cccl/pull/666
- fixup
___CUDA_VPTX->_CUDA_VPTXby @wmaxey in https://github.com/NVIDIA/cccl/pull/664 - Attempt to WAR CUB / RDC / MSVC issue by @gevtushenko in https://github.com/NVIDIA/cccl/pull/669
- Rework our system header approach to be more error proof by @miscco in https://github.com/NVIDIA/cccl/pull/661
- Project automation - fix sync action and draft setting step by @jarmak-nv in https://github.com/NVIDIA/cccl/pull/625
- Fix fallback when checking git repo by @wmaxey in https://github.com/NVIDIA/cccl/pull/1085
- Currently the verbose option does not work beacuse of a typo in the argument handling by @miscco in https://github.com/NVIDIA/cccl/pull/1088
- Adds virtual shared memory helper and tests by @elstehle in https://github.com/NVIDIA/cccl/pull/619
- Add
cuda::ptx::st_asyncby @ahendriksen in https://github.com/NVIDIA/cccl/pull/1078 - Add
cuda::ptx::red_asyncby @ahendriksen in https://github.com/NVIDIA/cccl/pull/1080 - Remove libcudacxx symlinks by @wmaxey in https://github.com/NVIDIA/cccl/pull/1075
- Move PTX tests that missed the symlink PR by @wmaxey in https://github.com/NVIDIA/cccl/pull/1098
- Fix truncation of constant value by @gevtushenko in https://github.com/NVIDIA/cccl/pull/1097
- Add
cuda::ptx:mbarrier_{try/test}_wait{_parity}by @ahendriksen in https://github.com/NVIDIA/cccl/pull/674 - Initial CUB/NVRTC support by @gevtushenko in https://github.com/NVIDIA/cccl/pull/1081
- Fix
cuda::ptx::red.asyncfor int32_t types by @ahendriksen in https://github.com/NVIDIA/cccl/pull/1102 - Fix local test runs with lit by @miscco in https://github.com/NVIDIA/cccl/pull/1108
- Fix config when only non-CDPv1 arches are enabled. by @alliepiper in https://github.com/NVIDIA/cccl/pull/1109
- Do not replace the sccache binary for windows by @miscco in https://github.com/NVIDIA/cccl/pull/1115
- Test cuda graph capture by @gevtushenko in https://github.com/NVIDIA/cccl/pull/1112
- Fix overflow bug for >2^32 elements in thrust::shuffle by @djns99 in https://github.com/NVIDIA/cccl/pull/1074
- Introduce CUB transform reduce by @gevtushenko in https://github.com/NVIDIA/cccl/pull/1091
- Add infrastructure for compile-time CUB tests by @gevtushenko in https://github.com/NVIDIA/cccl/pull/1124
- Fix GCC6 / FP8 warning by @gevtushenko in https://github.com/NVIDIA/cccl/pull/1130
- Fix thrust transform reduce bench by @gevtushenko in https://github.com/NVIDIA/cccl/pull/1133
- Fix
ptx.st.async.compile.pass.cppfailing in C++11. by @wmaxey in https://github.com/NVIDIA/cccl/pull/1132 - Fix
_LIBCUDACXX_UNREACHABLEfor old MSVC by @miscco in https://github.com/NVIDIA/cccl/pull/1114 - Allow filtering P0 benchmarks by @gevtushenko in https://github.com/NVIDIA/cccl/pull/1135
- Update barrierarrivetx.md docs by @gonzalobg in https://github.com/NVIDIA/cccl/pull/1147
- Update std iterators by @miscco in https://github.com/NVIDIA/cccl/pull/672
- Fix argument name in windows CI by @miscco in https://github.com/NVIDIA/cccl/pull/1145
- Fix XFAIL condition for subsumption tests by @miscco in https://github.com/NVIDIA/cccl/pull/1144
- Project Automation - remove draft automation + reduce permissions by @jarmak-nv in https://github.com/NVIDIA/cccl/pull/1154
- Use rst in block-scope docs by @gevtushenko in https://github.com/NVIDIA/cccl/pull/1150
- Fix errors when find_package(CCCL) is called twice. by @alliepiper in https://github.com/NVIDIA/cccl/pull/1157
- Fix icc / cub by @gevtushenko in https://github.com/NVIDIA/cccl/pull/1152
- Abort testing on unsupported dialect flags by @wmaxey in https://github.com/NVIDIA/cccl/pull/1158
- Run with latest nvbench by @robertmaynard in https://github.com/NVIDIA/cccl/pull/583
- Set finer-grain workflow permissions by @jrhemstad in https://github.com/NVIDIA/cccl/pull/1163
- Port device docs to rst by @gevtushenko in https://github.com/NVIDIA/cccl/pull/1160
- CI log improvements by @jrhemstad in https://github.com/NVIDIA/cccl/pull/621
- Setup documentation and corresponding github action by @wmaxey in https://github.com/NVIDIA/cccl/pull/1118
- Update Docs links in README.md by @wmaxey in https://github.com/NVIDIA/cccl/pull/1169
- Fix GCC 13 by @gevtushenko in https://github.com/NVIDIA/cccl/pull/1175
- Add missing exit from
run-as-coderby @jrhemstad in https://github.com/NVIDIA/cccl/pull/1176 - Adds new virtual shared memory facility to DeviceMergeSort by @elstehle in https://github.com/NVIDIA/cccl/pull/1117
- Add first batch of Catch2 tests for DeviceRadixSort by @alliepiper in https://github.com/NVIDIA/cccl/pull/1164
- Implement math functions for
thrust::complexby @miscco in https://github.com/NVIDIA/cccl/pull/1178 - Use anchors in matrix.yaml by @jrhemstad in https://github.com/NVIDIA/cccl/pull/1193
- Ensure the targets that Thrust creates are global. by @robertmaynard in https://github.com/NVIDIA/cccl/pull/1182
- Fix availability of
is_constant_evaluatedon old MSVC by @miscco in https://github.com/NVIDIA/cccl/pull/1180 - Enable std::variant for libcu++ by @miscco in https://github.com/NVIDIA/cccl/pull/1076
- Implement
enable_borrowed_rangeby @miscco in https://github.com/NVIDIA/cccl/pull/1196 - Reduce thrust benchmarks noise by @gevtushenko in https://github.com/NVIDIA/cccl/pull/1203
- Prepare more algorithms by @miscco in https://github.com/NVIDIA/cccl/pull/1161
- Add icc compiler to CI matrix by @jrhemstad in https://github.com/NVIDIA/cccl/pull/1159
- Unify handling of dialects by @miscco in https://github.com/NVIDIA/cccl/pull/1200
- Add argument to build/test scripts for additional cmake options by @jrhemstad in https://github.com/NVIDIA/cccl/pull/620
- Move definitions of execution space macros into
ccclby @miscco in https://github.com/NVIDIA/cccl/pull/1199 - Adds new virtual shared memory facility to
DeviceSelect::UniqueByKeyby @elstehle in https://github.com/NVIDIA/cccl/pull/1197 - Add Catch2 tests for cub::DeviceSegmentedRadixSort by @alliepiper in https://github.com/NVIDIA/cccl/pull/1214
- Fix the example on README.md by @so298 in https://github.com/NVIDIA/cccl/pull/1220
- Add missing overloads for thrust::pow by @miscco in https://github.com/NVIDIA/cccl/pull/1222
- Fix 'nvc++ -stdpar' by @dkolsen-pgi in https://github.com/NVIDIA/cccl/pull/1224
- Fix examples in reduce docs by @gevtushenko in https://github.com/NVIDIA/cccl/pull/1230
- Do not benchmark small problem sizes by @gevtushenko in https://github.com/NVIDIA/cccl/pull/1243
- Implement
enable_viewby @miscco in https://github.com/NVIDIA/cccl/pull/1208 - Refactors
thrust::unique_by_keyto usecub::DeviceSelect::UniqueByKeyby @elstehle in https://github.com/NVIDIA/cccl/pull/1245 - Fix merge conflict from incoming PR by @miscco in https://github.com/NVIDIA/cccl/pull/1250
- Disable
fast-mathfor ICC by @miscco in https://github.com/NVIDIA/cccl/pull/1252 - Fix a typo in thrust-config.cmake by @valgur in https://github.com/NVIDIA/cccl/pull/1259
- Implement
ranges::{c}beginandranges::{c}endby @miscco in https://github.com/NVIDIA/cccl/pull/1256 - Switch to entropy-based stopping criterion by @gevtushenko in https://github.com/NVIDIA/cccl/pull/1280
- Fix a sync bug in
stream_ref::waitby @PointKernel in https://github.com/NVIDIA/cccl/pull/1238 - Silence some static asserts in ptx helpers by @miscco in https://github.com/NVIDIA/cccl/pull/1257
- Restore docs images by @jrhemstad in https://github.com/NVIDIA/cccl/pull/1285
- Clarify Thrust/CUB ABI guarantees by @jrhemstad in https://github.com/NVIDIA/cccl/pull/1269
- Fix MSVC issues by @miscco in https://github.com/NVIDIA/cccl/pull/1261
- Ensure that
cuda::std::pairis potentially trivially copyable by @miscco in https://github.com/NVIDIA/cccl/pull/1249 - Update packman to fix CUB docs by @gevtushenko in https://github.com/NVIDIA/cccl/pull/1291
- Implement
ranges::{c}rbeginby @miscco in https://github.com/NVIDIA/cccl/pull/1295 - Make
cuda::stream_refuniversally available by @miscco in https://github.com/NVIDIA/cccl/pull/1293 - Properly test internal headers by @miscco in https://github.com/NVIDIA/cccl/pull/1258
- Remove remaining C++03 compatibility from unit tests by @Blonck in https://github.com/NVIDIA/cccl/pull/1228
- Add some documentation for
memory_resourceby @miscco in https://github.com/NVIDIA/cccl/pull/1217 - Filter axis values in perf analysis by @gevtushenko in https://github.com/NVIDIA/cccl/pull/1304
- Get CCCL revision outside of git repo by @gevtushenko in https://github.com/NVIDIA/cccl/pull/1305
- [DOC]: Move ptx.md out of extended API by @ahendriksen in https://github.com/NVIDIA/cccl/pull/1308
- Implement
ranges::{c}rendby @miscco in https://github.com/NVIDIA/cccl/pull/1301 - thrust/mr: fix the case of reuising a block for a smaller alloc. by @griwes in https://github.com/NVIDIA/cccl/pull/1232
- Allow offloading samples by @gevtushenko in https://github.com/NVIDIA/cccl/pull/1316
- [DOC]: Fix documentation links by @ahendriksen in https://github.com/NVIDIA/cccl/pull/1311
- Separate windows and Linux CI matrix by @jrhemstad in https://github.com/NVIDIA/cccl/pull/1206
- Revert "Separate windows and Linux CI matrix " by @jrhemstad in https://github.com/NVIDIA/cccl/pull/1324
- Introduce CUB ForEach algorithms by @gevtushenko in https://github.com/NVIDIA/cccl/pull/1302
- Cleanup transitive includes of
<cuda/std/functional>by @miscco in https://github.com/NVIDIA/cccl/pull/1253 - Implement
ranges::{c}databy @miscco in https://github.com/NVIDIA/cccl/pull/1313 - Remove stale comments from README by @jrhemstad in https://github.com/NVIDIA/cccl/pull/1328
- Ports
cub::DeviceMergeSorttests to Catch2 by @elstehle in https://github.com/NVIDIA/cccl/pull/1319 - Implement
ranges::sizeandranges::ssizeby @miscco in https://github.com/NVIDIA/cccl/pull/1330 - PTX: Add helper functions for dsmem by @ahendriksen in https://github.com/NVIDIA/cccl/pull/1336
- Remove double "ignore" in discard_iterator.h docs by @gonidelis in https://github.com/NVIDIA/cccl/pull/1342
- PTX: Add
cuda::ptx::fenceby @ahendriksen in https://github.com/NVIDIA/cccl/pull/1341 - Replace deprecated
_VSTDmacro withstdby @rupprecht in https://github.com/NVIDIA/cccl/pull/1331 - PTX: Add
cuda::ptx::mapaandcuda::ptx::getctarankby @ahendriksen in https://github.com/NVIDIA/cccl/pull/1345 - Cleanup our
__cccl_configby @miscco in https://github.com/NVIDIA/cccl/pull/1322 - Update to devcontainers 24.04 by @jrhemstad in https://github.com/NVIDIA/cccl/pull/1357
- ♻️📝 Update
modeexample to usethrust::unique_countby @codereport in https://github.com/NVIDIA/cccl/pull/1354 - Switch to NV runners for Windows. by @wmaxey in https://github.com/NVIDIA/cccl/pull/1356
- Implement
ranges::emptyby @miscco in https://github.com/NVIDIA/cccl/pull/1338 - PTX: Add
cuda::ptx::get_sregby @ahendriksen in https://github.com/NVIDIA/cccl/pull/1351 - Fix godbolt link. by @jrhemstad in https://github.com/NVIDIA/cccl/pull/1369
- Implement ranges concepts by @miscco in https://github.com/NVIDIA/cccl/pull/1364
- Print helpful error message in test scripts when no GPU is found by @jrhemstad in https://github.com/NVIDIA/cccl/pull/1362
- Implement
ranges::danglingby @miscco in https://github.com/NVIDIA/cccl/pull/1371 - Ensure that thrust fancy iterators are
trivially_copy_constructiblewhen possible by @miscco in https://github.com/NVIDIA/cccl/pull/1368 - Improve compiler detection defines by @Yaraslaut in https://github.com/NVIDIA/cccl/pull/1320
- Use relative includes for our public headers by @miscco in https://github.com/NVIDIA/cccl/pull/1325
- Implement
ranges::view_interfaceby @miscco in https://github.com/NVIDIA/cccl/pull/1377 - Use checked allocators in CUB catch2 tests by @alliepiper in https://github.com/NVIDIA/cccl/pull/1271
- small update to docs for CTK by @ZelboK in https://github.com/NVIDIA/cccl/pull/1378
- Fix order of system_header supression and includes by @miscco in https://github.com/NVIDIA/cccl/pull/1323
- Hide API accepting kernel pointers by @gevtushenko in https://github.com/NVIDIA/cccl/pull/1395
- Refactors
ChooseOffsetTto use::cuda::stdand introduces alias templatechoose_offset_tby @elstehle in https://github.com/NVIDIA/cccl/pull/1405 - Cleanup our delegated constructor workaround by @miscco in https://github.com/NVIDIA/cccl/pull/1404
- Implement
ranges::subrangeby @miscco in https://github.com/NVIDIA/cccl/pull/1387 - Test large arrays in in device radix sort by @alliepiper in https://github.com/NVIDIA/cccl/pull/1349
- CMake support absolute CMAKEINSTALLLIBDIR values by @robertmaynard in https://github.com/NVIDIA/cccl/pull/1393
- Fixes integer overflows in index computation when indexes approach
numeric_limits<OffsetT>::max()by @elstehle in https://github.com/NVIDIA/cccl/pull/1419 - Fix ptx usage to account for PTX ISA availability by @miscco in https://github.com/NVIDIA/cccl/pull/1359
- Refactors
thrust::copy_ifto usecub::DeviceSelectby @elstehle in https://github.com/NVIDIA/cccl/pull/1379 - Fix include of
with NVC++ by @dkolsen-pgi in https://github.com/NVIDIA/cccl/pull/1417 - Do not use VLAs in
cp_async_bulk_tensor_*tests by @miscco in https://github.com/NVIDIA/cccl/pull/1423 - Add support for sm_90a in
API by @ahendriksen in https://github.com/NVIDIA/cccl/pull/1411 - Add additional build job for sm90 by @jrhemstad in https://github.com/NVIDIA/cccl/pull/1428
- Rework
<span>to be latest revision by @miscco in https://github.com/NVIDIA/cccl/pull/1415 - PTX: Add
cuda::ptx:cp_async_bulk_*by @ahendriksen in https://github.com/NVIDIA/cccl/pull/1403 - Prepare namespace
ranges::viewsby @miscco in https://github.com/NVIDIA/cccl/pull/1434 - PTX: Add
cuda::ptx:barrier_cluster_{arrive,wait}by @ahendriksen in https://github.com/NVIDIA/cccl/pull/1366 - Refactor
thrust::[stable_]partition[_copy]to usecub::DevicePartitionby @elstehle in https://github.com/NVIDIA/cccl/pull/1435 - Fix
common_referenceofpairby @miscco in https://github.com/NVIDIA/cccl/pull/1438 - Properly check whether a string is alphanumeric by @miscco in https://github.com/NVIDIA/cccl/pull/1443
- Remove
cuda::ptx::mapaby @ahendriksen in https://github.com/NVIDIA/cccl/pull/1442 - Add
cuda::ptx:tensormap_{replace,cp_fenceproxy}by @ahendriksen in https://github.com/NVIDIA/cccl/pull/1441 - Enable more algorithms for internal use by @miscco in https://github.com/NVIDIA/cccl/pull/1432
- Cleanup diagnostic handling by @miscco in https://github.com/NVIDIA/cccl/pull/1420
- Create patch 2.4.0 by @wmaxey in https://github.com/NVIDIA/cccl/pull/1455
- Address various issues from internal CI by @miscco in https://github.com/NVIDIA/cccl/pull/1462
- Extent gcc miscompilation workaround for replace.cu by @miscco in https://github.com/NVIDIA/cccl/pull/1461
- Fix CUB docs image fetcher by @gevtushenko in https://github.com/NVIDIA/cccl/pull/1466
- Add
cuda::ptx::cp_reduce_async_bulkby @ahendriksen in https://github.com/NVIDIA/cccl/pull/1445 - Restore disabling benchmarks from ci scripts (removed in #493) by @wmaxey in https://github.com/NVIDIA/cccl/pull/1458
- Add test coverage for SM90 without PTX ISA 8.0 by @miscco in https://github.com/NVIDIA/cccl/pull/1468
- Ensure that we can use
std::ignoreon device by @miscco in https://github.com/NVIDIA/cccl/pull/1470 - Move
.multicasttests out into their own file by @miscco in https://github.com/NVIDIA/cccl/pull/1478 - Ensure that we can test libcu++ against architectures < 70 by @miscco in https://github.com/NVIDIA/cccl/pull/1475
- Reduce number of instantiations in
set_symmetric_differencetests by @miscco in https://github.com/NVIDIA/cccl/pull/1476 - Fixx test issues against gcc-6 by @miscco in https://github.com/NVIDIA/cccl/pull/1477
- Improve code block CSS in libcu++ docs by @Nyrio in https://github.com/NVIDIA/cccl/pull/1483
- Address issues with MSVC2017 by @miscco in https://github.com/NVIDIA/cccl/pull/1479
- Remove libcxx tests by @miscco in https://github.com/NVIDIA/cccl/pull/1480
- Separate CUB's catch2 test binaries by default for CI. by @alliepiper in https://github.com/NVIDIA/cccl/pull/1482
- Add Dev Containers guide for WSL by @gonidelis in https://github.com/NVIDIA/cccl/pull/1394
- PTX: add
cuda::mbarrier_initby @ahendriksen in https://github.com/NVIDIA/cccl/pull/1491 - Remove legacy Thrust/CUB CI files. by @bdice in https://github.com/NVIDIA/cccl/pull/1504
- Fix issues with ambiguous calls to
addressofinthrust::optionalby @miscco in https://github.com/NVIDIA/cccl/pull/1499 - Ensure that we play nicely with std::iterators by @miscco in https://github.com/NVIDIA/cccl/pull/1511
- Try harder to unwrap nested
thrust::tuple_of_iterator_referencesby @miscco in https://github.com/NVIDIA/cccl/pull/1469 - Match_any testing single bit by fusing into single LOP3 instruction by @IlyaGrebnov in https://github.com/NVIDIA/cccl/pull/1372
- Revert "Refactor
thrust::complexas a struct derived fromcuda::std::complex(#454)" by @miscco in https://github.com/NVIDIA/cccl/pull/1497 - Removes arch filtering of sm 90 for rdc builds by @elstehle in https://github.com/NVIDIA/cccl/pull/1506
- Adds test for
cub::PtxVersionby @elstehle in https://github.com/NVIDIA/cccl/pull/1521 - Fix tuple backwards compatibility by @miscco in https://github.com/NVIDIA/cccl/pull/1522
- [FEA] Split ptx.h by @ahendriksen in https://github.com/NVIDIA/cccl/pull/1520
- Make libcudacxx's codegen part of CI and add it to the project. by @wmaxey in https://github.com/NVIDIA/cccl/pull/1526
- Ensure that we can run
reduce_by_keywith const inputs by @miscco in https://github.com/NVIDIA/cccl/pull/1528 - Disallow float offset type in cub::segmented_reducde by @gonidelis in https://github.com/NVIDIA/cccl/pull/1430
- cuda::std::complex specializations for half and bfloat by @griwes in https://github.com/NVIDIA/cccl/pull/1140
- Rebase 2.4.x with main. by @wmaxey in https://github.com/NVIDIA/cccl/pull/1472
- [BACKPORT]: Provide backfills for missing
__halffunctionality by @miscco in https://github.com/NVIDIA/cccl/pull/1544 - [BACKPORT] Fix usage of naked array with 0 elements in sm90 barrier tests. (#1546) by @wmaxey in https://github.com/NVIDIA/cccl/pull/1549
- [BACKPORT] Fix unused variable warning for _canusecompletetx (#1547) by @wmaxey in https://github.com/NVIDIA/cccl/pull/1550
New Contributors
- @djns99 made their first contribution in https://github.com/NVIDIA/cccl/pull/1074
- @so298 made their first contribution in https://github.com/NVIDIA/cccl/pull/1220
- @valgur made their first contribution in https://github.com/NVIDIA/cccl/pull/1259
- @PointKernel made their first contribution in https://github.com/NVIDIA/cccl/pull/1238
- @rupprecht made their first contribution in https://github.com/NVIDIA/cccl/pull/1331
- @codereport made their first contribution in https://github.com/NVIDIA/cccl/pull/1354
- @Yaraslaut made their first contribution in https://github.com/NVIDIA/cccl/pull/1320
- @Nyrio made their first contribution in https://github.com/NVIDIA/cccl/pull/1483
- @IlyaGrebnov made their first contribution in https://github.com/NVIDIA/cccl/pull/1372
Full Changelog: https://github.com/NVIDIA/cccl/compare/v2.3.2...v2.4.0
- C++
Published by wmaxey over 1 year ago
cccl - v2.3.1
What's Changed
- [BACKPORT]: Fix bug in
stream_ref::waitby @miscco in https://github.com/NVIDIA/cccl/pull/1283 - Revert "Refactor
thrust::complexas a struct derived fromcuda::std::complex(#454)" by @miscco in https://github.com/NVIDIA/cccl/pull/1286 - Create patch 2.3.1 by @wmaxey in https://github.com/NVIDIA/cccl/pull/1287
Full Changelog: https://github.com/NVIDIA/cccl/compare/v2.3.0...v2.3.1
- C++
Published by wmaxey over 1 year ago
cccl - v2.3.2
What's Changed
- [BACKPORT]: Silence some static asserts in ptx helpers (#1257) by @miscco in https://github.com/NVIDIA/cccl/pull/1284
- [BACKPORT]: Ensure that pair is trivially copyable (#1249) by @miscco in https://github.com/NVIDIA/cccl/pull/1292
- [BACKPORT]: Properly test internal headers (#1258) by @miscco in https://github.com/NVIDIA/cccl/pull/1299
- [Backport]: Fix errors when find_package(CCCL) is called twice. (#1157) by @miscco in https://github.com/NVIDIA/cccl/pull/1298
- [BACKPORT] Fix MSVC issues (#1261) by @miscco in https://github.com/NVIDIA/cccl/pull/1297
- [backport] thrust/mr: fix the case of reuising a block for a smaller alloc. (#1232) by @griwes in https://github.com/NVIDIA/cccl/pull/1317
- [BACKPORT]: Fix ptx usage to account for PTX ISA availability (#1359) by @miscco in https://github.com/NVIDIA/cccl/pull/1421
- Create patch 2.3.2 by @wmaxey in https://github.com/NVIDIA/cccl/pull/1530
Full Changelog: https://github.com/NVIDIA/cccl/compare/v2.3.1...v2.3.2
- C++
Published by wmaxey almost 2 years ago
cccl - CCCL 2.3.0
What’s New
In addition to various fixes and documentation improvements, the following notable improvements have been made to Thrust, CUB, and libcudacxx.
System Headers and Warnings
Users don't want to see warnings from CCCL headers. The typical way to accomplish this with header libraries is to use -isystem. However, this causes problems when using CCCL from GitHub, it will conflict with the CCCL headers in the CTK. Therefore, you should always include CCCL headers via -I.
To achieve the same effect as -isystem, CCCL headers will now use the system_header pragma. For more information, see https://github.com/NVIDIA/cccl/issues/527.
TL;DR: You should never see warnings emitted from a CCCL header ever again!
Linkage Issues
Using CUB and Thrust in shared libraries is a known source of issues. Previously, the solution to these issues consisted of using the THRUST_CUB_WRAPPED_NAMESPACE macro so that different shared libraries have different symbol names. However, this solution has poor discoverability, since issues present themselves in forms of segmentation faults, hangs, wrong results, etc. As of the 2.3 release, linkage issues are addressed by default without the need for THRUST_CUB_WRAPPED_NAMESPACE. Although the fix is API compatible, it might cause ABI compatibility issues. For more details, see issue #443.
Thrust
thrust::tuple, thrust::pair, and thrust::complex have been replaced with cuda::std alternatives. This can be a breaking change, but should be source compatible.
CUB
Up to 60% performance improvements of cub::DeviceSelect::UniqueByKey, cub::DeviceScan::ExclusiveSumByKey, and cub::DeviceReduce::ReduceByKey on A100. cub::DeviceSegmentedReduce now supports 64-bit indexing.
libcudacxx
- The
cuda::ptxnamespace and<cuda/ptx>header is now available and provides access to various inline PTX functions that enumerate various async memcpy and barrier intrinsics. - #379 - Added experimental bulk TMA memcpy under
<cuda/barrier>
What's Changed
- Port cub::DeviceSegmentedReduce tests to catch2 by @elstehle in https://github.com/NVIDIA/cccl/pull/303
- Branch/2.2.x by @gevtushenko in https://github.com/NVIDIA/cccl/pull/305
- Tune unique by key on A100 by @gevtushenko in https://github.com/NVIDIA/cccl/pull/306
- Merge branch/2.2.x to main by @jrhemstad in https://github.com/NVIDIA/cccl/pull/308
- Add example cmake project by @jrhemstad in https://github.com/NVIDIA/cccl/pull/177
- Adds catch2 tests for reduce-by-key by @elstehle in https://github.com/NVIDIA/cccl/pull/311
- Tune scan by key on A100 by @gevtushenko in https://github.com/NVIDIA/cccl/pull/325
- Replace diagsuppress by nvdiag_suppress in documentation by @ahendriksen in https://github.com/NVIDIA/cccl/pull/281
- Fix MSVC / CUB tests build by @gevtushenko in https://github.com/NVIDIA/cccl/pull/336
- gdb pretty printer: handle non-cuda device vectors by @siboehm in https://github.com/NVIDIA/cccl/pull/264
- Add a nvrtc configuration for libcu++ by @miscco in https://github.com/NVIDIA/cccl/pull/202
- GH Infra: project automation and issue template fixes by @jarmak-nv in https://github.com/NVIDIA/cccl/pull/297
- Tune reduce by key on A100 by @gevtushenko in https://github.com/NVIDIA/cccl/pull/346
- Merge commits from 2.2 branch by @miscco in https://github.com/NVIDIA/cccl/pull/350
- Fix a shadow warning in thrust's executewithdependencies.h by @hageboeck in https://github.com/NVIDIA/cccl/pull/334
- Assorted fixes for MSVC 2017 by @miscco in https://github.com/NVIDIA/cccl/pull/341
- [skip-tests] Guard inline variables with
_LIBCUDACXX_INLINE_VARmacro by @miscco in https://github.com/NVIDIA/cccl/pull/355 - Port cub::DeviceScan tests to catch2 by @elstehle in https://github.com/NVIDIA/cccl/pull/347
- Remove _NOEXCEPT macro in favor of noexcept in libcu++ by @Blonck in https://github.com/NVIDIA/cccl/pull/349
- Project Automation: add conditional steps due to context errors by @jarmak-nv in https://github.com/NVIDIA/cccl/pull/353
- Work around strange gcc bug by @miscco in https://github.com/NVIDIA/cccl/pull/363
- Implement
iter_swapCPO by @miscco in https://github.com/NVIDIA/cccl/pull/332 - Replace default, constexpr, and delete macros by original keywords by @Blonck in https://github.com/NVIDIA/cccl/pull/360
- Add clang16 devcontainer and CI job by @miscco in https://github.com/NVIDIA/cccl/pull/362
- [skip-tests] Skip merge conflict from old iter_swap PR by @miscco in https://github.com/NVIDIA/cccl/pull/369
- [skip-tests] Also skip all CI runs that require a GPU when [skip-tests] is set by @miscco in https://github.com/NVIDIA/cccl/pull/370
- Remove LIBCUDACXXCXX03_LANG macro and all encapsulated code by @Blonck in https://github.com/NVIDIA/cccl/pull/368
- Remove checks against LIBCUDACXXSTD_VER < 11 by @Blonck in https://github.com/NVIDIA/cccl/pull/375
- Use
copy-pr-botby @ajschmidt8 in https://github.com/NVIDIA/cccl/pull/381 - Implement the
permutableconcept by @miscco in https://github.com/NVIDIA/cccl/pull/367 - [NFC] We missed some
_NOEXCEPT_macro uses by @miscco in https://github.com/NVIDIA/cccl/pull/371 - Implement
identitychanges for c++20 by @miscco in https://github.com/NVIDIA/cccl/pull/383 - Hide third party cmake options in our cmake developer builds. by @allisonvacanti in https://github.com/NVIDIA/cccl/pull/300
- Port cub::DeviceScanByKey tests to Catch2 by @elstehle in https://github.com/NVIDIA/cccl/pull/380
- Fixes a race in DeviceRunLengthEncode::NonTrivialRuns by @elstehle in https://github.com/NVIDIA/cccl/pull/399
- Add commit information to the test output by @miscco in https://github.com/NVIDIA/cccl/pull/401
- Project Automation: Handle PRs opened as non-draft + multiple bug fixes by @jarmak-nv in https://github.com/NVIDIA/cccl/pull/387
- Project Automation: set
Roadmapproject value on issue/pr close and Auto-type new issues by @jarmak-nv in https://github.com/NVIDIA/cccl/pull/389 - Add support for tests that should fail at runtime by @ahendriksen in https://github.com/NVIDIA/cccl/pull/418
- Port
DeviceAdjacentDifference::SubtractRighttests to catch2 by @miscco in https://github.com/NVIDIA/cccl/pull/390 - Project automation - Fix indentation for
continue-on-errorby @jarmak-nv in https://github.com/NVIDIA/cccl/pull/425 - [BUG] Ensure that all headers build on their own by @miscco in https://github.com/NVIDIA/cccl/pull/200
- Remove
util_device.cuhfrom iterator headers to enable online compilation by @leofang in https://github.com/NVIDIA/cccl/pull/412 - Fix ci-overview example by @gevtushenko in https://github.com/NVIDIA/cccl/pull/428
- Port
cub::DeviceRunLengthEncodetests to catch2 by @miscco in https://github.com/NVIDIA/cccl/pull/411 - Add cuda::device::barrier_arrive tx by @ahendriksen in https://github.com/NVIDIA/cccl/pull/358
- Fix CubDebug by @gevtushenko in https://github.com/NVIDIA/cccl/pull/430
- Do not use static member functions to initialize static member variables. by @miscco in https://github.com/NVIDIA/cccl/pull/438
- Implement the
projectedhelper struct by @miscco in https://github.com/NVIDIA/cccl/pull/385 - Add PTX wrapping functions for TMA features by @ahendriksen in https://github.com/NVIDIA/cccl/pull/379
- Clarify docstring for num_items parameter of DeviceSegmentedRadixSort by @HapeMask in https://github.com/NVIDIA/cccl/pull/320
- Enable lit to determine the compute architectures by @miscco in https://github.com/NVIDIA/cccl/pull/447
- Add NVRTCSKIPKERNEL_RUN tag to compile, but skip running NVRTC test by @ahendriksen in https://github.com/NVIDIA/cccl/pull/434
- Improve documentation of
cuda::barrierby @ahendriksen in https://github.com/NVIDIA/cccl/pull/440 - Extend
thrust::complexunit tests to prepare for upcoming replacement withstd::complexby @Blonck in https://github.com/NVIDIA/cccl/pull/413 - Remove having two install rules for
-header-search.cmake by @robertmaynard in https://github.com/NVIDIA/cccl/pull/298 - Run
.devcontainer/launch.shwith bash + add error checking by @wence- in https://github.com/NVIDIA/cccl/pull/407 - Remove C++03 compatability from unit tests by @Blonck in https://github.com/NVIDIA/cccl/pull/378
- [libcu++] Fix use of
__ppc64__by @miscco in https://github.com/NVIDIA/cccl/pull/451 - Update the README by @jrhemstad in https://github.com/NVIDIA/cccl/pull/291
- [libcu++] Try to avoid gcc misscompilation issues by @miscco in https://github.com/NVIDIA/cccl/pull/452
- Consolidate matrix logic into single script/job by @jrhemstad in https://github.com/NVIDIA/cccl/pull/361
- Implement the
indirectly_comparableconcept by @miscco in https://github.com/NVIDIA/cccl/pull/445 - Fix compute matrix dropping trailing zeros by @jrhemstad in https://github.com/NVIDIA/cccl/pull/466
- Avoid integer promotion warnings with MSVC by @miscco in https://github.com/NVIDIA/cccl/pull/460
- Implement ranges comparison objects by @miscco in https://github.com/NVIDIA/cccl/pull/464
- Fix CUB/MSVC/RDC tests by @gevtushenko in https://github.com/NVIDIA/cccl/pull/469
- Fix Thrust/CUB Linkage Issues by @gevtushenko in https://github.com/NVIDIA/cccl/pull/443
- Script for Running CUB Benchmarks by @gevtushenko in https://github.com/NVIDIA/cccl/pull/472
- [skip ci] Add list of CCCL users to README by @jrhemstad in https://github.com/NVIDIA/cccl/pull/474
constexprall the things by @pb-dseifert in https://github.com/NVIDIA/cccl/pull/476- Add Gonzalo/Allard to trustees by @jrhemstad in https://github.com/NVIDIA/cccl/pull/482
- Implement the
sortableconcept by @miscco in https://github.com/NVIDIA/cccl/pull/471 - [libcu++] Add LIBCUDACXXCUDACCBELOW12_3 macro by @gonzalobg in https://github.com/NVIDIA/cccl/pull/479
- Refactor
thrust::complexas a struct derived fromcuda::std::complexby @Blonck in https://github.com/NVIDIA/cccl/pull/454 - Add ci scripts for windows by @miscco in https://github.com/NVIDIA/cccl/pull/251
- Enable complex interop on MSVC by @miscco in https://github.com/NVIDIA/cccl/pull/490
- [skip ci] Add related projects to readme. by @jrhemstad in https://github.com/NVIDIA/cccl/pull/492
- Reenable nvrtc tests by @miscco in https://github.com/NVIDIA/cccl/pull/488
- Implement the
mergeableconcept by @miscco in https://github.com/NVIDIA/cccl/pull/484 - 64-bit indexing for DeviceSegmentedReduce by @jecs in https://github.com/NVIDIA/cccl/pull/414
- Implement
move_sentinelby @miscco in https://github.com/NVIDIA/cccl/pull/496 - Support skipped benches in run script by @gevtushenko in https://github.com/NVIDIA/cccl/pull/508
- Implement
unreachable_sentinelby @miscco in https://github.com/NVIDIA/cccl/pull/506 - Disable flaky barrier tests by @miscco in https://github.com/NVIDIA/cccl/pull/510
- Add constant initialization of managed variable to silence gcc warning by @miscco in https://github.com/NVIDIA/cccl/pull/509
- Add verbose flag to ninja build. by @jrhemstad in https://github.com/NVIDIA/cccl/pull/491
- Add devcontainer readme by @jrhemstad in https://github.com/NVIDIA/cccl/pull/481
- Add contributor guide by @jrhemstad in https://github.com/NVIDIA/cccl/pull/500
- [skip ci] Fix devcontainer guide link by @jrhemstad in https://github.com/NVIDIA/cccl/pull/518
- [skip ci] Add example godbolt link. by @jrhemstad in https://github.com/NVIDIA/cccl/pull/519
- Replace cuda::atomic with legacy functions for old arch compatibility. by @allisonvacanti in https://github.com/NVIDIA/cccl/pull/516
- Simplify examples matrix. by @jrhemstad in https://github.com/NVIDIA/cccl/pull/517
- Disable PR workflow triggering on pushes to main. by @jrhemstad in https://github.com/NVIDIA/cccl/pull/532
- Add CI job to verify devcontainers are always up to date by @jrhemstad in https://github.com/NVIDIA/cccl/pull/514
- [CI] Sink error when git repo is missing from build. by @wmaxey in https://github.com/NVIDIA/cccl/pull/533
- Rework our tuple implementation to work with older MSVC by @miscco in https://github.com/NVIDIA/cccl/pull/530
- Add jobs using clang as CUDA compiler by @jrhemstad in https://github.com/NVIDIA/cccl/pull/493
- Remove cudaDeviceSetSharedMemConfig from CUB tests by @gevtushenko in https://github.com/NVIDIA/cccl/pull/538
- Implement
__bounded_iterby @miscco in https://github.com/NVIDIA/cccl/pull/540 - Fix cub::BlockAdjacentDifference documentation by @pauleonix in https://github.com/NVIDIA/cccl/pull/542
- Add cuda::device::memcpyasynctx by @ahendriksen in https://github.com/NVIDIA/cccl/pull/405
- Introduce Thrust benchmarks by @gevtushenko in https://github.com/NVIDIA/cccl/pull/534
- Fix MSVC benchmarks build by @gevtushenko in https://github.com/NVIDIA/cccl/pull/536
- Fix nvc++ as host compiler by @gevtushenko in https://github.com/NVIDIA/cccl/pull/560
- Add missing overload definition of thrust::complex operator!= by @srinivasyadav18 in https://github.com/NVIDIA/cccl/pull/564
- Make template parameters consistent in thrust::complex operators by @srinivasyadav18 in https://github.com/NVIDIA/cccl/pull/555
- Migrate CI configs to CMake presets. by @allisonvacanti in https://github.com/NVIDIA/cccl/pull/324
- Replace thrust::detail::integral_constant with libcudacxx implementation by @ZelboK in https://github.com/NVIDIA/cccl/pull/561
- Add
cuda::device::barrier_expect_txby @ahendriksen in https://github.com/NVIDIA/cccl/pull/498 - Add ARM build configs for latest gcc/clang. by @jrhemstad in https://github.com/NVIDIA/cccl/pull/468
- Fea/486 Improve thrust::complex operators compile time throughput by @srinivasyadav18 in https://github.com/NVIDIA/cccl/pull/567
- Define compiler env vars for CMake in dev containers. by @allisonvacanti in https://github.com/NVIDIA/cccl/pull/576
- Revert back to working nvbench commit by @miscco in https://github.com/NVIDIA/cccl/pull/582
- use clang-format in dev containers by @miscco in https://github.com/NVIDIA/cccl/pull/513
- Introduce CCCL clang-format by @gevtushenko in https://github.com/NVIDIA/cccl/pull/551
- Add
cp.async.bulkglobal -> shared support tocuda::memcpy_asyncby @ahendriksen in https://github.com/NVIDIA/cccl/pull/501 - [skip ci] Also update the base image by @miscco in https://github.com/NVIDIA/cccl/pull/584
- Replace
thrust::tupleimplementation withcuda::std::tupleby @miscco in https://github.com/NVIDIA/cccl/pull/262 - Fix clangd integration by @gevtushenko in https://github.com/NVIDIA/cccl/pull/588
- Always treat CCCL as system headers by @miscco in https://github.com/NVIDIA/cccl/pull/531
- Refactor inline comments by @gevtushenko in https://github.com/NVIDIA/cccl/pull/581
- Relax Catch2 include order requirements by @gevtushenko in https://github.com/NVIDIA/cccl/pull/601
- Project Automation - Fix issue/pr sync workflow by @jarmak-nv in https://github.com/NVIDIA/cccl/pull/504
- [skip-tests] Add a preset that builds all configs of all projects. by @allisonvacanti in https://github.com/NVIDIA/cccl/pull/580
- Implement
ranges::advanceby @miscco in https://github.com/NVIDIA/cccl/pull/546 - Update status check job to check status of precursor jobs by @jrhemstad in https://github.com/NVIDIA/cccl/pull/605
- Report times for libcudacxx tests in CI by @jrhemstad in https://github.com/NVIDIA/cccl/pull/606
- Fix bug in the construct_at optimization by @miscco in https://github.com/NVIDIA/cccl/pull/608
- [skip-tests] Disable rdc tests for windows. by @miscco in https://github.com/NVIDIA/cccl/pull/615
- Implement
ranges::nextby @miscco in https://github.com/NVIDIA/cccl/pull/611 - Support FP8 in radix sort by @gevtushenko in https://github.com/NVIDIA/cccl/pull/623
- Fix examples/cccl_infra mixup in ci. by @wmaxey in https://github.com/NVIDIA/cccl/pull/633
- Fixes block-scope run-length decode one-past-the-end memory access into smem TempStorage by @elstehle in https://github.com/NVIDIA/cccl/pull/626
- Harmonize CUB includes by @gevtushenko in https://github.com/NVIDIA/cccl/pull/632
- Create NVRTCC, a utility for running tests under NVRTC by @wmaxey in https://github.com/NVIDIA/cccl/pull/494
- Fix typo and grammar errors by @VaibhavWakde52 in https://github.com/NVIDIA/cccl/pull/639
- [Backport branch/2.3.x] Add CCCL_VERSION and script for updating version by @github-actions in https://github.com/NVIDIA/cccl/pull/667
- Backport 574 ptx by @miscco in https://github.com/NVIDIA/cccl/pull/663
- [Backport branch/2.3.x] Fix C++11 support of recently added tests by @github-actions in https://github.com/NVIDIA/cccl/pull/658
- [Backport branch/2.3.x] Update CUDA newest to CTK 12.3 by @github-actions in https://github.com/NVIDIA/cccl/pull/1072
- [Backport to branch/2.3.x] Rework our system header approach to be more error proof (#661) by @miscco in https://github.com/NVIDIA/cccl/pull/675
- [Backport branch/2.3.x] Fix fallback when checking git repo by @github-actions in https://github.com/NVIDIA/cccl/pull/1086
- [Backport branch/2.3.x] Currently the verbose option does not work beacuse of a typo in the argument handling by @github-actions in https://github.com/NVIDIA/cccl/pull/1090
- [Backport branch/2.3.x] Add
cuda::ptx::st_asyncby @github-actions in https://github.com/NVIDIA/cccl/pull/1093 - [Backport branch/2.3.x] Add
cuda::ptx::red_asyncby @github-actions in https://github.com/NVIDIA/cccl/pull/1094 - Backport PR #1075 by @wmaxey in https://github.com/NVIDIA/cccl/pull/1100
- [Backport branch/2.3.x] Add
cuda::ptx:mbarrier_{try/test}_wait{_parity}by @github-actions in https://github.com/NVIDIA/cccl/pull/1106 - [Backport branch/2.3.x] Fix
cuda::ptx::red.asyncfor int32_t types by @github-actions in https://github.com/NVIDIA/cccl/pull/1107 - [Backport branch/2.3.x] Fix local test runs with lit by @github-actions in https://github.com/NVIDIA/cccl/pull/1110
- [Backport branch/2.3.x] Fix config when only non-CDPv1 arches are enabled. by @github-actions in https://github.com/NVIDIA/cccl/pull/1111
- [Backport branch/2.3.x] Fix GCC6 / FP8 warning by @github-actions in https://github.com/NVIDIA/cccl/pull/1131
- [Backport branch/2.3.x] Fix
ptx.st.async.compile.pass.cppfailing in C++11. by @github-actions in https://github.com/NVIDIA/cccl/pull/1136 - BACKPORT: Fix
_LIBCUDACXX_UNREACHABLEfor old MSVC (#1114) by @miscco in https://github.com/NVIDIA/cccl/pull/1143 - [2.3.x] Backport benchmarking PRs by @wmaxey in https://github.com/NVIDIA/cccl/pull/1168
- Backport P0 filter commit. by @wmaxey in https://github.com/NVIDIA/cccl/pull/1172
- [BACKPORT] Implement math functions for thrust::complex by @miscco in https://github.com/NVIDIA/cccl/pull/1191
- Backport fix icc / cub (#1152) by @wmaxey in https://github.com/NVIDIA/cccl/pull/1171
- [BACKPORT]: Fix availability of isconstantevaluated on old MSVC by @miscco in https://github.com/NVIDIA/cccl/pull/1198
- [BACKPORT] Add icc to the ci matrix by @miscco in https://github.com/NVIDIA/cccl/pull/1209
- [BACKPORT]: Add missing overloads for thrust::pow by @miscco in https://github.com/NVIDIA/cccl/pull/1223
New Contributors
- @siboehm made their first contribution in https://github.com/NVIDIA/cccl/pull/264
- @hageboeck made their first contribution in https://github.com/NVIDIA/cccl/pull/334
- @Blonck made their first contribution in https://github.com/NVIDIA/cccl/pull/349
- @leofang made their first contribution in https://github.com/NVIDIA/cccl/pull/412
- @HapeMask made their first contribution in https://github.com/NVIDIA/cccl/pull/320
- @jecs made their first contribution in https://github.com/NVIDIA/cccl/pull/414
- @pauleonix made their first contribution in https://github.com/NVIDIA/cccl/pull/542
- @srinivasyadav18 made their first contribution in https://github.com/NVIDIA/cccl/pull/564
- @ZelboK made their first contribution in https://github.com/NVIDIA/cccl/pull/561
- @VaibhavWakde52 made their first contribution in https://github.com/NVIDIA/cccl/pull/639
Full Changelog: https://github.com/NVIDIA/cccl/compare/v2.2.0...2.3.0
- C++
Published by wmaxey almost 2 years ago
cccl - CCCL 2.2.0
What's Changed
- Add axis for docker builds by @raydouglass in https://github.com/NVIDIA/cccl/pull/1
- Docker: Add support for ICPC and NVC++, install newer CMake, and add curl by @brycelelbach in https://github.com/NVIDIA/cccl/pull/4
- Update excludes by @raydouglass in https://github.com/NVIDIA/cccl/pull/5
- Docker: OS and CUDA upgrades, support for additional configurations by @brycelelbach in https://github.com/NVIDIA/cccl/pull/9
- Docker: Add Thrust/CUB documentation toolchain to Ubuntu docker images by @brycelelbach in https://github.com/NVIDIA/cccl/pull/15
- Re-enable CentOS images. by @allisonvacanti in https://github.com/NVIDIA/cccl/pull/16
- Add sccache to dockerfile by @msadang in https://github.com/NVIDIA/cccl/pull/17
- Update base containers. by @allisonvacanti in https://github.com/NVIDIA/cccl/pull/18
- Update
sccacheversion by @ajschmidt8 in https://github.com/NVIDIA/cccl/pull/19 - Build
11.5.1containers by @ajschmidt8 in https://github.com/NVIDIA/cccl/pull/20 - Add ops-bot.yaml by @jrhemstad in https://github.com/NVIDIA/cccl/pull/80
- Monorepo workflow by @jrhemstad in https://github.com/NVIDIA/cccl/pull/99
- Add devcontainers by @jrhemstad in https://github.com/NVIDIA/cccl/pull/105
- Update the libcu++ submodule by @miscco in https://github.com/NVIDIA/cccl/pull/109
- Update libcudaxx again by @miscco in https://github.com/NVIDIA/cccl/pull/110
- Remove submodules from CI workflow by @jrhemstad in https://github.com/NVIDIA/cccl/pull/115
- Fix CUB CI by @senior-zero in https://github.com/NVIDIA/cccl/pull/114
- Fix async scan / counting iterator tests by @senior-zero in https://github.com/NVIDIA/cccl/pull/118
- Make sccache work locally by @jrhemstad in https://github.com/NVIDIA/cccl/pull/113
- Fix compilation of thrust and cub by @miscco in https://github.com/NVIDIA/cccl/pull/120
- Fix segfault in cub::CachingDeviceAllocator by @senior-zero in https://github.com/NVIDIA/cccl/pull/119
- Initial GH Infra Setup by @jarmak-nv in https://github.com/NVIDIA/cccl/pull/23
- Visualize variant space coverage by @senior-zero in https://github.com/NVIDIA/cccl/pull/125
- Fix broken issue templates by @jarmak-nv in https://github.com/NVIDIA/cccl/pull/124
- Tune scan by key for SM90 by @senior-zero in https://github.com/NVIDIA/cccl/pull/121
- Update PR template to more explicitly prompt for a linked issue closed by the PR by @jrhemstad in https://github.com/NVIDIA/cccl/pull/134
- Change component section to more general "area" by @jrhemstad in https://github.com/NVIDIA/cccl/pull/132
- Try and fix CI for old CTK by @miscco in https://github.com/NVIDIA/cccl/pull/116
- Fix
tuple_catforstd::qualified types by @miscco in https://github.com/NVIDIA/cccl/pull/144 - Add ccache to lit invocation by @miscco in https://github.com/NVIDIA/cccl/pull/147
- Benchmark batched memcpy by @senior-zero in https://github.com/NVIDIA/cccl/pull/136
- Properly querry
CMAKE_CUDA_COMPILER_LAUNCHERfor ccache support by @miscco in https://github.com/NVIDIA/cccl/pull/152 - Implement Three-Way Partition Tuning / Benchmark by @senior-zero in https://github.com/NVIDIA/cccl/pull/155
- Port three-way partition to use Catch2 by @senior-zero in https://github.com/NVIDIA/cccl/pull/156
- Add gcc-6 to the test matrix by @miscco in https://github.com/NVIDIA/cccl/pull/160
- Tune reduce / unique by key for SM90 by @senior-zero in https://github.com/NVIDIA/cccl/pull/163
- Remove unused folders by @miscco in https://github.com/NVIDIA/cccl/pull/145
- Fix documentation of
atomic_refby @miscco in https://github.com/NVIDIA/cccl/pull/164 - New iterator traits by @miscco in https://github.com/NVIDIA/cccl/pull/158
- Improve implementation of
destructibleby @miscco in https://github.com/NVIDIA/cccl/pull/157 - Build script improvements by @jrhemstad in https://github.com/NVIDIA/cccl/pull/149
- Fix icpc / denormals by @senior-zero in https://github.com/NVIDIA/cccl/pull/185
- Enable tests by @jrhemstad in https://github.com/NVIDIA/cccl/pull/167
- Monorepo by @jrhemstad in https://github.com/NVIDIA/cccl/pull/194
- Multi-benchmark tuning by @senior-zero in https://github.com/NVIDIA/cccl/pull/208
- Fixes universal_vector test failure on CTK 11.1 & gcc-6 by @elstehle in https://github.com/NVIDIA/cccl/pull/209
- Delete several directories for older CI infra. by @wmaxey in https://github.com/NVIDIA/cccl/pull/218
- Memory-safe radix sort test by @senior-zero in https://github.com/NVIDIA/cccl/pull/222
- [FEA] Implement
iter_moveCPO by @miscco in https://github.com/NVIDIA/cccl/pull/197 - Build cub benchmarks in build_cub.sh by @jrhemstad in https://github.com/NVIDIA/cccl/pull/216
- [skip-tests] Do not run tests when
skip-testsis part of the latest commit message by @miscco in https://github.com/NVIDIA/cccl/pull/224 - Factor out build job logic into a "run-as-coder" reusable workflow. by @jrhemstad in https://github.com/NVIDIA/cccl/pull/205
- Fix instances of 'scan' copy-pasted into reduction documentation by @milesvant in https://github.com/NVIDIA/cccl/pull/221
- Add clangd to devcontainer by @senior-zero in https://github.com/NVIDIA/cccl/pull/225
- Add initial CODEOWNERS file by @jrhemstad in https://github.com/NVIDIA/cccl/pull/226
- Attempt to fix codeowners by @jrhemstad in https://github.com/NVIDIA/cccl/pull/231
- Make libcudacxx respect CMake options for CUDA archs. by @wmaxey in https://github.com/NVIDIA/cccl/pull/235
- Optimize Three-Way Partition by @senior-zero in https://github.com/NVIDIA/cccl/pull/228
- [BUG] Rework how we handle feature test macros by @miscco in https://github.com/NVIDIA/cccl/pull/195
- Enable use of
cudaMemcpyAsyncforthrust::copyby @miscco in https://github.com/NVIDIA/cccl/pull/211 - Enable additional arguments in build_common.sh by @wmaxey in https://github.com/NVIDIA/cccl/pull/236
- [BUG] Properly uglify all qualifiers in product headers by @miscco in https://github.com/NVIDIA/cccl/pull/201
- Port
cub::Device{Select, Partition}tests to catch2 by @miscco in https://github.com/NVIDIA/cccl/pull/229 - Fix CUB tests / MSVC 2022 by @senior-zero in https://github.com/NVIDIA/cccl/pull/255
- Ensure that any CMake re-rooting doesn't break our find_file by @miscco in https://github.com/NVIDIA/cccl/pull/257
- [BUG] Fix compilation issues with MSVC 2017 by @miscco in https://github.com/NVIDIA/cccl/pull/196
- Implement iterator concepts by @miscco in https://github.com/NVIDIA/cccl/pull/223
- Tune Histogram on H100 by @senior-zero in https://github.com/NVIDIA/cccl/pull/266
- Add WarpExchangeAlgorithm customization for WarpExchange class by @pb-dseifert in https://github.com/NVIDIA/cccl/pull/256
- [BUG]: Avoid deprecation warning for
std::aligned_storagewhen building with c++23 by @miscco in https://github.com/NVIDIA/cccl/pull/258 - Port cub::DeviceReduce tests to catch2 by @elstehle in https://github.com/NVIDIA/cccl/pull/267
- Add support for nvcc-specific matrix. by @jrhemstad in https://github.com/NVIDIA/cccl/pull/243
- Fix anchor link to cooperative groups in CUDA programming guide by @wence- in https://github.com/NVIDIA/cccl/pull/274
- Fix BibTeX syntax in CITATION.md [skip-tests] by @wence- in https://github.com/NVIDIA/cccl/pull/276
- Enforce C++17 for benches by @senior-zero in https://github.com/NVIDIA/cccl/pull/275
- Project Automation: Move PR and Linked Issues to In Progress by @jarmak-nv in https://github.com/NVIDIA/cccl/pull/170
- Update to 23.08 devcontainers and CUDA 12.2 by @jrhemstad in https://github.com/NVIDIA/cccl/pull/270
- [skip-tests] CTK 12.2 tuning image by @senior-zero in https://github.com/NVIDIA/cccl/pull/282
- Fix single-thread block reduction by @senior-zero in https://github.com/NVIDIA/cccl/pull/287
- Tune Select and Partition on A100 by @senior-zero in https://github.com/NVIDIA/cccl/pull/289
- Fix CUB tests / MSVC by @senior-zero in https://github.com/NVIDIA/cccl/pull/292
- Allow building CUB tests without cuRand by @senior-zero in https://github.com/NVIDIA/cccl/pull/250
- Fixup to CUB build - s/curand/cudart/ by @wmaxey in https://github.com/NVIDIA/cccl/pull/301
- Fix OOB in
cub::DeviceRunLengthEncode::NonTrivialRunsby @senior-zero in https://github.com/NVIDIA/cccl/pull/294 - Tune RLE on A100 by @senior-zero in https://github.com/NVIDIA/cccl/pull/295
- Tune scan on A100 by @senior-zero in https://github.com/NVIDIA/cccl/pull/302
- Add new CCCL:: CMake targets by @allisonvacanti in https://github.com/NVIDIA/cccl/pull/244
- Fix
cudaccandnvccmixup. by @wmaxey in https://github.com/NVIDIA/cccl/pull/329 - [skip-tests] Use builtin for
destructibleconcept on MSVC by @miscco in https://github.com/NVIDIA/cccl/pull/333 - Fix merge conflict from two inflight PRs by @miscco in https://github.com/NVIDIA/cccl/pull/338
New Contributors
- @raydouglass made their first contribution in https://github.com/NVIDIA/cccl/pull/1
- @brycelelbach made their first contribution in https://github.com/NVIDIA/cccl/pull/4
- @msadang made their first contribution in https://github.com/NVIDIA/cccl/pull/17
- @wmaxey made their first contribution in https://github.com/NVIDIA/cccl/pull/218
- @milesvant made their first contribution in https://github.com/NVIDIA/cccl/pull/221
- @pb-dseifert made their first contribution in https://github.com/NVIDIA/cccl/pull/256
- @wence- made their first contribution in https://github.com/NVIDIA/cccl/pull/274
Full Changelog: https://github.com/NVIDIA/cccl/commits/v2.2.0
- C++
Published by jrhemstad over 2 years ago