Recent Releases of cccl

cccl - CCCL Python Libraries v0.1.3.2.0.dev128 (pre-release)

These are the changes in the cuda.cccl libraries introduced in the pre-release 0.1.3.2.0dev128 dated August 14th, 2025. cuda.cccl is in "experimental" status, meaning that its API and feature set can change quite rapidly.

Major API improvements
- Single-call APIs in cuda.cccl.parallel
New algorithms
- Device-wide histogram
- StripedtoBlock exchange
Infrastructure improvements
- CuPy dependency replaced with cuda.core
- Support for CUDA 13 drivers

Major API improvements

Single-call APIs in `cuda.cccl.parallel`

Previously, performing operation like reduce_into required 4 API invocations to (1) create a reducer object, (2) compute the amount of temporary storage required for the reduction, (3) allocate the required amount of temporary memory, and (4) perform the reduction.

In this version, cuda.cccl.parallel introduces simpler, single-call APIs. For example, reduction looks like:

```python

New API - single function call with automatic temp storage

parallel.reduceinto(dinput, doutput, addop, numitems, hinit) ```

If you wish to have more control over temporary memory allocation, the previous API still exists (and always will). It has been renamed from reduce_into to make_reduce_into:

```python

Object API

reducer = parallel.makereduceinto(dinput, doutput, addop, hinit) tempstoragesize = reducer(None, dinput, doutput, numitems, hinit) tempstorage = cp.empty(tempstoragesize, dtype=np.uint8) reducer(tempstorage, dinput, doutput, numitems, hinit) ```

New algorithms

Device-wide histogram

The histogram_even function provides Python exposure of the corresponding CUB C++ API DeviceHistogram::HistogramEven.

`StripedtoBlock` exchange

cuda.cccl.cooperative adds a block.exchange providing Python exposure of the corresponding CUB C++ API BlockExchange. Currently, only the StripedToBlock exchange pattern is supported.

Infrastructure improvements

CuPy dependency replaced with `cuda.core`

Use of CuPy within the library has been replaced with the lighter weight cuda.core package. This means that installing cuda.cccl won't install CuPy as a dependency.

Support for CUDA 13 drivers

cuda.cccl can be used with CUDA 13 compatible drivers. However, the CUDA 13 toolkit (runtime and libraries) is not yet supported, meaning you still need the CUDA 12 toolkit. Full support for CUDA 13 toolkit is planned for the next pre-release.

- C++
Published by shwina 5 months ago

cccl - v3.0.2

What's Changed

🔄 Other Changes

[Version] Update branch/3.0.x to v3.0.2 by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/5348
Backport to 3.0: Add a macro to disable PDL (#5316) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5330
[Backport branch/3.0x] Add gitlab devcontainers (#5325) by @wmaxey in https://github.com/NVIDIA/cccl/pull/5352

Full Changelog: https://github.com/NVIDIA/cccl/compare/v3.0.1...v3.0.2

- C++
Published by github-actions[bot] 5 months ago

cccl - v3.0.1

What's Changed

🔄 Other Changes

[Version] Update branch/3.0.x to v3.0.1 by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/5256
[Backport branch/3.0.x] Disable assertions for QNX, they do not provide the machinery with their libc by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/5258
[BACKPORT 3.0] Make sure that nested tuple and pair have the expected size (#5246) by @miscco in https://github.com/NVIDIA/cccl/pull/5265
[BACKPORT] Add missed specializations of the new aligned vector types to cub (#5264) by @miscco in https://github.com/NVIDIA/cccl/pull/5271
[BACKPORT 3.0] Backport diagnostic suppression machinery by @miscco in https://github.com/NVIDIA/cccl/pull/5281

Full Changelog: https://github.com/NVIDIA/cccl/compare/v3.0.0...v3.0.1

- C++
Published by github-actions[bot] 6 months ago

cccl - v3.0.0

CCCL 3.0 Release

The 3.0 release of the CUDA Core Compute Libraries (CCCL) marks our first major version since unifying the Thrust, CUB, and libcudacxx libraries under a single repository. This release reflects over a year of work focused on cleanup, consolidation, and modernizing the codebase to support future growth.

While this release includes a number of breaking changes, many involve the consolidation of APIs—particularly in the thrust:: and cub:: namespaces—as well as cleanup of internal details that were never intended for public use. In many cases, redundant functionality from thrust:: or cub:: has been replaced with equivalent or improved abstractions from the cuda:: or cuda::std:: namespaces. Impact should be minimal for most users. For full details and recommended migration steps, please consult the CCCL 2.x to 3.0 Migration Guide.

Key Changes in CCCL 3.0

Requirements

C++17 or newer is now required (support for C++11 and C++14 has been dropped #3255)
CUDA Toolkit 12.0+ is now required (support for CTK 11.0+ has been dropped). For details on version compatibility, see the README.
Compilers:
- GCC 7+ (support for GCC < 7 has been dropped #3268)
- Clang 14+ (support for Clang < 14 has been dropped #3309)
- MSVC 2019+ (support for MSVC 2017 has been dropped #3287, #3553)
Dropped support for
- ICC #3277, #3279
- CUDA Dynamic Parallelism v1 (CDPv1) #3344

Header Directory Changes in CUDA Toolkit 13.0

CCCL 3.0 will be included with an upcoming CUDA Toolkit 13.0 release. In this release, the bundled CCCL headers have moved to new top-level directories under ${CTK_ROOT}/include/cccl/.

| Before CUDA 13.0 | After CUDA 13.0 | | :---- | :---- | | ${CTK_ROOT}/include/cuda/ | ${CTK_ROOT}/include/cccl/cuda/ | | ${CTK_ROOT}/include/cub/ | ${CTK_ROOT}/include/cccl/cub/ | | ${CTK_ROOT}/include/thrust/ | ${CTK_ROOT}/include/cccl/thrust/ |

These changes only affect the on-disk location of CCCL headers within the CUDA Toolkit installation.

What you need to know

❌ Do NOT write #include <cccl/...> — this will break.
If using CCCL headers only in files compiled with nvcc
- ✅ No action needed. This is the default for most users.
If using CCCL headers in files compiled exclusively with a host compiler (e.g., GCC, Clang, MSVC):
- Using CMake and linking CCCL::CCCL
- ✅ No action needed. (This is the recommended path. See example)
- Other build systems
- ⚠️ Add ${CTK_ROOT}/include/cccl to your compiler’s include search path (e.g., with -I)

These changes prevent issues when mixing CCCL headers bundled with the CUDA Toolkit and those from external package managers. For more detail, see the CCCL 2.x to 3.0 Migration Guide.

Major API Changes

Hundreds of macros, internal types, and implementation details were removed or relocated to internal namespaces. This significantly reduces surface area and eliminates long-standing technical debt, improving both compile times and maintainability.

Removed Macros

Over 50 legacy macros have been removed in favor of modern C++ alternatives:

CUB_{MIN,MAX}: use cuda::std::{min,max} instead #3821
THRUST_NODISCARD: use [[nodiscard]] instead #3746
THRUST_INLINE_CONSTANT: use `inline constexpr` instead #3746
See CCCL 2.x to 3.0 Migration Guide for complete list

Removed Functions and Classes

thrust::optional: use cuda::std::optional instead #4172
thrust::tuple: use cuda::std::tuple instead #2395
thrust::pair: use cuda::std::pair instead #2395
thrust::numeric_limits: use cuda::std::numeric_limits instead #3366
cub::BFE: use `cuda::bitfield_inser`t and cuda::bitfield_extract instead #4031
cub::ConstantInputIterator: use thrust::constant_iterator instead #3831
cub::CountingInputIterator: use thrust::counting_iterator instead #3831
cub::GridBarrier: use cooperative groups instead #3745
cub::DeviceSpmv: use cuSPARSE instead #3320
cub::Mutex: use cuda::std::mutex instead #3251
See CCCL 2.x to 3.0 Migration Guide for complete list

New Features

C++

`cuda::`

cuda::std::numeric_limits now supports __float128 #4059
cuda::std::optional<T&> implementation (P2988) #3631
cuda::std::numbers header for mathematical constants #3355
NVFP8/6/4 extended floating-point types support in <cuda/std/cmath> #3843
cuda::overflow_cast for safe numeric conversions #4151
cuda::ilog2 and cuda::ilog10 integer logarithms #4100
cuda::round_up and cuda::round_down utilities #3234

`cub::`

`cub::DeviceSegmentedReduce` now supports large number of segments #3746
`cub::DeviceCopy::Batched` now supports large number of buffers #4129
`cub::DeviceMemcpy::Batched` now supports large number of buffers #4065

`thrust::`

New `thrust::offset_iterator` iterator #4073
Temporary storage allocations in parallel algorithms now respect `par_nosync` #4204

Python

CUDA Python Core Libraries are now available on PyPI through the cuda-cccl package.

pip install cuda-cccl

cuda.cccl.cooperative

Block-level sorting now supports multi-dimensional thread blocks #4035, #4028
Block-level data movement now supports multi-dimensional thread blocks #3161
New block-level inclusive sum algorithm #3921

cuda.cccl.parallel

New device-level segmented-reduce algorithm #3906
New device-level unique-by-key algorithm #3947
New device-level merge-sort algorithm #3763

What's Changed

🚀 Thrust / CUB

Drop cub::Mutex by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3251
Remove legacy macros from CUB util_arch.cuh by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3257
Remove thrust::[unary|binary]_traits by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3260
Drop thrust not1 and not2 by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3264
Deprecate GridBarrier and GridBarrierLifetime by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3258
Drop thrust::[unary|binary]_function by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3274
Enable thrust::identity test for non-MSVC by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3281
Enable PDL in triple chevron launch by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3282
Drop Thrust legacy arch macros by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3298
Drop Thrust's compiler_fence.h by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3300
Drop CUB's util_compiler.cuh by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3302
Drop Thrust's deprecated compiler macros by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3301
Drop CUBRUNTIMEENABLED and THRUSTHASCUDART by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3305
Require C++17 for compiling Thrust and CUB by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3255
Deprecate Thrust's cpp_compatibility.h macros by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3299
Deprecate cub::IterateThreadStore by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3337
Drop CUB's BinaryFlip operator by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3332
Deprecate cub::Swap by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3333
Drop CUB APIs with a debug_synchronous parameter by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3330
Drop CUB's util_compiler.cuh for real by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3340
Drop cub::ValueCache by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3346
Drop CDPv1 by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3344
Use cuda::std::addressof in Thrust by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3363
Drop deprecated aliases in Thrust functional by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3272
Drop cub::DivideAndRoundUp by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3347
Use cuda::std::min/max in Thrust by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3364
Cleanup CUB util_arch by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2773
Deprecate thrust::null_type by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3367
Deprecate thrust::async by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3324
Review CUB util.ptx for CCCL 2.x by @fbusato in https://github.com/NVIDIA/cccl/pull/3342
Deprecate thrust::numeric_limits by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3366
Deprecate thrust::optional by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3307
Redefine and deprecate thrust::remove_cvref by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3394
Replace and deprecate thrust::cuda_cub::terminate by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3421
Deprecate cub::{min, max} and replace internal uses with those from libcu++ by @miscco in https://github.com/NVIDIA/cccl/pull/3419
Moves agents to detail::<algorithm_name> namespace by @elstehle in https://github.com/NVIDIA/cccl/pull/3435
Default transform_iterator's copy ctor by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3395
Refactor allocator handling of contiguous_storage by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3050
Drop thrust::detail::integer_traits by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3391
Deprecate a few CUB macros by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3456
Deprecate thrust universal iterator categories by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3461
Drop thrust universal iterator categories by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3474
Moves CUB kernel entry points to a detail namespace by @elstehle in https://github.com/NVIDIA/cccl/pull/3468
Deprecate block/warp algo specializations by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3455
Drop thrust numeric_traits by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3476
Deprecate and replace thrust::cuda_cub iterators by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3422
Deprecate thrust macros from type_deduction.h by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3501
Deprecate thrust event, future and more by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3457
Drop thrust::null_type by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3508
Deprecates tuning policy hubs by @elstehle in https://github.com/NVIDIA/cccl/pull/3514
Deprecate macros from cuda/detail/core/util.h by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3504
Deprecate CUB iterators existing in Thrust by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3304
Deprecate thrust logical meta functions by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3538
Fixes value type of thrust::tabulate_output_iterator by @elstehle in https://github.com/NVIDIA/cccl/pull/3573
Internalize cuda/detail/core/* by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3505
Remove CUB DeviceSpMV by @fbusato in https://github.com/NVIDIA/cccl/pull/3549
Remove LEGACY_PTX_ARCH by @fbusato in https://github.com/NVIDIA/cccl/pull/3551
Removes deprecated Agent* alias templates in the public namespace by @elstehle in https://github.com/NVIDIA/cccl/pull/3717
Move ForceInclusive parameter of DispatchScan before policy by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3739
Drop Thrust's cpp_compatibility.h by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3746
Drop thrust::identity by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3747
Drop deprecated entities from CUB util_type by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3743
Drop cub::GridBarrier by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3745
Move Dispatcher policy hub parameters to the back by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3740
Drop small deprecated entites by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3748
Error when users specialize BaseTraits but not numeric_limits by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3836
Drop deprecated iterators from Thrust cuda utils by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3905
Drop CUB thread operators by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3918
Minimize usage of cub::Traits by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3863
Drop/internalize some macros by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3936
Drop public access to RegBoundScaling/MemBoundScaling by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3934
Drop deprecated features from CUB util_ptx.cuh by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3935
Fix definition of universalhostpinnedmemoryresource by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3988
Assert offset type in DispatchScan[ByKey] to be unsigned and at least 4 bytes by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3992
Drop deprecated CUB macros by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3821
Drop deprecated warp/block algo specializations by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/4007
Drop remaining 2.8-deprecated entities by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/4009
Use cuda::std::array in histogram APIs by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3973
Test tuple of iterator reference assignment by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1964
Rework counting_iterator difference by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3861
[thrust, docs] Use the variadic overload of make_zip_iterator in the zip_iterator docs by @brycelelbach in https://github.com/NVIDIA/cccl/pull/4111 ### 📚 Libcudacxx
ptx: Add addptxinstruction.py by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3190
Fix assert definition for NVHPC due to constexpr issues by @miscco in https://github.com/NVIDIA/cccl/pull/3418
ceil_div return common type and optmize by @fbusato in https://github.com/NVIDIA/cccl/pull/3229
attempt to work around msvc bug exposed by type_list.h by @ericniebler in https://github.com/NVIDIA/cccl/pull/3487
Ensure that pointer_traits work nicely with proxy iterators by @miscco in https://github.com/NVIDIA/cccl/pull/3519
Define isfloatingpointv in terms of isfloating_point by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3923
Rework our mdspan implementation by @miscco in https://github.com/NVIDIA/cccl/pull/3343
Implement more of cmath by @miscco in https://github.com/NVIDIA/cccl/pull/3963 ### 📝 Documentation
Improve docs of std headers by @miscco in https://github.com/NVIDIA/cccl/pull/3416 ### 🔄 Other Changes
Expands support for more offset types in segmented benchmark by @elstehle in https://github.com/NVIDIA/cccl/pull/3231
Add escape hatches to the cmake configuration of the header tests so that we can tests deprecated compilers / dialects by @miscco in https://github.com/NVIDIA/cccl/pull/3253
[Version] Update main to v2.9.0 by @github-actions in https://github.com/NVIDIA/cccl/pull/3247
Architecture and OS identification macros by @fbusato in https://github.com/NVIDIA/cccl/pull/3237
[Version] Update main to v3.0.0 by @github-actions in https://github.com/NVIDIA/cccl/pull/3265
CCCL Internal macro documentation by @fbusato in https://github.com/NVIDIA/cccl/pull/3238
Require at least gcc7 by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3268
Drop ICC from CI by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3277
[STF] Corruption of the capture list of an extended lambda with a parallel_for construct on a host execution place by @caugonnet in https://github.com/NVIDIA/cccl/pull/3270
Disambiguate line continuations and macro continuations in by @wmaxey in https://github.com/NVIDIA/cccl/pull/3244
Drop VS 2017 from CI by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3287
Drop ICC support in code by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3279
Make CUB NVRTC commandline arguments come from a cmake template by @wmaxey in https://github.com/NVIDIA/cccl/pull/3292
Add components to the bug report template by @caugonnet in https://github.com/NVIDIA/cccl/pull/3295
Use process isolation instead of default hyper-v for Windows. by @wmaxey in https://github.com/NVIDIA/cccl/pull/3294
[pre-commit.ci] pre-commit autoupdate by @pre-commit-ci in https://github.com/NVIDIA/cccl/pull/3248
Drop CTK 11.x from CI by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3275
Update repo_man and packman versions by @shwina in https://github.com/NVIDIA/cccl/pull/3293
Adds support for large number of items to DevicePartition::If with the ThreeWayPartition overload by @elstehle in https://github.com/NVIDIA/cccl/pull/2506
Refactor scan tunings by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3262
Implement views::empty by @miscco in https://github.com/NVIDIA/cccl/pull/3254
Refactor limits and climits by @davebayer in https://github.com/NVIDIA/cccl/pull/3221
cuda.parallel: Add documentation for the current iterators along with examples and tests by @NaderAlAwar in https://github.com/NVIDIA/cccl/pull/3311
Drop clang<14 from CI, update devcontainers. by @alliepiper in https://github.com/NVIDIA/cccl/pull/3309
[STF] Cleanup task dependencies object constructors by @caugonnet in https://github.com/NVIDIA/cccl/pull/3291
Disable test with a gcc-14 regression by @miscco in https://github.com/NVIDIA/cccl/pull/3297
Remove dropped function objects from docs by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3319
Document NV_TARGET macros by @fbusato in https://github.com/NVIDIA/cccl/pull/3313
[STF] Define ctx.pick_stream() which was missing for the unified context by @caugonnet in https://github.com/NVIDIA/cccl/pull/3326
Clarify CUB transform output can overlap input by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3323
Limits the number of different offset types for DeviceMergeSort by @elstehle in https://github.com/NVIDIA/cccl/pull/3328
Drop thrust::void_t by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3362
Fix all_of documentation for empty ranges by @upsj in https://github.com/NVIDIA/cccl/pull/3358
[STF] Do not keep track of dangling events in a CUDA graph backend by @caugonnet in https://github.com/NVIDIA/cccl/pull/3327
Extract scan kernels into NVRTC-compilable header by @shwina in https://github.com/NVIDIA/cccl/pull/3334
Implement cuda::std::numeric_limits for __half and __nv_bfloat16 by @davebayer in https://github.com/NVIDIA/cccl/pull/3361
Deprecate cub::DeviceSpmv by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3320
Improves DeviceSegmentedSort test run time for large number of items and segments by @elstehle in https://github.com/NVIDIA/cccl/pull/3246
Compile basic infra test with C++17 by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3377
Adds support for large number of items and large number of segments to DeviceSegmentedSort by @elstehle in https://github.com/NVIDIA/cccl/pull/3308
Exit with error when RAPIDS CI fails. by @alliepiper in https://github.com/NVIDIA/cccl/pull/3385
cuda.parallel: Support structured types as algorithm inputs by @shwina in https://github.com/NVIDIA/cccl/pull/3218
Fix broken _CCCL_BUILTIN_ASSUME macro by @fbusato in https://github.com/NVIDIA/cccl/pull/3314
Replace typedef with using in libcu++ by @davebayer in https://github.com/NVIDIA/cccl/pull/3368
Upgrade to Catch2 3.8 by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3310
refactor <cuda/std/cstdint> by @davebayer in https://github.com/NVIDIA/cccl/pull/3325
Update CODEOWNERS by @jrhemstad in https://github.com/NVIDIA/cccl/pull/3331
Fix sign-compare warning by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3408
Implement more cmath functions to be usable on host and device by @miscco in https://github.com/NVIDIA/cccl/pull/3382
Extend CUB reduce benchmarks by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3401
Update upload-pages-artifact to v3 by @shwina in https://github.com/NVIDIA/cccl/pull/3423
std::linalg accessors and transposed_layout by @fbusato in https://github.com/NVIDIA/cccl/pull/2962
Add round up/down to multiple by @fbusato in https://github.com/NVIDIA/cccl/pull/3234
[FEA]: Introduce Python module with CCCL headers by @rwgk in https://github.com/NVIDIA/cccl/pull/3201
cuda.parallel: Add optional stream argument to reduce_into() by @NaderAlAwar in https://github.com/NVIDIA/cccl/pull/3348
Fix Deploy CCCL pages workflow by @rwgk in https://github.com/NVIDIA/cccl/pull/3434
[CUDAX] Fix CI issues in the nightly testing by @pciolkosz in https://github.com/NVIDIA/cccl/pull/3443
Remove deprecated cub::min and thrust::remove_cvref by @miscco in https://github.com/NVIDIA/cccl/pull/3450
Fix typo in builtin by @miscco in https://github.com/NVIDIA/cccl/pull/3451
Uses unsigned offset types in thrust's scan algorithms by @elstehle in https://github.com/NVIDIA/cccl/pull/3436
Turn C++ dialect warning into error by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3453
Uses unsigned offset types in thrust's sort algorithm calling into DispatchMergeSort by @elstehle in https://github.com/NVIDIA/cccl/pull/3437
Add cuda::is_floating_point supporting half and bfloat by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3379
Drop C++11 and C++14 support for all of cccl by @miscco in https://github.com/NVIDIA/cccl/pull/3417
[CUDAX] Fix block and grid dimension order in <<<>>> in one of the hierarchy tests by @pciolkosz in https://github.com/NVIDIA/cccl/pull/3465
Add --extended-lambda to the list of removed clangd flags by @fbusato in https://github.com/NVIDIA/cccl/pull/3432
add _CCCL_HAS_NVFP8 macro by @fbusato in https://github.com/NVIDIA/cccl/pull/3429
Add _CCCL_BUILTIN_PREFETCH by @fbusato in https://github.com/NVIDIA/cccl/pull/3433
Ensure that headers in <cuda/*> can be build with a C++ only compiler by @miscco in https://github.com/NVIDIA/cccl/pull/3472
Specialize _isextendedfloatingpoint for FP8 types by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3470
Refactor CUB's util_debug by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3345
Specialize cuda::std::numeric_limits for FP8 types by @davebayer in https://github.com/NVIDIA/cccl/pull/3478
Fix typo in limits by @miscco in https://github.com/NVIDIA/cccl/pull/3491
Add dynamic CUB dispatch for scan to support c.parallel by @shwina in https://github.com/NVIDIA/cccl/pull/3398
Use a raw string literal for nvrtc source by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3486
Add popcount, clz, ctz builtin intrinsics by @fbusato in https://github.com/NVIDIA/cccl/pull/3489
[STF] Fix paths in the STF unittest infrastructure by @caugonnet in https://github.com/NVIDIA/cccl/pull/3396
Increase test coverage now that we dropped half of our configs by @miscco in https://github.com/NVIDIA/cccl/pull/3500
Fix issue with conversion between mdspan<T> and mdspan<const T> by @miscco in https://github.com/NVIDIA/cccl/pull/3469
Extract merge sort kernels to NVRTC compilable header by @NaderAlAwar in https://github.com/NVIDIA/cccl/pull/3438
[STF] Generate statistics about the DOT output by @caugonnet in https://github.com/NVIDIA/cccl/pull/3509
[CUDAX] Align some naming and add missing docs by @pciolkosz in https://github.com/NVIDIA/cccl/pull/3497
[CUDAX] Rename hierarchy_dimensions_fragment to hierarchy_dimensions and remove the old alias by @pciolkosz in https://github.com/NVIDIA/cccl/pull/3496
cuda.parallel: invoke pytest directly rather than via python -m pytest by @shwina in https://github.com/NVIDIA/cccl/pull/3523
add a __call_result_t alias template, implement __is_callable_v with it by @ericniebler in https://github.com/NVIDIA/cccl/pull/3527
cudastf (examples): Fix compiler errors when enabling examples for CUDA STF by @janciesko in https://github.com/NVIDIA/cccl/pull/3516
A few improvements for internal macro documentation by @fbusato in https://github.com/NVIDIA/cccl/pull/3554
Replace pipes.quote with shlex.quote in lit config by @wmaxey in https://github.com/NVIDIA/cccl/pull/3547
Tune cub::DeviceTransform for Blackwell by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3543
Refactor injecting benchmark policy_hub by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3561
Try to always include the definition of barriernativehandle when needed by @miscco in https://github.com/NVIDIA/cccl/pull/3556
Fix transform iterator for non-copy-constructible types by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3542
Sync ptx helpers with libcudaptx by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3564
Update ptx_isa.h to include 8.7 by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3563
add missing visibility annotations to ustdex types that have data members by @ericniebler in https://github.com/NVIDIA/cccl/pull/3571
[STF] Document dot sections by @caugonnet in https://github.com/NVIDIA/cccl/pull/3506
Remove nvks runners from testing pool. by @alliepiper in https://github.com/NVIDIA/cccl/pull/3580
Try and get rapids green by @miscco in https://github.com/NVIDIA/cccl/pull/3503
Add __int128 and __float128 detection macros by @fbusato in https://github.com/NVIDIA/cccl/pull/3413
Remove all code paths and policies for SM37 and below by @fbusato in https://github.com/NVIDIA/cccl/pull/3466
PTX: Update generated files with Blackwell instructions by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3568
Update CI matrix to use NVKS nodes. by @alliepiper in https://github.com/NVIDIA/cccl/pull/3572
Deprecate and replace CUB_IS_INT128_ENABLED by @fbusato in https://github.com/NVIDIA/cccl/pull/3427
Adds support for large num items to DeviceMerge by @elstehle in https://github.com/NVIDIA/cccl/pull/3530
Support FP16 traits on CTK 12.0 by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3535
Suppress execution checks for vocabulary types by @miscco in https://github.com/NVIDIA/cccl/pull/3578
[nv/target] Add sm_120 macros. by @wmaxey in https://github.com/NVIDIA/cccl/pull/3550
PTX: Remove internal instructions by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3583
Add dynamic CUB dispatch for merge_sort by @NaderAlAwar in https://github.com/NVIDIA/cccl/pull/3525
PTX: Update existing instructions by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3584
PTX: Add clusterlaunchcontrol by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3589
PTX: Add st.bulk by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3604
PTX: Add multimem instructions by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3603
PTX: Add cp.async.mbarrier.arrive{.noinc} by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3602
PTX: Add tcgen05 instructions by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3607
Use a differrent implementation for tuple_of_iterator_references to tuple conversion by @miscco in https://github.com/NVIDIA/cccl/pull/3609
work around erroneous "undefined in device code" error in basic_any by @ericniebler in https://github.com/NVIDIA/cccl/pull/3614
Deprecate AgentSegmentFixupPolicy by @fbusato in https://github.com/NVIDIA/cccl/pull/3593
Fix deadlocks by enabling eager module loading in libcudacxx tests. by @wmaxey in https://github.com/NVIDIA/cccl/pull/3585
Add b200 tunings for histogram by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3616
make uninitialized[_async]_buffer's range accessors const-correct by @ericniebler in https://github.com/NVIDIA/cccl/pull/3615
Fix typo in index.rst by @cliffburdick in https://github.com/NVIDIA/cccl/pull/3620
Disable X86-64 detection macro for Arm64 emulation on MSVC by @fbusato in https://github.com/NVIDIA/cccl/pull/3540
Deprecate ABI v2 and v3 in libcudacxx by @wmaxey in https://github.com/NVIDIA/cccl/pull/3575
Add b200 policies for reduce by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3612
Add b200 tunings for reduce.by_key by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3610
Remove CUDA 11.x support by @fbusato in https://github.com/NVIDIA/cccl/pull/3596
PTX: fix cp.async.bulk.tensor and mbarrier.arrive by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3628
Add b200 tunings for radix_sort.keys by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3611
Try and make nvrtc on windows pass by @miscco in https://github.com/NVIDIA/cccl/pull/3623
Sync PTX refactorings from libcudaptx by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3632
Bump CI to use CTK 12.8, add sm100 build. by @alliepiper in https://github.com/NVIDIA/cccl/pull/3544
PTX: add bfind, exit and trap by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3627
Adds benchmarks for cub::DeviceMerge by @elstehle in https://github.com/NVIDIA/cccl/pull/3529
remove AgentSegmentFixupPolicy by @fbusato in https://github.com/NVIDIA/cccl/pull/3639
__builtin_isfinite is only available above nvrtc 12.2 by @miscco in https://github.com/NVIDIA/cccl/pull/3644
Turn TEST_[HALF|BF]_T into function-style macros and fix some tests by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3608
[STF] frozenlogicaldata::getaccessmode() by @caugonnet in https://github.com/NVIDIA/cccl/pull/3646
Internalize triple_chevron by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3648
This improves the detection logic for __cccl_ptx_isa for clang-cuda by @miscco in https://github.com/NVIDIA/cccl/pull/3647
Try to fix backport workflow by @leofang in https://github.com/NVIDIA/cccl/pull/3634
Revert #3623 by @leofang in https://github.com/NVIDIA/cccl/pull/3654
Deprecate cub::FpLimits in favor of cuda::std::numeric_limits by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3635
Fix transformiterator and drop resultofadaptablefunction by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3652
Transition build system of cudacccl and cudaparallel to scikit-build-core by @oleksandr-pavlyk in https://github.com/NVIDIA/cccl/pull/3597
Replaces bool template parameters on Dispatch* class templates to use enum class by @elstehle in https://github.com/NVIDIA/cccl/pull/3643
Add b200 policies for device.select.if,flagged,unique by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3545
Add b200 tunings for radix_sort.pairs by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3626
Fix the vectorized loading of BlockLoad by @ChristinaZ in https://github.com/NVIDIA/cccl/pull/3517
PTX: mbarrier.{test,try}_wait: Fix return value by @ahendriksen in https://github.com/NVIDIA/cccl/pull/3670
Add b200 policies for cub.select.uniquebykey by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3557
Update RAPIDS CI build to 25.04. by @alliepiper in https://github.com/NVIDIA/cccl/pull/3539
Fix issues with nvrtc compilation by @miscco in https://github.com/NVIDIA/cccl/pull/3666
Function-like macros for FP6/BF16 macros by @fbusato in https://github.com/NVIDIA/cccl/pull/3588
Remove cub::ArrayWrapper by @fbusato in https://github.com/NVIDIA/cccl/pull/3677
Internalize cub::PolicyWrapper by @fbusato in https://github.com/NVIDIA/cccl/pull/3681
Modernize MSVC 2005/nvcc workaround by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3606
Deprecate cub::AliasTemporaries by @fbusato in https://github.com/NVIDIA/cccl/pull/3679
[CUB] Remove pre-c++17 conditions and code by @fbusato in https://github.com/NVIDIA/cccl/pull/3684
Internalize cub::KernelConfig by @fbusato in https://github.com/NVIDIA/cccl/pull/3683
remove MSVC 2017 paths by @fbusato in https://github.com/NVIDIA/cccl/pull/3553
[Thrust] Remove pre-c++17 conditions and code by @fbusato in https://github.com/NVIDIA/cccl/pull/3687
Remove cugraph-ops from RAPIDS 25.04 builds. by @bdice in https://github.com/NVIDIA/cccl/pull/3675
Refactor radix_sort tuning by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3657
Make thrust iterators work with NVRTC by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3676
Deprecate and replace thrust::identity by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3649
Replace CUB iterators by Thrust ones by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3480
Drop Thrust's global workaround by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3692
replace Int2Type in CUB library by @fbusato in https://github.com/NVIDIA/cccl/pull/3641
Add b200 policies for cub.device.runlengthencode.encode,non_trivialruns by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3546
Deprecate cub::Trait::CATEGORY|PRIMITIVE|NULL_TYPE by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3689
Fix sccache reporting in CI summaries. by @alliepiper in https://github.com/NVIDIA/cccl/pull/3621
Make THRUSTDEVICESYSTEM and THRUSTCPPDIALECT independent of THRUSTHOSTSYSTEM by @adams381 in https://github.com/NVIDIA/cccl/pull/3659
Deprecate cub::RegBoundScaling and cub::MemBoundScaling by @fbusato in https://github.com/NVIDIA/cccl/pull/3685
Fix devcontainers' initializeCommand by @trxcllnt in https://github.com/NVIDIA/cccl/pull/3533
[cuda.cooperative] Add missing overloads to block.reduce and block.sum by @brycelelbach in https://github.com/NVIDIA/cccl/pull/2691
clean up the cudax __launch_transform code and document its purpose and design by @ericniebler in https://github.com/NVIDIA/cccl/pull/3526
Add b200 policies for partition.three_way by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3708
Fix multiple CI arches in matrix by @alliepiper in https://github.com/NVIDIA/cccl/pull/3702
Minor cleanups following bool-to-enum template parameter PR by @elstehle in https://github.com/NVIDIA/cccl/pull/3716
Remove V2 and V3 ABI support from libcudacxx. by @wmaxey in https://github.com/NVIDIA/cccl/pull/3662
Add b200 tunings for scan.exclusive.by_key by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3560
assorted bug fixes for the std::execution implementation in cudax by @ericniebler in https://github.com/NVIDIA/cccl/pull/3721
Minor fix for a regressing tuning in reduce.by_key by @gonidelis in https://github.com/NVIDIA/cccl/pull/3723
Fix SM100 histogram tunings by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3691
Move zip_iterator to internally use cuda::std::tuple by @miscco in https://github.com/NVIDIA/cccl/pull/3725
Remove reduce tunings with no benefit by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3724
fix ::cuda::discard_memory by @fbusato in https://github.com/NVIDIA/cccl/pull/3733
Add b200 policies for cub.device.partition.flagged,if by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3617
Add b200 tunings for scan.exclusive.sum by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3559
Fix cub trait deprecations by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3742
Nightly fixes by @alliepiper in https://github.com/NVIDIA/cccl/pull/3720
Clarify scan benchmarks by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3709
Drop thrust::future|event|async::* by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3730
Replace raw arm64/x86_64 macros by @fbusato in https://github.com/NVIDIA/cccl/pull/3732
Add Merge Sort implementation for c.parallel by @NaderAlAwar in https://github.com/NVIDIA/cccl/pull/3636
Extracted Segmented Reduce kernels into NVRTC compilable header by @oleksandr-pavlyk in https://github.com/NVIDIA/cccl/pull/3727
Remove unsupported CPU architecture paths (32-bit) by @fbusato in https://github.com/NVIDIA/cccl/pull/3752
[Automation] Add release workflow for tagging and testing new RCs by @wmaxey in https://github.com/NVIDIA/cccl/pull/3009
fix cuda std namespace by @fbusato in https://github.com/NVIDIA/cccl/pull/3751
Remove cuda/init.py in cuda-parallel package by @shwina in https://github.com/NVIDIA/cccl/pull/3750
Simplify cuda::std::{min,max} by @miscco in https://github.com/NVIDIA/cccl/pull/3758
Add dynamic CUB dispatch for SegmentedReduce by @oleksandr-pavlyk in https://github.com/NVIDIA/cccl/pull/3753
[STF] Implement kernel chains in the graph backend without child graphs by @caugonnet in https://github.com/NVIDIA/cccl/pull/3707
Add Scan implementation for c.parallel by @shwina in https://github.com/NVIDIA/cccl/pull/3462
cuda.parallel: Minor perf improvements by @shwina in https://github.com/NVIDIA/cccl/pull/3718
refactor <cuda/std/cstdlib> by @davebayer in https://github.com/NVIDIA/cccl/pull/3339
Fix python editable builds by @oleksandr-pavlyk in https://github.com/NVIDIA/cccl/pull/3762
Reinstate thrust::optional by @miscco in https://github.com/NVIDIA/cccl/pull/3759
Drop unsupported dialects for libcu++ by @miscco in https://github.com/NVIDIA/cccl/pull/3695
Disable [[no_unique_address]] for MSVC by @miscco in https://github.com/NVIDIA/cccl/pull/3757
cuda.coop: Generalize war_introspection utility for any # of arguments by @shwina in https://github.com/NVIDIA/cccl/pull/3769
Avoid issues with nvcc compilation in c++ mode by @miscco in https://github.com/NVIDIA/cccl/pull/3770
Refactor cuda/cmath functions documentation by @fbusato in https://github.com/NVIDIA/cccl/pull/3773
[STF] Factorize large event lists in CUDA graphs by @caugonnet in https://github.com/NVIDIA/cccl/pull/3756
Replace pre-c++17 traits with modern ones in CUB by @fbusato in https://github.com/NVIDIA/cccl/pull/3774
Drop cugraph-gnn from rapids CI by @miscco in https://github.com/NVIDIA/cccl/pull/3771
[STF] Ensure dot_section::guard is actually movable by @caugonnet in https://github.com/NVIDIA/cccl/pull/3778
Guard PDL by availability by @miscco in https://github.com/NVIDIA/cccl/pull/3779
[STF] virtual to_string() method for STF contexts by @caugonnet in https://github.com/NVIDIA/cccl/pull/3781
[STF] Enable freeze on logical tokens by @caugonnet in https://github.com/NVIDIA/cccl/pull/3782
Refactors DeviceMemcpy's vectorized_copy tests by @elstehle in https://github.com/NVIDIA/cccl/pull/3777
More h100 usage. by @alliepiper in https://github.com/NVIDIA/cccl/pull/3776
Add Python wrappers for c.parallel scan API by @shwina in https://github.com/NVIDIA/cccl/pull/3592
Replace _CCCL_IF_CONSTEXPR by @fbusato in https://github.com/NVIDIA/cccl/pull/3775
Remove _CCCL_CONSTEXPR_CXX14/17 by @fbusato in https://github.com/NVIDIA/cccl/pull/3793
Bump -std from 14 to 17 in `./ci/(build|test)_cub.sh examples. by @tpn in https://github.com/NVIDIA/cccl/pull/3792
[CUDAX] Add host launch API allowing stream ordered host execution by @pciolkosz in https://github.com/NVIDIA/cccl/pull/3555
Moves DeviceMemcpy's BitPackedCounter tests to Catch2 by @elstehle in https://github.com/NVIDIA/cccl/pull/3794
Refactor <cuda/std/cstring> by @davebayer in https://github.com/NVIDIA/cccl/pull/3484
fix NoopExecutor by @fbusato in https://github.com/NVIDIA/cccl/pull/3811
Unifies workload generation forDeviceMerge benchmarks by @elstehle in https://github.com/NVIDIA/cccl/pull/3645
Optimize and clean countl, countr, popcount, has_single_bit by @fbusato in https://github.com/NVIDIA/cccl/pull/3414
fix -Werror=unused-result by @fbusato in https://github.com/NVIDIA/cccl/pull/3810
Enable cuda::std::ssize for C++17 by @miscco in https://github.com/NVIDIA/cccl/pull/3813
fix _LIBCUDACXX_HAS_NO_INT128 with NVRTC by @fbusato in https://github.com/NVIDIA/cccl/pull/3802
Move radix sort kernels to separate NVRTC compilable header by @NaderAlAwar in https://github.com/NVIDIA/cccl/pull/3803
Fix popc parentheses warning by @fbusato in https://github.com/NVIDIA/cccl/pull/3820
Add arch_traits for sm100 to cudax. by @alliepiper in https://github.com/NVIDIA/cccl/pull/3818
Remove unused function parameter by @ericniebler in https://github.com/NVIDIA/cccl/pull/3828
CI summary fix by @alliepiper in https://github.com/NVIDIA/cccl/pull/3826
Refactor Thrust allocator example by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3830
[STF] Improved cache mechanism for executable CUDA graphs by @caugonnet in https://github.com/NVIDIA/cccl/pull/3768
Drop deprecated CUB iterators by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3831
Use libcu++ limits/trait in tests/benchmarks by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3822
Move uniquebykey kernels to NVRTC compilable header by @NaderAlAwar in https://github.com/NVIDIA/cccl/pull/3815
Specialize numeric_limits for CUDA 12.8 FP types by @davebayer in https://github.com/NVIDIA/cccl/pull/3832
Refactor thrust::zip_iterator by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3834
Refactor Thrust iterators 2/4 by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3840
Refactor Thrust iterators 3/4 by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3842
Refactor Thrust iterators 4/4 by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3833
Increase libcudacxx test timeout by @alliepiper in https://github.com/NVIDIA/cccl/pull/3850
Use lower case variable name to avoid macro collosions by @miscco in https://github.com/NVIDIA/cccl/pull/3856
Fix incorrect availability of variant in docs by @miscco in https://github.com/NVIDIA/cccl/pull/3859
Add cuda_cccl to the list of Python packages for which test suite is run by @oleksandr-pavlyk in https://github.com/NVIDIA/cccl/pull/3846
Refactor Thrust iterators 1/4 by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3839
Rewrites DeviceMemcpy::Batched tests to use device-side data generation and Catch2 by @elstehle in https://github.com/NVIDIA/cccl/pull/3849
Refactor CUB transfrom by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3825
Add Python wrappers for c.parallel merge_sort API by @NaderAlAwar in https://github.com/NVIDIA/cccl/pull/3763
Add c parallel segmented reduce api by @oleksandr-pavlyk in https://github.com/NVIDIA/cccl/pull/3838
[libcudacxx] Stable abstraction for Blackwell work-stealing (PTX try_cancel) by @gonzalobg in https://github.com/NVIDIA/cccl/pull/3671
Consider specializations of std::iterator_traits by @miscco in https://github.com/NVIDIA/cccl/pull/3837
Update supported C++ dialects in README by @davebayer in https://github.com/NVIDIA/cccl/pull/3879
Refactor assume_aligned implementation by @fbusato in https://github.com/NVIDIA/cccl/pull/3765
Refactor and make NVRTC compile <cub/util_device> by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3880
Cache the result of merge_sort() by @shwina in https://github.com/NVIDIA/cccl/pull/3881
do not try to use clang-19's support for c++26 pack indexing by @ericniebler in https://github.com/NVIDIA/cccl/pull/3888
Add support for single item per thread calls to blockscan.exclusivescan by @tpn in https://github.com/NVIDIA/cccl/pull/3829
Document cuda::maximum, cuda::minimum by @fbusato in https://github.com/NVIDIA/cccl/pull/3883
Refactor Thrust iterator_traits by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3892
Update Blackwell PTX instruction availability tables by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3894
Fix CCCL C headers to be compileable by C compiler by @oleksandr-pavlyk in https://github.com/NVIDIA/cccl/pull/3885
Move transform kernels to NVRTC compilable header by @shwina in https://github.com/NVIDIA/cccl/pull/3875
PTX shfl_sync by @fbusato in https://github.com/NVIDIA/cccl/pull/3241
Add a warning that we cannot tune transform by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3896
Extend tuning guide by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3904
Drop join_iterator by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3891
Revert Thrust findifnot implementation to please nvc++ by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3901
[CUB/docs] Add missing closing braces to BlockReduce kernel examples in CUB docs. by @brycelelbach in https://github.com/NVIDIA/cccl/pull/3916
[STF] Executable CUDA graphs caching policies by @caugonnet in https://github.com/NVIDIA/cccl/pull/3868
Refactor Thrust iterator internals by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3893
Revert Thrust mismatch implementation by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3899
Replace usage of CUB_MIN|MAX in reduce by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3927
Move to cuda::std::iterator_traits in CUB by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3924
Add C++ test for single-item-per-thread BlockScan Sum routines. by @tpn in https://github.com/NVIDIA/cccl/pull/3889
Rename threadsinblock -> threadsperblock to be consistent with CUB. by @tpn in https://github.com/NVIDIA/cccl/pull/3919
Implement cuda.coopertive.blockscan.inclusivesum(). by @tpn in https://github.com/NVIDIA/cccl/pull/3921
Replace CUB macros in more places by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3930
[PTX] Add shl, shr, bmsk, prmt by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3939
Add testreduceapi.py::testreducestructtypeminmax by @oleksandr-pavlyk in https://github.com/NVIDIA/cccl/pull/3938
Add cuda::std::aligned_accessor by @fbusato in https://github.com/NVIDIA/cccl/pull/3731
[STF] Thread safe graph_ctx by @caugonnet in https://github.com/NVIDIA/cccl/pull/3925
Replace CUB macros in tunings and benchmarks by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3931
Deprecate and replace some Thrust iterator traits by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3928
Optimize bit_floor, bit_ceil, bit_width by @fbusato in https://github.com/NVIDIA/cccl/pull/3296
Allow RAPIDS workflow to run on an arbitrary branch. by @alliepiper in https://github.com/NVIDIA/cccl/pull/3945
Initial CUDA C++ Execution Model documentation by @gonzalobg in https://github.com/NVIDIA/cccl/pull/3873
[STF] Remove unmaintained CUDASTF_DEBUG option by @caugonnet in https://github.com/NVIDIA/cccl/pull/3944
Revert "Initial CUDA C++ Execution Model documentation (#3873)" by @alliepiper in https://github.com/NVIDIA/cccl/pull/3950
Implement ranges::ref_view by @miscco in https://github.com/NVIDIA/cccl/pull/3316
Expose CCCL branch controls on Actions UI for RAPIDS workflow. by @alliepiper in https://github.com/NVIDIA/cccl/pull/3948
Drop unused TEST_COMPILER_CUDACC_BELOW_11_3 macro by @miscco in https://github.com/NVIDIA/cccl/pull/3946
Allow NVRTC to compile more of CUB by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3951
Use _CCCL_REQUIRES_EXPR in test code by @miscco in https://github.com/NVIDIA/cccl/pull/3954
Improve <cuda/std/bit> documentation by @fbusato in https://github.com/NVIDIA/cccl/pull/3959
[STF] Support generation of multiple CUDA graphs from separate threads by @caugonnet in https://github.com/NVIDIA/cccl/pull/3943
Add segmented_reduce python api by @oleksandr-pavlyk in https://github.com/NVIDIA/cccl/pull/3906
Implement __cccl_is_integer trait by @davebayer in https://github.com/NVIDIA/cccl/pull/3962
Implement cudax::async_buffer by @miscco in https://github.com/NVIDIA/cccl/pull/3460
Add dynamic CUB dispatch for uniquebykey by @NaderAlAwar in https://github.com/NVIDIA/cccl/pull/3816
Fix typo in _LIBCUDACXX_HAS_NVFP16 macro by @davebayer in https://github.com/NVIDIA/cccl/pull/3965
Drop obsolete thrust tuple algorithms by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3966
Extend CUB policy and tuning documentation by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3933
Fix thrust::rawreferencecast for tupleofiterator_references and simplify thrust::generate by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3970
[PTX] Add st, ld instructions by @fbusato in https://github.com/NVIDIA/cccl/pull/3974
[cuda.cooperative] Support multidimensional thread blocks in block load/store and improve load/store docs by @brycelelbach in https://github.com/NVIDIA/cccl/pull/3161
Disable automatic header inclusion for clangd by @miscco in https://github.com/NVIDIA/cccl/pull/3365
Deprecate and replace THRUST_STATIC_ASSERT by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3971
Avoid int overflow during multipl

- C++
Published by github-actions[bot] 6 months ago

cccl - v2.8.5

What's Changed

Avoid plain assert in device code by @miscco in https://github.com/NVIDIA/cccl/pull/4707
Do not use open-coded INFINITY for tests that also test extended floating points by @miscco in https://github.com/NVIDIA/cccl/pull/4744
[Version] Update branch/2.8.x to v2.8.5 by @github-actions in https://github.com/NVIDIA/cccl/pull/4755
[Backport branch/2.8.x] Update Blackwell PTX instruction availability tables by @github-actions in https://github.com/NVIDIA/cccl/pull/3900

Full Changelog: https://github.com/NVIDIA/cccl/compare/v2.8.4...v2.8.5

- C++
Published by github-actions[bot] 7 months ago

cccl - v2.8.4

What's Changed

[BACKPORT] Do not use pack indexing with clang-19 by @miscco in https://github.com/NVIDIA/cccl/pull/4447
[Backport branch/2.8.x] Always bypass automatic atomic storage checks to prevent potential compiler issues by @github-actions in https://github.com/NVIDIA/cccl/pull/4616
[Version] Update branch/2.8.x to v2.8.4 by @github-actions in https://github.com/NVIDIA/cccl/pull/4655

Full Changelog: https://github.com/NVIDIA/cccl/compare/v2.8.3...v2.8.4

- C++
Published by github-actions[bot] 8 months ago

cccl - v2.8.3

What's Changed

[BACKPORT: 2.8] Set NOCMAKEFINDROOTPATH for cudax. (#4162) by @miscco in https://github.com/NVIDIA/cccl/pull/4216
[BACKPORT 2.8] Fix the cuda python setup by @miscco in https://github.com/NVIDIA/cccl/pull/4218
Backport PR #4221 to branch/2.8.x — Remove python/cuda_cooperative/setup.py by @rwgk in https://github.com/NVIDIA/cccl/pull/4235
[Backport branch/2.8.x] Remove invalid single # in builtin.h by @github-actions in https://github.com/NVIDIA/cccl/pull/4326
[BACKPORT 2.8] Allow rapids to avoid unrolling some loops in sort (#4253) by @miscco in https://github.com/NVIDIA/cccl/pull/4387
[Backport branch/2.8.x] Fix uninitialized read in local atomic code path. by @github-actions in https://github.com/NVIDIA/cccl/pull/4424
[Version] Update branch/2.8.x to v2.8.3 by @github-actions in https://github.com/NVIDIA/cccl/pull/4423

Full Changelog: https://github.com/NVIDIA/cccl/compare/v2.8.2...v2.8.3

- C++
Published by github-actions[bot] 9 months ago

cccl - v2.8.2

What's Changed

[Version] Update branch/2.8.x to v2.8.2 by @github-actions in https://github.com/NVIDIA/cccl/pull/4079
Ignore Wmaybe-uninitialized in dispatch_reduce.cuh. by @bdice in https://github.com/NVIDIA/cccl/pull/4054
backport: fix numeric_limits digits for nvfp8/6/4 (#4070) by @miscco in https://github.com/NVIDIA/cccl/pull/4130
[BACKPORT]: Avoid compiler issue with MSVC and span constructor by @miscco in https://github.com/NVIDIA/cccl/pull/4127

Full Changelog: https://github.com/NVIDIA/cccl/compare/v2.8.1...v2.8.2

- C++
Published by github-actions[bot] 9 months ago

cccl - v2.8.1

What's Changed

Backport to 2.8: NVHPC fixes by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/4021
[Backport 2.8.x] [cuda::ptx] Fix .cta_group::2 definition (#4038) by @wmaxey in https://github.com/NVIDIA/cccl/pull/4044
[Version] Update branch/2.8.x to v2.8.1 by @github-actions in https://github.com/NVIDIA/cccl/pull/4049

Full Changelog: https://github.com/NVIDIA/cccl/compare/v2.8.0...v2.8.1

- C++
Published by github-actions[bot] 10 months ago

cccl - CCCL 2.8.0

What's Changed

Adds benchmarks for DeviceSelect::Unique by @elstehle in https://github.com/NVIDIA/cccl/pull/2359
CUB - Enable DPX Reduction by @fbusato in https://github.com/NVIDIA/cccl/pull/2286
[CUDAX] add a small c++17 implementation of std::execution (aka P2300) by @ericniebler in https://github.com/NVIDIA/cccl/pull/2301
Add thurst::transforminclusivescan with init value by @gonidelis in https://github.com/NVIDIA/cccl/pull/2326
Widen histogram agent constructor to more types by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2380
Use a constant for the amount of static SMEM by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2374
Add cub::DeviceTransform by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2086
Update toolkit to CTK 12.6 by @miscco in https://github.com/NVIDIA/cccl/pull/2348
implement make_integer_sequence in terms of intrinsics whenever possible by @ericniebler in https://github.com/NVIDIA/cccl/pull/2384
Implement cuda::mr::cuda_async_memory_resource by @miscco in https://github.com/NVIDIA/cccl/pull/1637
Drop implementation of thrust::pair and thrust::tuple by @miscco in https://github.com/NVIDIA/cccl/pull/2395
Pull out _LIBCUDACXX_UNREACHABLE into its own file by @miscco in https://github.com/NVIDIA/cccl/pull/2399
Share common compiler flags in new CCCL-level targets. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2386
conditionally include <crt/host_defines.h> from __cccl/execution_space.h header by @ericniebler in https://github.com/NVIDIA/cccl/pull/2406
add some simple utilities for manipulating lists of types by @ericniebler in https://github.com/NVIDIA/cccl/pull/2370
Drop thrusts diagnostic suppression warnings by @miscco in https://github.com/NVIDIA/cccl/pull/2392
[PoC]: Implement cuda::experimental::uninitialized_async_buffer by @miscco in https://github.com/NVIDIA/cccl/pull/1854
Fix thrust package to work with newer FindOpenMP.cmake. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2421
Introduce cccl_configure_target cmake function. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2388
Fix sccache errors in RAPIDS builds by @trxcllnt in https://github.com/NVIDIA/cccl/pull/2417
Replace CUDA C++ Core Libraries with CUDA Core Compute Libraries (only in README.md). by @rwgk in https://github.com/NVIDIA/cccl/pull/2424
Minor cleanup with cuda/atomic by @miscco in https://github.com/NVIDIA/cccl/pull/2418
uninitialized_buffer::get_resource returns a ref to an any_resource that can be copied by @ericniebler in https://github.com/NVIDIA/cccl/pull/2431
Refactor cuda::ceil_div to take two different types by @miscco in https://github.com/NVIDIA/cccl/pull/2376
Reduce PR testing matrix. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2436
Implement cudax::shared_resource by @miscco in https://github.com/NVIDIA/cccl/pull/2398
Increase the libcu++ timeout by @miscco in https://github.com/NVIDIA/cccl/pull/2435
Move c/include/cccl/.h files to c/include/cccl/c/.h by @rwgk in https://github.com/NVIDIA/cccl/pull/2428
Make any_resource emplacable by @miscco in https://github.com/NVIDIA/cccl/pull/2425
Fix issues with __host__ and __device__ definitions by @miscco in https://github.com/NVIDIA/cccl/pull/2413
Make bit_cast play nice with extended floating point types by @miscco in https://github.com/NVIDIA/cccl/pull/2434
Do not include our own string.h file by @miscco in https://github.com/NVIDIA/cccl/pull/2444
Move nightly time by @bdice in https://github.com/NVIDIA/cccl/pull/2437
Remove a ton of lines in thrust tests by @gonidelis in https://github.com/NVIDIA/cccl/pull/2356
[CUDAX] Add placeholder green context type and logical device that can hold both a green ctx and a device by @pciolkosz in https://github.com/NVIDIA/cccl/pull/2446
Fix typo in CCCLBuildCompilerTargets.cmake by @alliepiper in https://github.com/NVIDIA/cccl/pull/2453
Drop superflous compile definition from thrust tests by @miscco in https://github.com/NVIDIA/cccl/pull/2450
Consolidate packages and install rules by @alliepiper in https://github.com/NVIDIA/cccl/pull/2456
Prune CUB's ChainedPolicy by CUDAARCHLIST by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2154
fixes merge conflict for policy pruning by @elstehle in https://github.com/NVIDIA/cccl/pull/2466
Add CCCLENABLEWERROR flag. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2463
Add CUB tests for segmented sort/radix sort with 64-bit num. items and segments by @fbusato in https://github.com/NVIDIA/cccl/pull/2254
Propagate compiler flags down to libcu++ LIT tests by @Artem-B in https://github.com/NVIDIA/cccl/pull/2420
Drop remaining uses of _LIBCUDACXX_COMPILER_* by @miscco in https://github.com/NVIDIA/cccl/pull/2467
Avoid C++17 extension in c++11 tests by @miscco in https://github.com/NVIDIA/cccl/pull/2469
Add span to example and templated block size by @Kh4ster in https://github.com/NVIDIA/cccl/pull/2470
Drop Objective C++ support by @miscco in https://github.com/NVIDIA/cccl/pull/2468
removes superfluous template keyword in call to Dereference by @andrewcorrigan in https://github.com/NVIDIA/cccl/pull/2482
Improve build times in several heavyweight libcudacxx tests. by @wmaxey in https://github.com/NVIDIA/cccl/pull/2478
Drop __availability header by @miscco in https://github.com/NVIDIA/cccl/pull/2484
Replace a few more instances of CUDA C++ Core Libraries with CUDA Core Compute Libraries`. by @rwgk in https://github.com/NVIDIA/cccl/pull/2447
Fix common_type specialization for extended floating point types by @miscco in https://github.com/NVIDIA/cccl/pull/2483
Implement some CUDA API calls for async_memory_pool by @miscco in https://github.com/NVIDIA/cccl/pull/2455
Move cudax example project to CCCL project examples. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2462
Disable system header for narrowing conversion check by @miscco in https://github.com/NVIDIA/cccl/pull/2465
Require resources to always provide at least one execution space property by @miscco in https://github.com/NVIDIA/cccl/pull/2489
Rework builtin handling by @miscco in https://github.com/NVIDIA/cccl/pull/2461
Disable execution checks for std::equal by @miscco in https://github.com/NVIDIA/cccl/pull/2491
replace _CCCL_ALWAYS_INLINE with _CCCL_FORCEINLINE by @ericniebler in https://github.com/NVIDIA/cccl/pull/2439
Drop 2 relative includes that snuck in by @miscco in https://github.com/NVIDIA/cccl/pull/2492
re-express the cudax::__tupl::__apply member to make nvc++ happy by @ericniebler in https://github.com/NVIDIA/cccl/pull/2493
Drop badly named _One_of concept by @miscco in https://github.com/NVIDIA/cccl/pull/2490
Unify assert handling in cccl by @miscco in https://github.com/NVIDIA/cccl/pull/2382
Reduce scope of Thrust linkage in cudax. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2496
Centralize CPM logic. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2495
Fix typo in presets. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2497
Refactor away per-project TOPLEVEL flags. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2498
[FEA]: Validate cuda.parallel type matching in build and execution by @rwgk in https://github.com/NVIDIA/cccl/pull/2429
avoid gcc optimizer bug by not force inlining part of thrust::transform by @ericniebler in https://github.com/NVIDIA/cccl/pull/2509
Cleanup and modularize <cuda/std/barrier> by @miscco in https://github.com/NVIDIA/cccl/pull/2443
Consolidate header testing infra. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2460
Add ForEachN from CUB to cccl/c. by @wmaxey in https://github.com/NVIDIA/cccl/pull/2378
Adds support for large number of items in DeviceSelect and DevicePartition by @elstehle in https://github.com/NVIDIA/cccl/pull/2400
Adds support for large number of items to DeviceScan::*ByKey family of algorithms by @elstehle in https://github.com/NVIDIA/cccl/pull/2477
Integrate c/parallel with CCCL build system and CI. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2514
Create a command list utility for nvrtc/jitlink steps. by @wmaxey in https://github.com/NVIDIA/cccl/pull/2511
Fix the example project which the documentation refers too by @caugonnet in https://github.com/NVIDIA/cccl/pull/2531
Enable tests/headertests for c/parallel in all-dev presets. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2566
Rename cudax test targets to match CCCL conventions. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2568
Update project list in issue template by @alliepiper in https://github.com/NVIDIA/cccl/pull/2532
Disable compiler extensions on CCCL targets. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2559
Fixes cub::DeviceMemcpy::Batched to be able to copy from const source pointers by @elstehle in https://github.com/NVIDIA/cccl/pull/2573
Fix documentation error in ci/build_common.sh for -arch by @caugonnet in https://github.com/NVIDIA/cccl/pull/2574
gcc-14 gained the ability to mangle noexcept expressions by @ericniebler in https://github.com/NVIDIA/cccl/pull/2565
Miscellaneous simple fixes by @rwgk in https://github.com/NVIDIA/cccl/pull/2575
Avoid including yvals.h when the compiler is not MSVC. by @wmaxey in https://github.com/NVIDIA/cccl/pull/2545
Fix popc.h when architecture is not x86 on MSVC. by @wmaxey in https://github.com/NVIDIA/cccl/pull/2524
test for exceptions support on msvc with the _CPPUNWIND macro by @ericniebler in https://github.com/NVIDIA/cccl/pull/2576
fix the forwarding of the receiver in the just_from algorithm by @ericniebler in https://github.com/NVIDIA/cccl/pull/2569
Block type pack indexing on NVCC by @wmaxey in https://github.com/NVIDIA/cccl/pull/2563
Cleanup the semaphore headers by @miscco in https://github.com/NVIDIA/cccl/pull/2441
Add _CCCL_GRID_CONSTANT macro by @fbusato in https://github.com/NVIDIA/cccl/pull/2530
Add _CCCL_RESTRICT macro by @fbusato in https://github.com/NVIDIA/cccl/pull/2529
Try to use the same redefinition of __assert_fail as pytorch has by @miscco in https://github.com/NVIDIA/cccl/pull/2577
Fix miscellaneous bugs in cub/iterator documentation. by @rwgk in https://github.com/NVIDIA/cccl/pull/2580
Expose parts of <cuda/std/memory> by @fbusato in https://github.com/NVIDIA/cccl/pull/2502
add a config macro for testing support for inline variables by @ericniebler in https://github.com/NVIDIA/cccl/pull/2581
add dialect macros _CCCL_NO_RTTI and _CCCL_NO_TYPEID by @ericniebler in https://github.com/NVIDIA/cccl/pull/2578
fix misspelling in the _CCCL_NO_VARIABLE_TEMPLATES macro by @ericniebler in https://github.com/NVIDIA/cccl/pull/2584
Add atomic_ref support for 8 and 16b types. by @wmaxey in https://github.com/NVIDIA/cccl/pull/2255
add _LIBCUDACXX_REQUIRES_EXPR to the concepts emulation macros by @ericniebler in https://github.com/NVIDIA/cccl/pull/2564
Ensure CuPy arrays can be used with cuda.parallel too by @leofang in https://github.com/NVIDIA/cccl/pull/2335
assert that cuda::std::declval is noexcept by @ericniebler in https://github.com/NVIDIA/cccl/pull/2588
Revert accidental force push to main. by @wmaxey in https://github.com/NVIDIA/cccl/pull/2596
add __is_callable_v variable template when possible by @ericniebler in https://github.com/NVIDIA/cccl/pull/2598
Cleanup threading support by @miscco in https://github.com/NVIDIA/cccl/pull/2507
CCCLTOPLEVELPROJECT always needs to be defined by @robertmaynard in https://github.com/NVIDIA/cccl/pull/2597
Strip prefix paths from cudax documentation by @caugonnet in https://github.com/NVIDIA/cccl/pull/2603
examples/cudax/CMakeLists.txt should not be executable by @caugonnet in https://github.com/NVIDIA/cccl/pull/2594
[CUDAX] Peer access control on asyncmemorypool and asyncmemoryresource by @pciolkosz in https://github.com/NVIDIA/cccl/pull/2587
Introduce _CCCL_PRAGMA to CCCL by @davebayer in https://github.com/NVIDIA/cccl/pull/2610
Only enable CUDA language when needed. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2612
Modularize latch by @miscco in https://github.com/NVIDIA/cccl/pull/2508
Unify kernel dispatch paths for device reduce between CUB and c.parallel. by @griwes in https://github.com/NVIDIA/cccl/pull/2591
Integrate CUDASTF -> CudaX by @caugonnet in https://github.com/NVIDIA/cccl/pull/2572
[STF] The cmake example for stf was not updated when moving to main branch by @caugonnet in https://github.com/NVIDIA/cccl/pull/2618
Rework head_flags so that we do not rely on the tuple being unevaluated by @miscco in https://github.com/NVIDIA/cccl/pull/2619
[CUDAX] size_bytes in buffer types by @pciolkosz in https://github.com/NVIDIA/cccl/pull/2621
fix portability bug in libcu++'s implementation of char_traits by @ericniebler in https://github.com/NVIDIA/cccl/pull/2623
[cccl/c] Unify some build boilerplate by @wmaxey in https://github.com/NVIDIA/cccl/pull/2625
devcontainer: replace VAULT_HOST with AWS_ROLE_ARN by @jjacobelli in https://github.com/NVIDIA/cccl/pull/2604
Add checks to unique_id by @andralex in https://github.com/NVIDIA/cccl/pull/2622
Add cuda::get_device_address by @miscco in https://github.com/NVIDIA/cccl/pull/2611
Do not pass integral constants to ptx by @miscco in https://github.com/NVIDIA/cccl/pull/2229
Add nvhpc devcontainer to CI by @miscco in https://github.com/NVIDIA/cccl/pull/1488
Use a default initialization for CUDA graph mem alloc nodes by @caugonnet in https://github.com/NVIDIA/cccl/pull/2632
[CUDAX] Add getname to deviceref by @pciolkosz in https://github.com/NVIDIA/cccl/pull/2631
Add 12.5 devcontainer needed for nvhpc by @miscco in https://github.com/NVIDIA/cccl/pull/2634
a substitute for std::type_info when the compiler doesn't support RTTI by @ericniebler in https://github.com/NVIDIA/cccl/pull/2582
Check for missing inline on functions in public headers. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2570
fix linker errors about multiply defined symbols in STF by @ericniebler in https://github.com/NVIDIA/cccl/pull/2641
Add installation presets and update README with install steps by @alliepiper in https://github.com/NVIDIA/cccl/pull/2643
Fix annotated_ptr test failures. by @wmaxey in https://github.com/NVIDIA/cccl/pull/2607
Issue a deprecation warning when compiling with ICC by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2076
Include all python libs in inspect_changes. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2648
Add reusable workflow for updating version in branch with a PR by @wmaxey in https://github.com/NVIDIA/cccl/pull/2589
define _CCCL_NO_RTTI in device code; RTTI isn't available there by @ericniebler in https://github.com/NVIDIA/cccl/pull/2639
Migrate C2H library to top-level library by @alliepiper in https://github.com/NVIDIA/cccl/pull/2629
[CUDAX] Add canpeeraccessto API to deviceref and check both ways access in get_peers by @pciolkosz in https://github.com/NVIDIA/cccl/pull/2642
Use _CCCL_ASSERT for stf by @miscco in https://github.com/NVIDIA/cccl/pull/2645
un-templatize CUDASTF's callback_completion_kernel per @robertmaynard by @ericniebler in https://github.com/NVIDIA/cccl/pull/2656
Implement C++20 <source_location> by @miscco in https://github.com/NVIDIA/cccl/pull/2628
Disable [[no_unique_address]] for clang and mdspan by @miscco in https://github.com/NVIDIA/cccl/pull/2646
[STF] Adapt timingwithfences test to be more reliable by @caugonnet in https://github.com/NVIDIA/cccl/pull/2658
Add prefetching kernel as new fallback for cub::DeviceTransform by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2396
Drop cub::DeviceTransform fallback to cub::DeviceFor by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2660
Ignore more files when detecting CI changes. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2654
Add thrust::universal_host_pinned_vector by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2653
add new type-list algorithms copy_if, remove_if, find_if, and unique by @ericniebler in https://github.com/NVIDIA/cccl/pull/2644
abide by CCCL config macro naming conventions for _CCCL_PRETTY_FUNCTION and _CCCL_NO_BUILTIN_STRLEN by @ericniebler in https://github.com/NVIDIA/cccl/pull/2640
[STF] Fix how we define multi-dimensional shapes in the documentation by @caugonnet in https://github.com/NVIDIA/cccl/pull/2662
Automate creating a CCCL release from RC tags. by @wmaxey in https://github.com/NVIDIA/cccl/pull/2657
Enable span to work with contiguous std containers in C++17 by @miscco in https://github.com/NVIDIA/cccl/pull/2613
[Version] Update main to v2.8.0 by @github-actions in https://github.com/NVIDIA/cccl/pull/2670
promote the cudax __async/config.cuh to be the config for all of cudax by @ericniebler in https://github.com/NVIDIA/cccl/pull/2638
avoid using nvcc's __type_pack_element before 12.2 by @ericniebler in https://github.com/NVIDIA/cccl/pull/2673
Update ninja_summary.py to support ninja log v6. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2663
Rename new CUB headers to follow conventions. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2675
consistent use of _CUDAX function attributes in the cudax __async/ directory by @ericniebler in https://github.com/NVIDIA/cccl/pull/2676
[CUDAX] Add forwarding reference to functor accepting launch by @pciolkosz in https://github.com/NVIDIA/cccl/pull/2677
[CUDAX] Add initial bits of copybytes and fillbytes by @pciolkosz in https://github.com/NVIDIA/cccl/pull/2608
suppress msvc warning "qualifier applied to function type" in is_function by @ericniebler in https://github.com/NVIDIA/cccl/pull/2683
Disable ublkcp CUB transform kernel for NVHPC by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2664
Deprecate thrust::cuda_cub::identity by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2688
Remove an unused variable by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2690
Setup cudax examples. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2697
portability fixes for _CCCL_BUILTIN_PRETTY_FUNCTION and _CCCL_TYPEID by @ericniebler in https://github.com/NVIDIA/cccl/pull/2695
address portability issues found while using the typelist/typeset utities by @ericniebler in https://github.com/NVIDIA/cccl/pull/2694
Make tests technically correct by initializing the barrier by @miscco in https://github.com/NVIDIA/cccl/pull/2701
Fix invalid memory reads in testdevicebatch_copy. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2698
revert config macros _CCCL_CUDACC_BELOW_XX_X to their original semantics by @ericniebler in https://github.com/NVIDIA/cccl/pull/2700
This cleanes up our function objects a bit by @miscco in https://github.com/NVIDIA/cccl/pull/2702
Drop handling of 32bit Windows by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2689
Guard inclusion of cuda_runtime_api by using a cuda compiler by @miscco in https://github.com/NVIDIA/cccl/pull/2704
Fix race condition in blockreduceraking. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2699
Honor CCCLENABLEWERROR for CUDA targets. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2705
Fix nvbench helper compilation for clang-18 by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2707
Default ctor of deviceptr and normaliterator by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2708
Add cuda::minimum and cuda::maximum by @Jacobfaib in https://github.com/NVIDIA/cccl/pull/2681
Various fixes to cub::DeviceTransform by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2709
Make thrust::transform use cub::DeviceTransform by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2389
Ensure that we only use the inline variable trait when it is actually available by @miscco in https://github.com/NVIDIA/cccl/pull/2712
[CUDAX] Rename memory resource and memory pool from async to device by @pciolkosz in https://github.com/NVIDIA/cccl/pull/2710
triple_chevron fix by @fbusato in https://github.com/NVIDIA/cccl/pull/2720
Improve uninitialized_{async_}buffer API by @miscco in https://github.com/NVIDIA/cccl/pull/2713
Fix merge conflict from renaming of asyncmemoryresource by @miscco in https://github.com/NVIDIA/cccl/pull/2728
[STF] Improve DOT graph outputs by @caugonnet in https://github.com/NVIDIA/cccl/pull/2703
Implement _CCCL_SUPPRESS_DEPRECATED_[PUSH|POP] for ICC and NVHPC by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2730
Clean up CUB thread operators by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2716
Deprecate/replace more of Thrust functional by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2105
Alias cuda::std::identity to __identity by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2733
Do not read uninitialized memory for OOB elements. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2739
Add option to conditionally build CUDASTF by @miscco in https://github.com/NVIDIA/cccl/pull/2731
fix cuda::std::bit_width() return type by @fbusato in https://github.com/NVIDIA/cccl/pull/2745
[STF] Option to disable kernel generation in CUDASTF by @caugonnet in https://github.com/NVIDIA/cccl/pull/2723
fix static_extent() return type by @fbusato in https://github.com/NVIDIA/cccl/pull/2751
make the empty parens after level constructors optional by @ericniebler in https://github.com/NVIDIA/cccl/pull/2750
cudax: rename ustdex's __query member function to query by @ericniebler in https://github.com/NVIDIA/cccl/pull/2757
Implement execution policies by @miscco in https://github.com/NVIDIA/cccl/pull/2715
Document some transform iterator corner cases by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2740
Shorten the git commit message in the ci scripts by @miscco in https://github.com/NVIDIA/cccl/pull/2760
Separate CUDA and C++ code in C2H by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2734
Make get_stream work with queries by @miscco in https://github.com/NVIDIA/cccl/pull/2761
Allow thrust::identity to forward value category by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2732
Proclaim Thrust/CUB/libcu++ functor address stability by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2719
give declval an implementation that compiles 2x faster by @ericniebler in https://github.com/NVIDIA/cccl/pull/2758
[CUDAX] Add modernized simpleP2P sample by @pciolkosz in https://github.com/NVIDIA/cccl/pull/2696
s/get_delegatee_scheduler/get_delegation_scheduler/ by @ericniebler in https://github.com/NVIDIA/cccl/pull/2766
remove duplicated __apply_cv type trait by @ericniebler in https://github.com/NVIDIA/cccl/pull/2754
merge metaprogramming libs from libcudac++ and µstdex by @ericniebler in https://github.com/NVIDIA/cccl/pull/2767
Doc fix scan by @karthikeyann in https://github.com/NVIDIA/cccl/pull/2769
Remove obsolete ways to set iterator category in CUB by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2759
Run thrust::transform benchmarks with more elements by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2764
Increase libcu++ timeout by @miscco in https://github.com/NVIDIA/cccl/pull/2774
[STF] Rename the redux access mode into relaxed by @caugonnet in https://github.com/NVIDIA/cccl/pull/2776
Enable type trait aliases in all standard modes by @miscco in https://github.com/NVIDIA/cccl/pull/2763
Optimize, Cleanup, and Expose CUB Thread-Level Reduction by @fbusato in https://github.com/NVIDIA/cccl/pull/2756
Disable execution checks for tuple by @miscco in https://github.com/NVIDIA/cccl/pull/2780
Avoid benchmarking first-time setup in Thrust algorithms by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2782
Improve listing benchmarks and text by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2778
Fix thrust partition docs typo by @gonidelis in https://github.com/NVIDIA/cccl/pull/2791
Drop unused sanitizer hook by @miscco in https://github.com/NVIDIA/cccl/pull/2793
use _CCCL_HAS_FEATURE instead of plain __has_feature everywhere by @davebayer in https://github.com/NVIDIA/cccl/pull/2794
Avoid make_zip_iterator(make_tuple(...)) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2796
implement _CCCL_HAS_INCLUDE by @davebayer in https://github.com/NVIDIA/cccl/pull/2786
add __cpp_lib_mdspan feature-test macro by @fbusato in https://github.com/NVIDIA/cccl/pull/2787
Remove redundant cmake from example. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2804
change __as_type_list so it doesn't cause the instantiation of its argument by @ericniebler in https://github.com/NVIDIA/cccl/pull/2803
[CUDAX] Enable passing hierarchy levels directly into make_config by @pciolkosz in https://github.com/NVIDIA/cccl/pull/2755
Fix cudacc/cluster detection macro in launch path of libcudacxx tests by @wmaxey in https://github.com/NVIDIA/cccl/pull/2811
[STF] Replace CUDASTFCODEGENERATION by !CUDASTFDISABLECODE_GENERATION by @caugonnet in https://github.com/NVIDIA/cccl/pull/2797
Reduce P0 benchmark variations for mergesortpairs by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2798
Replace macros by lambdas in cub::DeviceTransform by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2817
Add nvrtc_sm_top_level::add_link_list() and use in c/parallel/src/reduce.cu by @rwgk in https://github.com/NVIDIA/cccl/pull/2781
give completion_signatures a fast lookup cache by @ericniebler in https://github.com/NVIDIA/cccl/pull/2812
implement new compiler checks for NVHPC by @davebayer in https://github.com/NVIDIA/cccl/pull/2816
Unify [CCCL|CUB|THRUST]ENABLEBENCHMARKS by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2827
Remove traces of metal from CCCL by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2828
Move our CUDACC version checks towards the new version check design by @miscco in https://github.com/NVIDIA/cccl/pull/2826
Extend CUB benchmarking documentation by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2831
Remove all warm-up runs from Thrust benchmarks by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2838
Utility scripts for benchmark database by @gevtushenko in https://github.com/NVIDIA/cccl/pull/2847
[CUDAX] Add missing sm_61 traits by @pciolkosz in https://github.com/NVIDIA/cccl/pull/2848
Move _CCCL_COMPILER_ICC to the new macro by @miscco in https://github.com/NVIDIA/cccl/pull/2849
Fix wrong include in Thrust benchmark by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2854
Add missing include by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2855
Move _CCCL_COMPILER_GCC to the new macro by @davebayer in https://github.com/NVIDIA/cccl/pull/2850
Add benchmarking and tuning presets by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2856
Fix race condition in block-RLD test harness. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2706
Add MatX build to CCCL CI by @alliepiper in https://github.com/NVIDIA/cccl/pull/2682
Fix DeviceSegmentedSort NVTX range name by @davidwendt in https://github.com/NVIDIA/cccl/pull/2857
Make discovery mechanism for cuda/_include directory compatible with pip install --editable by @rwgk in https://github.com/NVIDIA/cccl/pull/2846
add missing DOXYGEN_* predefined macros when building the cudax docs by @ericniebler in https://github.com/NVIDIA/cccl/pull/2858
correct the names of shared_resource's async allocate/deallocate members by @ericniebler in https://github.com/NVIDIA/cccl/pull/2880
[Docs/PTX] Add device tensor map init example by @ahendriksen in https://github.com/NVIDIA/cccl/pull/1983
Fix rst typos in benchmarking.html by @gonidelis in https://github.com/NVIDIA/cccl/pull/2868
Include use of NVHPC in CUB/Thrust magic namespace by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2771
backport to_underlying by @davebayer in https://github.com/NVIDIA/cccl/pull/2853
move _CCCL_COMPILER_CLANG to the new macro by @davebayer in https://github.com/NVIDIA/cccl/pull/2859
Automate release branch creation by @wmaxey in https://github.com/NVIDIA/cccl/pull/2685
Add thrust_create_target DISPATCH option. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2844
for_each_in_extent by @fbusato in https://github.com/NVIDIA/cccl/pull/2518
Fix old gcc version check by @davebayer in https://github.com/NVIDIA/cccl/pull/2904
Move implementation of _LIBCUDACXX_TEMPLATE to CCCL by @miscco in https://github.com/NVIDIA/cccl/pull/2832
Try to work around issue with NVHPC in conjunction with older CTK versions by @miscco in https://github.com/NVIDIA/cccl/pull/2889
Refactor nvbench helper less_t by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2905
add "interface" to _CCCL_PUSH_MACROS by @ericniebler in https://github.com/NVIDIA/cccl/pull/2919
Replace inconsistent Doxygen macros with _CCCL_DOXYGEN_INVOKED by @ericniebler in https://github.com/NVIDIA/cccl/pull/2921
implement C++26 std::span::at by @davebayer in https://github.com/NVIDIA/cccl/pull/2924
move msvc compiler macros to new version by @davebayer in https://github.com/NVIDIA/cccl/pull/2885
Reorganize PTX tests to match generator by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2930
Reorganize PTX docs to match generator by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2929
Improve build instructions for libcu++ by @miscco in https://github.com/NVIDIA/cccl/pull/2881
Reorganize PTX headers to match generator by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2925
implement C++26 std::span's constructor from std::initializer_list by @davebayer in https://github.com/NVIDIA/cccl/pull/2923
Add tuple protocol to cuda::std::complex from C++26 by @davebayer in https://github.com/NVIDIA/cccl/pull/2882
Add missing qualifier for cuda namespace by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2940
Try to fix a clang warning: by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2941
minor consistency improvements in concepts macros by @ericniebler in https://github.com/NVIDIA/cccl/pull/2928
Drop some of the mdspan fold implementation by @miscco in https://github.com/NVIDIA/cccl/pull/2949
[STF] Implement CUDASTFDOTTIMING for the ctx.cuda_kernel construct by @caugonnet in https://github.com/NVIDIA/cccl/pull/2950
Avoid potential null dereference in annotated_ptr by @miscco in https://github.com/NVIDIA/cccl/pull/2951
make compiler version comparison utility generic by @davebayer in https://github.com/NVIDIA/cccl/pull/2952
Add SM100 descriptor to target by @miscco in https://github.com/NVIDIA/cccl/pull/2954
Regenerate cuda::ptx headers/docs and run format by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2937
Regenerate cuda::ptx test by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2953
Do not include extended floating point headers if they are not needed by @miscco in https://github.com/NVIDIA/cccl/pull/2956
[CUDAX] Add copybytes and fillbytes overloads for mdspan by @pciolkosz in https://github.com/NVIDIA/cccl/pull/2932
add a _CCCL_NO_CONCEPTS config macro by @ericniebler in https://github.com/NVIDIA/cccl/pull/2945
remove definition of macro (_LIBCUDACXX_NO_RTTI) that is no longer used by @ericniebler in https://github.com/NVIDIA/cccl/pull/2957
Avoid symbol clashes with libc++ by @miscco in https://github.com/NVIDIA/cccl/pull/2955
Add more CUB transform benchmarks by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2906
Start reworking our math functions by @miscco in https://github.com/NVIDIA/cccl/pull/2749
Drop memory resources in libcu++ by @miscco in https://github.com/NVIDIA/cccl/pull/2860
std::dims by @fbusato in https://github.com/NVIDIA/cccl/pull/2961
Fix merge conflict from moving resources up a namespace by @miscco in https://github.com/NVIDIA/cccl/pull/2965
[CUDAX] Add a way to combine thread hierarchies by @pciolkosz in https://github.com/NVIDIA/cccl/pull/2746
Require approval to run CI on draft PRs by @bdice in https://github.com/NVIDIA/cccl/pull/2969
fix thread-reduce performance regression by @fbusato in https://github.com/NVIDIA/cccl/pull/2944
add a __type_switch utility and use it the ptx generator by @ericniebler in https://github.com/NVIDIA/cccl/pull/2946
replace use of old _CONCEPT_FRAGMENT macro in cudax by @ericniebler in https://github.com/NVIDIA/cccl/pull/2973
remove vestigal uses of the old DOXYGEN_SHOULD_SKIP_THIS macro by @ericniebler in https://github.com/NVIDIA/cccl/pull/2978
Fix proclaimcopyablearguments for lambdas by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2833
Forward declare half types in cuda::ptx by @ahendriksen in https://github.com/NVIDIA/cccl/pull/2981
Fix tuning benchmark for cub::DeviceTransform by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2970
fix old gcc version check by @davebayer in https://github.com/NVIDIA/cccl/pull/2989
Fix a typo in thrust/binary_search.h (#2980) by @hzhangxyz in https://github.com/NVIDIA/cccl/pull/2992
Enable assertions for CCCL users in CMake Debug builds by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2986
Fix CMake warning for FindPythonInterp by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2982
Further clarify host compiler support by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2991
Drop CCCLELSEIFCONSTEXPR by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2966
implement C++26 std::ignore by @davebayer in https://github.com/NVIDIA/cccl/pull/2922
make the upper limit on TMP loop unrolling configurable by @ericniebler in https://github.com/NVIDIA/cccl/pull/2971
Update docs with recent features by @davebayer in https://github.com/NVIDIA/cccl/pull/2994
Restore thrust single config options. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2977
Document tuning DB comparison scripts by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2968
Build CUB and Thrust tests with assertions by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2987
Issue a deprecation warning when compiling with Visual Studio 2017 by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2990
Guard forward declarations of extended FP types by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2998
[STF] Create dot sections to possibly collapse nodes when displaying large DOT graphs by @caugonnet in https://github.com/NVIDIA/cccl/pull/2988
Remove redundant pre c++11 checks by @davebayer in https://github.com/NVIDIA/cccl/pull/2999
Avoid checking unsigned values for negativity by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2997
Rename thrust example version.cu to print_version.cu by @j3soon in https://github.com/NVIDIA/cccl/pull/3002
don't bother sync-ing a stream with itself by @ericniebler in https://github.com/NVIDIA/cccl/pull/3007
Backport is_scoped_enum by @davebayer in https://github.com/NVIDIA/cccl/pull/3003
Put monostate in <utility> by @davebayer in https://github.com/NVIDIA/cccl/pull/3000
backport std integer comparison functions to C++11 by @davebayer in https://github.com/NVIDIA/cccl/pull/2805
backport forward_like by @davebayer in https://github.com/NVIDIA/cccl/pull/2995
Document how to profile benchmarks by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3015
Update Thrust examples ReadMe by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3004
Deprecate public CUB/Thrust deprecation macros by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3010
Fix libcudacxx example by @j3soon in https://github.com/NVIDIA/cccl/pull/3013
Refactor BlockLoad test by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3005
Fix NVBench profile flags in docs by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3016
Update RAPIDS to 25.02. by @bdice in https://github.com/NVIDIA/cccl/pull/2967
Tweak tuning database plot and comparison scripts by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2883
Allow passing debug flags to NVRTC in libcudacxx tests by @wmaxey in https://github.com/NVIDIA/cccl/pull/3020
Add missing template parameter to BlockRadixRank example. by @esoha-nvidia in https://github.com/NVIDIA/cccl/pull/2736
Fix value range overflows in tests by @Artem-B in https://github.com/NVIDIA/cccl/pull/3022
Avoid relative includesthat have slipped in by @miscco in https://github.com/NVIDIA/cccl/pull/3042
Fix word count example in Thrust by @caugonnet in https://github.com/NVIDIA/cccl/pull/3014
revise <cuda/std/version> by @davebayer in https://github.com/NVIDIA/cccl/pull/3043
Replace thrust::swap by cuda::std::swap by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2985
add a converting constructor to cudax::stream_ref from cuda::stream_ref by @ericniebler in https://github.com/NVIDIA/cccl/pull/3052
[CUDAX] Remove launch overloads taking dimensions and make everything except makehierarchy return kernelconfig by @pciolkosz in https://github.com/NVIDIA/cccl/pull/2979
move sender support library to __async/sender/ by @ericniebler in https://github.com/NVIDIA/cccl/pull/3056
[cuda.cooperative] Add block.load and block.store. by @brycelelbach in https://github.com/NVIDIA/cccl/pull/2693
Backport unreachable by @davebayer in https://github.com/NVIDIA/cccl/pull/3018
Define the destructor of kernel_arg by @miscco in https://github.com/NVIDIA/cccl/pull/3060
Add missing __syncthreads() to test by @miscco in https://github.com/NVIDIA/cccl/pull/3061
Add assertions in the mdspan accessors that we are not out of bounds by @miscco in https://github.com/NVIDIA/cccl/pull/3055
Do not use cudaGetErrorString on GPU. by @Artem-B in https://github.com/NVIDIA/cccl/pull/3059
Reduce number of per-PR CI jobs. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2931
Rework CUDA compiler checks by @davebayer in https://github.com/NVIDIA/cccl/pull/3057
implement C++23 invoke_r by @davebayer in https://github.com/NVIDIA/cccl/pull/3041
Consider NVTARGETSMINTEGERLIST for ChainedPolicy pruning by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2772
Add environment to encapsulate information needed for cudax::vector by @miscco in https://github.com/NVIDIA/cccl/pull/2775
We should not call cudaGetErrorString on device by @miscco in https://github.com/NVIDIA/cccl/pull/3062
Introduce cuda.cooperative overloads not requiring temporary storage by @gevtushenko in https://github.com/NVIDIA/cccl/pull/2528
basic_any: a utility for defining type-erasing wrappers in terms of an interface description by @ericniebler in https://github.com/NVIDIA/cccl/pull/2633
Fix Thrust/CUB tests by adding empty base opt-ins to iterator classes by @wmaxey in https://github.com/NVIDIA/cccl/pull/3066
Don't use exact comparison for FP values. by @Artem-B in https://github.com/NVIDIA/cccl/pull/2742
Use consistent spelling for aliasing select benchmarks by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3073
Improve handling of language level features by @miscco in https://github.com/NVIDIA/cccl/pull/3069
Only tune streaming DeviceSelect versions for 64-bit offsets by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3072
Disable nvrtc workaround by @miscco in https://github.com/NVIDIA/cccl/pull/1116
fix assorted problems in cudax memory resource equality fns by @ericniebler in https://github.com/NVIDIA/cccl/pull/3079
Support fancy iterators in cuda.parallel by @rwgk in https://github.com/NVIDIA/cccl/pull/2788
fix feature test for operator<=> by @ericniebler in https://github.com/NVIDIA/cccl/pull/3075
Mark test as potentially passing by @miscco in https://github.com/NVIDIA/cccl/pull/3078
Avoid padding warning with MSVC by @miscco in https://github.com/NVIDIA/cccl/pull/3077
Improve CUB tuning documentation by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3058
Optimise tuning compile-time by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3074
Use consistent spelling for CounterT in histogram benchmarks by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3089
[Improvement] Don't require specifying output type when constructing TransformIterator (cuda.parallel) by @shwina in https://github.com/NVIDIA/cccl/pull/3083
simplify the definition of the basic_any class template by @ericniebler in https://github.com/NVIDIA/cccl/pull/3085
Use only signed offset types in CUB benchmarks by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3087
Improve readability of DispatchSelectIf parameterization by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3092
[cudax] Simplify implementation of device attributes by @davebayer in https://github.com/NVIDIA/cccl/pull/3084
suppress -Werror=empty-body in char_traits implementation by @ericniebler in https://github.com/NVIDIA/cccl/pull/3098
help older clang and gcc to disambiguate basic_any<__ireference<I>> and basic_any<I&> bases by @ericniebler in https://github.com/NVIDIA/cccl/pull/3102
[PERF] cuda.parallel: Cache intermediate results to improve performance of cudax.reduce_into by @shwina in https://github.com/NVIDIA/cccl/pull/3001
[Improvement] cuda.parallel: Don't require value_type when constructing iterators by @shwina in https://github.com/NVIDIA/cccl/pull/3105
Fix zip and permutation iterator EBO on MSVC by @wmaxey in https://github.com/NVIDIA/cccl/pull/3106
Avoid signed unsigned warnings in annotated_ptr test by @miscco in https://github.com/NVIDIA/cccl/pull/3076
Changes DispatchScan[ByKey] documentation to advise using unsigned offset types by @elstehle in https://github.com/NVIDIA/cccl/pull/3111
[STF] reduce access mode by @caugonnet in https://github.com/NVIDIA/cccl/pull/2830
add support for comparing type-erased wrappers to non-type-erased objects by @ericniebler in https://github.com/NVIDIA/cccl/pull/3100
backport byte by @davebayer in https://github.com/NVIDIA/cccl/pull/3091
Add bound checks for each dimension of mdspan by @fbusato in https://github.com/NVIDIA/cccl/pull/3065
Move some CUB tunings to dedicated headers by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3096
[CUDAX] Add combine API to kernel_config and allow adding default configuration to kernel functors by @pciolkosz in https://github.com/NVIDIA/cccl/pull/3082
Extend tuning guide by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3117
Densen sm90 policy by @gonidelis in https://github.com/NVIDIA/cccl/pull/3121
Fix a typo in the documentation of cub::DeviceReduce::Reduce by @caugonnet in https://github.com/NVIDIA/cccl/pull/3123
Cleanup select if tuning by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3120
Modularize <cuda/std/cstddef> by @davebayer in https://github.com/NVIDIA/cccl/pull/3119
Use programmatic dependent launch in CUB merge sort by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3114
Refactor selecting default tuning for select_if by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3124
Refactor SM90 radix_sort tuning by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3125
[STF] Improved sparse CG example and rename scalar to scalar_view by @caugonnet in https://github.com/NVIDIA/cccl/pull/3112
[CUDAX] Fix the other copy of vector_add after migration to use configs in launch by @pciolkosz in https://github.com/NVIDIA/cccl/pull/3129
Refactor cub histogram tuning by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3128
Refactor RLE tuning by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3127
Make PDL available with CTK 12.0 by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3136
Refactor reducebykey tuning by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3137
Refactor scan tunings by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3138
Fix analyze.py bug by @gonidelis in https://github.com/NVIDIA/cccl/pull/3067
Refactor scanbykey tuning by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3139
Refactor threewayparition tuning by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3140
Clarify passing ValueT to scanbykey tuning by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3143
Move remaining CUB policy hubs to tuning headers by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3141
[Internal Cleanup] pre-commit ruff (excluding docs/tools, libcudacxx/test) by @rwgk in https://github.com/NVIDIA/cccl/pull/3110
Add Python codeowners by @jrhemstad in https://github.com/NVIDIA/cccl/pull/3150
make basic_any compile for device by stubbing out the virtual tables by @ericniebler in https://github.com/NVIDIA/cccl/pull/3109
Refactoring unique by key by @gonidelis in https://github.com/NVIDIA/cccl/pull/3145
Add missing header in bench scan exclusive base header by @gonidelis in https://github.com/NVIDIA/cccl/pull/3157
Use synchronize_optional for device-to-device copy in thrust::copy() by @davidwendt in https://github.com/NVIDIA/cccl/pull/3149
[Internal Cleanup] pre-commit ruff libcudacxx/tests by @rwgk in https://github.com/NVIDIA/cccl/pull/3152
Clarify unknown tuning axis are ignored by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3156
address portability issue in basic_any with older nvcc versions by @ericniebler in https://github.com/NVIDIA/cccl/pull/3160
Add limited H100 testing for CUB by @jrhemstad in https://github.com/NVIDIA/cccl/pull/3151
Unify policy hub handling and update documentation by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3142
make the _CCCL_REQUIRES_EXPR macro more robust by @ericniebler in https://github.com/NVIDIA/cccl/pull/3164
[Refactor] cuda.parallel: Simplify TransformIterator implementation and refactor iterators to derive from a common base by @shwina in https://github.com/NVIDIA/cccl/pull/3118
the streams created by cudax::stream should not synchronize with the null stream by @ericniebler in https://github.com/NVIDIA/cccl/pull/3167
[STF] Implement CUDASTFDOTTIMING for the host_launch construct by @caugonnet in https://github.com/NVIDIA/cccl/pull/3170
Add support for sm101 and sm101a to NV_TARGET by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3166
implement C++23 byteswap by @davebayer in https://github.com/NVIDIA/cccl/pull/3093
Unifies large problem test helper infrastructure by @elstehle in https://github.com/NVIDIA/cccl/pull/3171
Deprectate C++11 and C++14 for libcu++ by @miscco in https://github.com/NVIDIA/cccl/pull/3173
Implement abs and div from cstdlib by @davebayer in https://github.com/NVIDIA/cccl/pull/3153
Fix missing radix sort policies by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3174
Introduces new DeviceReduce::Arg{Min,Max} interface with two output iterators by @elstehle in https://github.com/NVIDIA/cccl/pull/3148
Extend tuning documentation by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3179
Add codespell pre-commit hook, fix typos in CCCL by @bdice in https://github.com/NVIDIA/cccl/pull/3168
Fix parameter space for TUNE_LOAD in scan benchmark by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3176
Fix various old compiler version checks by @davebayer in https://github.com/NVIDIA/cccl/pull/3178
Implement ADL-proof std::projected from C++26 by @davebayer in https://github.com/NVIDIA/cccl/pull/3175
Fix pre-commit config for codespell and remaining typos by @shwina in https://github.com/NVIDIA/cccl/pull/3182
Massive cleanup of our config by @miscco in https://github.com/NVIDIA/cccl/pull/3155
Fix UB in atomics with automatic storage by @wmaxey in https://github.com/NVIDIA/cccl/pull/2586
Refactor the source code layout for cuda.parallel by @shwina in https://github.com/NVIDIA/cccl/pull/3177
new type-erased memory resources by @ericniebler in https://github.com/NVIDIA/cccl/pull/2824
rename _LIBCUDACXX_DECLSPEC_EMPTY_BASES to _CCCL_DECLSPEC_EMPTY_BASES by @ericniebler in https://github.com/NVIDIA/cccl/pull/3186
Document address stability of thrust::transform by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3181
turn off cuda version check for clangd by @ericniebler in https://github.com/NVIDIA/cccl/pull/3194
[STF] jacobi example based on parallel_for by @caugonnet in https://github.com/NVIDIA/cccl/pull/3187
Fixes pre-CTK 11.5 diag suppression issues by @elstehle in https://github.com/NVIDIA/cccl/pull/3189
Prefer c2h::type_name over c2h::demangle by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3195
Fix memcpy_async* tests by @ahendriksen in https://github.com/NVIDIA/cccl/pull/3197
Add type annotations and mypy checks for cuda.parallel by @shwina in https://github.com/NVIDIA/cccl/pull/3180
Fix rendering of cuda.parallel docs by @shwina in https://github.com/NVIDIA/cccl/pull/3192
Enable PDL for DeviceMergeSortBlockSortKernel by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3199
Adds support for large num_items to DeviceReduce::{ArgMin,ArgMax} by @elstehle in https://github.com/NVIDIA/cccl/pull/2647
Fixes for Python 3.7 docs environment by @shwina in https://github.com/NVIDIA/cccl/pull/3206
Adds support for large number of items to DeviceTransform by @elstehle in https://github.com/NVIDIA/cccl/pull/3172
cpasyncbulk: Fix test by @ahendriksen in https://github.com/NVIDIA/cccl/pull/3198
cudax fixes for msvc 14.41 by @ericniebler in https://github.com/NVIDIA/cccl/pull/3200
avoid instantiating class templates in is_same implementation when possible by @ericniebler in https://github.com/NVIDIA/cccl/pull/3203
Fix: make launchers a CUB detail; make kernel source functions hidden. by @griwes in https://github.com/NVIDIA/cccl/pull/3209
help the ranges concepts recognize standard contiguous iterators in c++14/17 by @ericniebler in https://github.com/NVIDIA/cccl/pull/3202
unify macros and cmake options that control the suppression of deprecation warnings by @ericniebler in https://github.com/NVIDIA/cccl/pull/3220
Fx thread-reduce performance regression by @fbusato in https://github.com/NVIDIA/cccl/pull/3225
cuda.parallel: In-memory caching of cuda.parallel build objects by @shwina in https://github.com/NVIDIA/cccl/pull/3216
clean up the cuda::std::span implementation with minimal c++14 range support by @ericniebler in https://github.com/NVIDIA/cccl/pull/3211
use generalized concepts portability macros to simplify the range concept by @ericniebler in https://github.com/NVIDIA/cccl/pull/3217
Use Ruff to sort imports by @shwina in https://github.com/NVIDIA/cccl/pull/3230
Fix scan / sm90 perf regression by @gevtushenko in https://github.com/NVIDIA/cccl/pull/3236
[STF] Logical token by @caugonnet in https://github.com/NVIDIA/cccl/pull/3196
Fix ReduceByKey tuning by @gevtushenko in https://github.com/NVIDIA/cccl/pull/3240
Fix RLE tuning by @gevtushenko in https://github.com/NVIDIA/cccl/pull/3239
cuda.parallel: Forbid non-contiguous arrays as inputs (or outputs) by @shwina in https://github.com/NVIDIA/cccl/pull/3233
Backport to 2.8: Make CUB NVRTC commandline arguments come from a cmake template (#3292) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3322
Backport to 2.8: Deprecate GridBarrier and GridBarrierLifetime (#3258) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3288
Backport to 2.8: Deprecate cub::Swap (#3333) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3350
Backport to 2.8: Deprecate Thrust's cpp_compatibility.h macros (#3299) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3321
Backport to 2.8: Deprecate cub::IterateThreadStore (#3337) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3351
Backport to 2.8: Deprecate thrust::null_type (#3367) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3373
Backport to 2.8: Review/Deprecate CUB util.ptx for CCCL 2.x (#3342) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3389
Backport to 2.8: Deprecate thrust::optional (#3307) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3393
Backport to 2.8: Deprecate thrust::numeric_limits (#3366) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3392
Backport to 2.8: Redefine and deprecate thrust::remove_cvref (#3394) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3420
Backport to 2.8: Replace and deprecate thrust::cuda_cub::terminate (#3421) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3425
[BACKPORT]: Deprecate cub::{min, max} and replace internal uses with those from libcu++ (#3419) by @miscco in https://github.com/NVIDIA/cccl/pull/3447
Backport to 2.8: Deprecate thrust::async (#3324) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3388
[BACKPORT]: Moves agents to detail:: namespace by @elstehle in https://github.com/NVIDIA/cccl/pull/3454
Backport to 2.8: Deprecate a few CUB macros (#3456) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3463
[BACKPORT]: Fix assert definition for NVHPC due to constexpr issues (#3418) by @miscco in https://github.com/NVIDIA/cccl/pull/3448
Backport to 2.8: Deprecate cub::DeviceSpmv (#3320) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3374
Backport to 2.8: some FP8 support by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3479
Backport to 2.8: Deprecate block/warp algo specializations (#3455) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3481
Backport to 2.8: Refactor limits and climits (#3221) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3488
Backport to 2.8: Fix typo in limits (#3491) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3498
Backport to 2.8: Update upload-pages-artifact to v3 (#3423) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3513
Backport to 2.8: Implement cuda::std::numeric_limits for __half and __nv_bfloat16 (#3361) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3490
Backport PRs #3201, #3523, #3547, #3580 to the 2.8.x branch. by @rwgk in https://github.com/NVIDIA/cccl/pull/3536
[Backport 2.8] work around msvc bug exposed by __type_index in type_list.h (#3487) by @wmaxey in https://github.com/NVIDIA/cccl/pull/3537
[Backport] #3572 to the 2.8.x branch. by @miscco in https://github.com/NVIDIA/cccl/pull/3605
Backport to 2.8: Specialize cuda::std::numeric_limits for FP8 types (#3478) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3492
Backport to 2.8: Deprecate thrust universal iterator categories (#3461) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3471
Backport to 2.8: Deprecate and replace thrust::cuda_cub iterators (#3422) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3510
Backport to 2.8: Deprecate thrust macros from type_deduction.h (#3501) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3511
Backport to 2.8: Deprecate macros from cuda/detail/core/util.h (#3504) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3520
[BACKPORT]:: Try to always include the definition of barriernativehandle when needed (#3556) by @miscco in https://github.com/NVIDIA/cccl/pull/3569
Backport to 2.8: Deprecates tuning policy hubs by @elstehle in https://github.com/NVIDIA/cccl/pull/3531
[Backport 2.8] Add extended data type macro identification by @fbusato in https://github.com/NVIDIA/cccl/pull/3586
Backport to 2.8: Deprecate thrust logical meta functions (#3538) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3567
Backport to 2.8: Refactor (#3561) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3566
Backport to 2.8: Tune cub::DeviceTransform for Blackwell (#3543) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3565
Backport to 2.8: Deprecate and replace CUB_IS_INT128_ENABLED (#3427) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3629
Backport to 2.8: Deprecate CUB iterators existing in Thrust (#3304) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3534
Backport to 2.8: Deprecate thrust event, future and more (#3457) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3512
Backport to 2.8: PTX support for Blackwell by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3624
Backport to 2.8: Support FP16 traits on CTK 12.0 (#3535) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3625
[Backport 2.8] Deprecate AgentSegmentFixupPolicy by @fbusato in https://github.com/NVIDIA/cccl/pull/3638
Backport to 2.8: PTX: fix cp.async.bulk.tensor and mbarrier.arrive (#3628) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3630
Backport to 2.8: Suppress execution checks for vocabulary types (#3578) by @miscco in https://github.com/NVIDIA/cccl/pull/3599
[BACKPORT]: Try and get rapids green (#3503) by @miscco in https://github.com/NVIDIA/cccl/pull/3598
Backport to 2.8: Internalize triple_chevron (#3648) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3650
[BACKPORT]: Ensure that headers in <cuda/*> can be build with a C++ only compiler (#3472) by @miscco in https://github.com/NVIDIA/cccl/pull/3651
Backport to 2.8: __builtin_isfinite is only available above nvrtc 12.2 by @leofang in https://github.com/NVIDIA/cccl/pull/3653
[Backport 2.8.x] Backport #3575 deprecating old ABIs in libcudacxx by @wmaxey in https://github.com/NVIDIA/cccl/pull/3660
[Backport 2.8.x] Backport [nv/target] Add sm_120 macros. (#3550) by @wmaxey in https://github.com/NVIDIA/cccl/pull/3661
Backport to 2.8: Add b200 policies for device.select.if,flagged,unique (#3545) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3667
Backport to 2.8: Add b200 tunings for radix_sort.pairs (#3626) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3668
[Backport branch/2.8.x] PTX: mbarrier.{test,try}_wait: Fix return value by @github-actions in https://github.com/NVIDIA/cccl/pull/3672
Backport to 2.8: Add b200 tunings for radix_sort.keys (#3611) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3655
[Backport branch/2.8.x] Fix issues with nvrtc compilation by @github-actions in https://github.com/NVIDIA/cccl/pull/3674
[Backport branch/2.8.x] Add b200 policies for cub.select.uniquebykey by @github-actions in https://github.com/NVIDIA/cccl/pull/3673
[Backport branch/2.8.x] Deprecate cub::FpLimits in favor of cuda::std::numeric_limits by @github-actions in https://github.com/NVIDIA/cccl/pull/3658
Backport to 2.8: Deprecate cub::AliasTemporaries (#3679) and cub::PolicyWrapper (#3681) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3690
[Backport branch/2.8.x] Internalize cub::KernelConfig by @github-actions in https://github.com/NVIDIA/cccl/pull/3688
Backport to 2.8: Fix transform_iterator (#3652) and Deprecate thrust::identity (#3649) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3693
Backport to 2.8: Add b200 policies for cub.device.runlengthencode.encode,non_trivialruns (#3546) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3704
[BACKPORT] Remove cugraph-ops from RAPIDS 25.04 builds. (#3675) by @miscco in https://github.com/NVIDIA/cccl/pull/3696
Backport to 2.8: Make thrust iterators work with NVRTC (#3676) and replace CUB iterators by Thrust ones (#3480) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3697
[Backport branch/2.8.x] Deprecate cub::RegBoundScaling and cub::MemBoundScaling by @github-actions in https://github.com/NVIDIA/cccl/pull/3706
[backport 2.8] Deprecate and replace Int2Type by @fbusato in https://github.com/NVIDIA/cccl/pull/3705
[Backport branch/2.8.x] Add b200 policies for partition.three_way by @github-actions in https://github.com/NVIDIA/cccl/pull/3710
Backport to 2.8: Deprecate cub::Trait::CATEGORY|PRIMITIVE|NULL_TYPE (#3689) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3703
[Backport branch/2.8.x] Add b200 tunings for scan.exclusive.by_key by @github-actions in https://github.com/NVIDIA/cccl/pull/3719
Backport to 2.8: B200 reduce.by_key tunings by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3726
Backport to 2.8: B200 tunings for histogram by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3728
Backport to 2.8: B200 reduce tunings by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3735
Backport to 2.8: Add b200 policies for cub.device.partition.flagged,if (#3617) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3736
Backport to 2.8: Add b200 tunings for scan.exclusive.sum (#3559) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3738
[Backport branch/2.8.x] fix ::cuda::discard_memory by @github-actions in https://github.com/NVIDIA/cccl/pull/3737
[Backport branch/2.8.x] Fix cub trait deprecations by @github-actions in https://github.com/NVIDIA/cccl/pull/3744
[Backport branch/2.8.x] [Automation] Add release workflow for tagging and testing new RCs by @github-actions in https://github.com/NVIDIA/cccl/pull/3754
Suppress deprecatings on logical meta functions by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3795
Revert back to cub::Traits::CATEGORY|PRIMITIVE by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/3866
[2.8.x] Disable [[no_unique_address]] for MSVC (#3757) by @miscco in https://github.com/NVIDIA/cccl/pull/3869
[Backport branch/2.8.x] do not try to use clang-19's support for c++26 pack indexing by @github-actions in https://github.com/NVIDIA/cccl/pull/3903

New Contributors

@Artem-B made their first contribution in https://github.com/NVIDIA/cccl/pull/2420
@Kh4ster made their first contribution in https://github.com/NVIDIA/cccl/pull/2470
@andrewcorrigan made their first contribution in https://github.com/NVIDIA/cccl/pull/2482
@jjacobelli made their first contribution in https://github.com/NVIDIA/cccl/pull/2604
@andralex made their first contribution in https://github.com/NVIDIA/cccl/pull/2622
@Jacobfaib made their first contribution in https://github.com/NVIDIA/cccl/pull/2681
@karthikeyann made their first contribution in https://github.com/NVIDIA/cccl/pull/2769
@davidwendt made their first contribution in https://github.com/NVIDIA/cccl/pull/2857
@hzhangxyz made their first contribution in https://github.com/NVIDIA/cccl/pull/2992
@j3soon made their first contribution in https://github.com/NVIDIA/cccl/pull/3002
@esoha-nvidia made their first contribution in https://github.com/NVIDIA/cccl/pull/2736

Full Changelog: https://github.com/NVIDIA/cccl/compare/v2.7.0...v2.8.0

- C++
Published by wmaxey 10 months ago

cccl - CCCL 2.7.0

What’s New

C++

Thrust / CUB

Inclusive scan now supports initial value https://github.com/NVIDIA/cccl/pull/1940
Inclusive and exclusive scan now support problem sizes exceeding 2^31 elements https://github.com/NVIDIA/cccl/pull/2171
New cub::DeviceMerge::MergeKeys and cub::DeviceMerge::MergePairs algorithms https://github.com/NVIDIA/cccl/pull/1817
New thrust::tabulate_output_iterator fancy iterator https://github.com/NVIDIA/cccl/pull/2282

Libcudacxx

Enable Assertions on host and device depending on users choice
C++26 inplace_vector has been implemented and backported to C++14
Improved support for extended floating point types __half and __nv_bfloat16 both for cmath functions and complex
cuda::std::tuple is now trivially copyable if the stored types are trivially copyable
Reworked our atomics implementation
Improved <cuda/std/bit> conformance
Implemented <cuda/std/bitset> and backported to C++14
Implemented and backported C++20 bit_cast. It is available in all standard modes and constexpr with compiler support
Various backports and constexpr improvements (bool_constant, cuda::std::max)
Moved the experimental memory resources from <cuda/memory_resource> into <cuda/experimental/memory_resource.cuh>

Python

cuda.cooperative

Best practice of using CCCL to make your CUDA kernels easier to write and faster to execute is now available in Python through the cuda.cooperative module. This module currently supports block- and warp-level algorithms within numba.cuda kernels, offering speed-of-light reductions, prefix sums, radix, and merge sort. You can customize cuda.cooperative algorithms with user-defined data types and operators, implemented directly in Python.

Block and warp-level cooperative algorithms are now available in Python https://github.com/NVIDIA/cccl/pull/1973. Experimental versions of reduce, scan, merge and radix sort are available in numba.cuda kernels.

cuda.parallel

Apart from device-side cooperative algorithms, CCCL 2.7 provides an experimental version of host-side parallel algorithms as part of the cuda.parallel module. This release includes parallel reduction.

What's Changed

Fix documentation generation for thrust::pair by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1976
Correct typo in a launch configuration header name by @pciolkosz in https://github.com/NVIDIA/cccl/pull/1972
Fix thrust::sort for large problem sizes by @gevtushenko in https://github.com/NVIDIA/cccl/pull/1952
Avoid SIGPIPE when truncating verbose output in CI scripts. by @alliepiper in https://github.com/NVIDIA/cccl/pull/1971
Clarify compiler support by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1970
Experimental Python cooperative algorithms by @gevtushenko in https://github.com/NVIDIA/cccl/pull/1973
[pre-commit.ci] pre-commit autoupdate by @pre-commit-ci in https://github.com/NVIDIA/cccl/pull/1928
Guard against an overflow in sort tests by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1980
Remove obsolete Thrust function traits by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1962
Python: Add version string & wheel build command by @leofang in https://github.com/NVIDIA/cccl/pull/1985
Add device inclusive scan with init_value by @gonidelis in https://github.com/NVIDIA/cccl/pull/1845
Fix BWUtil report on early exit by @gonidelis in https://github.com/NVIDIA/cccl/pull/1994
Use libcu++ void_t everywhere by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1977
Drop zippedbinaryop by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1988
Clarify PtxVersion and SmVersion by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2004
More simplifications for CUB util_device by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1948
fix some typos in <cuda/stream_ref> by @ericniebler in https://github.com/NVIDIA/cccl/pull/2003
Add CI slack notifications. by @alliepiper in https://github.com/NVIDIA/cccl/pull/1961
Allow nightly workflow to be manually invoked. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2007
Need to use a different approach to reuse secrets in reusable workflows vs. actions. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2008
Enable RAPIDS builds for manually dispatched workflows. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2009
clean up complex.inl by @ZelboK in https://github.com/NVIDIA/cccl/pull/1655
Add github token to nightly workflow-results action. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2012
Remove obsolete build system glue from the Thrust/CUB submodule structure. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2016
Benchmark thrust::copy with non-trivially relocatable type by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1989
Make bool_constant available in C++11 by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1997
Spell value initialization where used in thrust vectors by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1990
Do no redefine __ELF__ macro by @miscco in https://github.com/NVIDIA/cccl/pull/2018
Port thrust::merge[_by_key] to CUB by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1817
Simplify some pointer traits by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2020
Simplify test data setup by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2023
Add tests to ensure that we properly propagate common_type for complex types by @miscco in https://github.com/NVIDIA/cccl/pull/2025
Update Thrust CMake README to use CCCL repo. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2026
Include container toolkit in manual prereqs by @bryevdv in https://github.com/NVIDIA/cccl/pull/2064
Avoid ADL issues with thrust::distance by @miscco in https://github.com/NVIDIA/cccl/pull/2053
Simplify thrust::detail::wrapped_function by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2019
Add a test for Thrust scan with non-commutative op by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2024
Update memory_resource docs by @miscco in https://github.com/NVIDIA/cccl/pull/1883
Temporarily switch nightly H100 CI to build-only. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2060
Do not rely on conversions between float and extended floating point types by @miscco in https://github.com/NVIDIA/cccl/pull/2046
experimental wrapper types for cudaEvent_t that provide a modern C++ interface. by @ericniebler in https://github.com/NVIDIA/cccl/pull/2017
[CUDAX] Add a dummy device struct for now by @pciolkosz in https://github.com/NVIDIA/cccl/pull/2066
Allow (somewhat) different input value types for merge by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2075
Avoid ::result_type for partial sums in TBB reducebykey by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1998
Fix formatting by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2090
Rename and refactor transformiteratorbase by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1987
Benchmark analysis: Print all top rows when asked for by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2089
Makes user-provided functors in our examples use __device__ instead of CUB_RUNTIME_FUNCTION by @elstehle in https://github.com/NVIDIA/cccl/pull/2088
Separate cuda/experimental when sorting includes by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2094
add support to cudax::device for querying a device's attributes by @ericniebler in https://github.com/NVIDIA/cccl/pull/2084
[CUDAX] Add experimental owning abstraction for cudaStream_t by @pciolkosz in https://github.com/NVIDIA/cccl/pull/2093
Do not query NVRTC for cuda runtime header by @miscco in https://github.com/NVIDIA/cccl/pull/2102
Cleanup CUB block/thread load and exchange by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1946
Improve binary function objects and replace thrust implementation by @srinivasyadav18 in https://github.com/NVIDIA/cccl/pull/1872
Replace _LIBCUDACXX_CPO_ACCESSIBILITY with _CCCL_GLOBAL_VARIABLE by @miscco in https://github.com/NVIDIA/cccl/pull/1881
Add script to update RAPIDS version. by @bdice in https://github.com/NVIDIA/cccl/pull/2082
Update bad links by @bryevdv in https://github.com/NVIDIA/cccl/pull/2080
Fix line break issues that break doxygen code examples by @miscco in https://github.com/NVIDIA/cccl/pull/2103
Add internal wrapper for cuda driver APIs by @pciolkosz in https://github.com/NVIDIA/cccl/pull/2070
Use common_type for complex pow by @miscco in https://github.com/NVIDIA/cccl/pull/1800
[CUDAX] rename device to device_ref, add immovable device as a place to cache properties by @ericniebler in https://github.com/NVIDIA/cccl/pull/2110
Use the float flavors of the cmath functions in the extended floating point fallbacks by @miscco in https://github.com/NVIDIA/cccl/pull/2106
[PoC]: Implement cuda::experimental::uninitialized_buffer by @miscco in https://github.com/NVIDIA/cccl/pull/1831
Ensure that we avoid ABI Version conflics by @miscco in https://github.com/NVIDIA/cccl/pull/2137
Ensure that cuda_memory_resource allocates memory on the proper device by @miscco in https://github.com/NVIDIA/cccl/pull/2073
Clarify compatibility wrt. template specializations by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2138
Implement a cudax::get_stream CPO by @miscco in https://github.com/NVIDIA/cccl/pull/2135
Make cuda::std::tuple trivially copyable by @miscco in https://github.com/NVIDIA/cccl/pull/2127
Fix missing copy of docs artifacts by @miscco in https://github.com/NVIDIA/cccl/pull/2162
Fix g++-14 warning on uninitialized copying by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2157
Fix flakey heterogeneous tests by @wmaxey in https://github.com/NVIDIA/cccl/pull/2085
Fix multiple definition of InclusiveScanKernel by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2169
[CUDAX] Add a global constexpr cudax::devices range for all devices in the system by @ericniebler in https://github.com/NVIDIA/cccl/pull/2100
fix use of cudaStream_t as if it were a stream wrapper by @ericniebler in https://github.com/NVIDIA/cccl/pull/2190
Fix uninitialized_buffer self assignment by @miscco in https://github.com/NVIDIA/cccl/pull/2170
Fix trivialcopydevicetodevice execution space by @gevtushenko in https://github.com/NVIDIA/cccl/pull/2164
Clarify libcu++ use by non-CUDA compilers by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1969
Warn when using C++14 in CUB and Thrust by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2166
Fix the clang-format path in the devcontainers by @miscco in https://github.com/NVIDIA/cccl/pull/2194
Mount a temporary build volume for CCCL projects if WSL is detected by @wmaxey in https://github.com/NVIDIA/cccl/pull/2035
2118 [CUDAX] Change the RAII device swapper to use driver API and add it in places where it was missing by @pciolkosz in https://github.com/NVIDIA/cccl/pull/2192
Fix singular vs plural typo in thread scope documentation. by @brycelelbach in https://github.com/NVIDIA/cccl/pull/2198
[CUDAX] fixing some minor issues with device attribute queries by @ericniebler in https://github.com/NVIDIA/cccl/pull/2183
Integrate Python docs by @bryevdv in https://github.com/NVIDIA/cccl/pull/2196
[FEA] Atomics codegen refactor by @wmaxey in https://github.com/NVIDIA/cccl/pull/1993
[CUDAX] add __launch_transform to transform arguments to cudax::launch prior to launching the kernel by @ericniebler in https://github.com/NVIDIA/cccl/pull/2202
Cleanup common testing headers and correct asserts in launch testing by @pciolkosz in https://github.com/NVIDIA/cccl/pull/2204
[CUDAX] Add an API to get deviceref from stream and add comparison operator to deviceref by @pciolkosz in https://github.com/NVIDIA/cccl/pull/2203
Update devcontainer docs for WSL by @jrhemstad in https://github.com/NVIDIA/cccl/pull/2200
add cudax::distribute<threadsPrBlock>(numElements) by @ericniebler in https://github.com/NVIDIA/cccl/pull/2210
Rework mdspan concept emulation by @miscco in https://github.com/NVIDIA/cccl/pull/2213
Un-doc functions taking debug_synchronous by @bryevdv in https://github.com/NVIDIA/cccl/pull/2209
CUDA vector_add sample project by @ericniebler in https://github.com/NVIDIA/cccl/pull/2160
avoid constraint recursion in the resource concept by @ericniebler in https://github.com/NVIDIA/cccl/pull/2215
fix cuda_memory_resource test for properly aligned memory by @ericniebler in https://github.com/NVIDIA/cccl/pull/2227
Fix including <complex> when bad CUDA bfloat/half macros are used. by @wmaxey in https://github.com/NVIDIA/cccl/pull/2226
Add license & fix long_description in setup.py by @leofang in https://github.com/NVIDIA/cccl/pull/2211
Extract reduction kernels into NVRTC-compilable header by @gevtushenko in https://github.com/NVIDIA/cccl/pull/2231
Implement <cuda/std/bitset> by @griwes in https://github.com/NVIDIA/cccl/pull/1496
Refactor Thrust placeholder operators by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2233
Add missing annotations for deprecated debug_sync APIs by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2212
Test thrust headers for disabled half/bf16 support by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2219
Make cuda::std::max constexpr in C++11 by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2107
Fix ForEachCopyN for non-contiguous iterators by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2220
Configure CUB/Thrust for C++17 by default by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2217
Allow installing components when downstream by @stephenswat in https://github.com/NVIDIA/cccl/pull/2096
Rename the memory resources to drop the superfluous prefix cuda_ by @miscco in https://github.com/NVIDIA/cccl/pull/2243
Fix and simplify by @wmaxey in https://github.com/NVIDIA/cccl/pull/2197
Proclaim pair and tuple trivially relocatable by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2010
Make cuda::std::min constexpr in C++11 by @miscco in https://github.com/NVIDIA/cccl/pull/2249
Add CCCL_DISABLE_NVTX macro by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2173
Workaround GCC 13 issue with empty histogram decoder op by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2252
Refactor Thrust's logical meta functions by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2260
Fix use of doxygen \file command by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2259
Add tests for transform_iterator's reference type by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2221
Small tuning script output improvements by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2262
Fix Thrust::vector ctor selection for int,int by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2261
Adds support for large number of items to DeviceScan by @elstehle in https://github.com/NVIDIA/cccl/pull/2171
Use and test radix sort for int128, half and bfloat16 in Thrust by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2168
Implement C API for device reduction by @gevtushenko in https://github.com/NVIDIA/cccl/pull/2256
Move cooperative module by @gevtushenko in https://github.com/NVIDIA/cccl/pull/2269
Move compiler version macros into libcu++ by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2250
Introduce cuda.parallel module by @gevtushenko in https://github.com/NVIDIA/cccl/pull/2276
Adds thrust::tabulate_output_iterator by @elstehle in https://github.com/NVIDIA/cccl/pull/2282
Drop macos string that lit cannot parse properly by @miscco in https://github.com/NVIDIA/cccl/pull/2283
Flatten forwarding headers by @miscco in https://github.com/NVIDIA/cccl/pull/2284
2270 static compute capabilities queries by @pciolkosz in https://github.com/NVIDIA/cccl/pull/2271
Fix read of dangling reference in thrust placeholders by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2290
Implement any_resource, an owning wrapper around a memory resource by @ericniebler in https://github.com/NVIDIA/cccl/pull/2266
Fixes formatting of tabulate_output_iterator.inl by @elstehle in https://github.com/NVIDIA/cccl/pull/2298
use NV_IF_TARGET to conditionally compile CUDAX tests by @ericniebler in https://github.com/NVIDIA/cccl/pull/2297
Make for_each compatible with NVRTC by @wmaxey in https://github.com/NVIDIA/cccl/pull/2288
refactor cmake so more cudax samples can be easily added by @ericniebler in https://github.com/NVIDIA/cccl/pull/2296
Use the in, out, and inout parameter decorators from cudax::launch by @ericniebler in https://github.com/NVIDIA/cccl/pull/2294
Implement std::bit_cast by @miscco in https://github.com/NVIDIA/cccl/pull/2258
Cleanup the <cuda/std/bit> header by @miscco in https://github.com/NVIDIA/cccl/pull/2299
change cudax::uninitialized_buffer to own its memory resource with cudax::any_resource by @ericniebler in https://github.com/NVIDIA/cccl/pull/2293
Documentation typos by @fbusato in https://github.com/NVIDIA/cccl/pull/2302
Add thrust::inclusivescan with initvalue support by @gonidelis in https://github.com/NVIDIA/cccl/pull/1940
Assure placeholder expressions are semi-regular by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2305
Add documentation for any_resource by @miscco in https://github.com/NVIDIA/cccl/pull/2309
Implement P0843 inplace_vector by @miscco in https://github.com/NVIDIA/cccl/pull/1936
Cleanup __config and unify most visibility macros by @miscco in https://github.com/NVIDIA/cccl/pull/2285
Add a fast, low memory "limited" mode to CUB testing. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2317
[CUDAX] Add eventref::isdone() and update event tests by @pciolkosz in https://github.com/NVIDIA/cccl/pull/2304
Minor cleanup to memory resources by @miscco in https://github.com/NVIDIA/cccl/pull/2308
Drop ICC from the cudax support matrix by @miscco in https://github.com/NVIDIA/cccl/pull/2330
Do not hardcode Thrust's host system to cpp. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2332
[CUDAX] Add computecapability device attribute and handle archtraits for future architectures by @pciolkosz in https://github.com/NVIDIA/cccl/pull/2328
Disable exec checks on ranges CPOs by @miscco in https://github.com/NVIDIA/cccl/pull/2331
Enable exceptions by default by @miscco in https://github.com/NVIDIA/cccl/pull/2329
Make the thrust dispatch mechanisms configurable by @miscco in https://github.com/NVIDIA/cccl/pull/2310
[CUDAX] give all the cudax headers the .cuh extension by @ericniebler in https://github.com/NVIDIA/cccl/pull/2340
Compiler version improvements by @fbusato in https://github.com/NVIDIA/cccl/pull/2316
Fix hardcoding _THRUSTHOSTSYSTEMNAMESPACE to cpp by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/2341
Improvements to the Cuda Core C library infrastructure by @miscco in https://github.com/NVIDIA/cccl/pull/2336
Fix bug remaining on thrust::inclusive_scan with init value with CDP by @gonidelis in https://github.com/NVIDIA/cccl/pull/2346
[CUDAX] make uninitialized_buffer usable with launch by @ericniebler in https://github.com/NVIDIA/cccl/pull/2342
Test and fix failing nightly libcudacxx + CUB jobs by @miscco in https://github.com/NVIDIA/cccl/pull/1847
Update Memory Model docs for HMM by @gonzalobg in https://github.com/NVIDIA/cccl/pull/2272
Harden thrust algorithms against evil iterators that overload operator, by @miscco in https://github.com/NVIDIA/cccl/pull/2349
Avoid circular concept definition with memory resources by @miscco in https://github.com/NVIDIA/cccl/pull/2351
add IWYU export pragma on config headers by @ericniebler in https://github.com/NVIDIA/cccl/pull/2352
Add cuda_parallel to CI. by @alliepiper in https://github.com/NVIDIA/cccl/pull/2338
[CUDAX] Branch out an experimental version of stream_ref by @pciolkosz in https://github.com/NVIDIA/cccl/pull/2343
Improve visibility macros for libcu++ by @miscco in https://github.com/NVIDIA/cccl/pull/2337
Add missing cuKernelGetFunction call to reduce by @pciolkosz in https://github.com/NVIDIA/cccl/pull/2355
Move invalid_stream to the proper file by @miscco in https://github.com/NVIDIA/cccl/pull/2360
fix the cudax vector_add sample by @ericniebler in https://github.com/NVIDIA/cccl/pull/2372
Add -Wmissing-field-initializers to cudax by @pciolkosz in https://github.com/NVIDIA/cccl/pull/2373
Update CCCL version to 2.7.0 by @wmaxey in https://github.com/NVIDIA/cccl/pull/2364
Backport several fixes into 2.7.x. by @wmaxey in https://github.com/NVIDIA/cccl/pull/2579
[BACKPORT]: Rework head_flags so that we do not rely on the tuple being unevaluated (#2619) by @miscco in https://github.com/NVIDIA/cccl/pull/2620
[Backport] Fix cluster launch error in branch/2.7.x by @wmaxey in https://github.com/NVIDIA/cccl/pull/2866
Disable execution checks for tuple (#2780) by @wmaxey in https://github.com/NVIDIA/cccl/pull/2867
[BACKPORT: Fix Thrust/CUB tests by adding empty base opt-ins to iterator classes (#3066) by @miscco in https://github.com/NVIDIA/cccl/pull/3068
[Backport] Fix EBO in zip_iterator on MSVC. by @wmaxey in https://github.com/NVIDIA/cccl/pull/3107

New Contributors

@bryevdv made their first contribution in https://github.com/NVIDIA/cccl/pull/2064
@stephenswat made their first contribution in https://github.com/NVIDIA/cccl/pull/2096

Full Changelog: https://github.com/NVIDIA/cccl/compare/v2.6.1...v2.7.0

- C++
Published by wmaxey about 1 year ago

cccl - CCCL 2.6.1

This release includes backports for PRs #2332 and #2341. Please see release 2.6.0 for the full list of changes included in the release.

What's Changed

Backport PR #2332 and #2341 by @wmaxey in https://github.com/NVIDIA/cccl/pull/2368

Full Changelog: https://github.com/NVIDIA/cccl/compare/v2.6.0...v2.6.1

- C++
Published by wmaxey over 1 year ago

cccl - CCCL 2.6.0

What's Changed

Restrict active histogram channels to channel count by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1796
Cleanup internal thrust CUDA utils by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1802
Use variadic interfaces in agent launcher by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1804
Use nullptr over NULL by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1805
Rework the documentation to be build with sphinx by @miscco in https://github.com/NVIDIA/cccl/pull/1753
Let Catch2 report cudaError descriptions by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1808
Check size-querying CUB API invocation in tests by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1809
Update docs link by @gevtushenko in https://github.com/NVIDIA/cccl/pull/1812
Add missing inline specifiers by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1813
Upgrade actions that use node16 to versions that use node20 by @trxcllnt in https://github.com/NVIDIA/cccl/pull/1779
Document NVTX range behavior during graph capture by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1814
Clean up AliasTemporaries by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1815
Drop removed clang-tidy option by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1810
Exclude docs from cccl infra changes. by @alliepiper in https://github.com/NVIDIA/cccl/pull/1821
Clean up thrust merge unit tests by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1819
Fix atomic performance regressions by avoiding use of memcpy with natively supported atomic types. by @wmaxey in https://github.com/NVIDIA/cccl/pull/1801
Clean up merge_by_key and merge_key_value tests by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1824
Restore the old thrust api documentation in rst by @miscco in https://github.com/NVIDIA/cccl/pull/1818
Drop all internal implementations of exceptions by @miscco in https://github.com/NVIDIA/cccl/pull/1806
Fix span for non-ranges by @miscco in https://github.com/NVIDIA/cccl/pull/1836
Cleanup thrust test special types by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1837
Add inclusive_scan with initial value support (warp/block) by @gonidelis in https://github.com/NVIDIA/cccl/pull/1749
Fix loading from incorrect URI on 404 page. by @wmaxey in https://github.com/NVIDIA/cccl/pull/1843
Port CUB temporary storage layout test to Catch2 by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1835
Port CUB thread operators test to Catch2 by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1834
Adds ceil_div by @gonzalobg in https://github.com/NVIDIA/cccl/pull/1825
Split workflow into multiple dispatch groups to avoid skipped jobs. by @alliepiper in https://github.com/NVIDIA/cccl/pull/1797
Fix broken CUB doc build and add 404 page to Sphinx. by @wmaxey in https://github.com/NVIDIA/cccl/pull/1846
Port CUB thread sort test to Catch2 by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1838
Cleanup CUB temporary storage layout test by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1848
Propogate error when docsbuild fails, add docs build to CI. by @alliepiper in https://github.com/NVIDIA/cccl/pull/1852
Cleanup CUB util_macro.cuh by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1849
Provide libcu++ transparent functors in C++11 by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1851
Roll back upload-pages-artifact to v2. by @alliepiper in https://github.com/NVIDIA/cccl/pull/1861
Port CUB iterator test to Catch2 by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1822
Symbol visibility is now invariant in regards to __cuda_std__ definition by @robertmaynard in https://github.com/NVIDIA/cccl/pull/1832
Add dimensions description functionality to CUDA Experimental library by @pciolkosz in https://github.com/NVIDIA/cccl/pull/1743
Document Asynchronous Operations by @gonzalobg in https://github.com/NVIDIA/cccl/pull/1781
Remove cpp11_required.h by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1860
Add workflow to build RAPIDS from source with local CCCL by @trxcllnt in https://github.com/NVIDIA/cccl/pull/1667
Refactor CI matrix. by @alliepiper in https://github.com/NVIDIA/cccl/pull/1844
Adds tests for large number of items in cub::DeviceScan by @elstehle in https://github.com/NVIDIA/cccl/pull/1830
Make CUB test launch wrappers functor instances by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1850
Improve CUB test overview docs by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1867
Skip devcontainer validation jobs if not needed. by @alliepiper in https://github.com/NVIDIA/cccl/pull/1853
Improve CUB device-scope documentation by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1862
Make integer sequence et al. available in C++11 by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1859
Minimize template instantiations in CUB thread_load by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1857
Create major version 2.6.0 by @wmaxey in https://github.com/NVIDIA/cccl/pull/1880
Drop facilities deprecated in CUB 1.x by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1868
Make thrust::sort use radix sort with more comparators by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1884
Make cuda::ptx::*_multicast pass on all architectures by @ahendriksen in https://github.com/NVIDIA/cccl/pull/1874
Replace typedef by alias declarations in CUB by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1885
Remove legacy benchmarks and other dvs/p4 remnants by @alliepiper in https://github.com/NVIDIA/cccl/pull/1901
Qualify call to distance in thrust::async_reduce by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1904
Rename CUB uninitialized_copy by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1913
Sanitizer fixes by @alliepiper in https://github.com/NVIDIA/cccl/pull/1916
Use c2h::vectors in all non-example CUB tests by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1914
Renamed overlooked uninitialized_copy by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1920
Add assert implementation for device side testing by @pciolkosz in https://github.com/NVIDIA/cccl/pull/1918
Thrust and CUB: README: Fix copy-paste from libcu++ and links by @pauleonix in https://github.com/NVIDIA/cccl/pull/1878
Follow-up fixes to CUB iterator test by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1875
Replace typedef by alias declarations in Thrust by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1915
Cleanup CUB util_type.cuh by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1863
Fix include for in cub/util_type.cuh by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1929
Fix issues with comments in the concept emulation by @miscco in https://github.com/NVIDIA/cccl/pull/1931
Deprecate and reduce use of old functional stuff by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1925
Deprecate more nested aliases in thrust functors by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1932
Fix various typos in CUB documentation and comments. by @brycelelbach in https://github.com/NVIDIA/cccl/pull/1933
Add BabelStream flavors as thrust::transform benchmarks by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1921
Some cleanup in Thrust config headers by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1934
Update to CUDA 12.5 containers by @jrhemstad in https://github.com/NVIDIA/cccl/pull/1935
Check that the current version of CMake supports policy 141 before se… by @alliepiper in https://github.com/NVIDIA/cccl/pull/1924
Fix memmove optimization by @miscco in https://github.com/NVIDIA/cccl/pull/1937
Fixes thrust::unique_by_key examples by @elstehle in https://github.com/NVIDIA/cccl/pull/1943
Use only explicit NVTX3 V1 API in CUB by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1751
Suppress a clang warning on array size computation by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1942
Add a benchmark for thrust::equal by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1944
Strip prefix paths to improve doc rendering by @bdice in https://github.com/NVIDIA/cccl/pull/1954
Modernize Thrust's alignment.h and triplechevronlaunch by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1905
Restore RAPIDS devcontainer by @bdice in https://github.com/NVIDIA/cccl/pull/1955
Fix for in-place DeviceSelect & thrust::remove_if by @elstehle in https://github.com/NVIDIA/cccl/pull/1782
Drop Thrust's cstdint.h by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1959
Use make_devcontainers.sh --clean when validating. by @alliepiper in https://github.com/NVIDIA/cccl/pull/1963
Fix missing binarypred in thrust::uniqueby_key by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1957
cuda::launch and launch configuration object with minimal functionality by @pciolkosz in https://github.com/NVIDIA/cccl/pull/1950
Backport PR #2046 - Fixing FP16 conversions. by @wmaxey in https://github.com/NVIDIA/cccl/pull/2222

Full Changelog: https://github.com/NVIDIA/cccl/compare/v2.5.0...v2.6.0

- C++
Published by wmaxey over 1 year ago

cccl - CCCL 2.5.0

What's New

This release includes several notable improvements and new features: - CUB device-level algorithms now support NVTX ranges in Nsight Systems. This integration makes it easier to identify and analyze the time spent in CUB algorithms. Please note that profiling with this feature requires at least C++14. - We have added new cub::DeviceSelect::FlaggedIf API, which allows you to select items based on applying a predicate to flags. This addition provides more flexibility and control over item selection.

What's Changed

Clean up libcu++ docs landing page by @jrhemstad in https://github.com/NVIDIA/cccl/pull/1492
PTX: Add cuda::ptx::elect_sync by @ahendriksen in https://github.com/NVIDIA/cccl/pull/1537
Print a summary of all tests sorted by execution time. by @alliepiper in https://github.com/NVIDIA/cccl/pull/1539
Fix unused variable warning for __can_use_complete_tx by @wmaxey in https://github.com/NVIDIA/cccl/pull/1547
Fix usage of naked array with 0 elements in sm90 barrier tests. by @wmaxey in https://github.com/NVIDIA/cccl/pull/1546
Add support for stream operators for complex by @miscco in https://github.com/NVIDIA/cccl/pull/1538
Fix __half for older architectures by @miscco in https://github.com/NVIDIA/cccl/pull/1543
Feat 565 remove redundant thrust dialect conditional by @ZelboK in https://github.com/NVIDIA/cccl/pull/566
fix missing device hint in WarpMergeSort Documentation by @MARD1NO in https://github.com/NVIDIA/cccl/pull/1553
Minor fixes and additions on cub developer guides by @gonidelis in https://github.com/NVIDIA/cccl/pull/1559
Consolidate handling of constexpr and if constexpr by @miscco in https://github.com/NVIDIA/cccl/pull/1562
Ensure that cuda::aligned_size_t is usable in a constexpr context by @miscco in https://github.com/NVIDIA/cccl/pull/1564
Group CUB docs by @gevtushenko in https://github.com/NVIDIA/cccl/pull/1565
Update toolkit to 12.4 by @miscco in https://github.com/NVIDIA/cccl/pull/1554
Work around change in cuTensorMapEncode by @miscco in https://github.com/NVIDIA/cccl/pull/1567
Remove stdlib arg from .clangd. by @alliepiper in https://github.com/NVIDIA/cccl/pull/1569
Add the DeviceSelect::FlaggedIf algorithm by @gonidelis in https://github.com/NVIDIA/cccl/pull/1533
Catch2 segmented sort by @alliepiper in https://github.com/NVIDIA/cccl/pull/1484
Do not emit diagnostic with extended device lambdas with preserved re… by @Revaj in https://github.com/NVIDIA/cccl/pull/1495
Use absolute includes for libcu++ by @miscco in https://github.com/NVIDIA/cccl/pull/1560
[NFC] Modularize <exception> by @miscco in https://github.com/NVIDIA/cccl/pull/199
Add test support for launching kernels with cluster size > 1 by @ahendriksen in https://github.com/NVIDIA/cccl/pull/416
Fix typo in README.md by @bprb in https://github.com/NVIDIA/cccl/pull/1574
[FEA]: Modularize <cuda/memory_resource> by @miscco in https://github.com/NVIDIA/cccl/pull/1532
Cleanup_complex by @miscco in https://github.com/NVIDIA/cccl/pull/1555
Add missing comma in barrier __try_wait by @miscco in https://github.com/NVIDIA/cccl/pull/1593
Segmented sort test fix by @alliepiper in https://github.com/NVIDIA/cccl/pull/1591
Add pre-commit configuration by @bdice in https://github.com/NVIDIA/cccl/pull/1596
Preserve .devcontainer/img/ when cleaning. by @alliepiper in https://github.com/NVIDIA/cccl/pull/1604
Add some documentation for recent additions to libcu++ by @miscco in https://github.com/NVIDIA/cccl/pull/1594
Ensure cuda::std::nullopt is visible in device code by @trxcllnt in https://github.com/NVIDIA/cccl/pull/1598
Fix ordering of alignas and __shared__ by @miscco in https://github.com/NVIDIA/cccl/pull/1601
Update Thrust CI tests. by @alliepiper in https://github.com/NVIDIA/cccl/pull/1605
Implement tuple interface for cuda vector types by @miscco in https://github.com/NVIDIA/cccl/pull/1410
Inspect PR changes to determine if subproject builds are needed. by @alliepiper in https://github.com/NVIDIA/cccl/pull/1572
Apply clang-format to cub by @bdice in https://github.com/NVIDIA/cccl/pull/1602
Add missing non-volatile atomic overloads. by @wmaxey in https://github.com/NVIDIA/cccl/pull/1582
Drop unused libcxx files by @miscco in https://github.com/NVIDIA/cccl/pull/1606
Apply formatting to libcudacxx by @miscco in https://github.com/NVIDIA/cccl/pull/1610
Add conda documentation to the README. by @bdice in https://github.com/NVIDIA/cccl/pull/1581
Allow jobs to be skipped. by @alliepiper in https://github.com/NVIDIA/cccl/pull/1611
Make libcu++ work with exceptions by @miscco in https://github.com/NVIDIA/cccl/pull/1607
Implement cuda::mr::cuda_memory_resource by @miscco in https://github.com/NVIDIA/cccl/pull/1578
Implement cuda::mr::managed_memory_resource by @miscco in https://github.com/NVIDIA/cccl/pull/1579
Apply formatting to thrust by @miscco in https://github.com/NVIDIA/cccl/pull/1616
Update exampledeviceradix_sort.cu by @eriktedhamre in https://github.com/NVIDIA/cccl/pull/1608
Implement cuda::mr::pinned_memory_resource by @miscco in https://github.com/NVIDIA/cccl/pull/1580
Set the devcontainers to format on save. by @miscco in https://github.com/NVIDIA/cccl/pull/1624
Enable internal use of std::allocator related functionality by @miscco in https://github.com/NVIDIA/cccl/pull/1583
Adds tests for large number of items for cub::DeviceSelect by @elstehle in https://github.com/NVIDIA/cccl/pull/1612
Add pre-commit docs to CONTRIBUTING.md. by @bdice in https://github.com/NVIDIA/cccl/pull/1627
Move visibility attributes to cccl by @miscco in https://github.com/NVIDIA/cccl/pull/1595
Work around thrust/memory.h circular include by @dkolsen-pgi in https://github.com/NVIDIA/cccl/pull/1634
Fix mbarrier.init addressing by @ahendriksen in https://github.com/NVIDIA/cccl/pull/1636
Trim trailing whitespace and normalize newlines. by @bdice in https://github.com/NVIDIA/cccl/pull/1633
Add a git-blame-ignore-revs file by @miscco in https://github.com/NVIDIA/cccl/pull/1629
Revert "PTX: Add cuda::ptx::elect_sync (#1537)" by @ahendriksen in https://github.com/NVIDIA/cccl/pull/1638
Address potential oob in cub when passing in an invalid device counter by @miscco in https://github.com/NVIDIA/cccl/pull/1641
Allow ninja_summary to fail by @jrhemstad in https://github.com/NVIDIA/cccl/pull/1644
Mostly flatten the folder structure of libcu++ by @miscco in https://github.com/NVIDIA/cccl/pull/1630
Make --cmake-options="" always override others. by @alliepiper in https://github.com/NVIDIA/cccl/pull/1648
Fix invalid _CCCL_CUDACC definition for clang cuda by @miscco in https://github.com/NVIDIA/cccl/pull/1656
Add missing #pragma once in some headers by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1668
Add NVTX ranges for all CUB algorithms by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1657
Implement LWG-3843 and LWG-3940 by @miscco in https://github.com/NVIDIA/cccl/pull/1621
Modularize <memory> by @miscco in https://github.com/NVIDIA/cccl/pull/1639
Expose <cuda/std/numeric> to be publicly available by @miscco in https://github.com/NVIDIA/cccl/pull/1671
Add nsight support for automated debugging by @gonidelis in https://github.com/NVIDIA/cccl/pull/1660
Format core headers by @miscco in https://github.com/NVIDIA/cccl/pull/1670
Guard resource_ref and friends behind feature flag by @miscco in https://github.com/NVIDIA/cccl/pull/1675
Create major version 2.5.0 by @wmaxey in https://github.com/NVIDIA/cccl/pull/1677
Install CUB headers with .hpp extension by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1687
Update CMakePresets.json by @alliepiper in https://github.com/NVIDIA/cccl/pull/1686
Fix deprecated status by @gevtushenko in https://github.com/NVIDIA/cccl/pull/1692
Test combined internal/user-side use of NVTX by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1690
CI Overhaul, new nightly workflow by @alliepiper in https://github.com/NVIDIA/cccl/pull/1654
Fix CMake option handling. by @alliepiper in https://github.com/NVIDIA/cccl/pull/1698
Fix issues that came up with building cuDF with main by @miscco in https://github.com/NVIDIA/cccl/pull/1643
Drop new properties until we are certain about the design by @miscco in https://github.com/NVIDIA/cccl/pull/1681
Remove more uses of __cuda_std__ by @miscco in https://github.com/NVIDIA/cccl/pull/1669
Fix usage of result_of in thrust by @miscco in https://github.com/NVIDIA/cccl/pull/1705
Fix thrust::optional::emplace() by @Snektron in https://github.com/NVIDIA/cccl/pull/1707
Remove old f(void) function signatures by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1708
Fix code sample in README and docs by @pauleonix in https://github.com/NVIDIA/cccl/pull/1652
Format libcudacxx/include files without extensions by @bdice in https://github.com/NVIDIA/cccl/pull/1676
Several improvements to zipiterator/zipfunction by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1710
Expose thrust's contiguous iterator unwrap helpers by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1717
Fix flakey heterogeneous tests by @wmaxey in https://github.com/NVIDIA/cccl/pull/1712
Ensure that we can use cuda::std::optional with types that are not __host__ __device__ by @miscco in https://github.com/NVIDIA/cccl/pull/1663
Fix a typo in barrier docs and update the godbolt link by @PointKernel in https://github.com/NVIDIA/cccl/pull/1718
Massively improve test times in heterogeneous atomics tests by @wmaxey in https://github.com/NVIDIA/cccl/pull/1719
Consolidate more common functionality by @miscco in https://github.com/NVIDIA/cccl/pull/1716
Increase timeout for the libcu++ test runs by @miscco in https://github.com/NVIDIA/cccl/pull/1720
Fix nightly CI: H100 runners are not in a testing pool. by @alliepiper in https://github.com/NVIDIA/cccl/pull/1723
Add a new CUDA Next library and a first entry in it with hierarchy_dimensions type template by @pciolkosz in https://github.com/NVIDIA/cccl/pull/1485
Atomics backend refactor by @wmaxey in https://github.com/NVIDIA/cccl/pull/1631
Const-qualify half_t::operator+/* by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1726
Reenable previously failing histogram test for icc by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1725
Enable testing for the other half of the heterogeneous managed memory tests on MSVC. by @wmaxey in https://github.com/NVIDIA/cccl/pull/1729
PTX: mark cpasyncbulk*multicast functions sm90a by @ahendriksen in https://github.com/NVIDIA/cccl/pull/1734
Improve libcu++ documentation a bit more by @miscco in https://github.com/NVIDIA/cccl/pull/1732
Make atomic_ref ctor constexpr. again. by @wmaxey in https://github.com/NVIDIA/cccl/pull/1737
Various and sundry fixes for Thrust's CPP backends. by @alliepiper in https://github.com/NVIDIA/cccl/pull/1722
Avoid ABI issues due to MSVC EBCO issues by @miscco in https://github.com/NVIDIA/cccl/pull/1739
Drop unused header from ptx by @miscco in https://github.com/NVIDIA/cccl/pull/1740
Allow an override matrix to reduce CI workload. by @alliepiper in https://github.com/NVIDIA/cccl/pull/1701
Fix docs generation by @miscco in https://github.com/NVIDIA/cccl/pull/1741
Add docs instructions on how to utilize CMake Presets by @gonidelis in https://github.com/NVIDIA/cccl/pull/1694
Ensure that {cr}begin works with types that pull in namespace std via ADL by @miscco in https://github.com/NVIDIA/cccl/pull/1685
Merge prep jobs for verify-devcontainers CI. by @alliepiper in https://github.com/NVIDIA/cccl/pull/1754
Fix typo in ci docs. by @alliepiper in https://github.com/NVIDIA/cccl/pull/1756
Add runtime + sccache info to CI comment by @alliepiper in https://github.com/NVIDIA/cccl/pull/1744
Add section about SSH signing keys to developer docs. by @alliepiper in https://github.com/NVIDIA/cccl/pull/1755
Add sm100 support to for NVCC by @wmaxey in https://github.com/NVIDIA/cccl/pull/1745
Fixduplicatejob_checks by @alliepiper in https://github.com/NVIDIA/cccl/pull/1759
Const-qualify histogram pointer input parameters by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1762
Return demangled name in c2h::type_name by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1773
Simplify argument forwarding in CUB histogram entry-points by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1776
Add guard against half support by @miscco in https://github.com/NVIDIA/cccl/pull/1735
Refactor CUB test launch helpers by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1770
Replace cub::ArrayWrapper by cuda::std::array and deprecate it by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1764
Fix missing qualification of pow in two instances by @miscco in https://github.com/NVIDIA/cccl/pull/1784
Add mechanism to split project tests into parallel jobs. by @alliepiper in https://github.com/NVIDIA/cccl/pull/1696
Fix __half conversion to float in histogram by @miscco in https://github.com/NVIDIA/cccl/pull/1785
Implement P3029R1: deduction from integral_constant by @miscco in https://github.com/NVIDIA/cccl/pull/1786
Revert to showing skipped jobs to WAR GHA bug. by @alliepiper in https://github.com/NVIDIA/cccl/pull/1794
Port to Catch2 and rework device histogram test by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/1695
Add gcc13, clang17, clang18 to CI by @jrhemstad in https://github.com/NVIDIA/cccl/pull/1757
Drop more of thrust type traits by @miscco in https://github.com/NVIDIA/cccl/pull/1721
Show workflow walltime, job max time in CI comment. by @alliepiper in https://github.com/NVIDIA/cccl/pull/1795
Fix span for non-ranges by @miscco in https://github.com/NVIDIA/cccl/pull/1840
Drop all internal implementations of exceptions (#1806) by @miscco in https://github.com/NVIDIA/cccl/pull/1839
Backport atomic regression fix #1801 by @wmaxey in https://github.com/NVIDIA/cccl/pull/1833
[BACKPORT] Symbol visibility is now invariant in regards to __cuda_std__ definition (#1832) by @miscco in https://github.com/NVIDIA/cccl/pull/1864

New Contributors

@MARD1NO made their first contribution in https://github.com/NVIDIA/cccl/pull/1553
@Revaj made their first contribution in https://github.com/NVIDIA/cccl/pull/1495
@bprb made their first contribution in https://github.com/NVIDIA/cccl/pull/1574
@eriktedhamre made their first contribution in https://github.com/NVIDIA/cccl/pull/1608
@Snektron made their first contribution in https://github.com/NVIDIA/cccl/pull/1707

Full Changelog: https://github.com/NVIDIA/cccl/compare/v2.4.0...v2.5.0

- C++
Published by wmaxey over 1 year ago

cccl - v2.4.0

What’s New

We are still hard at work in CCCL on paying down lots of technical debt, improving infrastructure, and various other simplifications as part of the unification of Thrust/CUB/libcu++. In addition to various fixes and documentation improvements, the following notable improvements have been made to Thrust, CUB, and libcudacxx.

Thrust

As part of our kernel consolidation effort, kernels of thrust::uniquebykey, thrust::copy_if, and thrust::partition algorithms are now consolidated in CUB. Kernel consolidation achieves two goals. First, it delivers the latest optimizations of CUB algorithms to Thrust users. Apart from the performance improvements, it introduces support of large problem sizes (64-bit offsets) into Thrust algorithms.

CUB

cub::DeviceSelect::UniqueByKey now supports equality operator and large problem sizes.
New cub::DeviceFor family of algorithms goes beyond conventional cub::DeviceFor::ForEach. cub::DeviceFor::ForEachCopy can provide you with additional performance benefits from vectorized memory accesses.
Many CUB algorithms now support CUDA graph capture mode.

libcudacxx

Added new cuda::ptx namespace with wrappers for inline-PTX instructions
cuda::std::complex specializations for CUDA types bfloat and half.

What's Changed

Implement remaining ranges iterator concepts and modernize array by @miscco in https://github.com/NVIDIA/cccl/pull/627
Fix C++11 support of recently added tests by @ahendriksen in https://github.com/NVIDIA/cccl/pull/651
Update CUDA newest to CTK 12.3 by @jrhemstad in https://github.com/NVIDIA/cccl/pull/629
Add cuda::ptx::* namespace by @ahendriksen in https://github.com/NVIDIA/cccl/pull/574
The test seems to pass just fine by @miscco in https://github.com/NVIDIA/cccl/pull/654
Fixes discard_memory compilation failure for pre-Volta by @elstehle in https://github.com/NVIDIA/cccl/pull/637
Reduce benchmarking time by @gevtushenko in https://github.com/NVIDIA/cccl/pull/657
Add CCCL_VERSION and script for updating version by @jrhemstad in https://github.com/NVIDIA/cccl/pull/652
Fixes compiler error for extended fp type data gen by @elstehle in https://github.com/NVIDIA/cccl/pull/666
fixup ___CUDA_VPTX -> _CUDA_VPTX by @wmaxey in https://github.com/NVIDIA/cccl/pull/664
Attempt to WAR CUB / RDC / MSVC issue by @gevtushenko in https://github.com/NVIDIA/cccl/pull/669
Rework our system header approach to be more error proof by @miscco in https://github.com/NVIDIA/cccl/pull/661
Project automation - fix sync action and draft setting step by @jarmak-nv in https://github.com/NVIDIA/cccl/pull/625
Fix fallback when checking git repo by @wmaxey in https://github.com/NVIDIA/cccl/pull/1085
Currently the verbose option does not work beacuse of a typo in the argument handling by @miscco in https://github.com/NVIDIA/cccl/pull/1088
Adds virtual shared memory helper and tests by @elstehle in https://github.com/NVIDIA/cccl/pull/619
Add cuda::ptx::st_async by @ahendriksen in https://github.com/NVIDIA/cccl/pull/1078
Add cuda::ptx::red_async by @ahendriksen in https://github.com/NVIDIA/cccl/pull/1080
Remove libcudacxx symlinks by @wmaxey in https://github.com/NVIDIA/cccl/pull/1075
Move PTX tests that missed the symlink PR by @wmaxey in https://github.com/NVIDIA/cccl/pull/1098
Fix truncation of constant value by @gevtushenko in https://github.com/NVIDIA/cccl/pull/1097
Add cuda::ptx:mbarrier_{try/test}_wait{_parity} by @ahendriksen in https://github.com/NVIDIA/cccl/pull/674
Initial CUB/NVRTC support by @gevtushenko in https://github.com/NVIDIA/cccl/pull/1081
Fix cuda::ptx::red.async for int32_t types by @ahendriksen in https://github.com/NVIDIA/cccl/pull/1102
Fix local test runs with lit by @miscco in https://github.com/NVIDIA/cccl/pull/1108
Fix config when only non-CDPv1 arches are enabled. by @alliepiper in https://github.com/NVIDIA/cccl/pull/1109
Do not replace the sccache binary for windows by @miscco in https://github.com/NVIDIA/cccl/pull/1115
Test cuda graph capture by @gevtushenko in https://github.com/NVIDIA/cccl/pull/1112
Fix overflow bug for >2^32 elements in thrust::shuffle by @djns99 in https://github.com/NVIDIA/cccl/pull/1074
Introduce CUB transform reduce by @gevtushenko in https://github.com/NVIDIA/cccl/pull/1091
Add infrastructure for compile-time CUB tests by @gevtushenko in https://github.com/NVIDIA/cccl/pull/1124
Fix GCC6 / FP8 warning by @gevtushenko in https://github.com/NVIDIA/cccl/pull/1130
Fix thrust transform reduce bench by @gevtushenko in https://github.com/NVIDIA/cccl/pull/1133
Fix ptx.st.async.compile.pass.cpp failing in C++11. by @wmaxey in https://github.com/NVIDIA/cccl/pull/1132
Fix _LIBCUDACXX_UNREACHABLE for old MSVC by @miscco in https://github.com/NVIDIA/cccl/pull/1114
Allow filtering P0 benchmarks by @gevtushenko in https://github.com/NVIDIA/cccl/pull/1135
Update barrierarrivetx.md docs by @gonzalobg in https://github.com/NVIDIA/cccl/pull/1147
Update std iterators by @miscco in https://github.com/NVIDIA/cccl/pull/672
Fix argument name in windows CI by @miscco in https://github.com/NVIDIA/cccl/pull/1145
Fix XFAIL condition for subsumption tests by @miscco in https://github.com/NVIDIA/cccl/pull/1144
Project Automation - remove draft automation + reduce permissions by @jarmak-nv in https://github.com/NVIDIA/cccl/pull/1154
Use rst in block-scope docs by @gevtushenko in https://github.com/NVIDIA/cccl/pull/1150
Fix errors when find_package(CCCL) is called twice. by @alliepiper in https://github.com/NVIDIA/cccl/pull/1157
Fix icc / cub by @gevtushenko in https://github.com/NVIDIA/cccl/pull/1152
Abort testing on unsupported dialect flags by @wmaxey in https://github.com/NVIDIA/cccl/pull/1158
Run with latest nvbench by @robertmaynard in https://github.com/NVIDIA/cccl/pull/583
Set finer-grain workflow permissions by @jrhemstad in https://github.com/NVIDIA/cccl/pull/1163
Port device docs to rst by @gevtushenko in https://github.com/NVIDIA/cccl/pull/1160
CI log improvements by @jrhemstad in https://github.com/NVIDIA/cccl/pull/621
Setup documentation and corresponding github action by @wmaxey in https://github.com/NVIDIA/cccl/pull/1118
Update Docs links in README.md by @wmaxey in https://github.com/NVIDIA/cccl/pull/1169
Fix GCC 13 by @gevtushenko in https://github.com/NVIDIA/cccl/pull/1175
Add missing exit from run-as-coder by @jrhemstad in https://github.com/NVIDIA/cccl/pull/1176
Adds new virtual shared memory facility to DeviceMergeSort by @elstehle in https://github.com/NVIDIA/cccl/pull/1117
Add first batch of Catch2 tests for DeviceRadixSort by @alliepiper in https://github.com/NVIDIA/cccl/pull/1164
Implement math functions for thrust::complex by @miscco in https://github.com/NVIDIA/cccl/pull/1178
Use anchors in matrix.yaml by @jrhemstad in https://github.com/NVIDIA/cccl/pull/1193
Ensure the targets that Thrust creates are global. by @robertmaynard in https://github.com/NVIDIA/cccl/pull/1182
Fix availability of is_constant_evaluated on old MSVC by @miscco in https://github.com/NVIDIA/cccl/pull/1180
Enable std::variant for libcu++ by @miscco in https://github.com/NVIDIA/cccl/pull/1076
Implement enable_borrowed_range by @miscco in https://github.com/NVIDIA/cccl/pull/1196
Reduce thrust benchmarks noise by @gevtushenko in https://github.com/NVIDIA/cccl/pull/1203
Prepare more algorithms by @miscco in https://github.com/NVIDIA/cccl/pull/1161
Add icc compiler to CI matrix by @jrhemstad in https://github.com/NVIDIA/cccl/pull/1159
Unify handling of dialects by @miscco in https://github.com/NVIDIA/cccl/pull/1200
Add argument to build/test scripts for additional cmake options by @jrhemstad in https://github.com/NVIDIA/cccl/pull/620
Move definitions of execution space macros into cccl by @miscco in https://github.com/NVIDIA/cccl/pull/1199
Adds new virtual shared memory facility to DeviceSelect::UniqueByKey by @elstehle in https://github.com/NVIDIA/cccl/pull/1197
Add Catch2 tests for cub::DeviceSegmentedRadixSort by @alliepiper in https://github.com/NVIDIA/cccl/pull/1214
Fix the example on README.md by @so298 in https://github.com/NVIDIA/cccl/pull/1220
Add missing overloads for thrust::pow by @miscco in https://github.com/NVIDIA/cccl/pull/1222
Fix 'nvc++ -stdpar' by @dkolsen-pgi in https://github.com/NVIDIA/cccl/pull/1224
Fix examples in reduce docs by @gevtushenko in https://github.com/NVIDIA/cccl/pull/1230
Do not benchmark small problem sizes by @gevtushenko in https://github.com/NVIDIA/cccl/pull/1243
Implement enable_view by @miscco in https://github.com/NVIDIA/cccl/pull/1208
Refactors thrust::unique_by_key to use cub::DeviceSelect::UniqueByKey by @elstehle in https://github.com/NVIDIA/cccl/pull/1245
Fix merge conflict from incoming PR by @miscco in https://github.com/NVIDIA/cccl/pull/1250
Disable fast-math for ICC by @miscco in https://github.com/NVIDIA/cccl/pull/1252
Fix a typo in thrust-config.cmake by @valgur in https://github.com/NVIDIA/cccl/pull/1259
Implement ranges::{c}begin and ranges::{c}end by @miscco in https://github.com/NVIDIA/cccl/pull/1256
Switch to entropy-based stopping criterion by @gevtushenko in https://github.com/NVIDIA/cccl/pull/1280
Fix a sync bug in stream_ref::wait by @PointKernel in https://github.com/NVIDIA/cccl/pull/1238
Silence some static asserts in ptx helpers by @miscco in https://github.com/NVIDIA/cccl/pull/1257
Restore docs images by @jrhemstad in https://github.com/NVIDIA/cccl/pull/1285
Clarify Thrust/CUB ABI guarantees by @jrhemstad in https://github.com/NVIDIA/cccl/pull/1269
Fix MSVC issues by @miscco in https://github.com/NVIDIA/cccl/pull/1261
Ensure that cuda::std::pair is potentially trivially copyable by @miscco in https://github.com/NVIDIA/cccl/pull/1249
Update packman to fix CUB docs by @gevtushenko in https://github.com/NVIDIA/cccl/pull/1291
Implement ranges::{c}rbegin by @miscco in https://github.com/NVIDIA/cccl/pull/1295
Make cuda::stream_ref universally available by @miscco in https://github.com/NVIDIA/cccl/pull/1293
Properly test internal headers by @miscco in https://github.com/NVIDIA/cccl/pull/1258
Remove remaining C++03 compatibility from unit tests by @Blonck in https://github.com/NVIDIA/cccl/pull/1228
Add some documentation for memory_resource by @miscco in https://github.com/NVIDIA/cccl/pull/1217
Filter axis values in perf analysis by @gevtushenko in https://github.com/NVIDIA/cccl/pull/1304
Get CCCL revision outside of git repo by @gevtushenko in https://github.com/NVIDIA/cccl/pull/1305
[DOC]: Move ptx.md out of extended API by @ahendriksen in https://github.com/NVIDIA/cccl/pull/1308
Implement ranges::{c}rend by @miscco in https://github.com/NVIDIA/cccl/pull/1301
thrust/mr: fix the case of reuising a block for a smaller alloc. by @griwes in https://github.com/NVIDIA/cccl/pull/1232
Allow offloading samples by @gevtushenko in https://github.com/NVIDIA/cccl/pull/1316
[DOC]: Fix documentation links by @ahendriksen in https://github.com/NVIDIA/cccl/pull/1311
Separate windows and Linux CI matrix by @jrhemstad in https://github.com/NVIDIA/cccl/pull/1206
Revert "Separate windows and Linux CI matrix " by @jrhemstad in https://github.com/NVIDIA/cccl/pull/1324
Introduce CUB ForEach algorithms by @gevtushenko in https://github.com/NVIDIA/cccl/pull/1302
Cleanup transitive includes of <cuda/std/functional> by @miscco in https://github.com/NVIDIA/cccl/pull/1253
Implement ranges::{c}data by @miscco in https://github.com/NVIDIA/cccl/pull/1313
Remove stale comments from README by @jrhemstad in https://github.com/NVIDIA/cccl/pull/1328
Ports cub::DeviceMergeSort tests to Catch2 by @elstehle in https://github.com/NVIDIA/cccl/pull/1319
Implement ranges::size and ranges::ssize by @miscco in https://github.com/NVIDIA/cccl/pull/1330
PTX: Add helper functions for dsmem by @ahendriksen in https://github.com/NVIDIA/cccl/pull/1336
Remove double "ignore" in discard_iterator.h docs by @gonidelis in https://github.com/NVIDIA/cccl/pull/1342
PTX: Add cuda::ptx::fence by @ahendriksen in https://github.com/NVIDIA/cccl/pull/1341
Replace deprecated _VSTD macro with std by @rupprecht in https://github.com/NVIDIA/cccl/pull/1331
PTX: Add cuda::ptx::mapa and cuda::ptx::getctarank by @ahendriksen in https://github.com/NVIDIA/cccl/pull/1345
Cleanup our __cccl_config by @miscco in https://github.com/NVIDIA/cccl/pull/1322
Update to devcontainers 24.04 by @jrhemstad in https://github.com/NVIDIA/cccl/pull/1357
♻️📝 Update mode example to use thrust::unique_count by @codereport in https://github.com/NVIDIA/cccl/pull/1354
Switch to NV runners for Windows. by @wmaxey in https://github.com/NVIDIA/cccl/pull/1356
Implement ranges::empty by @miscco in https://github.com/NVIDIA/cccl/pull/1338
PTX: Add cuda::ptx::get_sreg by @ahendriksen in https://github.com/NVIDIA/cccl/pull/1351
Fix godbolt link. by @jrhemstad in https://github.com/NVIDIA/cccl/pull/1369
Implement ranges concepts by @miscco in https://github.com/NVIDIA/cccl/pull/1364
Print helpful error message in test scripts when no GPU is found by @jrhemstad in https://github.com/NVIDIA/cccl/pull/1362
Implement ranges::dangling by @miscco in https://github.com/NVIDIA/cccl/pull/1371
Ensure that thrust fancy iterators are trivially_copy_constructible when possible by @miscco in https://github.com/NVIDIA/cccl/pull/1368
Improve compiler detection defines by @Yaraslaut in https://github.com/NVIDIA/cccl/pull/1320
Use relative includes for our public headers by @miscco in https://github.com/NVIDIA/cccl/pull/1325
Implement ranges::view_interface by @miscco in https://github.com/NVIDIA/cccl/pull/1377
Use checked allocators in CUB catch2 tests by @alliepiper in https://github.com/NVIDIA/cccl/pull/1271
small update to docs for CTK by @ZelboK in https://github.com/NVIDIA/cccl/pull/1378
Fix order of system_header supression and includes by @miscco in https://github.com/NVIDIA/cccl/pull/1323
Hide API accepting kernel pointers by @gevtushenko in https://github.com/NVIDIA/cccl/pull/1395
Refactors ChooseOffsetT to use ::cuda::std and introduces alias template choose_offset_t by @elstehle in https://github.com/NVIDIA/cccl/pull/1405
Cleanup our delegated constructor workaround by @miscco in https://github.com/NVIDIA/cccl/pull/1404
Implement ranges::subrange by @miscco in https://github.com/NVIDIA/cccl/pull/1387
Test large arrays in in device radix sort by @alliepiper in https://github.com/NVIDIA/cccl/pull/1349
CMake support absolute CMAKEINSTALLLIBDIR values by @robertmaynard in https://github.com/NVIDIA/cccl/pull/1393
Fixes integer overflows in index computation when indexes approach numeric_limits<OffsetT>::max() by @elstehle in https://github.com/NVIDIA/cccl/pull/1419
Fix ptx usage to account for PTX ISA availability by @miscco in https://github.com/NVIDIA/cccl/pull/1359
Refactors thrust::copy_if to use cub::DeviceSelect by @elstehle in https://github.com/NVIDIA/cccl/pull/1379
Fix include of with NVC++ by @dkolsen-pgi in https://github.com/NVIDIA/cccl/pull/1417
Do not use VLAs in cp_async_bulk_tensor_* tests by @miscco in https://github.com/NVIDIA/cccl/pull/1423
Add support for sm_90a in API by @ahendriksen in https://github.com/NVIDIA/cccl/pull/1411
Add additional build job for sm90 by @jrhemstad in https://github.com/NVIDIA/cccl/pull/1428
Rework <span> to be latest revision by @miscco in https://github.com/NVIDIA/cccl/pull/1415
PTX: Add cuda::ptx:cp_async_bulk_* by @ahendriksen in https://github.com/NVIDIA/cccl/pull/1403
Prepare namespace ranges::views by @miscco in https://github.com/NVIDIA/cccl/pull/1434
PTX: Add cuda::ptx:barrier_cluster_{arrive,wait} by @ahendriksen in https://github.com/NVIDIA/cccl/pull/1366
Refactor thrust::[stable_]partition[_copy] to use cub::DevicePartition by @elstehle in https://github.com/NVIDIA/cccl/pull/1435
Fix common_reference of pair by @miscco in https://github.com/NVIDIA/cccl/pull/1438
Properly check whether a string is alphanumeric by @miscco in https://github.com/NVIDIA/cccl/pull/1443
Remove cuda::ptx::mapa by @ahendriksen in https://github.com/NVIDIA/cccl/pull/1442
Add cuda::ptx:tensormap_{replace,cp_fenceproxy} by @ahendriksen in https://github.com/NVIDIA/cccl/pull/1441
Enable more algorithms for internal use by @miscco in https://github.com/NVIDIA/cccl/pull/1432
Cleanup diagnostic handling by @miscco in https://github.com/NVIDIA/cccl/pull/1420
Create patch 2.4.0 by @wmaxey in https://github.com/NVIDIA/cccl/pull/1455
Address various issues from internal CI by @miscco in https://github.com/NVIDIA/cccl/pull/1462
Extent gcc miscompilation workaround for replace.cu by @miscco in https://github.com/NVIDIA/cccl/pull/1461
Fix CUB docs image fetcher by @gevtushenko in https://github.com/NVIDIA/cccl/pull/1466
Add cuda::ptx::cp_reduce_async_bulk by @ahendriksen in https://github.com/NVIDIA/cccl/pull/1445
Restore disabling benchmarks from ci scripts (removed in #493) by @wmaxey in https://github.com/NVIDIA/cccl/pull/1458
Add test coverage for SM90 without PTX ISA 8.0 by @miscco in https://github.com/NVIDIA/cccl/pull/1468
Ensure that we can use std::ignore on device by @miscco in https://github.com/NVIDIA/cccl/pull/1470
Move .multicast tests out into their own file by @miscco in https://github.com/NVIDIA/cccl/pull/1478
Ensure that we can test libcu++ against architectures < 70 by @miscco in https://github.com/NVIDIA/cccl/pull/1475
Reduce number of instantiations in set_symmetric_difference tests by @miscco in https://github.com/NVIDIA/cccl/pull/1476
Fixx test issues against gcc-6 by @miscco in https://github.com/NVIDIA/cccl/pull/1477
Improve code block CSS in libcu++ docs by @Nyrio in https://github.com/NVIDIA/cccl/pull/1483
Address issues with MSVC2017 by @miscco in https://github.com/NVIDIA/cccl/pull/1479
Remove libcxx tests by @miscco in https://github.com/NVIDIA/cccl/pull/1480
Separate CUB's catch2 test binaries by default for CI. by @alliepiper in https://github.com/NVIDIA/cccl/pull/1482
Add Dev Containers guide for WSL by @gonidelis in https://github.com/NVIDIA/cccl/pull/1394
PTX: add cuda::mbarrier_init by @ahendriksen in https://github.com/NVIDIA/cccl/pull/1491
Remove legacy Thrust/CUB CI files. by @bdice in https://github.com/NVIDIA/cccl/pull/1504
Fix issues with ambiguous calls to addressof in thrust::optional by @miscco in https://github.com/NVIDIA/cccl/pull/1499
Ensure that we play nicely with std::iterators by @miscco in https://github.com/NVIDIA/cccl/pull/1511
Try harder to unwrap nested thrust::tuple_of_iterator_references by @miscco in https://github.com/NVIDIA/cccl/pull/1469
Match_any testing single bit by fusing into single LOP3 instruction by @IlyaGrebnov in https://github.com/NVIDIA/cccl/pull/1372
Revert "Refactor thrust::complex as a struct derived from cuda::std::complex (#454)" by @miscco in https://github.com/NVIDIA/cccl/pull/1497
Removes arch filtering of sm 90 for rdc builds by @elstehle in https://github.com/NVIDIA/cccl/pull/1506
Adds test for cub::PtxVersion by @elstehle in https://github.com/NVIDIA/cccl/pull/1521
Fix tuple backwards compatibility by @miscco in https://github.com/NVIDIA/cccl/pull/1522
[FEA] Split ptx.h by @ahendriksen in https://github.com/NVIDIA/cccl/pull/1520
Make libcudacxx's codegen part of CI and add it to the project. by @wmaxey in https://github.com/NVIDIA/cccl/pull/1526
Ensure that we can run reduce_by_key with const inputs by @miscco in https://github.com/NVIDIA/cccl/pull/1528
Disallow float offset type in cub::segmented_reducde by @gonidelis in https://github.com/NVIDIA/cccl/pull/1430
cuda::std::complex specializations for half and bfloat by @griwes in https://github.com/NVIDIA/cccl/pull/1140
Rebase 2.4.x with main. by @wmaxey in https://github.com/NVIDIA/cccl/pull/1472
[BACKPORT]: Provide backfills for missing __half functionality by @miscco in https://github.com/NVIDIA/cccl/pull/1544
[BACKPORT] Fix usage of naked array with 0 elements in sm90 barrier tests. (#1546) by @wmaxey in https://github.com/NVIDIA/cccl/pull/1549
[BACKPORT] Fix unused variable warning for _canusecompletetx (#1547) by @wmaxey in https://github.com/NVIDIA/cccl/pull/1550

New Contributors

@djns99 made their first contribution in https://github.com/NVIDIA/cccl/pull/1074
@so298 made their first contribution in https://github.com/NVIDIA/cccl/pull/1220
@valgur made their first contribution in https://github.com/NVIDIA/cccl/pull/1259
@PointKernel made their first contribution in https://github.com/NVIDIA/cccl/pull/1238
@rupprecht made their first contribution in https://github.com/NVIDIA/cccl/pull/1331
@codereport made their first contribution in https://github.com/NVIDIA/cccl/pull/1354
@Yaraslaut made their first contribution in https://github.com/NVIDIA/cccl/pull/1320
@Nyrio made their first contribution in https://github.com/NVIDIA/cccl/pull/1483
@IlyaGrebnov made their first contribution in https://github.com/NVIDIA/cccl/pull/1372

Full Changelog: https://github.com/NVIDIA/cccl/compare/v2.3.2...v2.4.0

- C++
Published by wmaxey over 1 year ago

cccl - v2.3.1

What's Changed

[BACKPORT]: Fix bug in stream_ref::wait by @miscco in https://github.com/NVIDIA/cccl/pull/1283
Revert "Refactor thrust::complex as a struct derived from cuda::std::complex (#454)" by @miscco in https://github.com/NVIDIA/cccl/pull/1286
Create patch 2.3.1 by @wmaxey in https://github.com/NVIDIA/cccl/pull/1287

Full Changelog: https://github.com/NVIDIA/cccl/compare/v2.3.0...v2.3.1

- C++
Published by wmaxey over 1 year ago

cccl - v2.3.2

What's Changed

[BACKPORT]: Silence some static asserts in ptx helpers (#1257) by @miscco in https://github.com/NVIDIA/cccl/pull/1284
[BACKPORT]: Ensure that pair is trivially copyable (#1249) by @miscco in https://github.com/NVIDIA/cccl/pull/1292
[BACKPORT]: Properly test internal headers (#1258) by @miscco in https://github.com/NVIDIA/cccl/pull/1299
[Backport]: Fix errors when find_package(CCCL) is called twice. (#1157) by @miscco in https://github.com/NVIDIA/cccl/pull/1298
[BACKPORT] Fix MSVC issues (#1261) by @miscco in https://github.com/NVIDIA/cccl/pull/1297
[backport] thrust/mr: fix the case of reuising a block for a smaller alloc. (#1232) by @griwes in https://github.com/NVIDIA/cccl/pull/1317
[BACKPORT]: Fix ptx usage to account for PTX ISA availability (#1359) by @miscco in https://github.com/NVIDIA/cccl/pull/1421
Create patch 2.3.2 by @wmaxey in https://github.com/NVIDIA/cccl/pull/1530

Full Changelog: https://github.com/NVIDIA/cccl/compare/v2.3.1...v2.3.2

- C++
Published by wmaxey almost 2 years ago

cccl - CCCL 2.3.0

What’s New

In addition to various fixes and documentation improvements, the following notable improvements have been made to Thrust, CUB, and libcudacxx.

System Headers and Warnings

Users don't want to see warnings from CCCL headers. The typical way to accomplish this with header libraries is to use -isystem. However, this causes problems when using CCCL from GitHub, it will conflict with the CCCL headers in the CTK. Therefore, you should always include CCCL headers via -I.

To achieve the same effect as -isystem, CCCL headers will now use the system_header pragma. For more information, see https://github.com/NVIDIA/cccl/issues/527.

TL;DR: You should never see warnings emitted from a CCCL header ever again!

Linkage Issues

Using CUB and Thrust in shared libraries is a known source of issues. Previously, the solution to these issues consisted of using the THRUST_CUB_WRAPPED_NAMESPACE macro so that different shared libraries have different symbol names. However, this solution has poor discoverability, since issues present themselves in forms of segmentation faults, hangs, wrong results, etc. As of the 2.3 release, linkage issues are addressed by default without the need for THRUST_CUB_WRAPPED_NAMESPACE. Although the fix is API compatible, it might cause ABI compatibility issues. For more details, see issue #443.

Thrust

thrust::tuple, thrust::pair, and thrust::complex have been replaced with cuda::std alternatives. This can be a breaking change, but should be source compatible.

CUB

Up to 60% performance improvements of cub::DeviceSelect::UniqueByKey, cub::DeviceScan::ExclusiveSumByKey, and cub::DeviceReduce::ReduceByKey on A100. cub::DeviceSegmentedReduce now supports 64-bit indexing.

libcudacxx

The cuda::ptx namespace and <cuda/ptx> header is now available and provides access to various inline PTX functions that enumerate various async memcpy and barrier intrinsics.
#379 - Added experimental bulk TMA memcpy under <cuda/barrier>

What's Changed

Port cub::DeviceSegmentedReduce tests to catch2 by @elstehle in https://github.com/NVIDIA/cccl/pull/303
Branch/2.2.x by @gevtushenko in https://github.com/NVIDIA/cccl/pull/305
Tune unique by key on A100 by @gevtushenko in https://github.com/NVIDIA/cccl/pull/306
Merge branch/2.2.x to main by @jrhemstad in https://github.com/NVIDIA/cccl/pull/308
Add example cmake project by @jrhemstad in https://github.com/NVIDIA/cccl/pull/177
Adds catch2 tests for reduce-by-key by @elstehle in https://github.com/NVIDIA/cccl/pull/311
Tune scan by key on A100 by @gevtushenko in https://github.com/NVIDIA/cccl/pull/325
Replace diagsuppress by nvdiag_suppress in documentation by @ahendriksen in https://github.com/NVIDIA/cccl/pull/281
Fix MSVC / CUB tests build by @gevtushenko in https://github.com/NVIDIA/cccl/pull/336
gdb pretty printer: handle non-cuda device vectors by @siboehm in https://github.com/NVIDIA/cccl/pull/264
Add a nvrtc configuration for libcu++ by @miscco in https://github.com/NVIDIA/cccl/pull/202
GH Infra: project automation and issue template fixes by @jarmak-nv in https://github.com/NVIDIA/cccl/pull/297
Tune reduce by key on A100 by @gevtushenko in https://github.com/NVIDIA/cccl/pull/346
Merge commits from 2.2 branch by @miscco in https://github.com/NVIDIA/cccl/pull/350
Fix a shadow warning in thrust's executewithdependencies.h by @hageboeck in https://github.com/NVIDIA/cccl/pull/334
Assorted fixes for MSVC 2017 by @miscco in https://github.com/NVIDIA/cccl/pull/341
[skip-tests] Guard inline variables with _LIBCUDACXX_INLINE_VAR macro by @miscco in https://github.com/NVIDIA/cccl/pull/355
Port cub::DeviceScan tests to catch2 by @elstehle in https://github.com/NVIDIA/cccl/pull/347
Remove _NOEXCEPT macro in favor of noexcept in libcu++ by @Blonck in https://github.com/NVIDIA/cccl/pull/349
Project Automation: add conditional steps due to context errors by @jarmak-nv in https://github.com/NVIDIA/cccl/pull/353
Work around strange gcc bug by @miscco in https://github.com/NVIDIA/cccl/pull/363
Implement iter_swap CPO by @miscco in https://github.com/NVIDIA/cccl/pull/332
Replace default, constexpr, and delete macros by original keywords by @Blonck in https://github.com/NVIDIA/cccl/pull/360
Add clang16 devcontainer and CI job by @miscco in https://github.com/NVIDIA/cccl/pull/362
[skip-tests] Skip merge conflict from old iter_swap PR by @miscco in https://github.com/NVIDIA/cccl/pull/369
[skip-tests] Also skip all CI runs that require a GPU when [skip-tests] is set by @miscco in https://github.com/NVIDIA/cccl/pull/370
Remove LIBCUDACXXCXX03_LANG macro and all encapsulated code by @Blonck in https://github.com/NVIDIA/cccl/pull/368
Remove checks against LIBCUDACXXSTD_VER < 11 by @Blonck in https://github.com/NVIDIA/cccl/pull/375
Use copy-pr-bot by @ajschmidt8 in https://github.com/NVIDIA/cccl/pull/381
Implement the permutable concept by @miscco in https://github.com/NVIDIA/cccl/pull/367
[NFC] We missed some _NOEXCEPT_ macro uses by @miscco in https://github.com/NVIDIA/cccl/pull/371
Implement identity changes for c++20 by @miscco in https://github.com/NVIDIA/cccl/pull/383
Hide third party cmake options in our cmake developer builds. by @allisonvacanti in https://github.com/NVIDIA/cccl/pull/300
Port cub::DeviceScanByKey tests to Catch2 by @elstehle in https://github.com/NVIDIA/cccl/pull/380
Fixes a race in DeviceRunLengthEncode::NonTrivialRuns by @elstehle in https://github.com/NVIDIA/cccl/pull/399
Add commit information to the test output by @miscco in https://github.com/NVIDIA/cccl/pull/401
Project Automation: Handle PRs opened as non-draft + multiple bug fixes by @jarmak-nv in https://github.com/NVIDIA/cccl/pull/387
Project Automation: set Roadmap project value on issue/pr close and Auto-type new issues by @jarmak-nv in https://github.com/NVIDIA/cccl/pull/389
Add support for tests that should fail at runtime by @ahendriksen in https://github.com/NVIDIA/cccl/pull/418
Port DeviceAdjacentDifference::SubtractRight tests to catch2 by @miscco in https://github.com/NVIDIA/cccl/pull/390
Project automation - Fix indentation for continue-on-error by @jarmak-nv in https://github.com/NVIDIA/cccl/pull/425
[BUG] Ensure that all headers build on their own by @miscco in https://github.com/NVIDIA/cccl/pull/200
Remove util_device.cuh from iterator headers to enable online compilation by @leofang in https://github.com/NVIDIA/cccl/pull/412
Fix ci-overview example by @gevtushenko in https://github.com/NVIDIA/cccl/pull/428
Port cub::DeviceRunLengthEncode tests to catch2 by @miscco in https://github.com/NVIDIA/cccl/pull/411
Add cuda::device::barrier_arrive tx by @ahendriksen in https://github.com/NVIDIA/cccl/pull/358
Fix CubDebug by @gevtushenko in https://github.com/NVIDIA/cccl/pull/430
Do not use static member functions to initialize static member variables. by @miscco in https://github.com/NVIDIA/cccl/pull/438
Implement the projected helper struct by @miscco in https://github.com/NVIDIA/cccl/pull/385
Add PTX wrapping functions for TMA features by @ahendriksen in https://github.com/NVIDIA/cccl/pull/379
Clarify docstring for num_items parameter of DeviceSegmentedRadixSort by @HapeMask in https://github.com/NVIDIA/cccl/pull/320
Enable lit to determine the compute architectures by @miscco in https://github.com/NVIDIA/cccl/pull/447
Add NVRTCSKIPKERNEL_RUN tag to compile, but skip running NVRTC test by @ahendriksen in https://github.com/NVIDIA/cccl/pull/434
Improve documentation of cuda::barrier by @ahendriksen in https://github.com/NVIDIA/cccl/pull/440
Extend thrust::complex unit tests to prepare for upcoming replacement with std::complex by @Blonck in https://github.com/NVIDIA/cccl/pull/413
Remove having two install rules for -header-search.cmake by @robertmaynard in https://github.com/NVIDIA/cccl/pull/298
Run .devcontainer/launch.sh with bash + add error checking by @wence- in https://github.com/NVIDIA/cccl/pull/407
Remove C++03 compatability from unit tests by @Blonck in https://github.com/NVIDIA/cccl/pull/378
[libcu++] Fix use of __ppc64__ by @miscco in https://github.com/NVIDIA/cccl/pull/451
Update the README by @jrhemstad in https://github.com/NVIDIA/cccl/pull/291
[libcu++] Try to avoid gcc misscompilation issues by @miscco in https://github.com/NVIDIA/cccl/pull/452
Consolidate matrix logic into single script/job by @jrhemstad in https://github.com/NVIDIA/cccl/pull/361
Implement the indirectly_comparable concept by @miscco in https://github.com/NVIDIA/cccl/pull/445
Fix compute matrix dropping trailing zeros by @jrhemstad in https://github.com/NVIDIA/cccl/pull/466
Avoid integer promotion warnings with MSVC by @miscco in https://github.com/NVIDIA/cccl/pull/460
Implement ranges comparison objects by @miscco in https://github.com/NVIDIA/cccl/pull/464
Fix CUB/MSVC/RDC tests by @gevtushenko in https://github.com/NVIDIA/cccl/pull/469
Fix Thrust/CUB Linkage Issues by @gevtushenko in https://github.com/NVIDIA/cccl/pull/443
Script for Running CUB Benchmarks by @gevtushenko in https://github.com/NVIDIA/cccl/pull/472
[skip ci] Add list of CCCL users to README by @jrhemstad in https://github.com/NVIDIA/cccl/pull/474
constexpr all the things by @pb-dseifert in https://github.com/NVIDIA/cccl/pull/476
Add Gonzalo/Allard to trustees by @jrhemstad in https://github.com/NVIDIA/cccl/pull/482
Implement the sortable concept by @miscco in https://github.com/NVIDIA/cccl/pull/471
[libcu++] Add LIBCUDACXXCUDACCBELOW12_3 macro by @gonzalobg in https://github.com/NVIDIA/cccl/pull/479
Refactor thrust::complex as a struct derived from cuda::std::complex by @Blonck in https://github.com/NVIDIA/cccl/pull/454
Add ci scripts for windows by @miscco in https://github.com/NVIDIA/cccl/pull/251
Enable complex interop on MSVC by @miscco in https://github.com/NVIDIA/cccl/pull/490
[skip ci] Add related projects to readme. by @jrhemstad in https://github.com/NVIDIA/cccl/pull/492
Reenable nvrtc tests by @miscco in https://github.com/NVIDIA/cccl/pull/488
Implement the mergeable concept by @miscco in https://github.com/NVIDIA/cccl/pull/484
64-bit indexing for DeviceSegmentedReduce by @jecs in https://github.com/NVIDIA/cccl/pull/414
Implement move_sentinel by @miscco in https://github.com/NVIDIA/cccl/pull/496
Support skipped benches in run script by @gevtushenko in https://github.com/NVIDIA/cccl/pull/508
Implement unreachable_sentinel by @miscco in https://github.com/NVIDIA/cccl/pull/506
Disable flaky barrier tests by @miscco in https://github.com/NVIDIA/cccl/pull/510
Add constant initialization of managed variable to silence gcc warning by @miscco in https://github.com/NVIDIA/cccl/pull/509
Add verbose flag to ninja build. by @jrhemstad in https://github.com/NVIDIA/cccl/pull/491
Add devcontainer readme by @jrhemstad in https://github.com/NVIDIA/cccl/pull/481
Add contributor guide by @jrhemstad in https://github.com/NVIDIA/cccl/pull/500
[skip ci] Fix devcontainer guide link by @jrhemstad in https://github.com/NVIDIA/cccl/pull/518
[skip ci] Add example godbolt link. by @jrhemstad in https://github.com/NVIDIA/cccl/pull/519
Replace cuda::atomic with legacy functions for old arch compatibility. by @allisonvacanti in https://github.com/NVIDIA/cccl/pull/516
Simplify examples matrix. by @jrhemstad in https://github.com/NVIDIA/cccl/pull/517
Disable PR workflow triggering on pushes to main. by @jrhemstad in https://github.com/NVIDIA/cccl/pull/532
Add CI job to verify devcontainers are always up to date by @jrhemstad in https://github.com/NVIDIA/cccl/pull/514
[CI] Sink error when git repo is missing from build. by @wmaxey in https://github.com/NVIDIA/cccl/pull/533
Rework our tuple implementation to work with older MSVC by @miscco in https://github.com/NVIDIA/cccl/pull/530
Add jobs using clang as CUDA compiler by @jrhemstad in https://github.com/NVIDIA/cccl/pull/493
Remove cudaDeviceSetSharedMemConfig from CUB tests by @gevtushenko in https://github.com/NVIDIA/cccl/pull/538
Implement __bounded_iter by @miscco in https://github.com/NVIDIA/cccl/pull/540
Fix cub::BlockAdjacentDifference documentation by @pauleonix in https://github.com/NVIDIA/cccl/pull/542
Add cuda::device::memcpyasynctx by @ahendriksen in https://github.com/NVIDIA/cccl/pull/405
Introduce Thrust benchmarks by @gevtushenko in https://github.com/NVIDIA/cccl/pull/534
Fix MSVC benchmarks build by @gevtushenko in https://github.com/NVIDIA/cccl/pull/536
Fix nvc++ as host compiler by @gevtushenko in https://github.com/NVIDIA/cccl/pull/560
Add missing overload definition of thrust::complex operator!= by @srinivasyadav18 in https://github.com/NVIDIA/cccl/pull/564
Make template parameters consistent in thrust::complex operators by @srinivasyadav18 in https://github.com/NVIDIA/cccl/pull/555
Migrate CI configs to CMake presets. by @allisonvacanti in https://github.com/NVIDIA/cccl/pull/324
Replace thrust::detail::integral_constant with libcudacxx implementation by @ZelboK in https://github.com/NVIDIA/cccl/pull/561
Add cuda::device::barrier_expect_tx by @ahendriksen in https://github.com/NVIDIA/cccl/pull/498
Add ARM build configs for latest gcc/clang. by @jrhemstad in https://github.com/NVIDIA/cccl/pull/468
Fea/486 Improve thrust::complex operators compile time throughput by @srinivasyadav18 in https://github.com/NVIDIA/cccl/pull/567
Define compiler env vars for CMake in dev containers. by @allisonvacanti in https://github.com/NVIDIA/cccl/pull/576
Revert back to working nvbench commit by @miscco in https://github.com/NVIDIA/cccl/pull/582
use clang-format in dev containers by @miscco in https://github.com/NVIDIA/cccl/pull/513
Introduce CCCL clang-format by @gevtushenko in https://github.com/NVIDIA/cccl/pull/551
Add cp.async.bulk global -> shared support to cuda::memcpy_async by @ahendriksen in https://github.com/NVIDIA/cccl/pull/501
[skip ci] Also update the base image by @miscco in https://github.com/NVIDIA/cccl/pull/584
Replace thrust::tuple implementation with cuda::std::tuple by @miscco in https://github.com/NVIDIA/cccl/pull/262
Fix clangd integration by @gevtushenko in https://github.com/NVIDIA/cccl/pull/588
Always treat CCCL as system headers by @miscco in https://github.com/NVIDIA/cccl/pull/531
Refactor inline comments by @gevtushenko in https://github.com/NVIDIA/cccl/pull/581
Relax Catch2 include order requirements by @gevtushenko in https://github.com/NVIDIA/cccl/pull/601
Project Automation - Fix issue/pr sync workflow by @jarmak-nv in https://github.com/NVIDIA/cccl/pull/504
[skip-tests] Add a preset that builds all configs of all projects. by @allisonvacanti in https://github.com/NVIDIA/cccl/pull/580
Implement ranges::advance by @miscco in https://github.com/NVIDIA/cccl/pull/546
Update status check job to check status of precursor jobs by @jrhemstad in https://github.com/NVIDIA/cccl/pull/605
Report times for libcudacxx tests in CI by @jrhemstad in https://github.com/NVIDIA/cccl/pull/606
Fix bug in the construct_at optimization by @miscco in https://github.com/NVIDIA/cccl/pull/608
[skip-tests] Disable rdc tests for windows. by @miscco in https://github.com/NVIDIA/cccl/pull/615
Implement ranges::next by @miscco in https://github.com/NVIDIA/cccl/pull/611
Support FP8 in radix sort by @gevtushenko in https://github.com/NVIDIA/cccl/pull/623
Fix examples/cccl_infra mixup in ci. by @wmaxey in https://github.com/NVIDIA/cccl/pull/633
Fixes block-scope run-length decode one-past-the-end memory access into smem TempStorage by @elstehle in https://github.com/NVIDIA/cccl/pull/626
Harmonize CUB includes by @gevtushenko in https://github.com/NVIDIA/cccl/pull/632
Create NVRTCC, a utility for running tests under NVRTC by @wmaxey in https://github.com/NVIDIA/cccl/pull/494
Fix typo and grammar errors by @VaibhavWakde52 in https://github.com/NVIDIA/cccl/pull/639
[Backport branch/2.3.x] Add CCCL_VERSION and script for updating version by @github-actions in https://github.com/NVIDIA/cccl/pull/667
Backport 574 ptx by @miscco in https://github.com/NVIDIA/cccl/pull/663
[Backport branch/2.3.x] Fix C++11 support of recently added tests by @github-actions in https://github.com/NVIDIA/cccl/pull/658
[Backport branch/2.3.x] Update CUDA newest to CTK 12.3 by @github-actions in https://github.com/NVIDIA/cccl/pull/1072
[Backport to branch/2.3.x] Rework our system header approach to be more error proof (#661) by @miscco in https://github.com/NVIDIA/cccl/pull/675
[Backport branch/2.3.x] Fix fallback when checking git repo by @github-actions in https://github.com/NVIDIA/cccl/pull/1086
[Backport branch/2.3.x] Currently the verbose option does not work beacuse of a typo in the argument handling by @github-actions in https://github.com/NVIDIA/cccl/pull/1090
[Backport branch/2.3.x] Add cuda::ptx::st_async by @github-actions in https://github.com/NVIDIA/cccl/pull/1093
[Backport branch/2.3.x] Add cuda::ptx::red_async by @github-actions in https://github.com/NVIDIA/cccl/pull/1094
Backport PR #1075 by @wmaxey in https://github.com/NVIDIA/cccl/pull/1100
[Backport branch/2.3.x] Add cuda::ptx:mbarrier_{try/test}_wait{_parity} by @github-actions in https://github.com/NVIDIA/cccl/pull/1106
[Backport branch/2.3.x] Fix cuda::ptx::red.async for int32_t types by @github-actions in https://github.com/NVIDIA/cccl/pull/1107
[Backport branch/2.3.x] Fix local test runs with lit by @github-actions in https://github.com/NVIDIA/cccl/pull/1110
[Backport branch/2.3.x] Fix config when only non-CDPv1 arches are enabled. by @github-actions in https://github.com/NVIDIA/cccl/pull/1111
[Backport branch/2.3.x] Fix GCC6 / FP8 warning by @github-actions in https://github.com/NVIDIA/cccl/pull/1131
[Backport branch/2.3.x] Fix ptx.st.async.compile.pass.cpp failing in C++11. by @github-actions in https://github.com/NVIDIA/cccl/pull/1136
BACKPORT: Fix _LIBCUDACXX_UNREACHABLE for old MSVC (#1114) by @miscco in https://github.com/NVIDIA/cccl/pull/1143
[2.3.x] Backport benchmarking PRs by @wmaxey in https://github.com/NVIDIA/cccl/pull/1168
Backport P0 filter commit. by @wmaxey in https://github.com/NVIDIA/cccl/pull/1172
[BACKPORT] Implement math functions for thrust::complex by @miscco in https://github.com/NVIDIA/cccl/pull/1191
Backport fix icc / cub (#1152) by @wmaxey in https://github.com/NVIDIA/cccl/pull/1171
[BACKPORT]: Fix availability of isconstantevaluated on old MSVC by @miscco in https://github.com/NVIDIA/cccl/pull/1198
[BACKPORT] Add icc to the ci matrix by @miscco in https://github.com/NVIDIA/cccl/pull/1209
[BACKPORT]: Add missing overloads for thrust::pow by @miscco in https://github.com/NVIDIA/cccl/pull/1223

New Contributors

@siboehm made their first contribution in https://github.com/NVIDIA/cccl/pull/264
@hageboeck made their first contribution in https://github.com/NVIDIA/cccl/pull/334
@Blonck made their first contribution in https://github.com/NVIDIA/cccl/pull/349
@leofang made their first contribution in https://github.com/NVIDIA/cccl/pull/412
@HapeMask made their first contribution in https://github.com/NVIDIA/cccl/pull/320
@jecs made their first contribution in https://github.com/NVIDIA/cccl/pull/414
@pauleonix made their first contribution in https://github.com/NVIDIA/cccl/pull/542
@srinivasyadav18 made their first contribution in https://github.com/NVIDIA/cccl/pull/564
@ZelboK made their first contribution in https://github.com/NVIDIA/cccl/pull/561
@VaibhavWakde52 made their first contribution in https://github.com/NVIDIA/cccl/pull/639

Full Changelog: https://github.com/NVIDIA/cccl/compare/v2.2.0...2.3.0

- C++
Published by wmaxey almost 2 years ago

cccl - CCCL 2.2.0

What's Changed

Add axis for docker builds by @raydouglass in https://github.com/NVIDIA/cccl/pull/1
Docker: Add support for ICPC and NVC++, install newer CMake, and add curl by @brycelelbach in https://github.com/NVIDIA/cccl/pull/4
Update excludes by @raydouglass in https://github.com/NVIDIA/cccl/pull/5
Docker: OS and CUDA upgrades, support for additional configurations by @brycelelbach in https://github.com/NVIDIA/cccl/pull/9
Docker: Add Thrust/CUB documentation toolchain to Ubuntu docker images by @brycelelbach in https://github.com/NVIDIA/cccl/pull/15
Re-enable CentOS images. by @allisonvacanti in https://github.com/NVIDIA/cccl/pull/16
Add sccache to dockerfile by @msadang in https://github.com/NVIDIA/cccl/pull/17
Update base containers. by @allisonvacanti in https://github.com/NVIDIA/cccl/pull/18
Update sccache version by @ajschmidt8 in https://github.com/NVIDIA/cccl/pull/19
Build 11.5.1 containers by @ajschmidt8 in https://github.com/NVIDIA/cccl/pull/20
Add ops-bot.yaml by @jrhemstad in https://github.com/NVIDIA/cccl/pull/80
Monorepo workflow by @jrhemstad in https://github.com/NVIDIA/cccl/pull/99
Add devcontainers by @jrhemstad in https://github.com/NVIDIA/cccl/pull/105
Update the libcu++ submodule by @miscco in https://github.com/NVIDIA/cccl/pull/109
Update libcudaxx again by @miscco in https://github.com/NVIDIA/cccl/pull/110
Remove submodules from CI workflow by @jrhemstad in https://github.com/NVIDIA/cccl/pull/115
Fix CUB CI by @senior-zero in https://github.com/NVIDIA/cccl/pull/114
Fix async scan / counting iterator tests by @senior-zero in https://github.com/NVIDIA/cccl/pull/118
Make sccache work locally by @jrhemstad in https://github.com/NVIDIA/cccl/pull/113
Fix compilation of thrust and cub by @miscco in https://github.com/NVIDIA/cccl/pull/120
Fix segfault in cub::CachingDeviceAllocator by @senior-zero in https://github.com/NVIDIA/cccl/pull/119
Initial GH Infra Setup by @jarmak-nv in https://github.com/NVIDIA/cccl/pull/23
Visualize variant space coverage by @senior-zero in https://github.com/NVIDIA/cccl/pull/125
Fix broken issue templates by @jarmak-nv in https://github.com/NVIDIA/cccl/pull/124
Tune scan by key for SM90 by @senior-zero in https://github.com/NVIDIA/cccl/pull/121
Update PR template to more explicitly prompt for a linked issue closed by the PR by @jrhemstad in https://github.com/NVIDIA/cccl/pull/134
Change component section to more general "area" by @jrhemstad in https://github.com/NVIDIA/cccl/pull/132
Try and fix CI for old CTK by @miscco in https://github.com/NVIDIA/cccl/pull/116
Fix tuple_cat for std:: qualified types by @miscco in https://github.com/NVIDIA/cccl/pull/144
Add ccache to lit invocation by @miscco in https://github.com/NVIDIA/cccl/pull/147
Benchmark batched memcpy by @senior-zero in https://github.com/NVIDIA/cccl/pull/136
Properly querry CMAKE_CUDA_COMPILER_LAUNCHER for ccache support by @miscco in https://github.com/NVIDIA/cccl/pull/152
Implement Three-Way Partition Tuning / Benchmark by @senior-zero in https://github.com/NVIDIA/cccl/pull/155
Port three-way partition to use Catch2 by @senior-zero in https://github.com/NVIDIA/cccl/pull/156
Add gcc-6 to the test matrix by @miscco in https://github.com/NVIDIA/cccl/pull/160
Tune reduce / unique by key for SM90 by @senior-zero in https://github.com/NVIDIA/cccl/pull/163
Remove unused folders by @miscco in https://github.com/NVIDIA/cccl/pull/145
Fix documentation of atomic_ref by @miscco in https://github.com/NVIDIA/cccl/pull/164
New iterator traits by @miscco in https://github.com/NVIDIA/cccl/pull/158
Improve implementation of destructible by @miscco in https://github.com/NVIDIA/cccl/pull/157
Build script improvements by @jrhemstad in https://github.com/NVIDIA/cccl/pull/149
Fix icpc / denormals by @senior-zero in https://github.com/NVIDIA/cccl/pull/185
Enable tests by @jrhemstad in https://github.com/NVIDIA/cccl/pull/167
Monorepo by @jrhemstad in https://github.com/NVIDIA/cccl/pull/194
Multi-benchmark tuning by @senior-zero in https://github.com/NVIDIA/cccl/pull/208
Fixes universal_vector test failure on CTK 11.1 & gcc-6 by @elstehle in https://github.com/NVIDIA/cccl/pull/209
Delete several directories for older CI infra. by @wmaxey in https://github.com/NVIDIA/cccl/pull/218
Memory-safe radix sort test by @senior-zero in https://github.com/NVIDIA/cccl/pull/222
[FEA] Implement iter_move CPO by @miscco in https://github.com/NVIDIA/cccl/pull/197
Build cub benchmarks in build_cub.sh by @jrhemstad in https://github.com/NVIDIA/cccl/pull/216
[skip-tests] Do not run tests when skip-tests is part of the latest commit message by @miscco in https://github.com/NVIDIA/cccl/pull/224
Factor out build job logic into a "run-as-coder" reusable workflow. by @jrhemstad in https://github.com/NVIDIA/cccl/pull/205
Fix instances of 'scan' copy-pasted into reduction documentation by @milesvant in https://github.com/NVIDIA/cccl/pull/221
Add clangd to devcontainer by @senior-zero in https://github.com/NVIDIA/cccl/pull/225
Add initial CODEOWNERS file by @jrhemstad in https://github.com/NVIDIA/cccl/pull/226
Attempt to fix codeowners by @jrhemstad in https://github.com/NVIDIA/cccl/pull/231
Make libcudacxx respect CMake options for CUDA archs. by @wmaxey in https://github.com/NVIDIA/cccl/pull/235
Optimize Three-Way Partition by @senior-zero in https://github.com/NVIDIA/cccl/pull/228
[BUG] Rework how we handle feature test macros by @miscco in https://github.com/NVIDIA/cccl/pull/195
Enable use of cudaMemcpyAsync for thrust::copy by @miscco in https://github.com/NVIDIA/cccl/pull/211
Enable additional arguments in build_common.sh by @wmaxey in https://github.com/NVIDIA/cccl/pull/236
[BUG] Properly uglify all qualifiers in product headers by @miscco in https://github.com/NVIDIA/cccl/pull/201
Port cub::Device{Select, Partition} tests to catch2 by @miscco in https://github.com/NVIDIA/cccl/pull/229
Fix CUB tests / MSVC 2022 by @senior-zero in https://github.com/NVIDIA/cccl/pull/255
Ensure that any CMake re-rooting doesn't break our find_file by @miscco in https://github.com/NVIDIA/cccl/pull/257
[BUG] Fix compilation issues with MSVC 2017 by @miscco in https://github.com/NVIDIA/cccl/pull/196
Implement iterator concepts by @miscco in https://github.com/NVIDIA/cccl/pull/223
Tune Histogram on H100 by @senior-zero in https://github.com/NVIDIA/cccl/pull/266
Add WarpExchangeAlgorithm customization for WarpExchange class by @pb-dseifert in https://github.com/NVIDIA/cccl/pull/256
[BUG]: Avoid deprecation warning for std::aligned_storage when building with c++23 by @miscco in https://github.com/NVIDIA/cccl/pull/258
Port cub::DeviceReduce tests to catch2 by @elstehle in https://github.com/NVIDIA/cccl/pull/267
Add support for nvcc-specific matrix. by @jrhemstad in https://github.com/NVIDIA/cccl/pull/243
Fix anchor link to cooperative groups in CUDA programming guide by @wence- in https://github.com/NVIDIA/cccl/pull/274
Fix BibTeX syntax in CITATION.md [skip-tests] by @wence- in https://github.com/NVIDIA/cccl/pull/276
Enforce C++17 for benches by @senior-zero in https://github.com/NVIDIA/cccl/pull/275
Project Automation: Move PR and Linked Issues to In Progress by @jarmak-nv in https://github.com/NVIDIA/cccl/pull/170
Update to 23.08 devcontainers and CUDA 12.2 by @jrhemstad in https://github.com/NVIDIA/cccl/pull/270
[skip-tests] CTK 12.2 tuning image by @senior-zero in https://github.com/NVIDIA/cccl/pull/282
Fix single-thread block reduction by @senior-zero in https://github.com/NVIDIA/cccl/pull/287
Tune Select and Partition on A100 by @senior-zero in https://github.com/NVIDIA/cccl/pull/289
Fix CUB tests / MSVC by @senior-zero in https://github.com/NVIDIA/cccl/pull/292
Allow building CUB tests without cuRand by @senior-zero in https://github.com/NVIDIA/cccl/pull/250
Fixup to CUB build - s/curand/cudart/ by @wmaxey in https://github.com/NVIDIA/cccl/pull/301
Fix OOB in cub::DeviceRunLengthEncode::NonTrivialRuns by @senior-zero in https://github.com/NVIDIA/cccl/pull/294
Tune RLE on A100 by @senior-zero in https://github.com/NVIDIA/cccl/pull/295
Tune scan on A100 by @senior-zero in https://github.com/NVIDIA/cccl/pull/302
Add new CCCL:: CMake targets by @allisonvacanti in https://github.com/NVIDIA/cccl/pull/244
Fix cudacc and nvcc mixup. by @wmaxey in https://github.com/NVIDIA/cccl/pull/329
[skip-tests] Use builtin for destructible concept on MSVC by @miscco in https://github.com/NVIDIA/cccl/pull/333
Fix merge conflict from two inflight PRs by @miscco in https://github.com/NVIDIA/cccl/pull/338

New Contributors

@raydouglass made their first contribution in https://github.com/NVIDIA/cccl/pull/1
@brycelelbach made their first contribution in https://github.com/NVIDIA/cccl/pull/4
@msadang made their first contribution in https://github.com/NVIDIA/cccl/pull/17
@wmaxey made their first contribution in https://github.com/NVIDIA/cccl/pull/218
@milesvant made their first contribution in https://github.com/NVIDIA/cccl/pull/221
@pb-dseifert made their first contribution in https://github.com/NVIDIA/cccl/pull/256
@wence- made their first contribution in https://github.com/NVIDIA/cccl/pull/274

Full Changelog: https://github.com/NVIDIA/cccl/commits/v2.2.0

- C++
Published by jrhemstad over 2 years ago

Recent Releases of cccl

cccl - CCCL Python Libraries v0.1.3.2.0.dev128 (pre-release)

Major API improvements

Single-call APIs in cuda.cccl.parallel

New API - single function call with automatic temp storage

Object API

New algorithms

Device-wide histogram

StripedtoBlock exchange

Infrastructure improvements

CuPy dependency replaced with cuda.core

Support for CUDA 13 drivers

cccl - v3.0.2

What's Changed

🔄 Other Changes

cccl - v3.0.1

What's Changed

🔄 Other Changes

cccl - v3.0.0

CCCL 3.0 Release

Key Changes in CCCL 3.0

Requirements

Header Directory Changes in CUDA Toolkit 13.0

What you need to know

Major API Changes

Removed Macros

Removed Functions and Classes

New Features

C++

cuda::

cub::

thrust::

Python

What's Changed

🚀 Thrust / CUB

cccl - v2.8.5

What's Changed

cccl - v2.8.4

What's Changed

cccl - v2.8.3

What's Changed

cccl - v2.8.2

What's Changed

cccl - v2.8.1

What's Changed

cccl - CCCL 2.8.0

What's Changed

New Contributors

cccl - CCCL 2.7.0

What’s New

C++

Thrust / CUB

Libcudacxx

Python

cuda.cooperative

cuda.parallel

What's Changed

New Contributors

cccl - CCCL 2.6.1

What's Changed

cccl - CCCL 2.6.0

What's Changed

cccl - CCCL 2.5.0

What's New

What's Changed

New Contributors

cccl - v2.4.0

What’s New

Thrust

CUB

libcudacxx

What's Changed

New Contributors

cccl - v2.3.1

What's Changed

cccl - v2.3.2

What's Changed

cccl - CCCL 2.3.0

What’s New

System Headers and Warnings

Single-call APIs in `cuda.cccl.parallel`

`StripedtoBlock` exchange

CuPy dependency replaced with `cuda.core`

`cuda::`

`cub::`

`thrust::`