Recent Releases of cutlass

cutlass - CUTLASS 4.1.0

CuTe DSL * Add aarch64 support, you can now pip install nvidia-cutlass-dsl on GB200 systems! * More examples demonstrating how to use CuTe DSL to write peak-performance kernels - Blackwell Mamba2 SSD - Blackwell SM100 persistent dense blockscaled GEMM with static scheduling * API updates - Please refer to FUNCTIONALITY.md for details

CUTLASS C++ * Further enhance Blackwell SM100 Attention kernels in example 77. - Add variable sequence length support for FMHA Backward kernel. - Add varlen test support to Backward runner. - Codes support empty batch sequences. * Replace subbyte_iterator with cute::recast_ptr when constructing logical iterators/arrays. * CuTe changes: - Rewrite ArithTuple and ScaledBasis for robustness and clarity. - Remove buggy and kludgy get_layoutA|B|C_MN and friends from Atoms/TiledX. - Factor out print_latex and friends and rewrite. - Factor out print_svg and friends and rewrite. * Support Blackwell SM100 SIMT packed fp32x2 kernels. * Support residual add for implicit gemm kernels. * Various fixes for CUTLASS C++ Python interface's EVT tracer: - Add verifier for sm90 to report the invalid input. - When adding an edge to the graph, if the edge already exists, add an identity compute node to avoid having multiple parallel edges. - Register operations of tanh, sigmoid, exp, gelu to the python ast frontend. - Replace the NotImplemented Error by packing all nodes into a single topological visitor node as a fallback. * Fix profiler bugs in exhaustive perf search. - Fix incorrect cluster shape output issue when doing exhaustive search. - Fix a bug in profiler grouped GEMM for setting tile scheduler swizzles, cluster shapes, and raster orders. * Fix some profiler issues. - Complete the reference for Blackwell blockwise gemm kernels. - Fix incorrect regex logic for L1 test.

- C++
Published by hwu36 7 months ago

cutlass - CUTLASS 4.1.0

CuTe DSL * Add aarch64 support, you can now pip install nvidia-cutlass-dsl on GB200 systems! * More examples demonstrating how to use CuTe DSL to write peak-performance kernels - Blackwell Mamba2 SSD - Blackwell SM100 persistent dense blockscaled GEMM with static scheduling * API updates - Please refer to FUNCTIONALITY.md for details

CUTLASS C++ * Further enhance Blackwell SM100 Attention kernels in example 77. - Add variable sequence length support for FMHA Backward kernel. - Add varlen test support to Backward runner. - Codes support empty batch sequences. * Replace subbyte_iterator with cute::recast_ptr when constructing logical iterators/arrays. * CuTe changes: - Rewrite ArithTuple and ScaledBasis for robustness and clarity. - Remove buggy and kludgy get_layoutA|B|C_MN and friends from Atoms/TiledX. - Factor out print_latex and friends and rewrite. - Factor out print_svg and friends and rewrite. * Support Blackwell SM100 SIMT packed fp32x2 kernels. * Support residual add for implicit gemm kernels. * Various fixes for CUTLASS C++ Python interface's EVT tracer: - Add verifier for sm90 to report the invalid input. - When adding an edge to the graph, if the edge already exists, add an identity compute node to avoid having multiple parallel edges. - Register operations of tanh, sigmoid, exp, gelu to the python ast frontend. - Replace the NotImplemented Error by packing all nodes into a single topological visitor node as a fallback. * Fix profiler bugs in exhaustive perf search. - Fix incorrect cluster shape output issue when doing exhaustive search. - Fix a bug in profiler grouped GEMM for setting tile scheduler swizzles, cluster shapes, and raster orders. * Fix some profiler issues. - Complete the reference for Blackwell blockwise gemm kernels. - Fix incorrect regex logic for L1 test.

- C++
Published by hwu36 7 months ago

cutlass - CUTLASS 4.0.0

CuTe DSL

CuTe DSL is a Python DSL centered around CuTe's abstractions - Enables authoring kernels in Python to reach peak performance on NVIDIA GPUs - Core DSL implementation files - DSL quick start - DSL Overview - Educational notebooks for getting started with CuTe DSL

CUTLASS C++

- C++
Published by kerrmudgeon 8 months ago

cutlass - CUTLASS 3.9.2

  • Fixed Blockwise and Groupwise GEMM hang issue when problem size K is 128.
  • Optimal code generation with CUDA toolkit versions 12.9.

- C++
Published by hwu36 10 months ago

cutlass - CUTLASS 3.9.1

  • Fixed Group Gemm hang issue in CUTLASS 3.x
  • Improved Hopper Blockwise and Groupwise GEMM performance.

- C++
Published by hwu36 10 months ago

cutlass - CUTLASS 3.9.0

- C++
Published by hwu36 10 months ago

cutlass - CUTLASS 3.8.0

CUTLASS 3.8 is the first release that supports the NVIDIA Blackwell SM100 architecture. For a background on Blackwell's new features, please consult the PTX documentation for CUDA 12.8.

Note: CUTLASS 3.x builds are known to be down on Windows platforms for all CUDA toolkits. CUTLASS team is working on a fix.

- C++
Published by hwu36 about 1 year ago

cutlass - CUTLASS 3.7.0

  • A new Hopper blockwise scaling FP8 GEMM where the operands and block scaling tensor are staged via shared memory.
  • Distributed GEMM is an experimental pipelined Tensor Parallelism implementation utilizing existing CUTLASS kernels and CUDA runtime features, which can hide the most of communication behind computation.
  • Improved persistent grid launch for Hopper kernels with large cluster sizes (>= size of 4) using the new make_kernel_hardware_info API as shown in example 48.
  • Enabled high precision accumulation for Hopper FP8 Sparse GEMM.

- C++
Published by hwu36 about 1 year ago

cutlass - CUTLASS 3.6.0

- C++
Published by hwu36 about 1 year ago

cutlass - CUTLASS 3.5.1

- C++
Published by hwu36 over 1 year ago

cutlass - CUTLASS 3.5.0

  • Implicit GEMM Convolutions targeting Hopper SM90A via WGMMA + TMA im2col.
    • Native implementation in CUTLASS 3.x using CuTe, mirroring the same design hierarchy as that of GEMMs.
    • Support for 1D, 2D, and 3D convolutions in a rank-agnostic fashion.
    • Support for Fprop, Dgrad, and Wgrad algorithms.
    • CUTLASS profiler support for 2D and 3D convolutions implemented via the 3.x API.
    • NOTE: this is a beta release. Further updates to CUTLASS will include major performance improvements, feature enablement, and possible breaking changes to the API until 3.7 release. Your feedback is welcome on the design!
  • Support for Ada (SM89) FP8 tensor cores via the 2.x API. Requires CUDA 12.4 or newer.
  • Ampere gather/scatter convolution example in CuTe and CUTLASS 3.x.
    • Showcasing how custom kernels can be written and optimized using CUTLASS 3.x and CuTe and the general strategy for implementing convolutions as specializations of GETTs.
    • Implementation of a coarse grained sparse gather/scatter kernel achieving peak performance on Ampere class tensor cores.
  • 32x and 16x tile sizes are added to CUTLASS 2.x to improve the performance of narrow-tall and wide-short matrices.
  • Updates to CuTe documentation for cute::Tensor<>, MMA atoms, and an overhauled CuTe GEMM tutorial series.
  • Extensions to CuTe to support L2 prefetching and TMA store+reductions.
  • Remove C++11 requirement on a few CUTLASS 2.x API header files. All CUTLASS files now require C++17.
  • Fixes to greatly reduce build warnings.
  • Updates and bugfixes from the community (thanks!)

- C++
Published by hwu36 almost 2 years ago

cutlass - CUTLASS 3.4.1

- C++
Published by hwu36 about 2 years ago

cutlass - CUTLASS 3.4.0

  • Improved Mixed-input Hopper GEMMs supporting {16-bit, 8-bit} x {8-bit, 4-bit} input types with fast numerical converters and group scaling factors tuned for optimal performance on Hopper H100.
  • Beta release of Pointer-Array Batched GEMMs utilizing TMA and Hopper H100 tensor cores now available. (Requires CUDA 12.3 or above)
  • Beta release of Group-GEMM - commonly used in optimization of Mixture-Of-Expert models, is now available on Hopper GPUs taking advantage of TMA and Hopper H100 tensor cores. (Requires CUDA 12.3 or above)
  • Ampere Sparse GEMM supports Epilogue Visitor Tree (EVT) now.
  • Impovements to NamedBarriers including details of ReservedNamedBarriers used within the CUTLASS library.
  • Improved CuTe documentation including improved clarity and depth of Quickstart, CuTe Layout, and CuTe Layout Algebra. Associated code comments, post-conditions, and details in CuTe Core Unit Tests also improved.

- C++
Published by hwu36 about 2 years ago

cutlass - CUTLASS 3.3.0

  • New Mixed-input Hopper GEMMs support covering 16-bit x 8-bit input types with optimal performance.
  • New Mixed-input Ampere GEMMs with support for canonical layouts (TN). The implementation supports upcast on operandB {fp16, bf16} x {s8, u8} and upcast on operandA {s8, u8} x {fp16, bf16}. They also include fast numeric conversion recipes and warp level shuffles to achieve optimal performance.
  • New Copy Async based Hopper GEMMs - which support lower than 16B aligned input tensors (across s8/fp8/fp16/bf16/tf32 types) with optimal performance. As a part of this, new kernel schedules, and Copy Ops SM80_CP_ASYNC_CACHE_* were also added.
  • EVT Support for RELU with Aux bitmap tensor store (used in dRELU). See SM90 EVT fusions for details.
  • Various subbyte enhancements like tagged device ptrs, support for vectorized copy, various operators to treat subbyte iterators as pointers, and full-fledged CuTe Tensor support.
  • Support for Clang as a host compiler.
  • Support for void-C kernels and SM80 mixed-input GEMMs in the CUTLASS Python interface

- C++
Published by hwu36 about 2 years ago

cutlass - CUTLASS 3.2.2

Bug fix for illegal memory access issue hit by Flash Attention tests in PyTorch. See #1138 for details.

- C++
Published by hwu36 over 2 years ago

cutlass - CUTLASS 3.2.1

  • Python support SM90 Epilogue Visitor Tree (EVT) on top of the C++ support released in 3.2.0.
  • SM80 EVT support in C++ and Python.
  • Other SM90 epilogue improvements.
  • Splitting CUTLASS library into smaller units based on operation, arch and datatypes. See https://github.com/NVIDIA/cutlass/discussions/1105 for details.
  • Making tools/library/scripts packageable - tools/library/scripts is now moving to python/cutlass_library. See the Python README for details.
  • SM90 TF32 kernel improvements for all layouts.
  • SM90 rasterization direction support in the CUTLASS profiler.
  • Improvement for CUTLASS profiler build times.
  • Remove Python-C++ bindings.

- C++
Published by hwu36 over 2 years ago

cutlass - CUTLASS 3.2

  • New warp-specialized persistent FP8 GEMM kernel kernel schedules and mainloops targeting Hopper architecture that achieve great performance with TMA, WGMMA, and threadblock clusters. An example showcasing Hopper warp-specialized FP8 GEMMs.
  • New Epilogue Visitor Tree (EVT) support for Hopper TMA epilogues. EVTs allows for user-defined customized epilogue fusion patterns without having to write a new epilogue.
  • Stream-K feature for Hopper. Note that this is only a functional implementation of stream-K, and should not be used for performance comparison. Optimizations are expected in a future release.
  • Improved CTA rasterization and support for CTA swizzling for Hopper kernels using the Tile Scheduler.
  • Improved performance for warp-specialized TensorFloat-32 (TF32) GEMM kernels targeting Hopper TMA.
  • Hopper GEMM+Permute, an example of fusing tensor reordering (permutation) with GEMM mainloop or epilogue.
  • New CUTLASS 2D Convolution Python interface. New example here.
  • Support for Windows (MSVC) builds.

- C++
Published by hwu36 over 2 years ago

cutlass - CUTLASS 3.1

  • New CUTLASS Python interface that aims to provide an ease-of-use interface for instantiating, emitting, compiling, and running CUTLASS kernels via Python. More details here and new examples.
  • New efficient epilogues using TMA for Hopper.
  • Support for fused epilogues, such Bias, ReLU and GELU, using the new efficient epilogues.
  • New warp-specialized TensorFloat-32 (TF32) GEMM kernels targeting Hopper TMA.
  • New warp-specialized persistent cooperative kernel design that allows for larger tile sizes and improves performance on Hopper.
  • An example showcasing GEMM-Like Tensor-Tensor Contraction (GETT) capability on Hopper.
  • Epilogue builders. Similar to mainloop builders (see example 49), epilogue builders aim to generate the best-possible epilogue while exposing incremental opt-ins for greater customization.
  • Profiler support for overriding kernel and epilogue builder auto schedules for 3.x API kernels, allowing specific policies to be run in the CUTLASS profiler.
  • Performance optimizations for the warp-specialized persistent ping-pong kernel.
  • Changes to the GEMM API 3.x, involving the host-facing arguments and the underlying Params structs.
  • FMHA Backward Pass from Meta xFormers.
  • Streamk GEMM with Broadcast enables epilogue broadcast with StreamK GEMM.
  • Batched B2B GEMM now can run multiple Back-to-Back GEMM with the same problem size in parallel.
  • Batched Strided GEMV support both row major and column major input matrix.
  • Permute + GEMM fusion can fuse Permute with following GEMM now. Before, we only support fusing GEMM with Permute in the epilogue.
  • Row Broadcast can be fused in the epilogue.
  • The GitHub branch is renamed from master to main in this release.
  • Optimal performance using CUDA 12.1
  • Updates and bugfixes from the community (thanks!)

- C++
Published by hwu36 almost 3 years ago

cutlass - CUTLASS 3.0

3.0.0 (2023-01-23)

- C++
Published by hwu36 almost 3 years ago

cutlass - CUTLASS 2.11

2.11.0 (2022-11-19)

  • Stream-K, which is a new general way to do split-K. It can not only improve performance, but can also significantly reduce the number of tile sizes that need to be profiled to find the best one.
  • Fused multi-head attention Kernel. It has two variants: one uses batched GEMM for the fixed sequence length, and the other one uses group GEMM for the variable sequence length. Both versions just need one kernel.
  • Dual GEMM, which can fuse A x B and A x C into one kernel. Two GEMMs has no producer-consumer dependency.
  • Hopper improves double precision matrix multiplication by 2x compared to Ampere at iso-clocks. It is supported since CUDA 11.8.
  • BLAS3 functions with Hoppers new double precision matrix multiplication instructions.
  • ELL Block Sparse GEMM, which uses an ELL matrix to describe the sparsity of A matrix. B and output matrices are still dense. The block size can be arbitary.
  • Optimized Group Conv for SingleGroup mode, which requires that the output channel per group is a multiple of Threadblock tile N.
  • Optimized DepthWise Conv. Two new modes are added
    • kOptimized - use direct conv to compute instead of implicit GEMM.
    • The restrictions are: 1) input ,output channel and group number should be multiple of (128 / sizeof(input element)). 2) The input filter size should be the same as the template parameter configuration.
    • kFixedStrideDilation - which puts stride and dilation into templates to further improve the performance. In this mode, kernel persistents some inputs into register to squeeze more performance, so large filter/stride/dilation is not recommanded.
    • The restrictions are: 1) input, output channel and group number should be multiple of (128 / sizeof(input element)). 2) input filter size, stride, dilation should same as the template parameter configuration.
  • Scripts to fuse multiple back-to-back GEMM. Its implementation was discussed in a GTC'22 Spring talk.
  • FP8 data type definition and conversion routines.
  • Updates and bugfixes from the community (thanks!). Big shout out to Meta's xFormers.

  • Deprecation announcement: CUTLASS plans to deprecate the following:

    • Maxwell and Pascal GPU architectures
    • Ubuntu 16.04
    • CUDA 10.2

- C++
Published by kerrmudgeon about 3 years ago

cutlass - CUTLASS 2.10.0

CUTLASS 2.10.0

CUTLASS Python now supports GEMM, Convolution and Grouped GEMM for different data types as well as different epilogue flavors. Optimizations for CUTLASS's Grouped GEMM kernel. It can move some scheduling into the host side if applicable. Optimizations for GEMM+Softmax. Grouped GEMM for Multihead Attention is a general MHA that does not require equal sequence length in every GEMM. GEMM + Layer norm fusion for Ampere can fuse the layernorm into GEMMs before and after. GEMM Epilogue Permutation Fusion can permute the GEMM output before storing. Grouped convolution targeting implicit GEMM introduces the first group convolution implementation to CUTLASS. It is an Analytical implementation, not an Optimized. Depthwise separable convolution introduces the first depthwise convolution which is also Analytical for now. Standalone Layernorm and Pooling kernels. Back-to-back GEMM enhancements. Updates and bugfixes from the community (thanks!)

- C++
Published by hwu36 over 3 years ago

cutlass - CUTLASS 2.9.1

Bug fixes, performance tuning, and enhancements to documentation.

- C++
Published by kerrmudgeon over 3 years ago

cutlass - CUTLASS 2.9.0

CUTLASS 2.9.0 * First layer Convolution kernels specialized for small channel counts and reduced alignment * Few channels specialization for reduced alignment capabilities * Fixed channels further specialized when channel count perfectly matches the access vector size * Unit tests * Python-based instance emitter in the CUTLASS Library and support in the Profiler * BLAS3 operators accelerated by Tensor Cores * Supported types: f32, cf32, f64, cf64 * HERK with emitter * SYRK with emitter * SYMM with emitter * TRMM with emitter * Unit tests * CUTLASS Python demonstrating JIT compilation of CUTLASS kernels and a Python-based runtime using CUDA Python * Python-based runtime interoperable with existing emitters * GEMM + Softmax example * Optimal performance using CUDA 11.6u2 * Updates and bugfixes from the community (thanks!)

- C++
Published by kerrmudgeon almost 4 years ago

cutlass - CUTLASS 2.8

- C++
Published by kerrmudgeon about 4 years ago

cutlass - CUTLASS 2.7

2.7.0

- C++
Published by kerrmudgeon over 4 years ago

cutlass - CUTLASS 2.6.1

  • Arbitrary padding and striding for CUTLASS Strided DGRAD Convolution operator (Analytic Iterators)
  • Tuning for GEMMs fused with partial reductions
  • Corrections and bug fixes reported by the CUTLASS community
    • Thank you for filing these issues!

- C++
Published by kerrmudgeon over 4 years ago

cutlass - CUTLASS 2.6.0

CUTLASS 2.6.0

  • Optimal performance when compiled with the CUDA 11.4 Toolkit
  • Fused operators with GEMM and Convolution
  • 64b tensor strides and leading dimensions support for GEMMs
  • Affine rank=2 matrix layouts
  • Batched GEMV preview implementation
  • New strided Dgrad implementation
    • Accelerates over previous implementation by cutting down redundant math by 4x
    • Support using new Dy and w analytic iterators and existing cutlass::conv::device::ImplicitGemmConvolution interface
  • Quaternion-valued GEMM and Convolution in single- and double-precision (targeting CUDA Cores)
  • Many improvements to the epilogue.
    • Provide an option to not fully unroll the epilogue to reduce the code size and improve the performance when using complicated elementwise operations
    • Performance improvement for FP16 tensor core kernels
    • Bug fixes
  • Enhanced Clang support and the combination of Clang 13 and CUDA 11.4 can build and run kernels from Pascal and Ampere.
  • Updated minimum CUDA Toolkit requirement to 10.2
  • Corrections and bug fixes reported by the CUTLASS community
    • Thank you for filing these issues!

- C++
Published by kerrmudgeon over 4 years ago

cutlass - CUTLASS 2.5.0

CUTLASS 2.5 is a minor release contributing: * Tensor reductions * m-to-n reductions of tensors with affine layout * Specializations for reductions including contiguous dimension * Specializations for reductions excluding contiguous dimension * Custom reduction functors such as cutlass::logical_and * Large tensor support, up to 2^63 elements (however, each dimension is limited to an extent of 2^31) * Optimizations for 3-D convolution * Optimized tile iterators using precomputed delta table for 3-D convolution * Full coverage of forward and backwards passes for 3D convolution * Fused Convolution+Convolution example * Corrections and bug fixes reported by the CUTLASS community * Thank you for filing these issues!

- C++
Published by kerrmudgeon almost 5 years ago

cutlass - CUTLASS 2.4.0

CUTLASS 2.4

  • Implicit GEMM convolution kernels supporting CUDA and Tensor Cores on NVIDIA GPUs
    • Operators: forward (Fprop), backward data gradient (Dgrad), and backward weight gradient (Wgrad) convolution
    • Data type: FP32, complex, Tensor Float 32 (TF32), BFloat16 (BF16), Float16, Int4, Int8, Int32
    • Spatial dimensions: 1-D, 2-D, and 3-D
    • Layout: NHWC, NCxHWx
  • Implicit GEMM convolution components:
    • Global memory iterators supporting Fprop, Dgrad, and Wgrad
    • MmaMultistage for implicit GEMM convolution for NVIDIA Ampere architecture
    • MmaPipeline for implicit GEMM convolution for NVIDIA Volta and Turing architectures
    • Documentation describing Implicit GEMM Convolution algorithm and implementation

- C++
Published by kerrmudgeon about 5 years ago

cutlass - CUTLASS 2.3

CUTLASS 2.3 * NVIDIA Ampere Architecture features * Sparse Tensor Core GEMM kernels: * Direct access to Sparse Tensor Cores and maximum performance via mma.sp.sync * Fast SGEMM targeting GeForce RTX 30-series CUDA Cores * Minor Features: * Activation functions such as GeLU and Sigmoid * Small matrix and quaternion template classes in device code * Floating-point constants * NVIDIA Ampere GPU Architecture examples and documentation: * Tensor Float 32 and * Sparse Tensor Cores * Documentation added on CUTLASS efficient row-major epilogue

- C++
Published by kerrmudgeon over 5 years ago

cutlass - CUTLASS 2.2

  • NVIDIA Ampere Architecture features
    • Fast Tensor Core operations:
    • Maximum performance via mma.sync
    • Tensor Float 32, BFloat16, and double-precision data types
    • Mixed integer data types (int8, int4, bin1)
    • Asynchronous copy for deep software pipelines via cp.async
    • Described in GTC 2020 Webinar (SR 21745) (free registration required)
  • Features:
    • SDK examples showing GEMM fused with bias+relu and fused GEMM+GEMM
    • Complex-valued GEMMs targeting NVIDIA Ampere Tensor Cores in double-precision and Tensor Float 32
    • Gaussian complex GEMMs using 3m complex multiply algorithm
    • Universal GEMM kernel supporting two batch modes and two algorithms for parallel reductions
  • Policy updates:
    • CUDA 11 Toolkit needed to enable NVIDIA Ampere Architecture features
    • Disabled F16C by default for compatibility - enable on cmake command line with -DCUTLASS_ENABLE_F16C=ON

- C++
Published by kerrmudgeon over 5 years ago

cutlass - CUTLASS 2.1

Planar Complex GEMM kernels targeting Volta and Turing Tensor Cores - Computes complex matrix products on matrices stored as disjoint real and imaginary parts - SDK Examples of Planar Complex GEMMs

BLAS-style host-side API added to CUTLASS Library - API to launch compiled kernel instances for GEMM and planar complex GEMM

Minor enhancements and bug fixes

- C++
Published by kerrmudgeon almost 6 years ago

cutlass - CUTLASS 2.0

Substantially refactored for - Better performance, particularly for native Turing Tensor Cores - Robust and durable templates spanning the design space - Encapsulated functionality embodying modern C++11 programming techniques - Optimized containers and data types for efficient, generic, portable device code

Updates to: - Quick start guide - Documentation - Utilities - CUTLASS Profiler

Native Turing Tensor Cores - Efficient GEMM kernels targeting Turing Tensor Cores - Mixed-precision floating point, 8-bit integer, 4-bit integer, and binarized operands

Coverage of existing CUTLASS functionality - GEMM kernels targeting CUDA and Tensor Cores in NVIDIA GPUs - Volta Tensor Cores through native mma.sync and through WMMA API - Optimizations such as parallel reductions, threadblock rasterization, and intra-threadblock reductions - Batched GEMM operations - Complex-valued GEMMs

Note: a host compiler supporting C++11 or greater is required.

- C++
Published by kerrmudgeon over 6 years ago

cutlass - CUTLASS 1.3.3

Final tagged release of CUTLASS 1.x branch.

- C++
Published by kerrmudgeon over 6 years ago

cutlass - CUTLASS 1.3.2

Performance enhancement for Volta Tensor Cores TN layout * Fixed performance defect with indirect access to pointer array for Volta TensorCores TN arrangement.

- C++
Published by kerrmudgeon over 6 years ago

cutlass - CUTLASS 1.3.0

CUTLASS 1.3 adds efficient GEMM kernels targeting Volta Tensor Cores via mma.sync instruction added in CUDA 10.1.

- C++
Published by kerrmudgeon almost 7 years ago

cutlass - CUTLASS 1.2

CUTLASS 1.2.0 (2018-10-26) - Parallelized reductions across threadblocks ("Split-K") - Improved IGEMM performance - Batched strided WMMA GEMMs

- C++
Published by kerrmudgeon over 7 years ago

cutlass - CUTLASS 1.1

CUTLASS 1.1.0 release adds: - Documentation - Examples - Turing Features - Batched Strided GEMM - Threadblock rasterization strategies - Extended CUTLASS Core components - Enhanced CUTLASS utilities

- C++
Published by kerrmudgeon over 7 years ago

cutlass - CUTLASS 1.0.1

CUTLASS 1.0.1.

Intra-threadblock reduction added for small threadblock tile sizes * sgemm64x128x16, sgemm128x128x16, sgemm128x64x16, sgemm128x32x16, sgemm64x64x16, sgemm64x32x16 * igemm_32x32x128 * GEMM K residue handled during prologue prior to mainloop

Replaced Google Test copy with submodule. Use git submodule init

- C++
Published by kerrmudgeon over 7 years ago

cutlass - CUTLASS 1.0.0

CUTLASS v1.0.0

- C++
Published by kerrmudgeon almost 8 years ago

cutlass - CUTLASS 0.1.1

Final patch of CUTLASS v0.1.

- C++
Published by kerrmudgeon almost 8 years ago

cutlass - CUTLASS 0.1.0

CUTLASS initial release.

- C++
Published by kerrmudgeon about 8 years ago