Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.7%) to scientific vocabulary
Last synced: 9 months ago · JSON representation

Repository

Basic Info
  • Host: GitHub
  • Owner: Ivorchu
  • License: apache-2.0
  • Language: Python
  • Default Branch: main
  • Size: 20.5 MB
Statistics
  • Stars: 0
  • Watchers: 0
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created 12 months ago · Last pushed 11 months ago
Metadata Files
Readme Contributing License Code of conduct Citation

README.md

Sparse GEMM Profiler with CUTLASS and AITemplate

This project implements a custom 2:4 structured sparse GEMM (General Matrix Multiplication) profiler using NVIDIA's CUTLASS library and Meta's AITemplate compiler framework. The profiler builds and benchmarks sparse tensor core GEMM kernels on Ampere GPUs, integrating metadata preprocessing and kernel configuration into AITemplate’s workflow.

📌 Key Features

  • Structured 2:4 sparsity with Tensor Core acceleration
  • Uses cutlass::gemm::device::SparseGemm for sparse kernel execution
  • Metadata reordering using cutlass::reorder_meta
  • Automatic profiler generation and kernel selection with AITemplate
  • Split-K support for parallelism along the reduction (K) dimension
  • Tensor inspection/debug tools via CUDA host/device memory copy

📁 Project Structure

AITemplate/ ├── 3rdparty/ │ └── cutlass/ # CUTLASS 2.x source ├── examples/sparse_test/ │ ├── sparse_test.py # Entry point for benchmarking │ └── [generated profiler .cu files] ├── python/aitemplate/ │ └── compiler/transform/profile/ # Profile logic and hooks

🚀 Getting Started

1. Build Docker Image

bash cd docker ./build.sh cuda

2. Launch the Container

bash ./run.sh

3. Run Sparse Profiler

bash cd examples/sparse_test python sparse_test.py

This will generate and run sparse GEMM profiler binaries with specific shape and split-K configs.

🧠 Technical Details

CUTLASS Sparse GEMM

CUTLASS provides SparseGemm, a class for structured sparse matrix multiplication on NVIDIA Ampere GPUs. It requires:

  • Operand A: dense activation
  • Operand B: 2:4 sparse weights (split into values and metadata)
  • Operand E: metadata tensor in CUTLASS-native format (reordered)

Metadata Reordering

cpp cutlass::TensorRef<ElementE, cutlass::layout::RowMajor> meta_src((ElementE*)m_ptr, meta_stride); cutlass::TensorRef<ElementE, cutlass::layout::RowMajor> meta_dst((ElementE*)m_ptr, meta_stride); cutlass::gemm::GemmCoord meta_extent(M, N, K / 2 / kElementsPerElementE); cutlass::reorder_meta(meta_dst, meta_src, meta_extent);

Kernel Arguments Setup

cpp Gemm::Arguments arguments{ cutlass::gemm::GemmCoord{M, N, K}, {a_ptr, a_stride}, {b_ptr, b_stride}, {c_ptr, c_stride}, {d_ptr, d_stride}, {meta_ptr, meta_stride}, {alpha, beta}, split_k_slices };

Tensor Debugging

cpp std::vector<cutlass::half_t> host_B(b_size); cudaMemcpy(host_B.data(), b_ptr, b_size * sizeof(cutlass::half_t), cudaMemcpyDeviceToHost); for (int i = 0; i < b_size; ++i) { std::cout << __half2float(static_cast<__half>(host_B[i])) << " "; }

📈 Profiling Output

AITemplate generates profiler binaries and stores them in:

/tmp/aitemplate_cache/<hash>/

The profiler selection uses cached results unless force_profile() is set to True.

🛠 Common Issues

| Issue | Fix | |------------------------------------------|----------------------------------------------------------------------| | cudaFuncSetAttribute failed | Ensure shared memory usage < 48 KB or compile with -maxrregcount | | Illegal memory access in profiling | Check metadata pointer and metastride validity | | `cutlass::halfttohalfconversion | Use.raw()or explicit cast tohalf` | | Invalid tensor layout | Make sure B is column-major if using RCR kernel |

🙋 Author

This project is developed by Ivor, a Computer Engineering student at Purdue University, as part of research on structured sparsity, AI acceleration, and compiler-accelerated inference.

📄 License

This project is for research and educational purposes. CUTLASS and AITemplate are licensed under their respective open-source licenses.

Owner

  • Name: Ivor Chu
  • Login: Ivorchu
  • Kind: user
  • Company: IvorChu

GitHub Events

Total
  • Push event: 6
  • Create event: 1
Last Year
  • Push event: 6
  • Create event: 1

Dependencies

.github/workflows/docs.yaml actions
  • actions/checkout v2 composite
  • actions/setup-python v2 composite
.github/workflows/pages.yaml actions
  • actions/checkout v3 composite
  • actions/configure-pages v2 composite
  • actions/deploy-pages v1 composite
  • actions/setup-python v2 composite
  • actions/upload-pages-artifact v1 composite
.github/workflows/pylint.yaml actions
  • actions/checkout v2 composite
  • actions/setup-python v2 composite
.github/workflows/rocm_ci.yml actions
  • actions/checkout v2 composite
docker/install/rocm_dev-requirements.txt pypi
  • ROCmSoftwarePlatform * development
  • danmar * development
fx2ait/setup.py pypi
  • torch *
python/setup.py pypi
  • jinja2 *