aitemplate-nm-pruning

https://github.com/ivorchu/aitemplate-nm-pruning

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (9.7%) to scientific vocabulary

Last synced: 9 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: Ivorchu
License: apache-2.0
Language: Python
Default Branch: main
Size: 20.5 MB

Statistics

Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Releases: 0

Created 12 months ago · Last pushed 11 months ago

Metadata Files

Readme Contributing License Code of conduct Citation

Sparse GEMM Profiler with CUTLASS and AITemplate

This project implements a custom 2:4 structured sparse GEMM (General Matrix Multiplication) profiler using NVIDIA's CUTLASS library and Meta's AITemplate compiler framework. The profiler builds and benchmarks sparse tensor core GEMM kernels on Ampere GPUs, integrating metadata preprocessing and kernel configuration into AITemplate’s workflow.

📌 Key Features

Structured 2:4 sparsity with Tensor Core acceleration
Uses cutlass::gemm::device::SparseGemm for sparse kernel execution
Metadata reordering using cutlass::reorder_meta
Automatic profiler generation and kernel selection with AITemplate
Split-K support for parallelism along the reduction (K) dimension
Tensor inspection/debug tools via CUDA host/device memory copy

📁 Project Structure

AITemplate/ ├── 3rdparty/ │ └── cutlass/ # CUTLASS 2.x source ├── examples/sparse_test/ │ ├── sparse_test.py # Entry point for benchmarking │ └── [generated profiler .cu files] ├── python/aitemplate/ │ └── compiler/transform/profile/ # Profile logic and hooks

🚀 Getting Started

1. Build Docker Image

bash cd docker ./build.sh cuda

2. Launch the Container

bash ./run.sh

3. Run Sparse Profiler

bash cd examples/sparse_test python sparse_test.py

This will generate and run sparse GEMM profiler binaries with specific shape and split-K configs.

🧠 Technical Details

CUTLASS Sparse GEMM

CUTLASS provides SparseGemm, a class for structured sparse matrix multiplication on NVIDIA Ampere GPUs. It requires:

Operand A: dense activation
Operand B: 2:4 sparse weights (split into values and metadata)
Operand E: metadata tensor in CUTLASS-native format (reordered)

Metadata Reordering

cpp cutlass::TensorRef<ElementE, cutlass::layout::RowMajor> meta_src((ElementE*)m_ptr, meta_stride); cutlass::TensorRef<ElementE, cutlass::layout::RowMajor> meta_dst((ElementE*)m_ptr, meta_stride); cutlass::gemm::GemmCoord meta_extent(M, N, K / 2 / kElementsPerElementE); cutlass::reorder_meta(meta_dst, meta_src, meta_extent);

Kernel Arguments Setup

cpp Gemm::Arguments arguments{ cutlass::gemm::GemmCoord{M, N, K}, {a_ptr, a_stride}, {b_ptr, b_stride}, {c_ptr, c_stride}, {d_ptr, d_stride}, {meta_ptr, meta_stride}, {alpha, beta}, split_k_slices };

Tensor Debugging

cpp std::vector<cutlass::half_t> host_B(b_size); cudaMemcpy(host_B.data(), b_ptr, b_size * sizeof(cutlass::half_t), cudaMemcpyDeviceToHost); for (int i = 0; i < b_size; ++i) { std::cout << __half2float(static_cast<__half>(host_B[i])) << " "; }

📈 Profiling Output

AITemplate generates profiler binaries and stores them in:

/tmp/aitemplate_cache/<hash>/

The profiler selection uses cached results unless force_profile() is set to True.

🛠 Common Issues

| Issue | Fix | |------------------------------------------|----------------------------------------------------------------------| | cudaFuncSetAttribute failed | Ensure shared memory usage < 48 KB or compile with -maxrregcount | | Illegal memory access in profiling | Check metadata pointer and metastride validity | | `cutlass::halfttohalfconversion | Use.raw()or explicit cast tohalf` | | Invalid tensor layout | Make sure B is column-major if using RCR kernel |

🙋 Author

This project is developed by Ivor, a Computer Engineering student at Purdue University, as part of research on structured sparsity, AI acceleration, and compiler-accelerated inference.

📄 License

This project is for research and educational purposes. CUTLASS and AITemplate are licensed under their respective open-source licenses.

Owner

Name: Ivor Chu
Login: Ivorchu
Kind: user
Company: IvorChu

Repositories: 15
Profile: https://github.com/Ivorchu

GitHub Events

Total

Push event: 6
Create event: 1

Last Year

Push event: 6
Create event: 1

Dependencies

.github/workflows/docs.yaml actions

actions/checkout v2 composite
actions/setup-python v2 composite

.github/workflows/pages.yaml actions

actions/checkout v3 composite
actions/configure-pages v2 composite
actions/deploy-pages v1 composite
actions/setup-python v2 composite
actions/upload-pages-artifact v1 composite

.github/workflows/pylint.yaml actions

actions/checkout v2 composite
actions/setup-python v2 composite

.github/workflows/rocm_ci.yml actions

actions/checkout v2 composite

docker/install/rocm_dev-requirements.txt pypi

ROCmSoftwarePlatform * development
danmar * development

fx2ait/setup.py pypi

torch *

python/setup.py pypi

jinja2 *

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science