Cublaslt Grouped | Gemm Portable
By "grouping" these operations, cuBLASLt packs multiple problems into a single grid, maximizing and minimizing communication bottlenecks. Key Features of cuBLASLt Grouped GEMM
Unlike legacy cublasGemmStridedBatchedEx , which requires all matrices in a batch to have the , cuBLASLt Grouped GEMM supports variable dimensions per group.
cublasLtMatmulDesc_t matmulDesc; cublasLtMatmulDescCreate(&matmulDesc, CUDA_R_32F, CUDA_R_16F); cublaslt grouped gemm
By managing multiple disparate matrix sizes in one go, you eliminate the need for hundreds of individual kernel calls, which is a major bottleneck for small-batch inference.
A single kernel launch for 1,024 GEMMs vs. 1,024 separate launches. On GPUs, kernel launch latency is in microseconds, but over thousands of operations, it adds up. Grouped GEMM reduces this to near zero. A single kernel launch for 1,024 GEMMs vs
[QST] GEMM Batched or Single? · Issue #1414 · NVIDIA/cutlass
), removes this restriction, enabling developers to process "irregular" workloads—such as those found in Mixture-of-Experts (MoE) models or LoRA (Low-Rank Adaptation) fine-tuning—with significantly higher GPU efficiency. Why Grouped GEMM? Grouped GEMM reduces this to near zero
cublasLtMatrixLayoutInit(&Adesc, CUDA_R_16F, M, K, lda); cublasLtMatrixLayoutSetAttribute(Adesc, CUBLASLT_MATRIX_LAYOUT_BATCH_COUNT, &batchCount, sizeof(batchCount)); cublasLtMatrixLayoutSetAttribute(Adesc, CUBLASLT_MATRIX_LAYOUT_STRIDED_BATCH_OFFSET, &strideA, sizeof(strideA));