This document summarizes how ggml’s CUDA/HIP backend executes inference on different GPU families, which code paths are used, and at what numeric precision the major compute happens. It also provides rough workload composition percentages to relate paths to each architecture’s FLOPS/TOPs.
References are to files under ggml/src/ggml-cuda unless noted.
- Matmul (quantized): mmq.cu,mmq.cuh,vecdotq.cuh,quantize.cu/.cuh
- Matmul (float): mmf.cu,mmvf.cu, cuBLAS/hipBLAS calls inggml-cuda.cu
- FlashAttention: fattn*.cu/.cuh
- Softmax: softmax.cu
- Norms: norm.cu
- RoPE: rope.cu
- Embedding/time-step embed: getrows.cu,tsembd.cu
- Feature detection + arch gates: common.cuh,mma.cuh,vendors/hip.h
These are model- and context-length dependent. As a rule of thumb for standard LLMs (e.g., LLaMA family) in quantized inference:
- GEMMs with quantized weights (Q, K, V, output projection, MLP up/gate/down): ~85–95% of total compute
- Executed via INT8 kernels (MMQ) using DP4A/TensorCore/MFMA as available
- Accumulation in FP32; outputs FP32
 
- Attention kernels (Q·K^T, softmax, P·V): ~3–12%
- Q·K^T and P·V tiles often run with FP16 MMAs (if supported) or FP16/FP32 vector/tile kernels
- Softmax is FP32
 
- “Glue” ops (RoPE, norms, masking, embeddings, residual adds, elementwise): ~1–5% combined, memory‑bound
- On‑the‑fly activation quantization (q8_1 tiling for MMQ): typically low single digit %
For short contexts, attention share is near the low end; for long contexts (large K/V window), attention grows toward the high end.
- Weight GEMM (activation × quantized weight)
- Path: MMQ kernels (mmq.cu/.cuh)
- Activations are re‑tiled/quantized to q8_1per tile (quantize.cu), weights remain in their quant format (Q4/Q5/Q8/k‑quants/IQ*)
- Compute: INT8×INT8 dot (DP4A / TensorCore MMA int8 / AMD MFMA int8)
- Accumulation: INT32 partial → scaled to FP32 outputs
 
- Path: MMQ kernels (
- Float GEMM (if unquantized or fallback): cuBLAS/hipBLAS or custom MMF/MMVF
- Precision: F16/BF16/F32 depending on tensor types and hardware
 
- FlashAttention (Q·K^T, softmax, P·V)
- FP16 MMA when available on the platform (NVIDIA Volta+/Turing+/Ampere+/Ada; AMD via rocWMMA when enabled)
- Fallback vector/tile kernels in F16/F32
- Softmax is always FP32 (softmax.cu)
 
- RoPE: FP32 math (rope.cu)
- RMSNorm/LayerNorm: FP32 (norm.cu)
- Embedding/getrows/timestep: FP32 (getrows.cu,tsembd.cu)
Entry points select among:
- Quantized GEMM (MMQ): ggml_cuda_mul_mat_q/ggml_cuda_op_mul_mat_q(mmq.cu)
- Float GEMM (MMF/MMVF) or cuBLAS/hipBLAS: ggml_cuda_op_mul_mat_f/vec_forggml_cuda_mul_mat*_cublas(ggml-cuda.cu)
Heuristics for MMQ usage: bool ggml_cuda_should_use_mmq(enum ggml_type type, int cc, int64_t ne11) (mmq.cu)
- Enabled for supported quant types (Q4_0/1, Q5_0/1, Q8_0, MXFP4, k‑quants, IQ*)
- NVIDIA:
- If Tensor Cores present (Turing+), prefer MMQ except when FP16 tensor cores are faster at large batch; if FP16 tensor cores present and batch is large, it may choose float paths; if FP16 TC absent or batch small (ne11 < 64), choose DP4A MMQ
 
- AMD CDNA:
- Prefer MMQ (uses MFMA int8). On CDNA3 (MI300), MMQ is forced due to rocBLAS/hipBLASLt issues noted in code
- For small/medium batches and certain quant types, MMQ is preferred
 
- AMD RDNA2/3/4:
- Uses MMQ via DP4A‑like sdot/sudot only for small batches (ne11 < 64); otherwise tends toward BLAS
 
| Architecture | Weight GEMM (quant) | Attention | Float GEMM (unquant/fallback) | Notes | 
|---|---|---|---|---|
| Pascal (SM 61) | MMQ via DP4A (INT8) | FA fallback vector/tile F16/F32; Softmax F32 | cuBLAS F32/F16 (no TCs) | No Tensor Cores for INT8/F16. DP4A available. | 
| Volta (SM 70) | MMQ via DP4A (INT8) | FA FP16 MMA (Tensor Cores); Softmax F32 | cuBLAS F16/F32 (TCs for FP16) | INT8 TCs not available; FP16 MMA available. | 
| Turing (SM 75) | MMQ via INT8 Tensor Cores (mma.sync s8) or DP4A for small shapes | FA FP16 MMA; Softmax F32 | cuBLAS TC paths | INT8 TCs available; heuristics may switch by batch. | 
| Ampere (SM 80) | MMQ via INT8 Tensor Cores | FA FP16 MMA; Softmax F32 | cuBLAS TC paths (TF32, FP16) | cp.async etc. not user‑visible in MMQ selection. | 
| Ada (SM 89) | MMQ via INT8 Tensor Cores | FA FP16 MMA; Softmax F32 | cuBLAS TC paths | Similar to Ampere, higher clocks/throughput. | 
Precision summary (NVIDIA): weights GEMMs primarily INT8 compute, FP32 accum; attention FP16 MMA; softmax/norm/rope FP32.
| Architecture | Weight GEMM (quant) | Attention | Float GEMM (unquant/fallback) | Notes | 
|---|---|---|---|---|
| CDNA1/2 (MI100/MI210) | MMQ via MFMA INT8 | FA FP16 MMA via rocWMMA if enabled; otherwise vector F16/F32; Softmax F32 | hipBLAS/rocBLAS | MFMA INT8 fast; FA WMMA requires GGML_HIP_ROCWMMA_FATTN. | 
| CDNA3 (MI300) | MMQ via MFMA INT8 (preferred for stability/perf) | FA FP16 MMA via rocWMMA if enabled; else vector | hipBLAS/rocBLASLt noted unstable in code; MMQ forced | Code comments prefer MMQ on CDNA3 regardless. | 
| RDNA2 (RX 6000) | MMQ via DP4A‑like ( __builtin_amdgcn_sdot4) for small batches; else BLAS | FA WMMA not generally enabled; vector F16/F32; Softmax F32 | hipBLAS/rocBLAS | dp4a‑equiv for INT8; no MFMA. | 
| RDNA3/4 (RX 7000/RX 9000) | MMQ via DP4A‑like ( __builtin_amdgcn_sudot4) only whenne11 < 64; else BLAS | FA FP16 WMMA via rocWMMA if compiled with GGML_HIP_ROCWMMA_FATTN; else vector | hipBLAS/rocBLAS | WMMA available via rocWMMA on RDNA3/4 when enabled; INT8 MFMA not on RDNA. | 
| RDNA1 | MMQ functional but DP4A not native (emulated), performance limited | FA vector F16/F32; Softmax F32 | hipBLAS/rocBLAS | Old arch; limited INT8 support. | 
HIP/ROCm INT8 support: Yes. Implemented via AMD MFMA int8 on CDNA and via sdot/sudot dot products on RDNA2/3/4, with fallbacks/emulation where needed (common.cuh: ggml_cuda_dp4a).
- 
MMQ (quant GEMM) - Tiles activations to block_q8_1_mmq(contains quantized int8 data and per‑block scales/partials), loaded into shared memory (quantize.cu)
- Loads quantized weight blocks (Q4/Q5/Q8/k/IQ variants) and performs INT8×INT8→INT32 partial dot products (vecdotq.cuh) using:- NVIDIA: DP4A or mma.sync.*.s8.s8.s32
- AMD: __builtin_amdgcn_mfma_i32_*i8(CDNA) or__builtin_amdgcn_sdot4/sudot4(RDNA)
 
- NVIDIA: DP4A or 
- Applies scales and merges to FP32 outputs
 
- Tiles activations to 
- 
FlashAttention (FA) - If FP16 MMA available, uses tensor core tiles for Q·K^T and P·V; otherwise vector/tile kernels in F16/F32 (fattn-*.cu)
- Softmax is FP32 always (softmax.cu)
 
- If FP16 MMA available, uses tensor core tiles for Q·K^T and P·V; otherwise vector/tile kernels in F16/F32 (
- 
Float GEMM (unquantized/fallback) - cuBLAS/hipBLAS calls (F16/BF16/F32) or custom MMF/MMVF for small shapes
 
- 
Other ops - Norms, RoPE, embeddings, masking: FP32
 
- Top‑level mul_mat: ggml-cuda.cu- Chooses between quantized (MMQ), float kernels (MMF/MMVF), and BLAS based on tensor types, transposition, sizes, and per‑arch feature checks
- For quantized weights and FP32 activations, MMQ is preferred when ggml_cuda_should_use_mmq(...)is true
 
- MMQ device heuristics: mmq.cu: ggml_cuda_should_use_mmq- NVIDIA: prefer MMQ on Turing+; on pre‑TC (Pascal/Volta) use DP4A when batch small, else may switch to FP16 TC GEMMs when faster
- AMD CDNA: prefer MMQ (MFMA); CDNA3 forces MMQ
- AMD RDNA2/3/4: MMQ used for small batches; BLAS otherwise
 
- Attention:
- FP16 MMA path enabled when FP16_MMA_AVAILABLE(NVIDIA Volta+ or HIP with rocWMMA on supported arch/flags)
- Otherwise use F16/F32 vector/tile fallbacks
 
- FP16 MMA path enabled when 
These ranges assume common transformer configs with quantized weights and FP16/FP32 activations, single‑token decode. Percentages shift with sequence length and model size.
- 
NVIDIA (Pascal → Ada) - INT8 MMQ (weight GEMMs): 85–95%
- FlashAttention (Q·K^T, P·V): 3–10% (FP16 MMA on Volta+; vector on Pascal)
- Softmax/Mask/Norm/RoPE/Embedding: 1–5% total (FP32)
- Quantize/dequantize overhead: ~1–3%
 
- 
AMD CDNA (MI100/MI210/MI300) - INT8 MMQ via MFMA: 85–95%
- FlashAttention: 3–10% (FP16 WMMA if rocWMMA; else vector)
- Other FP32 ops: 1–5%
- Quantize overhead: ~1–3%
 
- 
AMD RDNA2/3/4 - For small batches (ne11 < 64), INT8 MMQ (sdot/sudot) dominates similarly (80–90%)
- For larger batches, BLAS float GEMMs dominate; FA vector or WMMA (RDNA3/4 with rocWMMA) adds 5–12%
- Other FP32 ops: 1–5%
 
- For small batches (
Caveats: exact shares depend on batch size, head dims, FFN width, sequence length, cache layout, and kernel tiling.
- INT8 MFMA in MMQ requires CDNA and !GGML_HIP_NO_MMQ_MFMA(default enabled)
- FA tensor core path on HIP requires GGML_HIP_ROCWMMA_FATTN(androcwmmapresent); enabled for CDNA and RDNA3/4 (optionally gated for RDNA4 viaGGML_HIP_ROCWMMA_FATTN_GFX12)
- For RDNA2/3/4, INT8 dot uses __builtin_amdgcn_sdot4/sudot4; RDNA1 uses a slower fallback/emulation
- For quantized models on NVIDIA Turing/Ampere/Ada and AMD CDNA, virtually all heavy GEMMs run as INT8 (tensor core/MFMA), so achievable TOPs are the key limiter; FA runs in FP16 on tensor cores where possible
- On NVIDIA Pascal or AMD RDNA without MFMA, MMQ uses DP4A‑class dot where available or emulation; throughput is lower, and BLAS float GEMMs may be chosen for larger batches on RDNA
- Softmax, norms, rotaries, and embeddings remain FP32 across architectures and contribute modestly to total compute
Observation: On RDNA3, short context decode (e.g., pp512) may be ~2× slower than a comparable NVIDIA GPU, but at long context (e.g., pp512+d32768) the gap can widen to ~5×. The primary reason is that the attention portion grows with context length and RDNA3’s default attention path is significantly slower unless rocWMMA is enabled and used by FlashAttention (FA).
Why it grows with context
- For single‑token decode, the FFNs (weight GEMMs) are roughly constant work per token, while attention work increases with the amount of K/V you attend to (more memory traffic and dot products along K; FA reduces traffic but still scales with sequence length).
- On NVIDIA Turing/Ampere/Ada, FA uses FP16 tensor cores, keeping the attention portion fast as it grows. On RDNA3, if FA falls back to vector/tile kernels (no WMMA), attention throughput is much lower, so the attention share dominates total time at long context, widening the gap.
RDNA3 specific factors
- INT8 for weight GEMMs is fine: MMQ uses RDNA sdot/sudot (DP4A‑class) for INT8 and stays mostly constant vs context; the widening gap implicates attention, not MMQ.
- FA path: WMMA on RDNA3 is not used unless GGML_HIP_ROCWMMA_FATTNis enabled at build time. Without it, RDNA3 uses the non‑WMMA FA kernels (F16/F32 vector), which are much slower at long K.
- BLAS and shape heuristics: For larger batches/tiles, RDNA backends may fall back to BLAS or less optimal kernels; NVIDIA has mature TC kernels for FA and GEMM across many shapes.
- Memory/cache sensitivity: Long‑context FA touches more K/V cache. Differences in cache hierarchy and compiler scheduling can hurt the fallback FA more on RDNA3 than the TC/WMMA path on NVIDIA.
What to do: optimizations that materially help RDNA3
- 
Enable rocWMMA FlashAttention on RDNA3 - Build with -DGGML_HIP_ROCWMMA_FATTN=ONand ensurerocwmmaheaders are available (CMake will checkrocwmma/rocwmma.hpp).
- This turns on FP16_MMA_AVAILABLEfor HIP on RDNA3 incommon.cuh, selecting the FA WMMA path infattn.cu.
- Expect 2–3× speedups for attention‑heavy segments; this directly addresses the long‑context gap.
 
- Build with 
- 
Verify FA is actually taken and dimension‑matched - FA MMA cases are specialized for common head dims: 64, 80, 96, 112, 128, 256, 576. Ensure your model’s head dimension matches one of these so the MMA kernel is selected.
- Check that GQA settings trigger the optimized branches (fattn.cuuses gqa_ratio divisibility checks).
 
- 
Keep MMQ enabled for quantized weights (no change with context, but necessary for overall perf) - RDNA3 MMQ relies on sdot/sudot; it remains the dominant compute at short context. Confirm ggml_cuda_should_use_mmq(...)returns true for your quant types and batch sizes.
 
- RDNA3 MMQ relies on sdot/sudot; it remains the dominant compute at short context. Confirm 
- 
Toolchain and flags - Use ROCm ≥ 6.1 (as required by ggml-hip/CMakeLists.txt).
- If building for RDNA4, add -DGGML_HIP_ROCWMMA_FATTN_GFX12=ONso WMMA FA is allowed on GFX12.
- Ensure CMAKE_PREFIX_PATHincludes ROCm cmake dirs and rocWMMA installation paths.
 
- Use ROCm ≥ 6.1 (as required by 
- 
Secondary tunings (smaller gains) - Ensure softmax kernel launches with enough rows per block for your ncols; it is FP32 and can be memory‑bound at large nheads/ncols, but it is rarely the top bottleneck.
- Keep K/V cache layout contiguous and avoid host<->device syncs in the decode loop.
 
Why NVIDIA scales better at long context
- NVIDIA’s FA uses FP16 tensor cores broadly, with kernels tuned across many shapes and large on‑chip bandwidth (ldmatrix→mma pipelines). As attention dominates with context, these kernels retain a high fraction of theoretical throughput.
- RDNA3 without WMMA falls back to scalar/vector code, which amplifies the gap as attention workload increases. Enabling rocWMMA narrows this gap substantially.
Summary
- The widening 2×→5× gap at long contexts is primarily a software path issue on RDNA3: FA not using WMMA by default. Enabling GGML_HIP_ROCWMMA_FATTNand ensuring supported head dims typically yields 2–3× FA speedups and materially reduces the long‑context penalty.