Skip to content

Instantly share code, notes, and snippets.

@lhl
Created October 24, 2025 06:56
Show Gist options
  • Save lhl/a7896e16ab3993ebdc31ed35765ee2c5 to your computer and use it in GitHub Desktop.
Save lhl/a7896e16ab3993ebdc31ed35765ee2c5 to your computer and use it in GitHub Desktop.
Codex 5 High analysis of GGML CUDA paths

GGML CUDA/HIP Inference Paths and Precision by Architecture

This document summarizes how ggml’s CUDA/HIP backend executes inference on different GPU families, which code paths are used, and at what numeric precision the major compute happens. It also provides rough workload composition percentages to relate paths to each architecture’s FLOPS/TOPs.

References are to files under ggml/src/ggml-cuda unless noted.

  • Matmul (quantized): mmq.cu, mmq.cuh, vecdotq.cuh, quantize.cu/.cuh
  • Matmul (float): mmf.cu, mmvf.cu, cuBLAS/hipBLAS calls in ggml-cuda.cu
  • FlashAttention: fattn*.cu/.cuh
  • Softmax: softmax.cu
  • Norms: norm.cu
  • RoPE: rope.cu
  • Embedding/time-step embed: getrows.cu, tsembd.cu
  • Feature detection + arch gates: common.cuh, mma.cuh, vendors/hip.h

Workload Composition (typical decode, single token)

These are model- and context-length dependent. As a rule of thumb for standard LLMs (e.g., LLaMA family) in quantized inference:

  • GEMMs with quantized weights (Q, K, V, output projection, MLP up/gate/down): ~85–95% of total compute
    • Executed via INT8 kernels (MMQ) using DP4A/TensorCore/MFMA as available
    • Accumulation in FP32; outputs FP32
  • Attention kernels (Q·K^T, softmax, P·V): ~3–12%
    • Q·K^T and P·V tiles often run with FP16 MMAs (if supported) or FP16/FP32 vector/tile kernels
    • Softmax is FP32
  • “Glue” ops (RoPE, norms, masking, embeddings, residual adds, elementwise): ~1–5% combined, memory‑bound
  • On‑the‑fly activation quantization (q8_1 tiling for MMQ): typically low single digit %

For short contexts, attention share is near the low end; for long contexts (large K/V window), attention grows toward the high end.

Operation Precision Map (by op type)

  • Weight GEMM (activation × quantized weight)
    • Path: MMQ kernels (mmq.cu/.cuh)
    • Activations are re‑tiled/quantized to q8_1 per tile (quantize.cu), weights remain in their quant format (Q4/Q5/Q8/k‑quants/IQ*)
    • Compute: INT8×INT8 dot (DP4A / TensorCore MMA int8 / AMD MFMA int8)
    • Accumulation: INT32 partial → scaled to FP32 outputs
  • Float GEMM (if unquantized or fallback): cuBLAS/hipBLAS or custom MMF/MMVF
    • Precision: F16/BF16/F32 depending on tensor types and hardware
  • FlashAttention (Q·K^T, softmax, P·V)
    • FP16 MMA when available on the platform (NVIDIA Volta+/Turing+/Ampere+/Ada; AMD via rocWMMA when enabled)
    • Fallback vector/tile kernels in F16/F32
    • Softmax is always FP32 (softmax.cu)
  • RoPE: FP32 math (rope.cu)
  • RMSNorm/LayerNorm: FP32 (norm.cu)
  • Embedding/getrows/timestep: FP32 (getrows.cu, tsembd.cu)

How the Matmul Path Is Chosen

Entry points select among:

  • Quantized GEMM (MMQ): ggml_cuda_mul_mat_q / ggml_cuda_op_mul_mat_q (mmq.cu)
  • Float GEMM (MMF/MMVF) or cuBLAS/hipBLAS: ggml_cuda_op_mul_mat_f/vec_f or ggml_cuda_mul_mat*_cublas (ggml-cuda.cu)

Heuristics for MMQ usage: bool ggml_cuda_should_use_mmq(enum ggml_type type, int cc, int64_t ne11) (mmq.cu)

  • Enabled for supported quant types (Q4_0/1, Q5_0/1, Q8_0, MXFP4, k‑quants, IQ*)
  • NVIDIA:
    • If Tensor Cores present (Turing+), prefer MMQ except when FP16 tensor cores are faster at large batch; if FP16 tensor cores present and batch is large, it may choose float paths; if FP16 TC absent or batch small (ne11 < 64), choose DP4A MMQ
  • AMD CDNA:
    • Prefer MMQ (uses MFMA int8). On CDNA3 (MI300), MMQ is forced due to rocBLAS/hipBLASLt issues noted in code
    • For small/medium batches and certain quant types, MMQ is preferred
  • AMD RDNA2/3/4:
    • Uses MMQ via DP4A‑like sdot/sudot only for small batches (ne11 < 64); otherwise tends toward BLAS

Architecture Tables (execution paths and precisions)

NVIDIA GPUs

Architecture Weight GEMM (quant) Attention Float GEMM (unquant/fallback) Notes
Pascal (SM 61) MMQ via DP4A (INT8) FA fallback vector/tile F16/F32; Softmax F32 cuBLAS F32/F16 (no TCs) No Tensor Cores for INT8/F16. DP4A available.
Volta (SM 70) MMQ via DP4A (INT8) FA FP16 MMA (Tensor Cores); Softmax F32 cuBLAS F16/F32 (TCs for FP16) INT8 TCs not available; FP16 MMA available.
Turing (SM 75) MMQ via INT8 Tensor Cores (mma.sync s8) or DP4A for small shapes FA FP16 MMA; Softmax F32 cuBLAS TC paths INT8 TCs available; heuristics may switch by batch.
Ampere (SM 80) MMQ via INT8 Tensor Cores FA FP16 MMA; Softmax F32 cuBLAS TC paths (TF32, FP16) cp.async etc. not user‑visible in MMQ selection.
Ada (SM 89) MMQ via INT8 Tensor Cores FA FP16 MMA; Softmax F32 cuBLAS TC paths Similar to Ampere, higher clocks/throughput.

Precision summary (NVIDIA): weights GEMMs primarily INT8 compute, FP32 accum; attention FP16 MMA; softmax/norm/rope FP32.

AMD GPUs (HIP/ROCm)

Architecture Weight GEMM (quant) Attention Float GEMM (unquant/fallback) Notes
CDNA1/2 (MI100/MI210) MMQ via MFMA INT8 FA FP16 MMA via rocWMMA if enabled; otherwise vector F16/F32; Softmax F32 hipBLAS/rocBLAS MFMA INT8 fast; FA WMMA requires GGML_HIP_ROCWMMA_FATTN.
CDNA3 (MI300) MMQ via MFMA INT8 (preferred for stability/perf) FA FP16 MMA via rocWMMA if enabled; else vector hipBLAS/rocBLASLt noted unstable in code; MMQ forced Code comments prefer MMQ on CDNA3 regardless.
RDNA2 (RX 6000) MMQ via DP4A‑like (__builtin_amdgcn_sdot4) for small batches; else BLAS FA WMMA not generally enabled; vector F16/F32; Softmax F32 hipBLAS/rocBLAS dp4a‑equiv for INT8; no MFMA.
RDNA3/4 (RX 7000/RX 9000) MMQ via DP4A‑like (__builtin_amdgcn_sudot4) only when ne11 < 64; else BLAS FA FP16 WMMA via rocWMMA if compiled with GGML_HIP_ROCWMMA_FATTN; else vector hipBLAS/rocBLAS WMMA available via rocWMMA on RDNA3/4 when enabled; INT8 MFMA not on RDNA.
RDNA1 MMQ functional but DP4A not native (emulated), performance limited FA vector F16/F32; Softmax F32 hipBLAS/rocBLAS Old arch; limited INT8 support.

HIP/ROCm INT8 support: Yes. Implemented via AMD MFMA int8 on CDNA and via sdot/sudot dot products on RDNA2/3/4, with fallbacks/emulation where needed (common.cuh: ggml_cuda_dp4a).

What each path computes

  • MMQ (quant GEMM)

    • Tiles activations to block_q8_1_mmq (contains quantized int8 data and per‑block scales/partials), loaded into shared memory (quantize.cu)
    • Loads quantized weight blocks (Q4/Q5/Q8/k/IQ variants) and performs INT8×INT8→INT32 partial dot products (vecdotq.cuh) using:
      • NVIDIA: DP4A or mma.sync.*.s8.s8.s32
      • AMD: __builtin_amdgcn_mfma_i32_*i8 (CDNA) or __builtin_amdgcn_sdot4/sudot4 (RDNA)
    • Applies scales and merges to FP32 outputs
  • FlashAttention (FA)

    • If FP16 MMA available, uses tensor core tiles for Q·K^T and P·V; otherwise vector/tile kernels in F16/F32 (fattn-*.cu)
    • Softmax is FP32 always (softmax.cu)
  • Float GEMM (unquantized/fallback)

    • cuBLAS/hipBLAS calls (F16/BF16/F32) or custom MMF/MMVF for small shapes
  • Other ops

    • Norms, RoPE, embeddings, masking: FP32

Path selection details (high level)

  • Top‑level mul_mat: ggml-cuda.cu
    • Chooses between quantized (MMQ), float kernels (MMF/MMVF), and BLAS based on tensor types, transposition, sizes, and per‑arch feature checks
    • For quantized weights and FP32 activations, MMQ is preferred when ggml_cuda_should_use_mmq(...) is true
  • MMQ device heuristics: mmq.cu: ggml_cuda_should_use_mmq
    • NVIDIA: prefer MMQ on Turing+; on pre‑TC (Pascal/Volta) use DP4A when batch small, else may switch to FP16 TC GEMMs when faster
    • AMD CDNA: prefer MMQ (MFMA); CDNA3 forces MMQ
    • AMD RDNA2/3/4: MMQ used for small batches; BLAS otherwise
  • Attention:
    • FP16 MMA path enabled when FP16_MMA_AVAILABLE (NVIDIA Volta+ or HIP with rocWMMA on supported arch/flags)
    • Otherwise use F16/F32 vector/tile fallbacks

Approximate compute % by architecture (decode step)

These ranges assume common transformer configs with quantized weights and FP16/FP32 activations, single‑token decode. Percentages shift with sequence length and model size.

  • NVIDIA (Pascal → Ada)

    • INT8 MMQ (weight GEMMs): 85–95%
    • FlashAttention (Q·K^T, P·V): 3–10% (FP16 MMA on Volta+; vector on Pascal)
    • Softmax/Mask/Norm/RoPE/Embedding: 1–5% total (FP32)
    • Quantize/dequantize overhead: ~1–3%
  • AMD CDNA (MI100/MI210/MI300)

    • INT8 MMQ via MFMA: 85–95%
    • FlashAttention: 3–10% (FP16 WMMA if rocWMMA; else vector)
    • Other FP32 ops: 1–5%
    • Quantize overhead: ~1–3%
  • AMD RDNA2/3/4

    • For small batches (ne11 < 64), INT8 MMQ (sdot/sudot) dominates similarly (80–90%)
    • For larger batches, BLAS float GEMMs dominate; FA vector or WMMA (RDNA3/4 with rocWMMA) adds 5–12%
    • Other FP32 ops: 1–5%

Caveats: exact shares depend on batch size, head dims, FFN width, sequence length, cache layout, and kernel tiling.

Build/flag notes (HIP)

  • INT8 MFMA in MMQ requires CDNA and !GGML_HIP_NO_MMQ_MFMA (default enabled)
  • FA tensor core path on HIP requires GGML_HIP_ROCWMMA_FATTN (and rocwmma present); enabled for CDNA and RDNA3/4 (optionally gated for RDNA4 via GGML_HIP_ROCWMMA_FATTN_GFX12)
  • For RDNA2/3/4, INT8 dot uses __builtin_amdgcn_sdot4/sudot4; RDNA1 uses a slower fallback/emulation

Practical takeaways

  • For quantized models on NVIDIA Turing/Ampere/Ada and AMD CDNA, virtually all heavy GEMMs run as INT8 (tensor core/MFMA), so achievable TOPs are the key limiter; FA runs in FP16 on tensor cores where possible
  • On NVIDIA Pascal or AMD RDNA without MFMA, MMQ uses DP4A‑class dot where available or emulation; throughput is lower, and BLAS float GEMMs may be chosen for larger batches on RDNA
  • Softmax, norms, rotaries, and embeddings remain FP32 across architectures and contribute modestly to total compute

RDNA3 long‑context slowdown vs NVIDIA

Observation: On RDNA3, short context decode (e.g., pp512) may be ~2× slower than a comparable NVIDIA GPU, but at long context (e.g., pp512+d32768) the gap can widen to ~5×. The primary reason is that the attention portion grows with context length and RDNA3’s default attention path is significantly slower unless rocWMMA is enabled and used by FlashAttention (FA).

Why it grows with context

  • For single‑token decode, the FFNs (weight GEMMs) are roughly constant work per token, while attention work increases with the amount of K/V you attend to (more memory traffic and dot products along K; FA reduces traffic but still scales with sequence length).
  • On NVIDIA Turing/Ampere/Ada, FA uses FP16 tensor cores, keeping the attention portion fast as it grows. On RDNA3, if FA falls back to vector/tile kernels (no WMMA), attention throughput is much lower, so the attention share dominates total time at long context, widening the gap.

RDNA3 specific factors

  • INT8 for weight GEMMs is fine: MMQ uses RDNA sdot/sudot (DP4A‑class) for INT8 and stays mostly constant vs context; the widening gap implicates attention, not MMQ.
  • FA path: WMMA on RDNA3 is not used unless GGML_HIP_ROCWMMA_FATTN is enabled at build time. Without it, RDNA3 uses the non‑WMMA FA kernels (F16/F32 vector), which are much slower at long K.
  • BLAS and shape heuristics: For larger batches/tiles, RDNA backends may fall back to BLAS or less optimal kernels; NVIDIA has mature TC kernels for FA and GEMM across many shapes.
  • Memory/cache sensitivity: Long‑context FA touches more K/V cache. Differences in cache hierarchy and compiler scheduling can hurt the fallback FA more on RDNA3 than the TC/WMMA path on NVIDIA.

What to do: optimizations that materially help RDNA3

  1. Enable rocWMMA FlashAttention on RDNA3

    • Build with -DGGML_HIP_ROCWMMA_FATTN=ON and ensure rocwmma headers are available (CMake will check rocwmma/rocwmma.hpp).
    • This turns on FP16_MMA_AVAILABLE for HIP on RDNA3 in common.cuh, selecting the FA WMMA path in fattn.cu.
    • Expect 2–3× speedups for attention‑heavy segments; this directly addresses the long‑context gap.
  2. Verify FA is actually taken and dimension‑matched

    • FA MMA cases are specialized for common head dims: 64, 80, 96, 112, 128, 256, 576. Ensure your model’s head dimension matches one of these so the MMA kernel is selected.
    • Check that GQA settings trigger the optimized branches (fattn.cu uses gqa_ratio divisibility checks).
  3. Keep MMQ enabled for quantized weights (no change with context, but necessary for overall perf)

    • RDNA3 MMQ relies on sdot/sudot; it remains the dominant compute at short context. Confirm ggml_cuda_should_use_mmq(...) returns true for your quant types and batch sizes.
  4. Toolchain and flags

    • Use ROCm ≥ 6.1 (as required by ggml-hip/CMakeLists.txt).
    • If building for RDNA4, add -DGGML_HIP_ROCWMMA_FATTN_GFX12=ON so WMMA FA is allowed on GFX12.
    • Ensure CMAKE_PREFIX_PATH includes ROCm cmake dirs and rocWMMA installation paths.
  5. Secondary tunings (smaller gains)

    • Ensure softmax kernel launches with enough rows per block for your ncols; it is FP32 and can be memory‑bound at large nheads/ncols, but it is rarely the top bottleneck.
    • Keep K/V cache layout contiguous and avoid host<->device syncs in the decode loop.

Why NVIDIA scales better at long context

  • NVIDIA’s FA uses FP16 tensor cores broadly, with kernels tuned across many shapes and large on‑chip bandwidth (ldmatrix→mma pipelines). As attention dominates with context, these kernels retain a high fraction of theoretical throughput.
  • RDNA3 without WMMA falls back to scalar/vector code, which amplifies the gap as attention workload increases. Enabling rocWMMA narrows this gap substantially.

Summary

  • The widening 2×→5× gap at long contexts is primarily a software path issue on RDNA3: FA not using WMMA by default. Enabling GGML_HIP_ROCWMMA_FATTN and ensuring supported head dims typically yields 2–3× FA speedups and materially reduces the long‑context penalty.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment