This document summarizes how ggml’s CUDA/HIP backend executes inference on different GPU families, which code paths are used, and at what numeric precision the major compute happens. It also provides rough workload composition percentages to relate paths to each architecture’s FLOPS/TOPs.
References are to files under ggml/src/ggml-cuda unless noted.
- Matmul (quantized):
mmq.cu,mmq.cuh,vecdotq.cuh,quantize.cu/.cuh - Matmul (float):
mmf.cu,mmvf.cu, cuBLAS/hipBLAS calls inggml-cuda.cu - FlashAttention:
fattn*.cu/.cuh - Softmax:
softmax.cu