Leonard lhl

GGML CUDA/HIP Inference Paths and Precision by Architecture

This document summarizes how ggml’s CUDA/HIP backend executes inference on different GPU families, which code paths are used, and at what numeric precision the major compute happens. It also provides rough workload composition percentages to relate paths to each architecture’s FLOPS/TOPs.

References are to files under ggml/src/ggml-cuda unless noted.

Matmul (quantized): mmq.cu, mmq.cuh, vecdotq.cuh, quantize.cu/.cuh
Matmul (float): mmf.cu, mmvf.cu, cuBLAS/hipBLAS calls in ggml-cuda.cu
FlashAttention: fattn*.cu/.cuh
Softmax: softmax.cu

How to configure NGINX with LetsEncrypt using the simp_le client.

this includes the nginx configs, as well as the auto renewal steps. I took a bunch of these steps from this blog, and adapted it to how I like.

simp_le issues three return codes depending on the status of the request.

0 if certificate data was created or updated;
1 if renewal not necessary;
2 in case of errors.

Keybase proof

I hereby claim:

I am lhl on github.
I am lhl (https://keybase.io/lhl) on keybase.
I have a public key whose fingerprint is 4DAB 5922 AD2C B6F2 780C CC2A CE9A 69D9 663F C373

To claim this, I am signing this object:

	# Power Usage Calculator for AI Workloads

	'''
	# Serving
	$ vllm serve meta-llama/Llama-3.3-70B-Instruct --tensor-parallel-size 4 --num-scheduler-steps 20 --quantization=fp8 --gpu-memory-utilization=0.97
	INFO 01-13 04:59:05 api_server.py:712] vLLM API server version 0.6.6.post2.dev5+g5ce4627a

	# Benchmark - we do bs=64 to emulate https://arxiv.org/pdf/2310.03003
	cmd = [
	"python", os.path.expanduser("~/vllm/benchmarks/benchmark_serving.py"),

	! http://crunchbang.org/forums/viewtopic.php?id=5618
	! Xft.dpi: 110
	Xft.dpi: 96
	Xft.autohint: 0
	Xft.lcdfilter: lcddefault
	Xft.hintstyle: hintfull
	Xft.hinting: 1
	Xft.antialias: 1
	Xft.rgba: rgb

	/*

	Dragon Age Inquisition Multiplayer Key Bindings
	---
	You should map WASD (from WQSE to movement).
	These tweaks should make DAI easier to control.

	What the script does:
	* MB4 toggles RMB down/up (freelook)
	* Caps lock toggles sprint

	init:
	define f = Character('Felicia', color="#c8ffc8", show_two_window=False, image="felicia")
	$ narrator = Character(None, color="#c8ffc8")
	init:
	image felicia happy = Image("art/f_happy.png")
	image felicia sad = Image("art/f_sad.png")
	image felicia angry = Image("art/f_angry.png")
	image felicia pensive = Image("art/f_pensive.png")
	image felicia surprised = Image("art/f_surprised.png")
	image felicia suspicious = Image("art/f_suspicious.png")

	tell application "Google Chrome"
	set myWindow to make new window
	set myTab to active tab of myWindow
	set URL of myTab to "http://randomfoo.net/"
	activate end tell

	function get_tweets($num=3) {
	// Cached
	if($tweets = get_transient('tweets')) {
	return $tweets;
	}

	$url = 'http://api.twitter.com/1/statuses/user_timeline.json?include_entities=true&include_rts=true&screen_name=lhl&count=20&exclude_replies=true';
	$ch = curl_init();
	curl_setopt($ch, CURLOPT_URL, $url);
	curl_setopt($ch, CURLOPT_HEADER, 0);