Luke Meyers metric-space

Purpose

Bootstrap knowledge of LLMs ASAP. With a bias/focus to GPT.

Avoid being a link dump. Try to provide only valuable well tuned information.

Prelude

Neural network links before starting with transformers.

Using Linear Algebra to Convert a Large Code Model

Background

The SalesForce CodeGen models are a family of large language models trained on a large amount of natural language data and then fine-tuned on specialized datasets of code. Models of size 350M, 2B, 6B, and 16B parameters are provided in three flavors:

nl, the base model trained on The Pile, a large natural language dataset compiled by EleutherAI
multi, which is fine-tuned from the nl model on a dataset of code in multiple languages, scraped from GitHub, and
mono, which is fine-tuned from the multi model on Python code only.

Twitter thread: https://twitter.com/theshawwn/status/1456925974919004165
Hacker News thread: https://news.ycombinator.com/item?id=29128998

November 6, 2021

How does JAX allocate memory on a TPU?

jnp.device_put(1) is deceptively simple to write in JAX. But on a TPU, what actually happens? How does a tensor containing the value 1 actually get onto a TPU?

Turns out, the answer is "C++", and a lot of it.

Foreward

This document was originally written several years ago. At the time I was working as an execution core verification engineer at Arm. The following points are coloured heavily by working in and around the execution cores of various processors. Apply a pinch of salt; points contain varying degrees of opinion.

It is still my opinion that RISC-V could be much better designed; though I will also say that if I was building a 32 or 64-bit CPU today I'd likely implement the architecture to benefit from the existing tooling.

Mostly based upon the RISC-V ISA spec v2.0. Some updates have been made for v2.2

Original Foreword: Some Opinion

The RISC-V ISA has pursued minimalism to a fault. There is a large emphasis on minimizing instruction count, normalizing encoding, etc. This pursuit of minimalism has resulted in false orthogonalities (such as reusing the same instruction for branches, calls and returns) and a requirement for superfluous instructions which impacts code density both in terms of size and

	from bitsandbytes.nn.modules import Linear8bitLt, Linear4bit
	from contextlib import contextmanager

	def noop (x=None, args, *kwargs):
	"Do nothing"
	return x

	@contextmanager
	def no_kaiming():
	old_iku = init.kaiming_uniform_

	{-# LANGUAGE TypeSynonymInstances #-}
	data Dual d = D Float d deriving Show
	type Float' = Float

	diff :: (Dual Float' -> Dual Float') -> Float -> Float'
	diff f x = y'
	where D y y' = f (D x 1)

	class VectorSpace v where
	zero :: v

	#!/bin/bash

	# Attempt to set up the Nvidia GeForce GT 710 on a Pi CM4.
	#
	# I have tried both armv7l and aarch64 versions of the proprietary driver, in
	# addition to the nouveau open source driver (which needs to be compiled into
	# a custom Raspberry Pi kernel).
	#
	# tl;dr - None of the drivers worked :P

	{-# LANGUAGE DataKinds #-}
	{-# LANGUAGE FlexibleInstances #-}
	{-# LANGUAGE GADTs #-}
	{-# LANGUAGE KindSignatures #-}
	{-# LANGUAGE OverloadedStrings #-}

	module Main where

	import Control.Applicative
	import Data.Attoparsec.Text as A

	WAYLAND_PROTOCOLS=/usr/share/wayland-protocols

	# wayland-scanner is a tool which generates C headers and rigging for Wayland
	# protocols, which are specified in XML. wlroots requires you to rig these up
	# to your build system yourself and provide them in the include path.
	xdg-shell-protocol.h:
	wayland-scanner server-header \
	$(WAYLAND_PROTOCOLS)/stable/xdg-shell/xdg-shell.xml $@

	xdg-shell-protocol.c: xdg-shell-protocol.h

	{-# LANGUAGE BangPatterns #-}

	import qualified Data.Vector as V
	import System.CPUTime
	import System.Environment
	import Text.Printf

	{- Implementation of the WROM algorithm for finding all
	free trees of a given order. The algorithm is explained
	here: