furixturi / RoPE_rotation_matrix.md

Last active September 15, 2025 05:22

$$ \begin{bmatrix} \cos(\phi) & -\sin(\phi)\\ \sin(\phi) & \cos(\phi) \end{bmatrix} $$

$$ A'= \begin{bmatrix}

furixturi / RoPE_frequency.md

Created September 15, 2025 03:14

$\quad\quad \phi_m(p) = p \cdot B^{-\frac{2m}{d}}$

$\phi$: sine/cosine frequency
$p$: token position
$B$: the base constant, 10000 by default
$m$: dimension pair index ($2m$ for even and $2m+1$ for odd)
$d$: dimension size of the model or one attention head

furixturi / sinusoidal_position_encoding_division_term.md

Created September 15, 2025 02:28

$$ \frac{1}{10000^{\frac{2i}{d}}} = 10000^{-\frac{2i}{d}} = e^{ln(10000^{(-\frac{2i}{d})})} = e^{-\frac{2i}{d}ln10000} $$

furixturi / sinusoidal_position_encoding.md

Created September 12, 2025 06:19

$$ \text{PE}{(pos, 2i)} = \sin\left(\frac{pos}{10000^{\frac{2i}{d}}}\right) $$ $$ \text{PE}{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{\frac{2i}{d}}}\right) $$

furixturi / RMSNorm.md

Created August 20, 2025 13:56

$$ RN(x_i) = \frac{x_i}{RMS(x)}g_i, \quad \text{RMS}(a) = \sqrt{\frac{1}{n}\sum_{i=1}^n a_i^2} $$

$g \in \mathbb{R}^n$ is a learned scaling parameter

furixturi / mla_formula.md

Created August 5, 2025 12:02

$$\$\$ \begin{align} c_t^{Q} = W^{DQ} h_t \\\\ % (1) [q_{t,1}^{C}; q_{t,2}^{C}; \ldots; q_{t,n_h}^{C}] = q_t^{C} = W^{UQ} c_t^{Q} \\\\ % (2) [q_{t,1}^{R}; q_{t,2}^{R}; \ldots; q_{t,n_h}^{R}] = q_t^{R} = \mathrm{RoPE}(W^{QR} c_t^{Q}) \\\\ % (3) q_{t,i} = [ q_{t,i}^{C}; q_{t,i}^{R} ] \\\\ % (4) c_t^{KV} = W^{DKV} h_t \\\\ % (5) [k_{t,1}^{C}; k_{t,2}^{C}; \ldots; k_{t,n_h}^{C}] = k_t^{C} = W^{UK} c_t^{KV} \\\\ % (6) k_t^{R} = \mathrm{RoPE}(W^{KR} h_t) \\\\ % (7)$$

furixturi / masked_scaled_dot_product_attn_dropout.md

Created July 28, 2025 23:43

masked scaled dot-product with dropout formula

$Attention(Q, K, V, M, D) = \left[D \odot Softmax\left(\frac{QK^T}{\sqrt{d_k}} + mask(M)\right)\right]V$

Q: query matrix of shape (batch_size, seq_len, $d_q$)
K: key matrix of shape (batch_size, seq_len, $d_k$)
V: value matrix of shape (batch_size, seq_len, $d_v$)
M: Mask matrix (seq_len, seq_len), 0 for masked positions and 1 for allowed positions
D: Dropout (seq_len, seq_len), let p be the dropout probability, each element x_i has probability p to be set to 0, and probability 1-p to be kept and scaled up by 1/(1-p) to compensate for the removed units and preserve the expected sum

furixturi / dot_product_attention_formula.md

Created July 28, 2025 06:51

$$Attention(Q,K,V)=Softmax(\frac{QK^T}{\sqrt{d_k}})V$$

furixturi / benchmarks.md

Last active August 4, 2024 10:47

benchmarks-llama-3_1

Benchmarks used to evaluate Llama 3.1

source: Meta's Llama 3.1 paper The Llama 3 Heard of Models
Benchmark details generated by ChatGPT using GPT-4o model

Pre-training

Category	Benchmark	Full Name	Authors/Institution	Description	Example
Reading Comprehension	SQuAD V2 (2018)	Stanford Question Answering Dataset 2.0	Pranav Rajpurkar et al., Stanford University	Combines 100,000 questions from SQuAD 1.1 with 50,000 unanswerable questions.	"When were the Normans in Normandy?" Answer: "10th and 11th centuries".

furixturi / git-wrong-user.md

Created July 23, 2020 15:20

Check who is authenticated with github

ssh -T [email protected]

Temporarily switch user and reauthenticate

git config credential.username "otherName"

Xiaoli Shen furixturi

Benchmarks used to evaluate Llama 3.1

Pre-training

Check who is authenticated with github

Temporarily switch user and reauthenticate