Skip to content

Instantly share code, notes, and snippets.

View furixturi's full-sized avatar
😈

Xiaoli Shen furixturi

😈
View GitHub Profile

$$ \begin{bmatrix} \cos(\phi) & -\sin(\phi)\\ \sin(\phi) & \cos(\phi) \end{bmatrix} $$

$$ A'= \begin{bmatrix}

$\quad\quad \phi_m(p) = p \cdot B^{-\frac{2m}{d}}$

  • $\phi$: sine/cosine frequency
  • $p$: token position
  • $B$: the base constant, 10000 by default
  • $m$: dimension pair index ($2m$ for even and $2m+1$ for odd)
  • $d$: dimension size of the model or one attention head

$$ \frac{1}{10000^{\frac{2i}{d}}} = 10000^{-\frac{2i}{d}} = e^{ln(10000^{(-\frac{2i}{d})})} = e^{-\frac{2i}{d}ln10000} $$

$$ \text{PE}{(pos, 2i)} = \sin\left(\frac{pos}{10000^{\frac{2i}{d}}}\right) $$ $$ \text{PE}{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{\frac{2i}{d}}}\right) $$

$$ RN(x_i) = \frac{x_i}{RMS(x)}g_i, \quad \text{RMS}(a) = \sqrt{\frac{1}{n}\sum_{i=1}^n a_i^2} $$

$g \in \mathbb{R}^n$ is a learned scaling parameter

$$\$\$ \begin{align} c_t^{Q} = W^{DQ} h_t \\\\ % (1) [q_{t,1}^{C}; q_{t,2}^{C}; \ldots; q_{t,n_h}^{C}] = q_t^{C} = W^{UQ} c_t^{Q} \\\\ % (2) [q_{t,1}^{R}; q_{t,2}^{R}; \ldots; q_{t,n_h}^{R}] = q_t^{R} = \mathrm{RoPE}(W^{QR} c_t^{Q}) \\\\ % (3) q_{t,i} = [ q_{t,i}^{C}; q_{t,i}^{R} ] \\\\ % (4) c_t^{KV} = W^{DKV} h_t \\\\ % (5) [k_{t,1}^{C}; k_{t,2}^{C}; \ldots; k_{t,n_h}^{C}] = k_t^{C} = W^{UK} c_t^{KV} \\\\ % (6) k_t^{R} = \mathrm{RoPE}(W^{KR} h_t) \\\\ % (7)$$
@furixturi
furixturi / masked_scaled_dot_product_attn_dropout.md
Created July 28, 2025 23:43
masked scaled dot-product with dropout formula

$Attention(Q, K, V, M, D) = \left[D \odot Softmax\left(\frac{QK^T}{\sqrt{d_k}} + mask(M)\right)\right]V$

  • Q: query matrix of shape (batch_size, seq_len, $d_q$)
  • K: key matrix of shape (batch_size, seq_len, $d_k$)
  • V: value matrix of shape (batch_size, seq_len, $d_v$)
  • M: Mask matrix (seq_len, seq_len), 0 for masked positions and 1 for allowed positions
  • D: Dropout (seq_len, seq_len), let p be the dropout probability, each element x_i has probability p to be set to 0, and probability 1-p to be kept and scaled up by 1/(1-p) to compensate for the removed units and preserve the expected sum

$$Attention(Q,K,V)=Softmax(\frac{QK^T}{\sqrt{d_k}})V$$

@furixturi
furixturi / benchmarks.md
Last active August 4, 2024 10:47
benchmarks-llama-3_1

Benchmarks used to evaluate Llama 3.1

Pre-training

Category Benchmark Full Name Authors/Institution Description Example
Reading Comprehension SQuAD V2 (2018) Stanford Question Answering Dataset 2.0 Pranav Rajpurkar et al., Stanford University Combines 100,000 questions from SQuAD 1.1 with 50,000 unanswerable questions. "When were the Normans in Normandy?" Answer: "10th and 11th centuries".

Check who is authenticated with github

Temporarily switch user and reauthenticate

git config credential.username "otherName"