$$ A'= \begin{bmatrix}
- 
$\phi$ : sine/cosine frequency
- 
$p$ : token position
- 
$B$ : the base constant, 10000 by default
- 
$m$ : dimension pair index ($2m$ for even and$2m+1$ for odd)
- 
$d$ : dimension size of the model or one attention head
- Q: query matrix of shape (batch_size, seq_len, $d_q$ )
- K: key matrix of shape (batch_size, seq_len, $d_k$ )
- V: value matrix of shape (batch_size, seq_len, $d_v$ )
- M: Mask matrix (seq_len, seq_len), 0 for masked positions and 1 for allowed positions
- D: Dropout (seq_len, seq_len), let p be the dropout probability, each element x_i has probability p to be set to 0, and probability 1-p to be kept and scaled up by 1/(1-p) to compensate for the removed units and preserve the expected sum
- source: Meta's Llama 3.1 paper The Llama 3 Heard of Models
- Benchmark details generated by ChatGPT using GPT-4o model
| Category | Benchmark | Full Name | Authors/Institution | Description | Example | 
|---|---|---|---|---|---|
| Reading Comprehension | SQuAD V2 (2018) | Stanford Question Answering Dataset 2.0 | Pranav Rajpurkar et al., Stanford University | Combines 100,000 questions from SQuAD 1.1 with 50,000 unanswerable questions. | "When were the Normans in Normandy?" Answer: "10th and 11th centuries". | 
ssh -T [email protected]git config credential.username "otherName"NewerOlder