notpolomarco · April 5, 2025 10:56 · Mar 18, 2025 · Mar 18, 2025 · Mar 18, 2025 · Mar 18, 2025
diff --git a/pref_model.md b/pref_model.md
@@ -8,10 +8,10 @@ The Bradley-Terry model works like this:
 ![](https://files.catbox.moe/jmib0e.png)
 ![](https://files.catbox.moe/tfkuz9.webp)
 
-# what part is new about what i am trying to do
-For my experimental setup I am doing chunks of the last 64 tokens in the seq to train my reward model, and evaluating each chunk on a sliding window. Then, I am taking the average of these judgements across the sequence as the reward for the whole longform generation.
+# what parts are new when it comes to what i am trying to do
+For my experimental setup I am doing chunks of the last 64 tokens in the sequence to train my reward model, and evaluating each chunk on a sliding window. Then, I am taking the average of these judgements across the sequence as the reward for the whole longform generation.
 
-In addition to this I am creating synthetic rejected/unpreferred data via the Qwen2.5 7b base model at varying temperatures.
+In addition to this, I'm making synthetic preferred/unpreferred data via the Qwen2.5 7b base model at varying temperatures. For future revisions, I want to experiment with intentionally making the text worse in more diverse ways, such as translating to and from another language.
 
 This creates a preference modeling baseline that by default is normalized at different positions, and is always judging the same relative "volume" of information at a time on average.
 

diff --git a/pref_model.md b/pref_model.md
@@ -29,4 +29,10 @@ The model expects input in this precise format:
 
 In my setup, the `letter` is A (chosen) or B (rejected).
 
-I use vllm to evaluate the probability distribution for the A/B comparison, for every chunk.
+I use vllm to evaluate the probability distribution for the A/B comparison, for every chunk.
+
+# link to the model and dataset
+
+https://huggingface.co/Quest-AI/pretrain-rm-baseline-7b
+
+https://huggingface.co/datasets/Quest-AI/quest-270k-chunked-64-judgement
diff --git a/gistfile1.md → pref_model.md b/gistfile1.md → pref_model.md
@@ -13,6 +13,8 @@ For my experimental setup I am doing chunks of the last 64 tokens in the seq to
 
 In addition to this I am creating synthetic rejected/unpreferred data via the Qwen2.5 7b base model at varying temperatures.
 
+This creates a preference modeling baseline that by default is normalized at different positions, and is always judging the same relative "volume" of information at a time on average.
+
 The model expects input in this precise format:
 
 ```

diff --git a/gistfile1.md b/gistfile1.md
@@ -1,8 +1,30 @@
-# Bradley-Terry Model
+# the generic basics of preference reward modeling
 
 The Bradley-Terry model works like this:
-- It's based on a chosen/rejected split
+- It's based on a chosen/rejected split 
 - The model is trained on binary judgements of specific content/samples as being either 'preferred' or 'dispreferred'
 - The log ratio between preferred and dispreferred can be used as the natural reward signal
 
-![Caption: For my experimental setup I am doing chunks of tokens that it evaluates on my trained reward model](https://files.catbox.moe/jmib0e.png)
+![](https://files.catbox.moe/jmib0e.png)
+![](https://files.catbox.moe/tfkuz9.webp)
+
+# what part is new about what i am trying to do
+For my experimental setup I am doing chunks of the last 64 tokens in the seq to train my reward model, and evaluating each chunk on a sliding window. Then, I am taking the average of these judgements across the sequence as the reward for the whole longform generation.
+
+In addition to this I am creating synthetic rejected/unpreferred data via the Qwen2.5 7b base model at varying temperatures.
+
+The model expects input in this precise format:
+
+```
+[Original text from previous 64-token chunks]...
+
+<<JUDGEMENT_REGION>>
+[Next 64-token chunk to evaluate]
+<</JUDGEMENT_REGION>>
+
+<<JUDGEMENT>>letter
+```
+
+In my setup, the `letter` is A (chosen) or B (rejected).
+
+I use vllm to evaluate the probability distribution for the A/B comparison, for every chunk.
diff --git a/gistfile1.txt → gistfile1.md b/gistfile1.txt → gistfile1.md
diff --git a/gistfile1.txt b/gistfile1.txt
@@ -0,0 +1,8 @@
+# Bradley-Terry Model
+
+The Bradley-Terry model works like this:
+- It's based on a chosen/rejected split
+- The model is trained on binary judgements of specific content/samples as being either 'preferred' or 'dispreferred'
+- The log ratio between preferred and dispreferred can be used as the natural reward signal
+
+![Caption: For my experimental setup I am doing chunks of tokens that it evaluates on my trained reward model](https://files.catbox.moe/jmib0e.png)