# the generic basics of preference reward modeling
The Bradley-Terry model works like this:
- It's based on a chosen/rejected split
- The model is trained on binary judgements of specific content/samples as being either 'preferred' or 'dispreferred'
- The log ratio between preferred and dispreferred can be used as the natural reward signal


# what parts are new when it comes to what i am trying to do
For my experimental setup I am doing chunks of the last 64 tokens in the sequence to train my reward model, and evaluating each chunk on a sliding window. Then, I am taking the average of these judgements across the sequence as the reward for the whole longform generation.
In addition to this, I'm making synthetic preferred/unpreferred data via the Qwen2.5 7b base model at varying temperatures. For future revisions, I want to experiment with intentionally making the text worse in more diverse ways, such as translating to and from another language.
This creates a preference modeling baseline that by default is normalized at different positions, and is always judging the same relative "volume" of information at a time on average.
The model expects input in this precise format:
```
[Original text from previous 64-token chunks]...
<>
[Next 64-token chunk to evaluate]
<>
<>letter
```
In my setup, the `letter` is A (chosen) or B (rejected).
I use vllm to evaluate the probability distribution for the A/B comparison, for every chunk.
# link to the model and dataset
https://huggingface.co/Quest-AI/pretrain-rm-baseline-7b
https://huggingface.co/datasets/Quest-AI/quest-270k-chunked-64-judgement