# the generic basics of preference reward modeling

The Bradley-Terry model works like this:
- It's based on a chosen/rejected split 
- The model is trained on binary judgements of specific content/samples as being either 'preferred' or 'dispreferred'
- The log ratio between preferred and dispreferred can be used as the natural reward signal

![](https://files.catbox.moe/jmib0e.png)
![](https://files.catbox.moe/tfkuz9.webp)

# what parts are new when it comes to what i am trying to do
For my experimental setup I am doing chunks of the last 64 tokens in the sequence to train my reward model, and evaluating each chunk on a sliding window. Then, I am taking the average of these judgements across the sequence as the reward for the whole longform generation.

In addition to this, I'm making synthetic preferred/unpreferred data via the Qwen2.5 7b base model at varying temperatures. For future revisions, I want to experiment with intentionally making the text worse in more diverse ways, such as translating to and from another language.

This creates a preference modeling baseline that by default is normalized at different positions, and is always judging the same relative "volume" of information at a time on average.

The model expects input in this precise format:

```
[Original text from previous 64-token chunks]...

<<JUDGEMENT_REGION>>
[Next 64-token chunk to evaluate]
<</JUDGEMENT_REGION>>

<<JUDGEMENT>>letter
```

In my setup, the `letter` is A (chosen) or B (rejected).

I use vllm to evaluate the probability distribution for the A/B comparison, for every chunk.

# link to the model and dataset

https://huggingface.co/Quest-AI/pretrain-rm-baseline-7b

https://huggingface.co/datasets/Quest-AI/quest-270k-chunked-64-judgement