# the generic basics of preference reward modeling The Bradley-Terry model works like this: - It's based on a chosen/rejected split - The model is trained on binary judgements of specific content/samples as being either 'preferred' or 'dispreferred' - The log ratio between preferred and dispreferred can be used as the natural reward signal ![](https://files.catbox.moe/jmib0e.png) ![](https://files.catbox.moe/tfkuz9.webp) # what parts are new when it comes to what i am trying to do For my experimental setup I am doing chunks of the last 64 tokens in the sequence to train my reward model, and evaluating each chunk on a sliding window. Then, I am taking the average of these judgements across the sequence as the reward for the whole longform generation. In addition to this, I'm making synthetic preferred/unpreferred data via the Qwen2.5 7b base model at varying temperatures. For future revisions, I want to experiment with intentionally making the text worse in more diverse ways, such as translating to and from another language. This creates a preference modeling baseline that by default is normalized at different positions, and is always judging the same relative "volume" of information at a time on average. The model expects input in this precise format: ``` [Original text from previous 64-token chunks]... <> [Next 64-token chunk to evaluate] <> <>letter ``` In my setup, the `letter` is A (chosen) or B (rejected). I use vllm to evaluate the probability distribution for the A/B comparison, for every chunk. # link to the model and dataset https://huggingface.co/Quest-AI/pretrain-rm-baseline-7b https://huggingface.co/datasets/Quest-AI/quest-270k-chunked-64-judgement