Skip to content

Instantly share code, notes, and snippets.

@notpolomarco
Forked from kalomaze/pref_model.md
Created April 5, 2025 10:56
Show Gist options
  • Save notpolomarco/c4f4788cca1e8a349f162fc4ff801ca4 to your computer and use it in GitHub Desktop.
Save notpolomarco/c4f4788cca1e8a349f162fc4ff801ca4 to your computer and use it in GitHub Desktop.

Revisions

  1. @kalomaze kalomaze revised this gist Mar 18, 2025. 1 changed file with 3 additions and 3 deletions.
    6 changes: 3 additions & 3 deletions pref_model.md
    Original file line number Diff line number Diff line change
    @@ -8,10 +8,10 @@ The Bradley-Terry model works like this:
    ![](https://files.catbox.moe/jmib0e.png)
    ![](https://files.catbox.moe/tfkuz9.webp)

    # what part is new about what i am trying to do
    For my experimental setup I am doing chunks of the last 64 tokens in the seq to train my reward model, and evaluating each chunk on a sliding window. Then, I am taking the average of these judgements across the sequence as the reward for the whole longform generation.
    # what parts are new when it comes to what i am trying to do
    For my experimental setup I am doing chunks of the last 64 tokens in the sequence to train my reward model, and evaluating each chunk on a sliding window. Then, I am taking the average of these judgements across the sequence as the reward for the whole longform generation.

    In addition to this I am creating synthetic rejected/unpreferred data via the Qwen2.5 7b base model at varying temperatures.
    In addition to this, I'm making synthetic preferred/unpreferred data via the Qwen2.5 7b base model at varying temperatures. For future revisions, I want to experiment with intentionally making the text worse in more diverse ways, such as translating to and from another language.

    This creates a preference modeling baseline that by default is normalized at different positions, and is always judging the same relative "volume" of information at a time on average.

  2. @kalomaze kalomaze revised this gist Mar 18, 2025. 1 changed file with 7 additions and 1 deletion.
    8 changes: 7 additions & 1 deletion pref_model.md
    Original file line number Diff line number Diff line change
    @@ -29,4 +29,10 @@ The model expects input in this precise format:

    In my setup, the `letter` is A (chosen) or B (rejected).

    I use vllm to evaluate the probability distribution for the A/B comparison, for every chunk.
    I use vllm to evaluate the probability distribution for the A/B comparison, for every chunk.

    # link to the model and dataset

    https://huggingface.co/Quest-AI/pretrain-rm-baseline-7b

    https://huggingface.co/datasets/Quest-AI/quest-270k-chunked-64-judgement
  3. @kalomaze kalomaze renamed this gist Mar 18, 2025. 1 changed file with 2 additions and 0 deletions.
    2 changes: 2 additions & 0 deletions gistfile1.md → pref_model.md
    Original file line number Diff line number Diff line change
    @@ -13,6 +13,8 @@ For my experimental setup I am doing chunks of the last 64 tokens in the seq to

    In addition to this I am creating synthetic rejected/unpreferred data via the Qwen2.5 7b base model at varying temperatures.

    This creates a preference modeling baseline that by default is normalized at different positions, and is always judging the same relative "volume" of information at a time on average.

    The model expects input in this precise format:

    ```
  4. @kalomaze kalomaze revised this gist Mar 18, 2025. 1 changed file with 25 additions and 3 deletions.
    28 changes: 25 additions & 3 deletions gistfile1.md
    Original file line number Diff line number Diff line change
    @@ -1,8 +1,30 @@
    # Bradley-Terry Model
    # the generic basics of preference reward modeling

    The Bradley-Terry model works like this:
    - It's based on a chosen/rejected split
    - It's based on a chosen/rejected split
    - The model is trained on binary judgements of specific content/samples as being either 'preferred' or 'dispreferred'
    - The log ratio between preferred and dispreferred can be used as the natural reward signal

    ![Caption: For my experimental setup I am doing chunks of tokens that it evaluates on my trained reward model](https://files.catbox.moe/jmib0e.png)
    ![](https://files.catbox.moe/jmib0e.png)
    ![](https://files.catbox.moe/tfkuz9.webp)

    # what part is new about what i am trying to do
    For my experimental setup I am doing chunks of the last 64 tokens in the seq to train my reward model, and evaluating each chunk on a sliding window. Then, I am taking the average of these judgements across the sequence as the reward for the whole longform generation.

    In addition to this I am creating synthetic rejected/unpreferred data via the Qwen2.5 7b base model at varying temperatures.

    The model expects input in this precise format:

    ```
    [Original text from previous 64-token chunks]...
    <<JUDGEMENT_REGION>>
    [Next 64-token chunk to evaluate]
    <</JUDGEMENT_REGION>>
    <<JUDGEMENT>>letter
    ```

    In my setup, the `letter` is A (chosen) or B (rejected).

    I use vllm to evaluate the probability distribution for the A/B comparison, for every chunk.
  5. @kalomaze kalomaze renamed this gist Mar 18, 2025. 1 changed file with 0 additions and 0 deletions.
    File renamed without changes.
  6. @kalomaze kalomaze created this gist Mar 18, 2025.
    8 changes: 8 additions & 0 deletions gistfile1.txt
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,8 @@
    # Bradley-Terry Model

    The Bradley-Terry model works like this:
    - It's based on a chosen/rejected split
    - The model is trained on binary judgements of specific content/samples as being either 'preferred' or 'dispreferred'
    - The log ratio between preferred and dispreferred can be used as the natural reward signal

    ![Caption: For my experimental setup I am doing chunks of tokens that it evaluates on my trained reward model](https://files.catbox.moe/jmib0e.png)