-
-
Save notpolomarco/c4f4788cca1e8a349f162fc4ff801ca4 to your computer and use it in GitHub Desktop.
Revisions
-
kalomaze revised this gist
Mar 18, 2025 . 1 changed file with 3 additions and 3 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -8,10 +8,10 @@ The Bradley-Terry model works like this:   # what parts are new when it comes to what i am trying to do For my experimental setup I am doing chunks of the last 64 tokens in the sequence to train my reward model, and evaluating each chunk on a sliding window. Then, I am taking the average of these judgements across the sequence as the reward for the whole longform generation. In addition to this, I'm making synthetic preferred/unpreferred data via the Qwen2.5 7b base model at varying temperatures. For future revisions, I want to experiment with intentionally making the text worse in more diverse ways, such as translating to and from another language. This creates a preference modeling baseline that by default is normalized at different positions, and is always judging the same relative "volume" of information at a time on average. -
kalomaze revised this gist
Mar 18, 2025 . 1 changed file with 7 additions and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -29,4 +29,10 @@ The model expects input in this precise format: In my setup, the `letter` is A (chosen) or B (rejected). I use vllm to evaluate the probability distribution for the A/B comparison, for every chunk. # link to the model and dataset https://huggingface.co/Quest-AI/pretrain-rm-baseline-7b https://huggingface.co/datasets/Quest-AI/quest-270k-chunked-64-judgement -
kalomaze renamed this gist
Mar 18, 2025 . 1 changed file with 2 additions and 0 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -13,6 +13,8 @@ For my experimental setup I am doing chunks of the last 64 tokens in the seq to In addition to this I am creating synthetic rejected/unpreferred data via the Qwen2.5 7b base model at varying temperatures. This creates a preference modeling baseline that by default is normalized at different positions, and is always judging the same relative "volume" of information at a time on average. The model expects input in this precise format: ``` -
kalomaze revised this gist
Mar 18, 2025 . 1 changed file with 25 additions and 3 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,8 +1,30 @@ # the generic basics of preference reward modeling The Bradley-Terry model works like this: - It's based on a chosen/rejected split - The model is trained on binary judgements of specific content/samples as being either 'preferred' or 'dispreferred' - The log ratio between preferred and dispreferred can be used as the natural reward signal   # what part is new about what i am trying to do For my experimental setup I am doing chunks of the last 64 tokens in the seq to train my reward model, and evaluating each chunk on a sliding window. Then, I am taking the average of these judgements across the sequence as the reward for the whole longform generation. In addition to this I am creating synthetic rejected/unpreferred data via the Qwen2.5 7b base model at varying temperatures. The model expects input in this precise format: ``` [Original text from previous 64-token chunks]... <<JUDGEMENT_REGION>> [Next 64-token chunk to evaluate] <</JUDGEMENT_REGION>> <<JUDGEMENT>>letter ``` In my setup, the `letter` is A (chosen) or B (rejected). I use vllm to evaluate the probability distribution for the A/B comparison, for every chunk. -
kalomaze renamed this gist
Mar 18, 2025 . 1 changed file with 0 additions and 0 deletions.There are no files selected for viewing
File renamed without changes. -
kalomaze created this gist
Mar 18, 2025 .There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,8 @@ # Bradley-Terry Model The Bradley-Terry model works like this: - It's based on a chosen/rejected split - The model is trained on binary judgements of specific content/samples as being either 'preferred' or 'dispreferred' - The log ratio between preferred and dispreferred can be used as the natural reward signal 