notpolomarco

the generic basics of preference reward modeling

The Bradley-Terry model works like this:

It's based on a chosen/rejected split
The model is trained on binary judgements of specific content/samples as being either 'preferred' or 'dispreferred'
The log ratio between preferred and dispreferred can be used as the natural reward signal

	# train_grpo.py
	#
	# See https://github.com/willccbb/verifiers for ongoing developments
	#
	import re
	import torch
	from datasets import load_dataset, Dataset
	from transformers import AutoTokenizer, AutoModelForCausalLM
	from peft import LoraConfig
	from trl import GRPOConfig, GRPOTrainer

	# the "verifiers" repository is a clean implementation of templated GRPO reinforcement learning training environments
	# this is a generic set of "install from scratch" commands complete with a deepspeed z3 config that i have been using when i spin up nodes
	# it will run on the gsm8k example w/ default batch size & generation size (8), and the 8th GPU is used for vllm generations
	# qwen 14b full finetuning will run on this configuration too without LoRA or CUDA OOM, at least for the gsm8k task's context sizes + generation lengths
	# hyperparameters are controlled by `verifiers/utils/config_utils.py`; i have been preferring extreme grad clipping (between 0.001 and 0.01) and low beta (under 0.01)

	# NOTE FEB 27: examples have moved into `verifiers/examples` not `/examples`

	cd /root
	mkdir boom