Skip to content

Instantly share code, notes, and snippets.

@notpolomarco
notpolomarco / grpo_demo.py
Created April 5, 2025 10:59 — forked from willccbb/grpo_demo.py
GRPO Llama-1B
# train_grpo.py
#
# See https://github.com/willccbb/verifiers for ongoing developments
#
import re
import torch
from datasets import load_dataset, Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import LoraConfig
from trl import GRPOConfig, GRPOTrainer
@notpolomarco
notpolomarco / gist:3596fa6520219e11d5e28e1323207562
Created April 5, 2025 10:57 — forked from kalomaze/gist:37c70e022cb1e9428ebb1ee7a4b52275
GRPO Reinforcement Learning - 7b GSM8k on 8xH100 / 8xA100
# the "verifiers" repository is a clean implementation of templated GRPO reinforcement learning training environments
# this is a generic set of "install from scratch" commands complete with a deepspeed z3 config that i have been using when i spin up nodes
# it will run on the gsm8k example w/ default batch size & generation size (8), and the 8th GPU is used for vllm generations
# qwen 14b full finetuning will run on this configuration too without LoRA or CUDA OOM, at least for the gsm8k task's context sizes + generation lengths
# hyperparameters are controlled by `verifiers/utils/config_utils.py`; i have been preferring extreme grad clipping (between 0.001 and 0.01) and low beta (under 0.01)
# NOTE FEB 27: examples have moved into `verifiers/examples` not `/examples`
cd /root
mkdir boom
@notpolomarco
notpolomarco / pref_model.md
Created April 5, 2025 10:56 — forked from kalomaze/pref_model.md
pref modeling overview

the generic basics of preference reward modeling

The Bradley-Terry model works like this:

  • It's based on a chosen/rejected split
  • The model is trained on binary judgements of specific content/samples as being either 'preferred' or 'dispreferred'
  • The log ratio between preferred and dispreferred can be used as the natural reward signal