Paper: https://www.synthlabs.ai/pdf/Generative_Reward_Models.pdf arXiv: https://arxiv.org/abs/2410.12832 Official SynthLabs blog post: https://www.synthlabs.ai/research/generative-reward-models Rentry: https://rentry.org/genrm
synthlabs proposes Generative Reward Models (GenRM): instead of training a separate scalar reward head (e.g., Bradley–Terry), they use an LLM itself as the reward model—prompted to generate a decision token (and optionally a chain of thought) that selects the preferred response. they introduce two variants: GenRM (direct classifier via an answer indicator) and CoT-GenRM (produce reasoning, then the indicator). trained with STaR-style bootstrapping and a DPO objective (STaR-DPO), the judge matches classical reward models in-distribution and generalizes better out-of-distribution, with the strongest OOD gains coming from the reasoning-based STaR-DPO setup. ([arXiv][1])