nathan lile (@nlile) nlile

Generative Reward Models

Paper: https://www.synthlabs.ai/pdf/Generative_Reward_Models.pdf arXiv: https://arxiv.org/abs/2410.12832 Official SynthLabs blog post: https://www.synthlabs.ai/research/generative-reward-models Rentry: https://rentry.org/genrm

introduction

synthlabs proposes Generative Reward Models (GenRM): instead of training a separate scalar reward head (e.g., Bradley–Terry), they use an LLM itself as the reward model—prompted to generate a decision token (and optionally a chain of thought) that selects the preferred response. they introduce two variants: GenRM (direct classifier via an answer indicator) and CoT-GenRM (produce reasoning, then the indicator). trained with STaR-style bootstrapping and a DPO objective (STaR-DPO), the judge matches classical reward models in-distribution and generalizes better out-of-distribution, with the strongest OOD gains coming from the reasoning-based STaR-DPO setup. ([arXiv][1])

Prompt Chaining with QwQ, Qwen, o1-mini, Ollama, and LLM

Here we explore prompt chaining with local reasoning models in combination with base models. With shockingly powerful local models like QwQ and Qwen, we can build some powerful prompt chains that let us tap into their capabilities in a immediately useful, local, private, AND free way.

Explore the idea of building prompt chains where the first is a powerful reasoning model that generates a response, and then use a base model to extract the response.

Play with the prompts and models to see what works best for your use cases. Use the o1 series to see how qwq compares.

Setup

Bun (to run bun run chain.ts ...)

Current WebSim prompts and main context. System/User/Assistant blocks denote different roles in the messages array for the API requests. Stuff in {} is either a file that's too big to be inserted directly, or an explanation.

From what I can see, WebSim is mostly "carried" by Claude's creativity.

Main prompt: main_prompt.txt - also check main_flow.txt to see how a complete request is made.
Edit prompt: edit_prompt.txt- used when right-click editing the element. Uses the currently selected model. I didn't manage to get the whole prompt with the examples, but most of it at least.
Fake LLM API prompt: api_prompt.txt - currently WebSim always uses Claude 3.5 Sonnet for this (from info on Discord).
Image rewriting prompt: image_gen_prompt.txt - also uses Claude (don't know what model). Not sure what image model is being used, probably some version SDXL (like SDXL Turbo and similar)

The temperature used is 1, at least for Claude.

LLM Samplers Explained

Everytime a large language model makes predictions, all of the thousands of tokens in the vocabulary are assigned some degree of probability, from almost 0%, to almost 100%. There are different ways you can decide to choose from those predictions. This process is known as "sampling", and there are various strategies you can use which I will cover here.

OpenAI Samplers

Temperature

Temperature is a way to control the overall confidence of the model's scores (the logits). What this means is that, if you use a lower value than 1.0, the relative distance between the tokens will become larger (more deterministic), and if you use a larger value than 1.0, the relative distance between the tokens becomes smaller (less deterministic).
1.0 Temperature is the original distribution that the model was trained to optimize for, since the scores remain the same.
Graph demonstration with voiceover: https://files.catbox.moe/6ht56x.mp4

Stevey's Google Platforms Rant

I was at Amazon for about six and a half years, and now I've been at Google for that long. One thing that struck me immediately about the two companies -- an impression that has been reinforced almost daily -- is that Amazon does everything wrong, and Google does everything right. Sure, it's a sweeping generalization, but a surprisingly accurate one. It's pretty crazy. There are probably a hundred or even two hundred different ways you can compare the two companies, and Google is superior in all but three of them, if I recall correctly. I actually did a spreadsheet at one point but Legal wouldn't let me show it to anyone, even though recruiting loved it.

I mean, just to give you a very brief taste: Amazon's recruiting process is fundamentally flawed by having teams hire for themselves, so their hiring bar is incredibly inconsistent across teams, despite various efforts they've made to level it out. And their operations are a mess; they don't real

	#!/usr/bin/env python3
	"""
	Expose Ollama models to LM Studio by symlinking its model files.
	NOTE: On Windows, you need to run this script with administrator privileges.
	"""

	import json
	import os
	from pathlib import Path


	// Claude Code is a Beta product per Anthropic's Commercial Terms of Service.
	// By using Claude Code, you agree that all code acceptance or rejection decisions you make,
	// and the associated conversations in context, constitute Feedback under Anthropic's Commercial Terms,
	// and may be used to improve Anthropic's products, including training models.
	// You are responsible for reviewing any code suggestions before use.

	// (c) Anthropic PBC. All rights reserved. Use is subject to Anthropic's Commercial Terms of Service (https://www.anthropic.com/legal/commercial-terms).

	// Version: 0.2.9


	<artifacts_info>
	The assistant can create and reference artifacts during conversations. Artifacts are for substantial, self-contained content that users might modify or reuse, displayed in a separate UI window for clarity.
	Good artifacts are...

	Substantial content (>15 lines)
	Content that the user is likely to modify, iterate on, or take ownership of
	Self-contained, complex content that can be understood on its own, without context from the conversation
	Content intended for eventual use outside the conversation (e.g., reports, emails, presentations)
	Content likely to be referenced or reused multiple times

	from pydantic import BaseModel
	from typing import Any, Dict

	class ST(BaseModel):
	value: str

	class DynamicPropertyClass:
	def __init__(self, **kwargs):
	for key, value in kwargs.items():
	setattr(self, f"_{key}", ST(value=str(value)))

	#!/bin/bash

	# This script downloads audio from a YouTube playlist, resamples the audio (if necessary), transcribes it using whisper, and saves the transcription.
	# The audio files are saved in the `~/mp3` directory, and transcriptions are saved in the `~/transcripts` directory.
	# If the `--remove-audio` flag is set, the audio files are not saved.
	# Usage: ./script_name.sh --playlist-url [YouTube playlist URL] [--remove-audio]

	# Paths for whisper and model
	whisper_path="$HOME/whisper.cpp/main"
	model_path="$HOME/whisper.cpp/models/ggml-medium.en.bin"