kemolo

Training open-source LLMs on ChatGPT output is a really bad idea.

Everyone is now racing to create open-source alternatives to compete with GPT3.5/GPT4. A common shortcut used by some teams to bootstrap their effort is to fine-tune their model on ChatGPT output. I used to think it was a good idea and totally fair play to do this. Actually, I still think it’s fair play. OpenAI effectively distilled the entire web into its models. They are saying themself that they are using publicly accessible information (mostly). So distilling their model is, in effect, distilling the public open web, so small Term of Service details aside, I don’t see major ethical problems with that. Right? Well, it’s not entirely true and I realized now that, even when ignoring the ethical considerations, using their output is a really bad idea.

First of all, from a purely technical point of view, as @yoavgo is explaining it beautifully in his recent post, there is no way to align LLMs correctly without the RLHF component. I encourag

Reinforcement Learning for Language Models

Yoav Goldberg, April 2023.

Why RL?

With the release of the ChatGPT model and followup large language models (LLMs), there was a lot of discussion of the importance of "RLHF training", that is, "reinforcement learning from human feedback". I was puzzled for a while as to why RL (Reinforcement Learning) is better than learning from demonstrations (a.k.a supervised learning) for training language models. Shouldn't learning from demonstrations (or, in language model terminology "instruction fine tuning", learning to immitate human written answers) be sufficient? I came up with a theoretical argument that was somewhat convincing. But I came to realize there is an additional argumment which not only supports the case of RL training, but also requires it, in particular for models like ChatGPT. This additional argument is spelled out in (the first half of) a talk by John Schulman from OpenAI. This post pretty much

	from transformers import (
	AutoConfig,
	AutoTokenizer,
	BitsAndBytesConfig,
	GenerationConfig,
	AutoModelForCausalLM,
	LlamaTokenizerFast,
	PreTrainedModel,
	TextIteratorStreamer,
	StoppingCriteria,