ChatGPT appeared like an explosion on all my social media timelines in early December 2022. While I keep up with machine learning as an industry, I wasn't focused so much on this particular corner, and all the screenshots seemed like they came out of nowehre. What was this model? How did the chat prompting work? What was the context of OpenAI doing this work and collecting my prompts for training data?
I decided to do a quick investigation. Here's all the information I've found so far. I'm aggregating and synthesizing it as I go, so it's currently changing pretty frequently.
Source: https://github.com/lvwerra/trlChatGPT is a sibling model to InstructGPT, which is trained to follow an instruction in a prompt and provide a detailed response. We trained this model using Reinforcement Learning from Human Feedback (RLHF), using the same methods as InstructGPT, but with slight differences in the data collection setup. ChatGPT is fine-tuned from a model in the GPT-3.5 series.
There are some important high-level concepts to understand here. First is that GPT-3, Generative Pre-Trained Transformer is a model developed by OpenAI to perform the task of chat completion. So, given a prompt, it will finish the prompt.
GPT-3 is an autoregressive model, which means it predicts a future outcome of a sequence based on previously-observed outcomes in that sequence.. Or, otherwise stated:
LLMs are generative mathematical models of the statistical distribution of tokens in the vast public corpus of human-generated text, where the tokens in question include words, parts of words, or individual characters including punctuation marks. They are generative because we can sample from them, which means we can ask them questions. But the questions are of the following very specific kind. “Here’s a fragment of text. Tell me how this fragment might go on. According to your model of the statistics of human language, what words are likely to come next?” It is very important to bear in mind that this is what large language models really do. Suppose we give an LLM the prompt “The first person to walk on the Moon was ”, and suppose it responds with “Neil Armstrong”. What are we really asking here? In an important sense, we are not really asking who was the first person to walk on the Moon. What we are really asking the model is the following question: Given the statistical distribution of words in the vast public corpus of (English) text, what words are most likely to follow the sequence “The first person to walk on the Moon was ”? A good reply to this question is “Neil Armstrong.”
Second, we don't know for sure how ChatGPT was developed, but we know its high-level details and the lower-level details behind a similar model, InstructGPT. We know that ChatGPT is an ensemble and multi-stage model: The base model of this is a un unsupervised large language model, GPT-3. This model is then fine-tuned using reinforcement learning, a technique in machine learning that looks to guide an agent (in this case the model) to take the correct action by learning a function that rewards previous correct actions (weighing them more heavily) and disincentivizes incorrect actions.
To inform the reward,
we needed to collect comparison data, which consisted of two or more model responses ranked by quality. To collect this data, we took conversations that AI trainers had with the chatbot. We randomly selected a model-written message, sampled several alternative completions, and had AI trainers rank them. Using these reward models, we can fine-tune the model using Proximal Policy Optimization. We performed several iterations of this process.
Links:
- ChatGPT Announcement Post
- Language Models are Few-Shot Learners: GPT3 - GPT3 Paper
- Talking about Large Language Models
- InstructGPT Blog Post - Companion Model to ChatGPT
- InstructGPT Model Card
- My Twitter Thread Question on the Model
- Model- reinforcement learning for language models
Summaries:
- Illutstrating Reinforcement Learning from Human Feedback Tutorial
- OpenGPT Jailbreak? - Good Overview video
text-davinci-003 is an improvement on text-davinci-002

Links:
OpenAI as a non-profit was founded by Elon Musk, Infosys - an IT consulting and outsourcing firm , Y Combinator’s Sam Altman (who worked for Paul Graham), former head of LinkedIn Reid Hoffman, Peter Thiel, and Amazon Web Services. It’s headed from the research side by Ilya Sutskever, who worked with AI luminaries Geoffrey Hinton and Andrew Ng before a stint at Google. Altman is currently the head of OpenAI.
In the summer of 2019, Microsoft invested $1 billion dollars into OpenAI, effectively rendering the non-profit as an arm of Microsoft, which meant that OpenAI is now tied both to GitHub and to Azure services.
OpenAPI also separately sells many of its models as APIs.
Links:
- OpenAI in 2019
- OpenAI models as a service - this will likely be offered as an API service by Azure AI
- The model data is recent as of 2021 and does offline inference (aka it doesn't know anything about, for example, the death of Queen Elizabeth 2).
Originally I asked about this on Twitter and didn't come up with much. My Twitter Thread Question on Training Data. But since then, independent researchers have been discussing and verifying the very opaque training data behind the OpenAI models.
A key component of GPT-3x models are Books1 and Books2, both of which are shrouded in mystery. Researchers have attempted to recrate the data using OpenBooks1 and 2.
The model was trained on:
- Books1 - also known as BookCorpus. Here's a paper on BookCorpus, which maintains that it's free books scraped from smashwords.com.
- Books2 - No one knows exactly what this is, people suspect it's libgen
- Common Crawl
- WebText2 - an internet dataset created by scraping URLs extracted from Reddit submissions with a minimum score of 3 as a proxy for quality, deduplicated at the document level with MinHash
- What's in MyAI Paper, Source - Detailed dive into these datasets.
The policy model was evaluated by humans,
InstructGPT is then further fine-tuned on a dataset labeled by human labelers. The labelers comprise a team of about 40 contractors whom we hired through Upwork and ScaleAI. Our aim was to select a group of labelers who were sensitive to the preferences of different demographic groups, and who were good at identifying outputs that were potentially harmful. Thus, we conducted a screening test designed to measure labeler performance on these axes. We selected labelers who performed well on this test. We collaborated closely with the labelers over the course of the project. We had an onboarding process to train labelers on the project, wrote detailed instructions for each task, and answered labeler questions in a shared chat room.
It runs in Azure.
A large machine learning job spans many nodes and runs most efficiently when it has access to all of the hardware resources on each node. This allows GPUs to cross-communicate directly using NVLink, or GPUs to directly communicate with the NIC using GPUDirect. So for many of our workloads, a single pod occupies the entire node.
We have very little HTTPS traffic, with no need for A/B testing, blue/green, or canaries. Pods communicate directly with one another on their pod IP addresses with MPI via SSH, not service endpoints. Service “discovery” is limited; we just do a one-time lookup for which pods are participating in MPI at job startup time.
Links:
- My Twitter question on potential use-cases. So far:
- content summarization
- writing marketing copy
- writing SEO copy
- support + chatbot
- bootstrapping a new writing sample from scratch
 
- Code completion
- Semantic search
- Creative writing, aka backstories for characters
- A lot of potential prompts
The likely use-cases point to it being bundled as a chatbot/support bot and sold to corporations and also as a potential bundle into CoPilot.











