Large Language Models (LLMs) work by taking a piece of text (e.g. user prompt) and calculating the next word. In a more technical term, token. LLMs have a vocabulary, or a dictionary, of valid tokens, and will reference those in training and inference (the process of generating text). More on that below. You need to understand why we use tokens (sub-words) instead of words or letters first. But first, a short glossary of some technical terms that aren't explained in the sections below in-depth:
Logits: The raw, unnormalized scores output by the model for each token in its vocabulary. Higher logits indicate tokens the model considers more likely to come next. Softmax: A mathematical function that converts logits into a proper probability distribution - values between 0 and 1 that sum to 1. Entropy: A measure of uncertainty or randomness in a probability distribution. Higher entropy means the model is less certain abou