How LLMs Work: The Definitive, Surprising Truth

How LLMs work is easier to understand when you stop treating them as magic and start treating them as a sequence modeling problem that became computationally tractable at massive scale.

At the highest level, a large language model is trained to assign probabilities to sequences of tokens and then predict the next token given the tokens that came before it. That sounds narrow. It is narrow. But it turns out to be enough to learn syntax, many statistical regularities of meaning, long-range dependencies, and reusable patterns for tasks like summarization, translation, coding, and question answering when the model, data, and compute are scaled far enough. The key is not that the objective changed into “understanding.” The key is that the representation, architecture, training recipe, and scale changed dramatically over time.

Why the history matters

A lot of confusion about how LLMs work comes from compressing 75 years of language modeling into one oversimplified idea: “they just predict the next word.” That phrase is directionally true but technically incomplete.

Modern models usually predict the next token, not the next word. A token may be a whole word, part of a word, punctuation, whitespace, or a short character sequence depending on the tokenizer. More importantly, current models do not do this with a small table of counts or a short Markov window. They do it with enormous learned parameter sets, distributed vector representations, attention mechanisms, and multi-stage training pipelines. The continuity is real, but so is the leap in complexity.

The mathematical starting point: probability over sequences

To explain how LLMs work, start with the basic probability fact that a sequence can be decomposed into conditional probabilities. In plain English, the probability of an entire sentence can be written as the product of the probability of each next item given the previous items.

That framing sits close to the foundation of modern language modeling. Shannon’s 1948 work on communication theory established a quantitative view of information and statistical structure in messages. Later language models operationalized that idea by estimating the probability of text sequences. Bengio’s 2003 neural language model paper states the central idea directly: a statistical model of language can be represented by the conditional probability of the next word given the previous ones, and the joint probability of a sentence can be decomposed into those conditional terms.

This is the first important correction to the usual shorthand. The next-token objective is not a toy trick bolted onto modern AI. It is the operational form of sequence probability modeling. The real questions are how much context the model can use, how it represents symbols, how it shares statistical strength across similar contexts, and how efficiently it can be trained.

Before neural networks: characters, Markov assumptions, and n-grams

Early language modeling methods were grounded in explicit statistics. One practical version was character prediction: given the previous few characters, estimate the next one. Another was word prediction: given the previous one, two, or three words, estimate the next word. These models are often described as Markov-style approximations because they assume the next item depends only on a limited recent history rather than the entire past.

That assumption was useful because full language distributions are intractable. If vocabulary is large and context length grows, the number of possible sequences explodes. Bengio’s paper made this point starkly, noting that modeling even 10 consecutive words with a vocabulary of 100,000 creates an astronomically large parameter space. Traditional n-gram approaches worked anyway because they used short overlapping fragments observed in training data. They were effective, but they generalized poorly when the exact phrasing had not been seen before.

This is the real limit of early next-word systems. They were not wrong. They were brittle. If the model had seen “the dog ran into the yard” and not “the puppy ran into the yard,” the statistical machinery had little ability to understand that dog and puppy might belong in nearby contexts unless someone manually engineered that structure.

The breakthrough: distributed representations

The major conceptual shift in how LLMs work came when language models stopped treating words as isolated symbols and started treating them as points in a learned vector space.

Bengio and colleagues proposed learning a distributed representation for each word along with the probability function over word sequences. In that setup, similar words can end up with nearby vector representations, so experience with one phrasing helps the model assign reasonable probability to related phrasings it never saw exactly. That is why embeddings matter. They let the model generalize by similarity rather than by exact lookup.

This is where many modern explanations get fuzzy, so it helps to be concrete. An embedding is not a dictionary definition stored inside the model. It is a learned numeric representation. During training, the model adjusts those numbers so that words or subword pieces that behave similarly in context tend to be represented in ways that let later layers make useful predictions. Meaning is not inserted manually. It is induced from prediction pressure across many examples.

That is also why weights matter. The learned weights in a neural network are the parameters that transform one representation into another. Some weights live in embedding matrices. Others live in attention projections and feed-forward layers. They do not “store facts” in a clean human-readable way. They encode statistical regularities that help reduce prediction error during training.

Why tokens replaced words

If you want a practical understanding of how LLMs work, tokenization is unavoidable.

Word-level vocabularies have a hard problem: open vocabulary. New words, rare words, names, misspellings, code identifiers, and morphological variations constantly appear. A fixed pure word vocabulary either explodes in size or fails badly on unseen forms.

Subword tokenization solved much of that. In 2016, Sennrich, Haddow, and Birch showed that neural systems could operate on subword units and use byte pair encoding to represent an open vocabulary with a fixed-size vocabulary of variable-length character sequences. That let models handle rare and unseen words far more effectively than rigid word-level schemes.

This is why a modern model may split “unbelievability” into multiple tokens or break an unfamiliar product name into pieces. It is not a bug. It is a design choice that makes vocabulary management tractable and improves generalization.

It also explains a common misconception. People often say LLMs predict the next word because that sounds intuitive. In practice, they usually predict the next token. That difference matters because token boundaries affect context length, efficiency, multilingual behavior, cost, and how well a model can handle rare strings such as URLs, code, and specialized terminology. For anyone building or buying AI systems, this is one of the first operational realities to understand.

The transformer changed the scaling ceiling

A second major shift in how LLMs work came from architecture.

Before transformers, recurrent architectures were widely used for sequence modeling. They could, in principle, process long sequences, but training them efficiently at scale was difficult. The transformer paper, “Attention Is All You Need,” proposed an architecture based solely on attention, removing recurrence and convolution from the core sequence model. The authors showed that this design was more parallelizable and could train more efficiently while achieving strong translation results.

Why did that matter so much?

Because attention gives the model a direct way to compute how strongly each token in the current context should influence the representation of another token. Instead of compressing everything through a narrow recurrent state, the model can compare tokens across a window and learn which previous pieces matter most for the current prediction.

That does not mean attention is “understanding.” It means the model has a flexible weighting mechanism over context. If the current token depends on a subject mentioned 20 tokens earlier, or a function definition 200 tokens earlier, attention creates a path for that dependency to matter. Modern transformers stack many such layers, so the model progressively builds richer representations of text structure, syntax, reference, and task cues.

What attention is actually doing

Attention is frequently explained with metaphors, but the mechanics are straightforward enough at a high level.

Each token is turned into a vector. The model projects those vectors into queries, keys, and values. Query-key similarity determines how much one token attends to another, and that weighting is used to mix value vectors into a new representation. Multi-head attention repeats this process in several learned subspaces, allowing different heads to capture different relationships.

For a non-specialist, the important point is not the linear algebra details. It is that attention is a learned relevance mechanism. It lets the model decide that a pronoun should connect to an earlier noun, that a closing bracket should relate to an opening bracket, that the current sentence should reflect a style instruction from the prompt, or that the next code token should align with a variable name defined much earlier. That is a large part of why next-token prediction started producing behavior that looked qualitatively more capable than classic autocomplete.

From embeddings to layers to logits

A practical mental model of how LLMs work looks like this:

First, raw text is broken into tokens.

Second, each token is mapped to an embedding vector, usually combined with positional information so the model knows order still matters.

Third, the sequence moves through many transformer layers. Each layer uses attention to mix information across positions and feed-forward blocks to transform representations further.

Fourth, the final representation at the current position is projected into a score over the vocabulary. Those scores, often called logits, become probabilities after normalization. The model then predicts the next token or samples from that distribution.

That entire pipeline is differentiable, which means the system can be trained end-to-end by comparing predicted next-token probabilities with the actual next token and adjusting weights to reduce error. The training signal is simple. The machinery that makes it effective is not.

What the weights are learning

The phrase “billions of parameters” gets repeated so often that it stops meaning anything. In practice, the weights are the model.

When people ask how LLMs work, they are often really asking what those parameters are doing. The honest answer is that they are learning a compressed statistical program for mapping contexts to next-token distributions. Some parameters help represent word pieces. Some help identify syntactic patterns. Some help route information through attention. Some appear to support task structure, factual associations, or common reasoning patterns. But there is no clean table where one neuron equals one concept. The learned system is distributed.

This matters operationally. You should not think of a model as a database with perfect retrieval. Training does not guarantee exact storage of every source fact, and next-token generation does not guarantee faithful recall. What training does produce is a parameterized function that is often very good at modeling linguistic and task regularities from its data distribution.

Why scaling changed behavior

The next major step in how LLMs work was not a brand-new objective. It was scale.

Kaplan and colleagues showed that transformer language model performance on cross-entropy loss follows empirical power-law scaling with model size, dataset size, and compute. In practical terms, that meant performance improved predictably as systems got larger and were trained longer on more data. Later, Hoffmann and colleagues showed that many large models were undertrained relative to their size and that compute-optimal training required scaling model size and training tokens together more carefully.

This was a crucial result because it reframed progress. Instead of asking whether next-token prediction was too weak as an objective, the field increasingly asked how far that objective could go with better architecture, more data, more compute, and better training ratios.

Brown and colleagues then demonstrated that scaling an autoregressive language model to 175 billion parameters substantially improved few-shot performance across many NLP tasks without gradient updates at inference time. That did not prove models reason like humans. It did show that broad task competence could emerge from next-token pretraining plus in-context examples at sufficient scale.

This is one of the most important points in the entire topic. The surprising behavior did not appear because researchers changed the system from prediction to reasoning. It appeared because prediction at scale forced the model to internalize enough structure that prompt-based task performance became possible.

Why next-token prediction can look like reasoning

This is where explanations often drift into either hype or dismissal.

The hype version says the model is obviously thinking.

The dismissive version says it is only autocomplete, so none of the behavior matters.

Both are incomplete.

A better view is that next-token prediction on a huge corpus rewards the model for learning many latent structures that are useful across language tasks: syntax, discourse flow, genre conventions, question-answer patterns, code structure, chain-like procedural templates, and statistical associations between concepts. If you can compress those patterns well enough, your next-token distribution starts to support behaviors that look like summarizing, translating, classifying, drafting, or stepwise problem solving.

That still does not justify strong claims about consciousness, intent, or human-like understanding. But it does explain why “just prediction” is not a rebuttal. Prediction is the training objective. It is not a full description of the capabilities that can emerge from optimizing that objective at scale.

Pretraining is only the first stage

Many people who want to understand how LLMs work stop at pretraining. That is not enough for modern deployed systems.

Pretraining usually teaches the base model to predict tokens over large corpora. That creates a broadly capable but not necessarily helpful or well-behaved system. A raw base model may complete text fluently while still being poor at following instructions, staying on task, or avoiding undesirable outputs.

Post-training changed that. Instruct-style fine-tuning and reinforcement learning from human feedback were used to push models toward more helpful and aligned behavior. Ouyang and colleagues showed that a 1.3B parameter InstructGPT model was preferred by human evaluators over the much larger 175B GPT-3 on their prompt distribution, despite the huge size difference. That result is a reminder that capability and usability are not the same thing. Training objective and post-training target matter.

For operators and product teams, this is not trivia. It explains why two models with similar raw pretraining pedigrees can feel very different in practice. The post-training recipe strongly shapes obedience to instructions, conversational behavior, refusal patterns, and output style.

What a modern training pipeline usually includes

A simplified modern pipeline for how LLMs work looks like this:

Large text corpora are collected, filtered, deduplicated, and tokenized.

A transformer model is initialized with billions of parameters.

The model is pretrained on next-token prediction over massive token sequences.

The resulting base model may then be instruction-tuned on curated examples.

A further stage may use preference data or related alignment methods to make outputs more useful, safer, and more consistent with user intent.

The important thing to notice is that the public-facing chatbot experience is the product of all these stages together. When someone says a model “learned from the internet,” that is only a partial description. Data curation, tokenization, architecture, optimizer choices, scaling decisions, and post-training objectives all materially shape the end result.

The role of context windows

Another practical point in how LLMs work is context length. The model does not have unrestricted access to everything ever seen. At inference time, it operates over a bounded context window of tokens. What it can use directly depends on what is inside that window.

That is why prompting matters. Instructions, examples, retrieved documents, conversation history, and formatting all change the token sequence the model conditions on. The model is not pulling answers from an unlimited hidden store each time. It is computing the next-token distribution from the current context plus what its weights learned during training.

This distinction matters for product design. If you need factual precision on a bounded source set, retrieval and grounding are often more reliable than hoping the model’s pretrained parameters will reproduce the needed details exactly.

Limits that follow from the mechanism

Understanding how LLMs work also means understanding what the mechanism does not guarantee.

Next-token training does not guarantee truth. A token sequence can be probable without being correct.

Distributed weights do not guarantee transparent reasoning. The model can produce a strong answer without exposing the internal basis for it.

Scale does not eliminate data bias, benchmark contamination risk, or hallucination risk.

Instruction tuning improves behavior but does not solve all alignment or reliability problems.

These are not side notes. They follow directly from the training setup. A model trained to continue text plausibly will often be useful, but usefulness is not the same as verified correctness. For high-stakes applications, external validation, constrained generation, retrieval, and careful evaluation remain necessary.

The cleanest way to think about how LLMs work

If you need one concise mental model, use this:

An LLM is a large parameterized function that converts token context into a probability distribution over the next token.

It became powerful because researchers improved four things at once:

representation, through embeddings and subword tokenization

architecture, through transformers and attention

scale, through more data, model capacity, and compute

training, through better pretraining ratios and post-training alignment

That is the progression from early mathematical language models to current systems. The through-line is sequence probability. The step change came from learning richer representations and context-sensitive weighting at unprecedented scale.

Why this matters to anyone evaluating AI

If you are evaluating AI products, the practical lesson is simple: do not ask whether a system is “just predictive text.” Ask what kind of predictive system it is.

How LLMs work in production depends on tokenization, context handling, architecture, scale, post-training, and whether the product adds retrieval, tools, or workflow constraints around the model. Two systems that both use next-token prediction can behave very differently depending on these choices.

That is why serious evaluation should focus on grounded performance, failure modes, cost, latency, controllability, and fit for the task. The right question is not whether next-token prediction sounds humble. The right question is how far this particular implementation pushes the basic idea, and where its limits still show.

Where to go deeper

For readers who want the original technical sources, the most useful path is chronological.

Start with Shannon for the information-theoretic frame.

Read Bengio for the neural language-model turn and distributed representations.

Read Sennrich for why subword tokenization became so important.

Read Vaswani for the transformer shift.

Read Kaplan and Hoffmann for scaling and compute-optimal training.

Read Brown for the scaling-era capabilities jump.

Read Ouyang for why post-training changed the usability of modern assistants.

That sequence gives you a grounded picture of how LLMs work without hype and without pretending the field started in 2022.

FAQ

What is the simplest explanation of how LLMs work?

The simplest correct explanation is that an LLM takes a sequence of tokens, computes probabilities for what token should come next, and repeats that process token by token. The hard part is that it does this using learned vector representations, transformer layers, attention, and very large parameter sets trained on massive corpora.

Do LLMs predict words or tokens?

Modern LLMs usually predict tokens, not whole words. A token can be a word, part of a word, punctuation mark, or other text fragment depending on the tokenizer. Subword tokenization became important because it handles rare and unseen words more effectively than rigid word-level vocabularies.

Why does next-token prediction produce useful behavior?

Because learning to predict the next token across very large datasets forces the model to absorb many reusable structures in language, including syntax, style, discourse, and task patterns. At sufficient scale, those learned regularities support behaviors like summarization, translation, and few-shot task completion.

What do attention and transformer layers add?

Attention gives the model a flexible way to weight which earlier tokens matter for the current prediction. Transformer layers stack that mechanism repeatedly, making it easier to model long-range dependencies and train efficiently at scale.

Are LLMs just autocomplete?

That description is too shallow to be useful. The training objective is autoregressive next-token prediction, but the resulting system is a large neural sequence model with learned embeddings, attention, and post-training. Calling it autocomplete is directionally true and still incomplete.

Does a larger model automatically mean a better assistant?

No. Scaling improves many capabilities, but post-training matters a lot. InstructGPT showed that a much smaller post-trained model could be preferred by humans over a much larger raw pretrained model for instruction following.

Sources

    Sign up for the kylebeyke.com newsletter and get notifications about my latest writings and projects.