How LLMs Work: Essential Guide for Builders

Lesson

How LLMs work for builders and operators

Learning Objectives

  • Explain how LLMs work using a practical builder’s mental model.
  • Define tokens and context windows in operational terms.
  • Describe next-token prediction without drifting into hype or false anthropomorphism.
  • Identify why hallucinations happen and why grounding improves reliability.
  • Understand why outputs vary across runs and prompts.
  • Apply that mental model to prompting, API usage, and workflow design.

Prerequisites

No deep machine learning background is required. You will get more value from this lesson if you already understand basic APIs, request and response patterns, and the difference between application logic and model output. Some familiarity with prompts is helpful but not necessary.


How LLMs work is easier to understand when you stop treating large language models as mysterious thinking machines and start treating them as probabilistic systems that turn tokenized context into likely next-token choices. That description sounds narrow, but it is the most useful mental model for builders. It explains why these systems can be impressively fluent, why they can fail confidently, why missing context causes fabrication, why phrasing changes outputs, and why good implementation depends on grounding and validation rather than prompt optimism alone.

A builder does not need a full graduate course in machine learning to work productively with LLMs. But a builder does need the right conceptual frame. If your mental model is “the model understands like a person,” you will make bad design decisions. If your mental model is “the model predicts likely continuations from token context, then the surrounding application adds grounding, constraints, and checks,” you can design systems that are much more reliable in practice. That is the core lesson of this article.

How LLMs work starts with tokens

The first practical fact about how LLMs work is that models do not read raw text the way humans do. They process tokens. OpenAI defines tokens as the building blocks of text that models process, noting that tokens can be as short as a single character or as long as a full word depending on language and context. OpenAI also gives a useful English rule of thumb: roughly one token is about four characters or about three quarters of a word.

That matters because everything operational runs through tokens. Cost is measured in tokens. Context limits are measured in tokens. Output limits are measured in tokens. Latency often increases with token volume. If you send a long prompt, long history, many retrieval snippets, or verbose tool schemas, you are not just sending “more text.” You are consuming more of the model’s working budget.

For builders, the important takeaway is simple: words are what users see, but tokens are what the model actually budgets against. That is why two prompts that look similar in characters can differ meaningfully in token count, especially across formatting, punctuation, code, JSON, or non-English text.

Here is a practical token-count example using Anthropic’s documented token counting API pattern:

import anthropicclient = anthropic.Anthropic()response = client.messages.count_tokens(
model="claude-opus-4-7",
system="You are a scientist",
messages=[{"role": "user", "content": "Hello, Claude"}],
)print(response.json()) # Example from docs returns {"input_tokens": 14}

This is useful because it lets you estimate prompt size before inference rather than discovering token growth after deployment. Anthropic documents this exact pattern and shows an example that returns 14 input tokens for a short system plus user message.

How LLMs work depends on a bounded context window

The next practical piece of how LLMs work is the context window. OpenAI describes the context window as the maximum number of tokens that can be used in a single request, including input, output, and for some models reasoning tokens. Anthropic similarly describes the context window as all the text a model can reference when generating a response, calling it a kind of working memory rather than the full corpus seen during training.

That distinction is critical. Builders often confuse training knowledge with live working context. The model may have been trained on massive text corpora, but at inference time it can only directly use what is in the current context window. If the needed fact, instruction, example, or schema is not in that effective working context, the model cannot directly consult it the way a database query would. It has to continue based on what is present in context plus whatever patterns it learned during training.

This is one reason missing context causes fabrication. Suppose you ask for a customer-specific answer without providing account data. Or you ask for a policy-based response without supplying the current policy. The model may still produce a fluent answer because fluent continuation is exactly what it is trained to do. But fluency is not evidence that the answer was grounded in the right source.

Anthropic’s context-window documentation also makes an important operational point: more context is not automatically better. As token count grows, accuracy and recall can degrade, which it refers to as context rot. In practice, that means large windows are useful, but sending everything is usually worse than curating the right subset.

Next-token prediction is the core mechanism

At the heart of how LLMs work is next-token prediction. Kyle Beyke’s article on the topic states the clean mental model directly: an LLM is a large parameterized function that converts token context into a probability distribution over the next token. The transformer paper, “Attention Is All You Need,” explains the architecture shift that made this style of sequence modeling far more powerful and parallelizable by relying on attention mechanisms instead of recurrence.

The important builder-level point is not every architectural detail. The important point is that the model repeatedly chooses what token is likely to come next given the tokens already present. Then it adds that token to the running sequence and repeats the process. That loop is why the system can produce paragraphs, code, summaries, classifications, and JSON. It is still one token at a time under the hood.

This sounds underwhelming until you notice what scale changes. The GPT-3 paper showed that large autoregressive language models could perform strong zero-shot, one-shot, and few-shot task behavior from text conditioning alone, without gradient updates at inference time. In other words, the same basic predictive mechanism became capable enough that examples and instructions inside the prompt could shape behavior across many tasks.

That is why builders should not dismiss LLMs as “just autocomplete” in a trivial sense, but they also should not overstate what is happening. The system is not searching a symbolic truth table. It is not guaranteed to reason transparently. It is using learned statistical structure to predict plausible continuations from context. Sometimes that produces very strong task performance. Sometimes it produces elegant nonsense.

Why fluency is not understanding

One of the most important lessons in how LLMs work is that fluency is not understanding.

LLMs are trained to produce likely continuations that fit patterns in data. That can look like deep understanding because human language itself carries structure, constraints, relationships, and common task formats. The model learns many of those regularities well enough to answer questions, summarize documents, write code, and follow examples. But none of that guarantees that the model has stable human-style understanding of truth, reference, intent, or causality.

This is why a model can sound authoritative while being wrong. Next-token training does not guarantee truth. Kyle Beyke’s article puts this directly: a token sequence can be probable without being correct. Anthropic’s hallucination guidance makes the same operational point from another angle by recommending techniques like allowing the model to say “I don’t know,” requiring quotes, verifying with citations, and restricting it to provided documents for factual work.

For builders and operators, this is not an abstract philosophical issue. It changes product design. If a model is fluent but not inherently grounded, you do not ship it as an oracle. You wrap it with retrieval, tool access, structured outputs, policy constraints, and evaluation. You treat it as a capable probabilistic component, not a final authority.

Why hallucinations happen

Hallucinations happen because the model’s job is to continue the sequence plausibly, not to refuse every time the evidence is incomplete.

If a prompt strongly implies that an answer should exist, and the model lacks sufficient grounding, it may still generate one. That is not a moral failure or a strange bug. It is a consequence of the mechanism. The system has learned that questions are often followed by answers, citations are often followed by URLs, biographies are often followed by dates, and structured requests are often followed by filled-in fields. When the needed fact is absent from context, the model may still complete the pattern.

Missing context is one major cause. Ambiguous prompts are another. Overly broad tasks are another. Long contexts with poor relevance selection can also contribute, because the model may have enough material to remain fluent but not enough well-prioritized material to stay accurate. Anthropic’s docs recommend direct quotes, citations, and explicit restrictions to provided documents specifically because those patterns reduce unsupported generation. They also state clearly that these methods significantly reduce hallucinations but do not eliminate them entirely.

A practical example:

Prompt A: “Write a reply explaining whether this customer qualifies for a refund.”

If you do not include the company’s refund policy, order status, and account details, the model may still write a polished answer. But it may invent policy logic or assume facts not in evidence.

Prompt B: “Using only the policy excerpt and order data below, determine refund eligibility. If the evidence is insufficient, say ‘insufficient information.’”

The second prompt works better not because of magic wording alone, but because it changes the task and the available grounding.

Why models need grounding

Grounding means tying the model’s output to data, documents, tools, or system state that are relevant to the task at hand.

OpenAI’s function-calling guide states that function calling gives models access to external systems and data outside their training data. That is one of the clearest practical statements of why grounding matters: training data is not enough for many business tasks, and the model often needs current or system-specific information to respond correctly.

Grounding can take several forms:

  • retrieval from documents
  • database lookups
  • tool calls to internal systems
  • structured inputs from your application
  • citations or quotes from source material
  • schema-constrained outputs that make unsupported answers easier to reject

Anthropic’s hallucination guidance recommends grounding with direct quotes and external-knowledge restriction when factual precision matters. OpenAI’s function-calling guidance shows the same broader pattern from the application side: the model suggests a tool call, the application executes it, and the final answer is generated with access to tool output.

For builders, the lesson is simple: if the answer should depend on current data or internal truth, do not ask the base model to guess. Feed the relevant context or let it call a tool. That is how you move from “plausible text generator” toward “useful system component.”

Why outputs vary

Another important part of how LLMs work is that outputs vary. OpenAI’s text generation guide explicitly states that content generated from a model is non-deterministic. OpenAI’s API reference also explains that temperature changes sampling behavior, with higher temperatures making output more random and lower temperatures making it more focused and deterministic, while still noting that determinism is not guaranteed even when a seed is used.

That means there is no single cause of variation. Outputs vary because:

  • the system samples from probability distributions over possible next tokens
  • temperature and top_p change how broad or narrow that sampling is
  • small prompt changes can shift probability mass across candidate continuations
  • model updates and backend changes can affect results
  • long conversational context changes what tokens are most likely next

This is normal behavior, not a defect. But it has practical consequences. If your workflow needs exact structure every time, free-form prompting is often not enough. Anthropic’s consistency guide recommends structured outputs when guaranteed schema compliance is needed. That is a control-layer solution to a probabilistic generation problem.

Here is a simple prompt-variation example:

Prompt 1: “Summarize this meeting.”

Prompt 2: “Summarize this meeting in 5 bullets, each under 12 words, covering decisions, risks, and next steps only.”

Both prompts ask for a summary, but the second prompt narrows the target distribution. It makes length, structure, and relevance constraints more explicit. Anthropic’s prompt engineering guide emphasizes clarity and examples for exactly this reason.

Same prompt, different outputs

To see output variability operationally, imagine this simple API call pattern:

from openai import OpenAIclient = OpenAI()response = client.responses.create(
model="gpt-5.4",
input="Write a one-sentence explanation of what an LLM does."
)print(response.output_text)

This exact Python example pattern is documented in OpenAI’s text generation guide. In practice, repeated runs can differ in wording, emphasis, or level of abstraction because generation is non-deterministic by default.

One run might say, “An LLM predicts likely next tokens from context to generate useful text.”

Another might say, “A large language model uses patterns learned from data to continue text one token at a time.”

Both could be acceptable. Neither proves that the model has a single stable internal paraphrase. It proves that multiple high-probability continuations fit the prompt.

Why phrasing changes results

Prompt phrasing changes results because prompts are part of the token context the model conditions on. A different phrase changes the token sequence. A different token sequence changes the next-token probabilities. That is the underlying reason prompt engineering matters.

This does not mean prompt engineering is mystical. It means language is the interface to a probabilistic system. Clear instructions, good examples, precise constraints, and explicit output formats help because they shape the model’s probability landscape more effectively than vague requests. Anthropic’s prompting guide frames prompt engineering as writing effective instructions, and OpenAI’s text guide describes it as a mix of art and science because content generation is non-deterministic even when best practices help.

A useful mental model is this: prompting does not install knowledge into the model at runtime. It steers a pre-trained system toward one region of possible behavior rather than another. That is why prompt quality matters, but it is also why prompting alone is not enough for high-stakes accuracy.

A builder’s implementation pattern

For most real products, a good implementation pattern looks like this:

  1. Keep the prompt narrow.
  2. Add the minimum relevant context.
  3. Use retrieval or tool calls for facts outside the prompt.
  4. Constrain outputs with schemas when possible.
  5. Validate outputs before taking action.
  6. Log failures and edge cases for prompt and system improvement.

This pattern fits the documented direction from both OpenAI and Anthropic. OpenAI’s function-calling guide describes the application-controlled tool loop; Anthropic’s guides emphasize clarity, grounding, and consistency controls; OpenAI’s text generation guide points builders toward structured outputs and reusable prompts when consistency matters.

Common mistakes builders make

The first mistake is anthropomorphizing the model. If you assume it “knows” in a human sense, you will trust fluent answers too easily.

The second mistake is ignoring token economics. Builders who think only in characters or paragraphs often discover too late that long prompts, large histories, and oversized retrieval blocks raise cost and latency.

The third mistake is treating prompt writing as the whole job. Prompting matters, but retrieval, tool access, schema control, and validation usually matter more in production.

The fourth mistake is confusing long context with guaranteed recall. More tokens can help, but both OpenAI and Anthropic document hard context limits, and Anthropic explicitly warns that more context is not automatically better.

The practical mental model to keep

If you need one durable mental model for how LLMs work, use this:

An LLM is a probabilistic next-token predictor operating over tokenized context inside a bounded context window. It can be extremely useful because training at scale teaches it many reusable linguistic and task patterns. But fluency is not the same as understanding, probability is not the same as truth, and good outputs usually depend on well-designed context, grounding, and validation.

That is the version of how LLMs work that builders and operators need. It is simple enough to guide implementation, but accurate enough to prevent the most expensive mistakes.


Key Takeaways

  • How LLMs work is best understood as probabilistic next-token prediction over tokenized context.
  • Tokens are the real budgeting unit for cost, context, and output length.
  • Context windows are bounded working memory, not the model’s entire training knowledge.
  • Fluency is not proof of understanding or truth.
  • Hallucinations happen when the model continues patterns without enough grounding.
  • Better implementation comes from grounding, tool use, structured outputs, and validation, not from prompt cleverness alone.

Practical Exercise

Objective: Build a better mental model by testing how context and phrasing affect outputs.

Task:

  1. Choose a simple LLM API or playground.
  2. Run the same prompt three times: “Explain what a context window is.”
  3. Compare the answers for wording, length, and emphasis.
  4. Now change the prompt to: “Explain what a context window is in 3 bullet points for a software engineer. Mention tokens and truncation.”
  5. Compare again.
  6. Finally, ask a policy question without policy text, then rerun it with a short policy excerpt included and the instruction: “Use only the policy below. If uncertain, say ‘insufficient information.’”

Starter instructions:

  • Record the prompt text exactly.
  • Note token count if your platform exposes it.
  • Save the outputs side by side.
  • Mark where variation came from prompt shape versus missing context.

What success looks like:

  • You can explain why the repeated outputs were not identical.
  • You can show how the more specific prompt narrowed the result shape.
  • You can show that adding source context reduced unsupported guessing.
  • You can state one concrete implementation rule you would use in production, such as “never answer policy questions without grounded policy text.”

Stretch goal:
Wrap one of the prompts in a structured-output or schema-constrained version and compare how much easier it is to validate.

FAQ

What is the simplest correct explanation of how LLMs work?

How LLMs work can be summarized as this: the model converts token context into probabilities over the next token and repeats that process one token at a time.

Do LLMs understand meaning like humans do?

They learn many useful patterns in language and tasks, but fluent output does not guarantee human-style understanding or truth.

Why do LLMs hallucinate?

They may generate plausible continuations even when the needed evidence is missing or weak, especially without grounding or explicit uncertainty handling.

Why does the same prompt give different answers?

Because generation is non-deterministic by default, and parameters such as temperature affect sampling behavior.

What is grounding?

Grounding means connecting the model to relevant documents, data, or tools so answers depend on actual source material rather than unsupported completion.

Is prompt engineering enough?

No. Prompting helps, but reliable systems usually also need context design, retrieval or tool access, output constraints, and validation.

Sources

Sign up for the kylebeyke.com newsletter and get notifications about my latest writings and projects.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.