How Large Language Models Work
No math. No CS degree. Just the working knowledge you need to design with them.
What Is a Language Model?
Not artificial intelligence. Something more specific and more interesting.
Most of the discourse around AI is noise. Sentient machines, the singularity, robots taking your job. None of that helps you understand what’s actually happening when you talk to ChatGPT or Claude.
Here’s what a large language model actually is: a machine that got very good at one task. Predicting what comes next in a sequence of text. That’s it. The conversations, the code, the apparent reasoning? All of that emerges from that single ability. It’s autocomplete, scaled to a point where the output starts to feel like thinking.
The autocomplete frame
You already use a language model every day. Your phone suggests the next word as you type. It’s seen enough text to know that “looking forward to” is probably followed by “seeing” or “hearing” or “meeting.”
An LLM does the same thing at a different scale. Imagine your phone’s autocomplete had read a significant chunk of the internet, all of Wikipedia, and millions of books. And instead of predicting one word, it could hold thousands of words in context.
The result: a system that doesn’t just predict “seeing” after “looking forward to.” It can produce a coherent paragraph about quantum mechanics, or a working React component, or a poem in the style of Mary Oliver. Not because it understands any of those things the way you do. Because it’s learned the patterns of how humans express them in text.
Why this matters for design
You shape how people interact with technology. LLMs are becoming a fundamental layer of that interaction. You don’t need to build one. But you need a working mental model of what it can and can’t do, where it’s confident and where it’s guessing, and why it sometimes produces output that looks right but is completely wrong.
Here’s how I think about it. You don’t need to know how a CPU works to design a great app. But you do need to understand latency, state, and network constraints. Same principle. You don’t need to train a model. You need to understand what shapes its behavior.
A language model is not a database of facts. It’s a compression of patterns. It doesn’t “know” things. It has learned statistical relationships between words that often correspond to real knowledge, and sometimes don’t.
The prediction loop
Here’s how generation actually works. You give the model a prompt: “The capital of France is.” The model looks at those tokens (more on tokens in the next chapter) and produces a probability distribution over every possible next token. “Paris” might get 97%. “Lyon” might get 0.5%. “Banana” gets effectively zero.
The model picks one (usually the most probable, with some controlled randomness). That token gets appended to the sequence. Then the whole thing runs again. “The capital of France is Paris” becomes the new input, and the model predicts what comes after “Paris.” Maybe a period. Maybe “, which.”
This loop (predict, append, repeat) is how every response you’ve ever gotten from an LLM was generated. One token at a time. No planning ahead, no outline, no rough draft. Just the next most probable token, given everything that came before it.
Tokenization
How text becomes numbers. The first transformation.
Computers don’t read text. They process numbers. Every letter, word, and sentence has to be converted into a numerical form before a model can touch it. That conversion is called tokenization, and everything downstream depends on it.
You might assume tokenization splits text into words. It doesn’t. Modern language models use subword tokenization, which breaks text into chunks that are sometimes words, sometimes parts of words, sometimes individual characters. The word don’t becomes two tokens: don and ’t. A German compound word like Lebensversicherung gets split into several pieces. Common words like the stay whole. The algorithm learns these splits from massive amounts of text, optimizing for a vocabulary that covers language efficiently.
A type designer’s way of thinking about it
If you’ve worked with typography, think of tokens like glyphs in a font. A typeface doesn’t store one shape per letter. It has ligatures ( fi, fl), alternate forms, and composed characters. The glyph set is a practical vocabulary, not a one-to-one map of the alphabet. Subword tokens work the same way. The model’s vocabulary is a set of useful text fragments, chosen because they show up frequently enough to earn their own entry.
Why tokenization matters for design
Pricing. API costs are measured in tokens, not words. The conversion rate varies by language. Japanese text produces roughly twice as many tokens as English for the same meaning, because the vocabulary was trained mostly on English. If you’re designing a multilingual product, your cost model needs to account for this.
Context limits. When a model says it supports 128k context, that’s 128,000 tokens, not words. English runs about 0.75 words per token on average. So 128k tokens is closer to 96,000 words, though the exact ratio depends on your content.
Character-level reasoning. Ask a model how many rs are in strawberry and it might get it wrong. Not because it’s bad at counting, but because strawberry is split into subword chunks (something like straw and berry). The model never sees individual letters. It works with token-level representations and has limited ability to reason about the characters inside them. Any task requiring letter-by-letter analysis is working against the grain of how tokenization works.
The vocabulary
A model’s token vocabulary is a fixed set, typically between 30,000 and 100,000 entries. Each token maps to an integer. The sentence The cat sat might become [464, 3857, 3290]. Three flat numbers. They tell the model which tokens are present, but nothing about what those tokens mean.
A flat integer has no structure. 464 is not “closer” to 3857 in any meaningful sense. To capture meaning, we need a richer representation. That’s where embeddings come in.
Embeddings
Meaning as geometry. The leap that makes everything else possible.
After tokenization, every token is an integer. Integers are easy for computers, but they don’t encode relationships. The number for dog has no inherent connection to the number for puppy. To give the model a sense of meaning, we need to place each token in a space where distance corresponds to similarity.
That’s the core idea behind embeddings: meaning as a point in space. Literally. Each token gets mapped to a list of numbers (a vector) that positions it in a high-dimensional coordinate system. Words with similar meanings end up near each other. Unrelated words end up far apart.
The color analogy
If you’ve worked with color, you already get this. Think of LAB color space. Every color is a point defined by three coordinates: lightness, green-red, blue-yellow. Two colors that look similar sit close together. Two that look different sit far apart. Distance maps to perceived difference.
Embeddings do the same thing for meaning, just with way more dimensions (typically 768 to 4,096 instead of three). Dog and puppy sit near each other. Dog and spreadsheet sit far apart. The geometry encodes what the model knows about how concepts relate.
Directions have meaning too
It goes deeper than proximity. Directions in embedding space encode relationships. The classic example: take the vector for king, subtract man, add woman, and you land near queen. The direction from man to woman captures something about gender, and that direction is reusable across concepts.
In LAB terms, it’s like shifting along the lightness axis. Moving from a dark red to a light red is the same operation as moving from a dark blue to a light blue. The axis means something, independent of which color you start from.
From tokens to understanding
In the model’s pipeline, embeddings are a lookup step. Each token integer gets swapped for its embedding vector. Like replacing an index number with a full address. The flat sequence [464, 3857, 3290] becomes three rich vectors, each carrying information about meaning.
But there’s a limitation. At this stage, each token’s embedding is fixed. The word bank gets the same vector whether it appears in river bank or bank account. The embedding captures the average of all the ways bank is used, not the specific meaning in this sentence. To resolve that ambiguity, the model needs something more.
Design implications
Embeddings power more than language models. Anywhere you need to compare meaning programmatically, embeddings are the tool. Semantic search finds results by meaning rather than keyword matching, so a query for cozy places to eat can surface results about intimate restaurants. Recommendation systems use embedding similarity to find related content. Clustering algorithms group documents by topic without anyone defining what the topics are.
If you’re designing a search experience, a feed, or a recommendation surface, the quality of the underlying embeddings directly shapes what your users see.
Attention
How words relate to each other. The mechanism that lets context reshape meaning.
Consider the sentence: The cat sat on the mat because it was tired. You know it refers to the cat, not the mat. You resolved that reference instantly, using context. A language model needs a mechanism to do the same thing. That mechanism is called attention.
After embedding, each token has a vector that captures its general meaning. But meaning is context-dependent. It means nothing on its own. Attention lets each token look at every other token in the sequence and adjust its representation based on what it finds. By the time attention finishes, it carries information about the cat, because the model learned that those two positions are strongly connected.
A design analogy
Think of attention as dynamic page hierarchy. In a static layout, every element has a fixed relationship to every other element. But imagine a layout where the relationships recompute for every piece of content. A heading might relate strongly to the paragraph below it, weakly to a sidebar, and not at all to the footer. Attention works like this, except it computes these relationships fresh for every input, across every position in the sequence.
Query, key, value
The mechanics involve three transformations applied to each token. Every token produces a query (what am I looking for?), a key (what do I contain?), and a value (what information do I carry?). The model compares each token’s query against every other token’s key to compute a relevance score. High scores mean strong relationships. Those scores then weight the values, so each token’s output is a blend of information from the tokens most relevant to it.
For it was tired, the query from it matches strongly with the key from cat. So the output representation of it absorbs information from cat. The ambiguity that embeddings alone couldn’t resolve gets resolved here.
Multi-head attention
A single attention pass captures one type of relationship. But language encodes many types at once: syntactic (subject-verb agreement), semantic (pronoun reference), positional (nearby words), and more. To handle this, transformers run 32 to 128 attention heads in parallel, each learning to focus on a different kind of relationship.
One head might specialize in pronoun references. Another in adjective-noun pairs. A third in long-range dependencies between the opening and closing of a sentence. Different lenses, same text, different structures revealed.
Why attention changed everything
Before transformers, language models processed text sequentially. One token at a time. Information had to pass through a chain of steps to travel from one end of a sentence to the other. Long-range dependencies degraded. By the time the model reached tired at the end of a long passage, it might have lost track of what it referred to.
Attention eliminates that bottleneck. Every token can attend to every other token directly, regardless of distance. A word at position 1 and a word at position 10,000 have the same access to each other as two adjacent words. That’s the core innovation of the transformer, and the reason it displaced everything that came before.
The Stack
Layers of understanding. How simple operations compose into comprehension.
Attention is powerful, but a single round only gets you so far. It can figure out that “bank” relates to “river” in one sentence and “money” in another. Going from that kind of local disambiguation to something resembling comprehension takes repetition. Dozens or hundreds of times.
A large language model is a stack of identical layers, each performing attention followed by a feed-forward network. The first layers tend to handle grammar and syntax. Middle layers build up meaning. Final layers resolve task intent, figuring out what you actually want from the prompt.
Non-destructive editing
If you’ve used Photoshop or Figma, you know adjustment layers. You don’t paint directly on the image. You stack transformations on top of it: curves, color balance, sharpen. Each one adds its effect without destroying what came before. You can reorder them, toggle them, blend them.
Transformer layers work the same way, through residual connections. Each layer doesn’t replace the previous representation. It adds to it. The output of layer 12 is the output of layer 11 plus whatever layer 12 contributed. Like blend modes in a design tool, each layer refines the image without ever erasing the original.
The residual stream
Think of a shared design file where different specialists leave annotations. The file flows through the entire model from first layer to last, accumulating information along the way. Early layers write notes about syntax: “this is a noun,” “this verb is past tense.” Middle layers add semantic annotations: “this paragraph is about climate policy,” “the user is being sarcastic.” Late layers write task-level conclusions: “this is a question that wants a list of three items.”
That shared workspace is the residual stream. No single layer owns the representation. Each one reads from the stream, does its work, and writes its contribution back. The final representation is the sum of every layer’s input.
Next word: "mat" (97%)
Scale
GPT-4 is reported to have roughly 120 layers (OpenAI hasn’t confirmed the architecture). Each layer contains an attention mechanism and a feed-forward network, both with millions of parameters. More layers means more capacity for nuance, more room for the model to build up subtle representations before committing to output. It also means more compute, more memory, and more cost per token.
There’s no fixed rule for how many layers a model needs. Smaller models (7 billion parameters, 32 layers) handle straightforward tasks well. Larger models need the depth for harder problems: multi-step reasoning, long-range coherence, nuanced tone. The stack is where capacity lives.
The Prediction
How output actually happens. One token at a time, from a probability distribution.
After your prompt passes through every layer, the model arrives at a single task: predict the next token. Not the next sentence. Not a paragraph. One token. It does this by producing a probability distribution over its entire vocabulary, which typically contains 50,000 to 100,000 candidates.
Every token in the vocabulary gets a score. “The capital of France is” might give “Paris” a 97% probability, “Lyon” 0.8%, “the” 0.3%, and “Banana” something vanishingly close to zero. The model doesn’t pick from a lookup table. It computes these probabilities fresh every single time.
Autoregressive generation
The model picks a token, appends it to the sequence, and runs the entire model again with the extended input. That’s autoregressive generation. Each token requires a full forward pass through every layer of the stack.
There is no planning. No outline. No rough draft that gets refined. The model can’t look ahead to see where a sentence is going before it starts writing it. Each token is locally optimal, the best next move given everything so far, with no guarantee the sequence as a whole will be coherent. The fact that it usually is coherent is a testament to how much structure the training process baked into the weights.
Temperature
Temperature controls how the model samples from its probability distribution. Think of it as a snap-to-grid setting in a design tool.
At temperature 0, the model always picks the highest-probability token. Rigid, deterministic, identical output every time. At temperature 1, the model samples proportionally from the full distribution, so a token with 10% probability gets picked 10% of the time. Loose, varied, occasionally surprising. At temperature 2, the distribution flattens further and unlikely tokens get real chances. Output becomes chaotic, often incoherent.
Most applications use a temperature between 0.3 and 0.9. Lower for factual tasks. Higher for creative ones.
Top-k and top-p
Temperature alone is a blunt instrument. Two additional filters give finer control.
Top-k limits the model to the k most probable tokens before sampling. If k is 50, the model ignores everything outside the top 50 candidates, no matter what temperature you set. Simple, but it treats every prediction the same, whether the model is confident or uncertain.
Top-p (also called nucleus sampling) is adaptive. Instead of a fixed count, it takes the smallest set of tokens whose probabilities sum to p. If the model is 95% sure about one token and you set top_p=0.95, it considers only that one token. If probability is spread across twenty tokens, it considers all twenty. The filter tightens when the model is confident and loosens when it isn’t.
Same prompt, different outputs. Not because the model thinks differently each time, but because it samples from a distribution. The apparent creativity is controlled randomness.
The Memory Wall
Context windows. The hard boundary on what an LLM can hold at once.
Every LLM has a context window. A fixed number of tokens it can process in a single pass. This is not a scrolling page that grows as the conversation continues. It’s a fixed-size canvas. Your system prompt, the full conversation history, your latest message, and the model’s response all have to fit on that canvas at the same time.
If they don’t fit, something gets cut.
What lives on the canvas
When you send a message to an LLM-powered app, the actual input the model sees is much larger than what you typed. It typically includes a system prompt (instructions from the developer), the entire conversation so far, and your latest message. All of these compete for the same limited space.
A long system prompt leaves less room for conversation. A long conversation leaves less room for the response. Every token allocated to context is a token the model has to attend to. More compute, more latency, more cost.
Why long conversations degrade
When the conversation exceeds the context window, older messages get truncated or dropped entirely. The model doesn’t “forget” gradually, the way you might. One moment a message is in context. The next it’s gone. The model has no idea it was ever there.
Even within the window, not all positions are equal. Research has identified a “lost in the middle” problem: models attend more strongly to the beginning and end of their context, and underweight information buried in the middle. If the critical detail is on message 47 of a 100-message thread, the model is more likely to miss it than if it showed up in message 2 or message 99.
The size race
Context windows have expanded fast. GPT-3 launched with 4,096 tokens (roughly 3,000 words). GPT-4 pushed to 128,000. By 2024, Claude and Gemini both hit 1 million tokens. In 2026, most frontier models support at least 1 million, and Gemini 3 Pro advertises 10 million.
But bigger is not free. Attention scales quadratically with context length. Doubling the window doesn’t double the cost. It quadruples it. A model processing 200,000 tokens is doing orders of magnitude more work than one processing 4,000. That’s why long-context requests are slower, more expensive, and why providers charge per token in both directions.
Design implications
If you’re designing an LLM-powered product, the context window is one of your most important constraints. How much conversation history do you retain? When do you summarize older messages instead of keeping them verbatim? Do you show users how much context remains, the way a character count shows remaining space in a tweet? What happens when the window fills up?
Graceful degradation matters. Users don’t expect to hit a wall mid-conversation, but they will. The apps that handle this well (summarizing, archiving, notifying) will feel dramatically more reliable than the ones that silently start dropping context and producing confused responses.
A context window is not memory. It’s a workspace. Every session starts blank. The continuity you experience is an application-level illusion.
How It Learned
Training. The loop that turns random weights into something useful.
The core loop is almost boring. Predict the next token. Check the answer. Adjust. Repeat. The magic isn’t in the mechanism. It’s in doing it trillions of times.
The data
When people say a model was “trained on the internet,” they mean something specific: Common Crawl (a snapshot of billions of web pages), digitized books, Wikipedia, academic papers, GitHub repositories, forums, and more. GPT-4 was reportedly trained on around 13 trillion tokens. The data is filtered for quality, but it’s still messy. The model learns from all of it, including the mistakes.
What learning means
A model starts as billions of parameters set to random values. It sees a sequence of tokens and predicts the next one. A loss function measures how wrong the prediction was. Backpropagation traces that error backward through the network and nudges each parameter to be slightly less wrong next time. Over billions of iterations, the loss drops. The model gets better.
What gets stored
The model primarily stores statistical tendencies, not verbatim text. But memorization of specific sequences does happen, especially for text that appears many times in the training data. Think of a chef who has cooked 10,000 different dishes. They develop intuition about flavor combinations, technique, and timing. They don’t consult a recipe for most things. But ask them to recite one they’ve read a hundred times, and they probably can. Knowledge in a neural network is distributed the same way: spread across millions of parameters, with some well-worn paths carved deeper than others.
Scale and emergence
Here’s the part that surprised everyone, including the researchers. Bigger models don’t just get incrementally better. They develop capabilities nobody programmed. GPT-2 (1.5 billion parameters) was barely coherent past a few sentences. GPT-3 (175 billion) could write essays and translate languages. GPT-4, reported to have over a trillion parameters across a mixture-of-experts architecture (though OpenAI has never confirmed the details), can pass the bar exam. Same architecture. More scale.
A design analogy: imagine training a junior designer by having them critique 10 million designs. No lectures, no theory, no reading list. Just look at a design, predict what element comes next, check the answer, repeat. At some point they stop mimicking and start understanding composition.
Training doesn’t teach the model facts. It teaches the model the shape of how humans express things in text. The facts come along for the ride.
Becoming Useful
Fine-tuning and RLHF. From raw capability to helpful assistant.
After pre-training, a model is good at one thing: predicting the next token. It has no concept of being an assistant. Ask it a question and it might continue your sentence, or generate another question, or produce something that looks like a Wikipedia article. It’s a text-completion engine, not a conversationalist.
Fine-tuning
Fine-tuning means continued training on a smaller, carefully curated dataset. Question-and-answer pairs. Instructions and responses. Multi-turn conversations. The model’s weights shift to favor patterns that look like helpful dialogue instead of raw internet text.
Think of it as onboarding a generalist designer. They already know typography, layout, and color theory. Fine-tuning is showing them the brand guidelines, the component library, and examples of past work. You’re not teaching them design. You’re teaching them how this team designs.
RLHF: reinforcement learning from human feedback
Fine-tuning gets the model into the right neighborhood. RLHF refines it further. The process: humans are shown pairs of model outputs for the same prompt and asked which one is better. Those preferences train a separate reward model that learns to score outputs the way a human would. Then the language model is trained to maximize that reward score.
If you’ve ever run a preference test (showing users two design options and asking which feels better), you already understand the core idea. RLHF is preference testing at scale.
What RLHF actually changes
This is where the model learns that “how to pick a lock?” should get declined rather than answered. Not because it has morals, but because its weights have been shaped so that refusal is the highest-scoring continuation for that kind of prompt. It’s also why models are sometimes too cautious or too verbose. The reward signal optimized for helpfulness and safety, and the model overshot in places. Calibration is ongoing.
Alignment is a design problem
What does “helpful” mean? Helpful to whom? Who decides what counts as harmful? These aren’t engineering questions. They’re the same questions designers face when defining success metrics, writing content policies, or building moderation systems. The people shaping RLHF datasets are making design decisions, whether they call them that or not.
Steering the Ship
System prompts. The hidden brief that shapes every interaction.
Every time you talk to ChatGPT, Claude, or any other LLM-powered product, there is text you never see. The system prompt is prepended to every conversation before your first message arrives. It sets the model’s role, tone, constraints, and behavior. It’s the design brief the model reads before every interaction.
Not a command, a brief
A system prompt doesn’t work like a command line. The model doesn’t obey instructions the way software executes code. Instead, it generates text that is consistent with a world where the system prompt is true. If the prompt says “you are a helpful medical assistant,” the model shifts its probability distribution toward the kind of language a helpful medical assistant would produce. It doesn’t gain medical knowledge it didn’t have before. It doesn’t lose capabilities. The landscape of likely outputs shifts.
A design system is a collection of reusable components, guided by clear standards, that can be assembled together to build any number of applications. It includes things like color palettes, typography scales, spacing rules, and component libraries. Think of it as a shared language between designers and developers.
Why wording matters
Every word in a system prompt shifts the probability landscape. “Respond concisely” and “respond briefly” and “respond in under 50 words” all pull from different contexts in the training data and produce different behavior. Prompt engineering feels more like writing a creative brief than writing code. The system prompt is a design surface. The words you choose shape the product.
Few-shot examples
One of the most effective techniques: include input-output pairs directly in the prompt. Instead of describing the format you want, show it. Three examples of the desired behavior consistently outperform a paragraph of description. Same reason a UI mockup communicates more than a written spec. The model can pattern-match on concrete examples rather than interpret abstract instructions.
The probability landscape
A good system prompt reshapes the terrain so that the natural, high-probability paths lead to outputs you actually want. Too restrictive and the model sounds robotic, forced into narrow corridors. Too loose and it wanders, generating plausible but unfocused text. Same tension at the heart of every design system: enough structure for consistency, enough freedom to stay natural.
Reaching Out
Tool use and agents. When language models act in the world.
Language models generate text. That’s all they do. But tool use lets them call APIs, query databases, search the web, and run code. The important distinction: the model never executes anything itself. It emits structured text describing what it wants done, and a separate system carries out the action.
The component system analogy
When a designer writes <Button variant=“primary” /> in a React file, they’re not rendering pixels. They’re referencing a component, and the component handles the rendering. Tool use works the same way. The model emits something like search(query: “current weather in Denver”), and a runtime layer executes the actual search. The model describes intent. The system handles execution.
The loop
Tool use follows a cycle. The model generates a function call based on the conversation so far. The system executes that function and returns the result. The result goes back into the model’s context window as new text. The model continues generating, now informed by the result. This loop can repeat multiple times in a single interaction.
Agents
An agent is a model that runs this loop autonomously: think, act, observe, repeat. It decides which tool to call, interprets the result, and decides what to do next. The model is the conductor of an orchestra. It can’t play any of the instruments, but it knows which one should play next, and when to bring them in together.
The frontier
A text-only model is an advisor. A tool-using model is an employee. Employees can take actions with consequences, and the stakes multiply accordingly. A hallucinated fact in a conversation is annoying. A hallucinated API call that modifies a production database is a disaster. Designing the guardrails, confirmation flows, and permission boundaries around tool use is critical design work. Some of the most important interface design happening right now.
What LLMs Can’t Do
Constraints worth designing around.
No memory between sessions
The model’s weights don’t change at inference time. Nothing you say in a conversation alters the model itself. When a product like ChatGPT appears to remember you, that’s application-level engineering: your previous messages stored in a database, injected into the context window. The model is stateless. Memory is a product feature, not a model capability.
Reasoning, sort of
Whether LLMs truly reason is an active debate. What’s clear is that the mechanism is different from human reasoning. Models like o1 and Claude use extended thinking to work through problems step by step, and the results can be impressive. But the failure modes are alien. A model solves nine logic puzzles perfectly, then fails catastrophically on the tenth in ways no human would. That inconsistency is the tell. Design for a system that can approximate reasoning most of the time, not one you can trust to reason all of the time.
Hallucination is a feature
The model is always doing the same thing: generating the most plausible next token. There is no internal fact-checker, no separate system verifying claims before outputting them. True statements and false statements are produced by the exact same mechanism. Hallucination is not a bug to be fixed. It’s an inherent property of how the system works.
No knowledge of its own limits
A model can’t know what it doesn’t know. When it hedges (“I’m not sure, but...”), that hedging is a probable continuation of the sequence, not an actual measurement of uncertainty. It can sound maximally confident while being completely wrong. The confidence in the text tells you nothing about the reliability of the content.
Frozen weights at inference
If you correct the model mid-conversation, it adjusts within that context window. But the correction doesn’t persist. Next conversation starts from the same weights, and the same mistake is just as likely. Learning happens during training, not during use.
Why this makes you better
Designers work with constraints. These constraints are your material. Design verification into the interface so users don’t have to trust the model blindly. Build citation systems that let people check sources. Architect memory at the application layer instead of expecting the model to remember. Test edge cases the way you would for a pattern matcher, not for a human.
These systems are not magic and they are not conscious. They are pattern machines built on math, data, and scale. The more clearly you see the machinery, the better you can design around it.