Trending Tech

How LLMs Work: An Interactive Visual Guide Explains It All

Manaal Khan24 April 2026 at 11:13 pm6 دقيقة للقراءة

Key Takeaways

LLMs train on roughly 15 trillion tokens filtered from 44 terabytes of web text
GPT-4 uses a vocabulary of 100,277 tokens built through Byte Pair Encoding
Base models are autocomplete engines, not assistants. The assistant behavior comes from fine-tuning

A Visual Walkthrough of Large Language Models

A new interactive guide posted on Hacker News breaks down exactly how large language models like ChatGPT are built. The project, based on Andrej Karpathy's technical deep dive, walks through each step of the process with visual explanations that make the architecture accessible without sacrificing technical accuracy.

The guide covers the complete pipeline. It starts with data collection and ends with the sampling process that generates text one token at a time. For anyone who has wondered what actually happens inside these systems, this is the clearest explanation currently available online.

Step One: Collecting the Training Data

Everything begins with text. Organizations like Common Crawl have been indexing the web since 2007. By 2024, they had crawled 2.7 billion pages. But raw web data is messy. It includes spam, duplicates, low-quality content, and pages that would poison a model's training.

The raw data goes through aggressive filtering to produce datasets like FineWeb. The goal is large quantity of high quality, diverse documents. After filtering, you end up with about 44 terabytes of text. That's roughly 10 consumer hard drives worth. This filtered dataset represents approximately 15 trillion tokens.

15 trillion tokens

The approximate size of filtered training data after processing 44 terabytes of web text through quality filters

Tokenization: Turning Text Into Numbers

Neural networks cannot process raw text. They work with numbers. The solution is tokenization, which breaks text into sub-word chunks and assigns each chunk an ID.

GPT-4 uses a vocabulary of 100,277 tokens. These tokens are built using the Byte Pair Encoding (BPE) algorithm. BPE starts with individual bytes (256 symbols). It then iteratively merges the most frequent adjacent pairs. This compresses the sequence length while expanding the vocabulary.

The result is a system where common words might be single tokens, while rare words get split into multiple pieces. The word "tokenization" might become three or four tokens. The word "the" is just one.

Training the Transformer

The Transformer neural network starts with random parameters. These are billions of numerical values that the guide calls "knobs." Training adjusts these knobs so the network gets better at one specific task: predicting the next token in any sequence.

Every training step follows the same pattern. Sample a window of tokens. Feed them to the network. Compare the prediction to the actual next token. Nudge all parameters slightly in the right direction. Repeat this process billions of times.

The loss is a single number measuring prediction error. It falls steadily as the model learns the statistical patterns of human language. The guide includes visual representations of how model output quality improves as training progresses.

How Generation Actually Works

Once trained, the network generates text through a process called autoregressive generation. Feed a sequence of tokens. Get a probability distribution over all 100,000 possible next tokens. Sample one. Append it. Repeat.

This process is stochastic. The same prompt generates different outputs every time because the model is effectively flipping a biased coin. Higher-probability tokens are more likely to be chosen but not guaranteed.

Temperature controls the randomness. Low temperature (0.1) means the model always picks the top token. High temperature (2.0) creates uniform chaos where any token might be selected. The sweet spot for coherent but creative text is typically 0.7 to 1.0.

Base Models Are Not Assistants

This is where most explanations get it wrong. After pre-training, you have a base model. It's a sophisticated autocomplete engine. It does not answer questions. It does not follow instructions. It continues token sequences based on what it saw on the internet.

Give it a Wikipedia sentence and it will complete it from memory. Ask it "What is 2+2?" and it might give you a math textbook page, a quiz answer key, or go off on a tangent. Whatever was statistically common in its training data.

The guide explains that the base model's knowledge lives in its parameters. For frontier models, that's around 405 billion parameters. These parameters act as a lossy compression of the internet. The guide compares it to a zip file that approximates rather than perfectly reproduces its source material.

ℹ️

Logicity's Take

Why This Matters for Practitioners

Understanding this pipeline clarifies several practical questions. Why do LLMs sometimes produce nonsense? Because they're sampling from probability distributions, not retrieving facts. Why do they "hallucinate"? Because they're trained to predict plausible continuations, not to verify truth.

Why does prompt engineering work? Because the model's predictions depend heavily on the token sequence you provide. A different prompt shifts the probability distribution toward different outputs.

The guide notes that the figures shown are representative of frontier models circa 2024. Exact numbers shift with every release. The scale is the point, not the precision. What matters is understanding the architecture, not memorizing specific parameter counts.

The Fine-Tuning Step

The source text cuts off before covering fine-tuning in detail, but the distinction between base models and assistants hints at what comes next. The helpful, instruction-following behavior of ChatGPT and similar products comes from additional training steps after pre-training.

This typically involves supervised fine-tuning on examples of good assistant behavior, followed by reinforcement learning from human feedback (RLHF). The base model learns to predict text. Fine-tuning teaches it to be helpful.

Frequently Asked Questions

How much data is used to train large language models?

Frontier models train on roughly 15 trillion tokens, filtered from about 44 terabytes of web text. This comes from billions of crawled web pages that go through aggressive quality filtering.

What is tokenization in LLMs?

Tokenization breaks text into sub-word chunks and assigns each chunk a numerical ID. GPT-4 uses 100,277 tokens built through Byte Pair Encoding. Common words are single tokens while rare words split into multiple pieces.

Why do LLMs give different answers to the same prompt?

LLMs generate text by sampling from probability distributions. They pick the next token based on probabilities, not deterministic rules. Temperature settings control how random this sampling is.

What is the difference between a base model and an assistant?

Base models are autocomplete engines trained to predict the next token. They don't follow instructions or answer questions directly. Assistant behavior comes from additional fine-tuning after pre-training.

What does temperature mean in LLM generation?

Temperature controls randomness in token selection. Low temperature (0.1) always picks the highest-probability token. High temperature (2.0) makes selection nearly random. Values between 0.7 and 1.0 balance coherence with creativity.

ℹ️

Need Help Implementing This?

Source: Hacker News: Best