How LLMs Work: An Interactive Visual Guide Explains It All
Key Takeaways
- LLMs train on roughly 15 trillion tokens filtered from 44 terabytes of web text
- GPT-4 uses a vocabulary of 100,277 tokens built through Byte Pair Encoding
- Base models are autocomplete engines, not assistants. The assistant behavior comes from fine-tuning
A Visual Walkthrough of Large Language Models
A new interactive guide posted on Hacker News breaks down exactly how large language models like ChatGPT are built. The project, based on Andrej Karpathy's technical deep dive, walks through each step of the process with visual explanations that make the architecture accessible without sacrificing technical accuracy.
The guide covers the complete pipeline. It starts with data collection and ends with the sampling process that generates text one token at a time. For anyone who has wondered what actually happens inside these systems, this is the clearest explanation currently available online.
Step One: Collecting the Training Data
Everything begins with text. Organizations like Common Crawl have been indexing the web since 2007. By 2024, they had crawled 2.7 billion pages. But raw web data is messy. It includes spam, duplicates, low-quality content, and pages that would poison a model's training.
The raw data goes through aggressive filtering to produce datasets like FineWeb. The goal is large quantity of high quality, diverse documents. After filtering, you end up with about 44 terabytes of text. That's roughly 10 consumer hard drives worth. This filtered dataset represents approximately 15 trillion tokens.
Tokenization: Turning Text Into Numbers
Neural networks cannot process raw text. They work with numbers. The solution is tokenization, which breaks text into sub-word chunks and assigns each chunk an ID.
GPT-4 uses a vocabulary of 100,277 tokens. These tokens are built using the Byte Pair Encoding (BPE) algorithm. BPE starts with individual bytes (256 symbols). It then iteratively merges the most frequent adjacent pairs. This compresses the sequence length while expanding the vocabulary.
The result is a system where common words might be single tokens, while rare words get split into multiple pieces. The word "tokenization" might become three or four tokens. The word "the" is just one.
Training the Transformer
The Transformer neural network starts with random parameters. These are billions of numerical values that the guide calls "knobs." Training adjusts these knobs so the network gets better at one specific task: predicting the next token in any sequence.
Every training step follows the same pattern. Sample a window of tokens. Feed them to the network. Compare the prediction to the actual next token. Nudge all parameters slightly in the right direction. Repeat this process billions of times.
The loss is a single number measuring prediction error. It falls steadily as the model learns the statistical patterns of human language. The guide includes visual representations of how model output quality improves as training progresses.
How Generation Actually Works
Once trained, the network generates text through a process called autoregressive generation. Feed a sequence of tokens. Get a probability distribution over all 100,000 possible next tokens. Sample one. Append it. Repeat.
This process is stochastic. The same prompt generates different outputs every time because the model is effectively flipping a biased coin. Higher-probability tokens are more likely to be chosen but not guaranteed.
Temperature controls the randomness. Low temperature (0.1) means the model always picks the top token. High temperature (2.0) creates uniform chaos where any token might be selected. The sweet spot for coherent but creative text is typically 0.7 to 1.0.
Base Models Are Not Assistants
This is where most explanations get it wrong. After pre-training, you have a base model. It's a sophisticated autocomplete engine. It does not answer questions. It does not follow instructions. It continues token sequences based on what it saw on the internet.
Give it a Wikipedia sentence and it will complete it from memory. Ask it "What is 2+2?" and it might give you a math textbook page, a quiz answer key, or go off on a tangent. Whatever was statistically common in its training data.
The guide explains that the base model's knowledge lives in its parameters. For frontier models, that's around 405 billion parameters. These parameters act as a lossy compression of the internet. The guide compares it to a zip file that approximates rather than perfectly reproduces its source material.
Logicity's Take
Why This Matters for Practitioners
Understanding this pipeline clarifies several practical questions. Why do LLMs sometimes produce nonsense? Because they're sampling from probability distributions, not retrieving facts. Why do they "hallucinate"? Because they're trained to predict plausible continuations, not to verify truth.
Why does prompt engineering work? Because the model's predictions depend heavily on the token sequence you provide. A different prompt shifts the probability distribution toward different outputs.
The guide notes that the figures shown are representative of frontier models circa 2024. Exact numbers shift with every release. The scale is the point, not the precision. What matters is understanding the architecture, not memorizing specific parameter counts.
The Fine-Tuning Step
The source text cuts off before covering fine-tuning in detail, but the distinction between base models and assistants hints at what comes next. The helpful, instruction-following behavior of ChatGPT and similar products comes from additional training steps after pre-training.
This typically involves supervised fine-tuning on examples of good assistant behavior, followed by reinforcement learning from human feedback (RLHF). The base model learns to predict text. Fine-tuning teaches it to be helpful.
Frequently Asked Questions
How much data is used to train large language models?
Frontier models train on roughly 15 trillion tokens, filtered from about 44 terabytes of web text. This comes from billions of crawled web pages that go through aggressive quality filtering.
What is tokenization in LLMs?
Tokenization breaks text into sub-word chunks and assigns each chunk a numerical ID. GPT-4 uses 100,277 tokens built through Byte Pair Encoding. Common words are single tokens while rare words split into multiple pieces.
Why do LLMs give different answers to the same prompt?
LLMs generate text by sampling from probability distributions. They pick the next token based on probabilities, not deterministic rules. Temperature settings control how random this sampling is.
What is the difference between a base model and an assistant?
Base models are autocomplete engines trained to predict the next token. They don't follow instructions or answer questions directly. Assistant behavior comes from additional fine-tuning after pre-training.
What does temperature mean in LLM generation?
Temperature controls randomness in token selection. Low temperature (0.1) always picks the highest-probability token. High temperature (2.0) makes selection nearly random. Values between 0.7 and 1.0 balance coherence with creativity.
Need Help Implementing This?
Source: Hacker News: Best
Manaal Khan
Tech & Innovation Writer
Related Articles
Browse all
Robotaxi Companies Are Hiding How Often Humans Take the Wheel
Autonomous vehicle firms like Waymo and Tesla are under scrutiny for refusing to disclose how often remote operators step in to control their self-driving cars. A Senate investigation reveals major gaps in transparency, raising safety and accountability concerns.

Wisconsin Governor Throws a Wrench in Age Verification Plans
Wisconsin Governor Tony Evers has vetoed a bill that would have required residents to verify their age before accessing adult content online, citing concerns over privacy and data security. This move comes as several other states have already implemented similar age check requirements. The veto has significant implications for the future of online age verification.

Apple's App Store Empire Under Siege: The Battle for the Future of Tech
The long-running feud between Apple and Epic Games has reached a boiling point, with Apple preparing to take its case to the Supreme Court. The tech giant is fighting to maintain control over its App Store, while Epic Games is pushing for more freedom for developers. The outcome could have far-reaching implications for the entire tech industry.

Tesla's Remote Parking Feature: The Investigation That Didn't Quite Park Itself
The US auto safety regulators have closed their investigation into Tesla's remote parking feature, but what does this mean for the future of autonomous driving? We dive into the details of the investigation and what it reveals about the technology. The National Highway Traffic Safety Administration found that crashes were rare and minor, but the investigation's closure doesn't necessarily mean the feature is completely safe.
Also Read

How to Turn Your Old Tablet Into a Second Monitor for Free
A free app called spacedesk transforms any old Android tablet into a wireless second display for your Windows laptop. No subscription, no account, no cables. The setup takes about 10 minutes and works over your existing Wi-Fi network.

5 3D-Printed Mechanical Keyboard Projects to Build This Weekend
Joe Scotto's open-source keyboard designs let makers 3D print everything from a simple 16-key macropad to a full split ergonomic board. All files and firmware are free, with detailed build guides for each project.

8 Ahrefs Alternatives That Fit Different SEO Workflows
Ahrefs dominates SEO research, but recent price increases and specialized needs have professionals looking elsewhere. From enterprise-scale platforms to free keyword tools, here are the alternatives worth considering based on your actual workflow.