LLM 101: Part 2 — Training: or, How We Teach Computers to Predict Words by Showing Them the Entire Internet

Previously: We learned that LLMs are very fancy autocomplete. Now let's talk about how they got that way.

Welcome back. In Part 1, we established that Large Language Models are essentially prediction machines with an absurd number of parameters. But how do those billions of tiny dials get tuned in the first place? The answer involves more compute than most of us can conceptualize and more text than anyone should probably read.

The Three-Act Structure of LLM Training

Training a modern LLM happens in stages, like a particularly expensive coming-of-age story. Let's walk through them.

Act 1: Pre-training (Where the Magic Happens)

This is the big one. Pre-training is where we take a model that knows absolutely nothing and teach it to predict text by showing it... well, most of the internet.

The dataset: Imagine scraping Common Crawl (a dataset of web pages), a bunch of books, Wikipedia, GitHub, academic papers, Reddit threads (yes, really), and whatever else you can legally get your hands on. We're talking trillions of tokens. That's your training data.

The task: Hilariously simple. Take a sequence of text, hide the next word, and ask the model to predict it. That's it. "The cat sat on the ___" → model guesses "mat" (or "keyboard" or "existential_dread" depending on what it's been reading).

The process:

Feed in a chunk of text
Model makes a prediction about the next token
Check if it's right (or close enough)
Adjust those billions of parameters slightly to make better predictions next time
Repeat approximately one trillion times

This takes weeks or months on clusters of GPUs that cost more than a decent house. The electricity bill alone could fund a small country's education system. But you know what? It works.

By the end of pre-training, the model has developed an internal representation of grammar, facts, reasoning patterns, and even some coding ability — all from trying to predict the next word. It's like how you don't learn to write by studying "how to write," you learn by reading a lot and absorbing patterns. Same idea, scaled up until it gets weird.

Act 2: Supervised Fine-Tuning (Making It Useful)

Plot twist: a pre-trained model is actually kind of useless for conversation. You can prompt it with "The capital of France is" and it'll complete it with "Paris." Great. But ask it a question like "What's the capital of France?" and it might just... continue with another question. Or a random fact. It's been trained to predict plausible text, not to be helpful.

This is where supervised fine-tuning comes in.

Engineers (and increasingly, contractors) write examples of good question-answer pairs:

Human: "Explain photosynthesis simply"
Assistant: "Photosynthesis is how plants convert sunlight into chemical energy..."

The model gets trained on thousands of these examples, learning to act like a helpful assistant rather than a text completion engine. It's the difference between a model that continues your sentence and one that actually responds to what you're asking.

This phase is much shorter than pre-training — days instead of months — because we're not teaching it language from scratch. We're just teaching it to be polite about the language it already knows.

Act 3: RLHF (Reinforcement Learning from Human Feedback)

Here's where things get philosophically interesting. We've got a model that can complete text and answer questions. But how do we make it good at answering questions? How do we teach it to be helpful, harmless, and honest?

Human preference.

The setup:

Give the model a prompt
Let it generate several different responses
Have humans rank those responses from best to worst
Train the model to prefer generating responses that humans would rank higher

This is called Reinforcement Learning from Human Feedback (RLHF), and it's why ChatGPT doesn't just spew the first grammatically correct thing that comes to its neural networks. It's learned that some responses are better than others according to human judges.

The result? Models that refuse to help you build a bomb, apologize when they're wrong (too much, honestly), and try to be balanced in their answers. They've learned human preferences, not just human language.

The catch: Human preference is subjective. What's "helpful" to one person is "verbose" to another. What's "harmless" in one culture might be offensive in another. RLHF makes models useful, but it also bakes in a particular set of values. Worth remembering.

What This Actually Looks Like at Scale

Let's put some numbers to this absurdity:

GPT-3 (2020): 175 billion parameters, trained on 300 billion tokens, cost estimated at $5-12 million
GPT-4 (2023): Parameter count unknown (OpenAI stopped sharing), trained on who-knows-how-much data, cost estimated at over $100 million
LLaMA 2 70B (2023): 70 billion parameters, trained on 2 trillion tokens, Meta had to build their own data centers

These aren't grad student projects. These are industrial operations that require:

Teams of engineers
Clusters of thousands of GPUs
Months of compute time
Cooling systems that could handle a small data center
More electrical power than most people use in a lifetime

And that's just for the base model. Once you've got one, you can fine-tune it relatively cheaply (comparatively speaking — still thousands of dollars). But pre-training from scratch? That's big tech company money.

Why Does This Process Work?

Honestly? We're not entirely sure.

We know that it works. We can measure that it works. But the mechanism by which showing a neural network trillions of tokens teaches it to reason, write code, and engage in abstract thought? That's still an active area of research.

Some researchers think the model develops internal "world models" — representations of how things work. Others think it's just very sophisticated pattern matching all the way down. The truth is probably somewhere in between, and we're still figuring it out.

What we do know: scale matters. Bigger models trained on more data consistently perform better. There's something about having 100 billion parameters instead of 10 billion that unlocks new capabilities. We call these "emergent abilities" — things the model can do that smaller models simply cannot, even though the training process is identical.

It's like how enough neurons firing in a particular way somehow creates consciousness. Or how enough ants following simple rules creates a colony that acts intelligently. Scale creates complexity creates capability. The exact mechanism? Still mysterious.

The Dirty Secret

All that training data? Scraping the internet means you're inevitably training on:

Copyrighted content (lawsuits pending)
Biased text (from a biased world)
Factual errors (the internet is wrong about everything)
Toxic content (yes, even with filters)

The model learns all of it. Then we try to steer it away from the bad stuff through fine-tuning and RLHF. Does this completely solve the problem? No. Does it make the models usable? Yes. Is it philosophically messy? Absolutely.

The Takeaway

Training an LLM is an exercise in brute force pattern recognition at a scale that's hard to conceptualize. We're not programming these things in any traditional sense. We're creating conditions for them to learn patterns from data, then hoping (and guiding) them toward useful behavior.

It's expensive, it's resource-intensive, it raises questions about copyright and bias, and it works surprisingly well despite us not fully understanding why.

In Part 3, we'll dig deeper into what "training" actually means mathematically (don't worry, I'll be gentle). In Part 4, we'll talk about fine-tuning and why you probably don't want to train your own LLM from scratch. And in Part 5, we'll cover RAG — a technique that lets you bolt your company's knowledge onto an existing model without retraining anything.

See you there.

Next: Part 3 — What Does "Training" Actually Mean? (Gradient Descent and Other Joys)