LLM 101: Part 3 — What Does "Training" Actually Mean? (Gradient Descent and Other Joys)

Previously: We hand-waved about "adjusting parameters." Time to get slightly more specific.

In Part 2, I told you that training involves showing the model text and adjusting billions of parameters to make better predictions. You nodded politely. But what does "adjusting parameters" actually mean? And how does the model know which way to adjust them?

Welcome to the part where we get a tiny bit mathematical. Don't panic. No calculus required, just concepts.

The Fundamental Problem

You've got a model with 70 billion parameters. It makes predictions. Most of them are wrong at first (randomly initialized parameters are about as useful as a random number generator). You need to make those predictions better.

But here's the thing: you can't just manually tune 70 billion dials. Even if you spent one second per parameter, that's over 2,200 years. You need an algorithm. Specifically, you need an algorithm that can figure out which direction to adjust each parameter to improve the model's predictions.

Enter: gradient descent. The workhorse of modern machine learning.

Loss Functions: How Wrong Are We?

Before you can improve, you need to measure how bad you are. This is what a loss function does.

Here's the setup:

Model predicts the next token: "The cat sat on the ___"
Model gives probabilities: mat (35%), floor (25%), chair (20%), keyboard (15%), hat (5%)
Actual answer from training data: "mat"
Loss function calculates: how far off were those probabilities from perfect?

Perfect would be: mat (100%), everything else (0%). We're not perfect. The loss function gives us a number representing how wrong we were. Lower is better. Zero would be perfect (never happens).

The most common loss function for LLMs is called "cross-entropy loss." The math doesn't matter. What matters is: it gives us a single number that says "this prediction was this wrong."

Gradient Descent: Rolling Downhill

Imagine you're standing on a mountain in dense fog. You can't see the valley below, but you want to get there. You also have a magic stick that tells you which direction is downhill right where you're standing.

You take a small step downhill. Check the stick again. Step again. Eventually, you reach the bottom (or at least a local low point).

This is gradient descent. Except:

The "mountain" is a 70-billion-dimensional space (don't try to visualize this)
"Downhill" means "direction that reduces loss"
The "magic stick" is calculus (specifically, the gradient)

The process:

Make a prediction with current parameters
Calculate the loss (how wrong you were)
Calculate the gradient (which direction to adjust each parameter to reduce loss)
Adjust all parameters slightly in that direction
Repeat a trillion times

The gradient tells you: if you increase this parameter slightly, loss goes up or down. Do that for all 70 billion parameters simultaneously. It's computed efficiently using something called backpropagation, which is clever calculus that I'm not going to explain because you'd hate me.

Learning Rate: The Goldilocks Problem

How big should each adjustment be? This is called the learning rate, and it's critical.

Too large: You overshoot. Imagine trying to get to the bottom of the valley by taking enormous leaps. You'll just bounce from one side to the other, possibly flying right over the valley entirely. Your loss goes up, not down. Training explodes.

Too small: You get there eventually, but it takes forever. Each step is a tiny shuffle. Training takes months instead of weeks. Your AWS bill becomes a mortgage payment.

Just right: Big enough to make progress, small enough to not overshoot. Finding this sweet spot is part art, part science. It usually involves trying a bunch of values and seeing what works.

(In practice, we often use adaptive learning rates that start large and get smaller over time. Clever optimizers like Adam do this automatically. But the concept remains.)

Batches: Because One Example at a Time Is Too Slow

In theory, you could:

Show the model one piece of text
Calculate loss
Update all parameters
Repeat with next piece of text

In practice, this is painfully slow and noisy. Instead, we use batches.

Batching:

Show the model 1,000 pieces of text (a "batch")
Calculate loss for each
Average them
Update parameters based on the average gradient

This is faster (GPUs are good at parallel processing) and more stable (averaging smooths out noise). The batch size is another hyperparameter to tune. Too small: noisy and slow. Too large: maybe you miss nuances, and definitely you need more GPU memory.

Why This Feels Like Alchemy

Here's the uncomfortable truth: we've got a bunch of hyperparameters to set:

Learning rate
Batch size
How long to train
Which optimizer to use (Adam? AdamW? Something else?)
Learning rate schedules
Regularization techniques

And no one really knows the "right" values for a new model. We make educated guesses based on what worked before. We run small experiments. We use rules of thumb. Sometimes we just try stuff until something works.

It's less "applied mathematics" and more "principled tinkering guided by intuition and expensive compute." Research labs develop instincts. Best practices emerge. But there's no formula that tells you exactly how to train the perfect model.

The Miracle of Backpropagation

I glossed over this, but it's worth a moment of appreciation. Backpropagation is the algorithm that efficiently computes gradients for all 70 billion parameters without having to check each one individually.

It works by the chain rule from calculus, propagating the error backward through the network. It's mathematically elegant and computationally efficient. Without it, training deep neural networks would be impossible.

You don't need to understand how it works (frankly, most people using it don't really grok the details). But know that it's the reason we can train these models at all.

Overfitting: When the Model Memorizes Instead of Learns

Here's a trap: imagine studying for a test by memorizing all the questions and answers. You'll ace that test. But give you a slightly different question, and you're lost. That's overfitting.

LLMs can do this too. Train on the same data too long, and the model starts memorizing specific examples rather than learning general patterns. It'll repeat training data verbatim but fail on anything new.

We combat this with:

Regularization: Techniques that penalize the model for getting too confident
Dropout: Randomly disabling parts of the network during training to force redundancy
Early stopping: Quit training when performance on held-out data stops improving
More data: Harder to memorize when there's trillions of tokens

The goal is a model that generalizes — that learned the underlying patterns of language, not just the specific examples it saw.

Convergence: Are We There Yet?

How do you know when to stop training? When the model is "done"?

Trick question. It's never done. Loss keeps decreasing, but with diminishing returns. You stop when:

Loss stops improving meaningfully
You run out of compute budget
You run out of time
Performance on your evaluation benchmarks plateaus

Training is an economic decision as much as a scientific one. That last 2% improvement might cost as much as the first 80%. Is it worth it? Depends on your use case and your checkbook.

The Philosophical Bit

What are these billions of parameters actually learning? We think they're building internal representations of concepts, relationships, and patterns. The early layers might learn basic features (common letter combinations, simple grammar). Deeper layers learn more abstract concepts (semantic meaning, logical reasoning, maybe even something like "understanding").

But here's the thing: we can't directly read these representations. The parameters are just numbers. We can probe the network, run experiments, visualize activations. But ultimately, the model is a black box. We know what goes in (text) and what comes out (predictions), but the middle is largely opaque.

This bothers some people. It should probably bother more people. We've built something we don't fully understand. It works. We use it. But we can't always explain why it makes the decisions it makes.

The Takeaway

Training is an optimization process. We start with random parameters and iteratively adjust them to minimize prediction error. It's mathematically principled (gradient descent is well-understood) but practically messy (hyperparameters are chosen by vibes and experimentation).

The result is a model that has compressed patterns from trillions of tokens into billions of parameters. It's lossy compression — the model doesn't memorize everything (we hope). Instead, it learns statistical patterns that let it generate plausible text.

Is it perfect? No. Is it impressive? Ridiculously so.

In Part 4, we'll talk about fine-tuning — what you do with a trained model to make it better at specific tasks. Spoiler: it's training again, but faster and cheaper. And in Part 5, we'll cover RAG, which is how you make models work with your data without training at all.

Almost there.

Next: Part 4 — Fine-Tuning: Why You Probably Don't Want to Train Your Own LLM