Published on

LLM 101: Part 4 — Fine-Tuning: Why You Probably Don't Want to Train Your Own LLM

Authors

Previously: We learned how training works mathematically. Now let's talk about what you actually do with a trained model.


You've made it to Part 4. By now you know that training an LLM from scratch costs millions of dollars and requires GPU clusters that could heat a small city. You're probably thinking: "Great, so I'll never train one of those. But what about fine-tuning? I keep hearing we should fine-tune a model for our use case."

Let me save you some time and money: you probably shouldn't.

But let's talk about when you should, because those cases do exist. Just fewer of them than the consulting firms would have you believe.

What Is Fine-Tuning, Actually?

Fine-tuning is taking an already-trained model and training it more on specific data. You're not starting from scratch (that's pre-training, which we covered in Part 2). You're taking something like Qwen or LLaMA and continuing the training process with your own dataset.

Think of it like this: Pre-training taught the model English. Fine-tuning teaches it your dialect.

The base model already knows grammar, facts, reasoning, and how to generate text. Fine-tuning adjusts those billions of parameters slightly to make it better at your specific thing. That thing could be:

  • Writing in your company's tone and style
  • Following specific formatting rules
  • Answering questions about a specialized domain
  • Performing a particular task with high accuracy
  • Refusing certain types of requests
  • Literally anything you can show it examples of

The process is the same as pre-training (gradient descent, loss functions, all that stuff from Part 3), just with your data instead of the entire internet. And it's much faster and cheaper because you're not teaching the model language from scratch, you're just nudging it in a direction.

The Types of Fine-Tuning You'll Hear About

There isn't just one kind of fine-tuning. Let's clarify what people mean when they throw the term around.

Supervised Fine-Tuning (SFT)

This is the "normal" fine-tuning most people mean. You provide input-output pairs, and the model learns to map your inputs to your outputs.

Example dataset:

Input: "Translate to SQL: show me all users who signed up last month"
Output: "SELECT * FROM users WHERE signup_date >= DATE_SUB(CURDATE(), INTERVAL 1 MONTH)"

Input: "Translate to SQL: count active subscriptions"
Output: "SELECT COUNT(*) FROM subscriptions WHERE status = 'active'"

Feed the model a few hundred or thousand examples like this, and it'll get pretty good at your specific task. This is what people usually mean when they say "fine-tune."

Instruction Tuning

This is supervised fine-tuning, but specifically for making the model better at following instructions. It's what turns a base model (which just completes text) into an assistant (which responds to requests).

The big AI labs do this as part of their training process (remember Act 2 from Part 2?). You probably don't need to do it yourself unless you're building something very specific.

Parameter-Efficient Fine-Tuning (PEFT)

Here's where things get clever. Full fine-tuning updates all 70 billion parameters. That's expensive and slow. PEFT methods update only a small subset of parameters or add new ones.

LoRA (Low-Rank Adaptation): The most popular PEFT method. Instead of updating all parameters, you add small "adapter" layers that modify the model's behavior. You only train these tiny adapter layers (maybe 0.1% of the original model size). Much cheaper, much faster, and you can swap adapters in and out for different tasks.

Prompt tuning: You don't change the model at all. You just train a few "soft prompt" tokens that get prepended to inputs. The model stays frozen; only these prompt tokens are learned. Sounds weird, works surprisingly well for some tasks.

PEFT is what makes fine-tuning accessible to teams that don't have big-tech budgets. You can fine-tune a 70B parameter model on a single GPU with LoRA in hours instead of days.

RLHF (Again)

We covered this in Part 2, but it's worth mentioning here. Reinforcement Learning from Human Feedback is technically fine-tuning, just with a different objective. Instead of training on input-output pairs, you train on human preferences: "this response is better than that response."

Unless you're OpenAI or Anthropic, you're not doing this. It's complex, expensive, and requires a lot of human labeling. But it's how models learn to be helpful, harmless, and honest.

When Fine-Tuning Makes Sense

Alright, the moment you've been waiting for. Here's when you should actually consider fine-tuning:

1. You Need Consistent Formatting

Your use case requires output in a very specific format - JSON with particular fields, XML with specific tags, code with certain conventions. You can prompt for this, but the model sometimes gets it wrong. Fine-tuning can make it much more reliable.

Example: You need the model to always return JSON like:

{"intent": "...", "entities": [...], "confidence": 0.95}

After fine-tuning on a few thousand examples, the model will nail this format every single time. Prompting alone? Maybe 95% of the time.

2. You Need a Specific Style or Tone

Your company has a very particular voice, and no amount of prompting captures it perfectly. You have thousands of examples of "correct" writing in your style.

Example: A legal firm needs responses in formal legal language with specific citation formats. A gaming company needs everything in a casual, playful tone. These are hard to prompt consistently but relatively easy to fine-tune.

3. You Have a Narrow, Repeated Task

You're doing the same type of task thousands of times, and getting it right matters. The base model is pretty good, but fine-tuning can push accuracy from 85% to 95%.

Example: Classifying customer support tickets into categories. Extracting structured data from invoices. Translating technical documentation between specific domains.

4. You Need Lower Latency at Scale

This is subtle. If you're making millions of API calls, fine-tuning can let you:

  • Use a smaller model that runs faster (fine-tune a 7B model instead of using a 70B model)
  • Use shorter prompts (the behavior is baked in, so you don't need long instructions)
  • Both of these save money and reduce latency

The math: If you're doing 10 million calls per month, shaving 100ms and 500 tokens per call saves real money. But you need to be at scale for this to matter.

5. You Have Proprietary Knowledge AND Prompting Doesn't Cut It

Notice I said "AND." Just having proprietary knowledge isn't enough reason to fine-tune. Most of the time, you should use RAG (we'll cover that in Part 5).

But if you have very specialized knowledge and RAG isn't working well (maybe your domain is so niche that the base model doesn't have good priors), fine-tuning can help the model develop better intuitions.

Example: You're in a super-niche industry with jargon that doesn't exist elsewhere. Medical subspecialties. Rare legal domains. Proprietary internal systems with made-up names.

When Fine-Tuning Doesn't Make Sense (Most of the Time)

Here's the uncomfortable truth: most people who think they need fine-tuning don't.

You Just Want the Model to Know Your Company's Information

Don't fine-tune. Use RAG.

Fine-tuning is terrible at teaching factual knowledge. The model compresses your data into parameters, and information gets lost, distorted, or hallucinated. It's trying to memorize, which we explicitly tried to prevent during training (remember overfitting from Part 3?).

RAG (Part 5) lets the model look up information instead of memorizing it. It's more accurate, more transparent, easier to update, and cheaper. Save fine-tuning for behavior, not knowledge.

You Haven't Tried Better Prompting

Seriously. Have you tried:

  • Few-shot examples in your prompt?
  • Chain-of-thought reasoning?
  • Clearer instructions?
  • Asking it to output in a specific format?
  • Using a better base model?

Modern models are shockingly good at following detailed instructions. Before spending thousands of dollars and weeks fine-tuning, spend an afternoon crafting better prompts. You'll be surprised how far this gets you.

You Don't Have Enough Quality Training Data

Rule of thumb: you need at least a few hundred high-quality examples to fine-tune effectively. Ideally thousands. If you have 50 examples, you don't have enough. Either collect more data or use few-shot prompting instead.

And "quality" matters. Garbage data produces garbage models, just faster than training from scratch. If your training data has inconsistencies, errors, or unclear patterns, fine-tuning will learn those too.

You Can't Maintain It

Fine-tuning isn't a one-time thing. Your use case evolves. Your data changes. The base models improve (GPT-5 gets better, new models come out). You'll need to:

  • Collect new training data over time
  • Re-fine-tune when base models update
  • Evaluate and monitor model performance
  • Debug when things go wrong

If you can't commit to this ongoing maintenance, don't start. A well-prompted base model that you can easily update beats a fine-tuned model that goes stale.

The Practical Reality: Cost and Effort

Let's talk numbers so you know what you're signing up for.

Data Collection and Preparation

This is the hard part. Before you can fine-tune, you need training data:

  • For supervised fine-tuning: Hundreds to thousands of input-output pairs
  • For preference tuning: Multiple outputs per input, ranked by quality
  • All of it needs to be high quality, consistent, and representative

Time investment: Weeks to months, depending on whether you have existing data or need to create it from scratch. Creating data from scratch usually requires domain experts and is expensive.

Cost: If you're hiring contractors to create examples, expect 20100perhighqualityexample,dependingoncomplexity.For1,000examples,thats20-100 per high-quality example, depending on complexity. For 1,000 examples, that's 20K-100K before you even start training.

The Fine-Tuning Itself

Using a managed service (OpenAI, Anthropic, Google, etc.):

  • Cost: $2-20 per 1M tokens of training data, plus inference costs
  • Time: Hours to a day
  • Complexity: Low - these services handle the infrastructure

Using open-source models (LLaMA, Mistral, etc.) yourself:

  • Cost: GPU rental ($1-5 per hour), or use colab/other notebooks
  • Time: Hours to days, depending on model size and your data
  • Complexity: Medium to high - you need to set up infrastructure, choose hyperparameters, debug issues

Using PEFT (LoRA) dramatically reduces both cost and time. Full fine-tuning of a 70B parameter model might take days on expensive GPUs. LoRA can do it in hours on a single GPU.

Evaluation and Iteration

You fine-tuned your model. Congrats! Now you need to:

  1. Test it on held-out data
  2. Find where it fails
  3. Collect more training data for those failure cases
  4. Fine-tune again
  5. Repeat until good enough

Budget another few weeks for this cycle. First tries rarely produce production-ready models.

The Thing Nobody Tells You About Fine-Tuning

Here it is: fine-tuning is mostly about having good data, not fancy algorithms.

The difference between a mediocre fine-tuned model and a great one usually isn't the learning rate or optimizer choice. It's data quality. Specifically:

  • Consistency: Your training data should show the model clear, consistent patterns
  • Coverage: You need examples of all the edge cases you care about
  • Balance: If 90% of your examples are one type and 10% another, the model will be bad at the 10%
  • Quality: Every example teaches the model something. Bad examples teach bad lessons

This means the most important skill for fine-tuning isn't ML engineering. It's data curation. Looking at your data, finding inconsistencies, fixing them, and knowing when you need more examples of something.

If you take away one thing from this section: spend 80% of your time on data, 20% on the actual training. Not the other way around.

Fine-Tuning vs. Prompt Engineering: A Decision Tree

Let's make this concrete. Here's how to decide:

Start here: Can you fit good examples in your prompt?

  • Yes: Use few-shot prompting. Don't fine-tune.
  • No: Continue.

Can you describe the task clearly in natural language?

  • Yes: Try prompt engineering with clear instructions. If it works, stop. If it doesn't, continue.
  • No: This is a sign fine-tuning might help. Continue.

Is your task primarily about knowledge or behavior?

  • Knowledge: Use RAG (Part 5), not fine-tuning.
  • Behavior/style/format: Continue.

Do you have 500+ high-quality, consistent examples?

  • No: Collect more data or stick with prompting.
  • Yes: Continue.

Will you do this task enough times to justify the cost?

  • No: Stick with prompting.
  • Yes: Fine-tuning might make sense. Try it.

The Models You Can Actually Fine-Tune

Not all models support fine-tuning. Here's what's available:

Closed-source (via API):

  • OpenAI: GPT-4o-mini, GPT-4.1-mini (GPT-4.1 fine-tuning is limited/expensive)
  • Anthropic: Claude fine-tuning is available for enterprise customers
  • Google: PaLM 2, Gemini (through Vertex AI)

Open-source (self-hosted):

  • LLaMA 2/3 (Meta): 7B to 70B parameters
  • Mistral: 7B parameter models
  • Qwen: Various sizes
  • Dozens of others

Open-source gives you more control and potentially lower costs at scale, but requires infrastructure and expertise. Closed-source APIs are easier but lock you into a vendor.

For most companies, start with an API. You can always switch later if you hit scale where self-hosting makes economic sense.

The Uncomfortable Truth About Base Models

Here's something that should probably be more widely acknowledged: base models keep getting better.

GPT-5 is substantially better than GPT-4.1 at following instructions, even without fine-tuning. Claude 4.5 improved on Claude 4. This trend will continue. Every few months, there's a new state-of-the-art model that's better at everything.

What does this mean for your fine-tuned model?

If you fine-tuned GPT-4.1 four months ago, you're stuck with GPT-4.1 capabilities. Meanwhile, everyone using GPT-5 with good prompts is getting better results with zero training cost. Eventually, you'll need to re-fine-tune on the new base model or accept that you're falling behind.

This isn't a reason to never fine-tune. But it's a reason to be thoughtful about the maintenance burden you're taking on. Prompts are portable across model versions. Fine-tuned models are not.

When Training from Scratch Makes Sense (Spoiler: Almost Never)

You probably noticed this section is titled "Fine-Tuning" not "Training Your Own LLM." That's because training from scratch is almost never the answer for individual companies.

Training a competitive LLM from scratch requires:

  • Compute: Millions of dollars in GPU costs
  • Data: Trillions of tokens of training data
  • Expertise: Team of ML researchers and engineers
  • Time: Months of training, months of iteration
  • Maintenance: Ongoing costs forever

Even if you have $50 million to spare, you're competing with OpenAI, Google, Anthropic, and Meta. They have better data, better researchers, and more compute. Your from-scratch model will be worse than their base models. Then you'll still need to fine-tune it anyway.

The only organizations that should train from scratch are:

  • Big tech companies (and most of them don't)
  • AI research labs (that's their job)
  • Countries or large institutions with specific security/sovereignty requirements

For everyone else: use existing base models. Fine-tune if you must. Focus on building applications that create value, not on recreating infrastructure that already exists.

The Takeaway

Fine-tuning is a tool, not a magic solution. It's really good at making models follow specific formatting, adopt particular styles, or excel at narrow tasks. It's not good at teaching knowledge, fixing fundamental model limitations, or replacing well-crafted prompts.

Before you fine-tune:

  1. Try better prompting
  2. Consider RAG for knowledge needs
  3. Make sure you have good data
  4. Calculate the real cost (time + money + maintenance)
  5. Have a plan for evaluation and monitoring

Most importantly: be honest about whether you actually need it. The unsexy truth is that most use cases are better served by a well-prompted base model, maybe with RAG, than by a mediocre fine-tuned model.

Save fine-tuning for when you've exhausted simpler options and have a clear use case that justifies the investment. Your future self (and your budget) will thank you.

Next up: Part 5, where we talk about RAG - the technique that lets you give models access to your knowledge without the cost or complexity of fine-tuning. Spoiler: it's probably what you should be doing instead.


Next: Part 5 - RAG: It's Not a Cleaning Cloth, It's How You Make LLMs Know Your Stuff