LLM Fine-Tuning in Practice
When to fine-tune vs RAG vs prompt. Full LoRA fine-tuning walkthrough on a real dataset using HuggingFace Transformers and PEFT.
Fine-tuning is not always the answer. Most production LLM applications are better served by prompt engineering or RAG. Fine-tune only when you have labelled data, a specific behaviour to change, and evidence that prompting cannot get you there.
This is the question every ML engineer at an Indian startup faces when building an LLM-powered feature: should we fine-tune a model or can we get there with prompting and retrieval? Fine-tuning is expensive — data collection, training compute, evaluation, deployment — and it is irreversible. A fine-tuned model that learned the wrong behaviour is worse than a base model.
The answer depends on what you are trying to change. If the base model already knows how to do the task but needs domain-specific facts — use RAG. If it knows how to do the task but needs a specific output format or tone — use prompting. If it genuinely cannot do the task reliably even with perfect prompts and all context in window — then fine-tune. The bar for fine-tuning should be high.
Hiring an expert consultant vs training a new employee. A consultant (prompting) is fast, flexible, and immediately available — give them the context they need and they will do good work. A trained employee (fine-tuned model) internalises your company's way of doing things, does not need context each time, is faster at inference, but costs significant upfront investment. You hire a consultant first, hire full-time only when the work is consistent, high-volume, and the consultant approach is insufficient.
Razorpay's payment dispute classifier: prompting GPT-4 worked at 80% accuracy. Fine-tuned LLaMA-3-8B reached 94% at 10× lower cost per query. The volume justified the training investment. Volume and consistency are the two conditions that make fine-tuning worth it.
Prompt vs RAG vs fine-tune — a decision framework with real examples
Data preparation — the format that makes or breaks fine-tuning
The quality of fine-tuning data matters far more than the choice of model or hyperparameters. 500 high-quality, diverse examples consistently outperform 5,000 mediocre examples. Every example must follow the exact same chat template the base model was trained with. Mismatched templates are the most common silent failure — the model trains without error but produces garbage at inference.
QLoRA fine-tuning — 4-bit quantisation + LoRA on a 7B model
QLoRA (Module 51) combines 4-bit quantisation of the frozen base model with LoRA adapters that train in fp16. This makes fine-tuning a 7B model possible on a single 16GB GPU — a Google Colab T4 or a local RTX 4090. The TRL library (from HuggingFace) wraps SFTTrainer — a Trainer specifically designed for supervised fine-tuning that handles chat template formatting, packing short sequences together, and gradient checkpointing automatically.
Evaluating fine-tuned LLMs — beyond perplexity
Training loss and perplexity tell you the model is learning but not whether it will perform well in production. For task-specific fine-tuning, evaluate on task metrics: exact match accuracy for classification, ROUGE for summarisation, code execution rate for code generation. Always hold out a test set that the model never sees during training. Always compare against the base model and a prompting baseline — if fine-tuning does not beat prompting by a meaningful margin, the fine-tuning is not worth the cost.
Deploying fine-tuned LLMs — serving, versioning, and monitoring
Every common LLM fine-tuning mistake — explained and fixed
You can fine-tune any LLM. Next: models that see and understand both images and text simultaneously.
Fine-tuning adapts a model to a specific task using labelled examples. The next frontier is multimodal models — models that jointly understand images and text. CLIP encodes images and text in a shared embedding space. LLaVA connects a vision encoder to an LLM decoder, enabling visual question answering. Module 66 covers how these architectures work and how to use them for tasks that require understanding both what is written and what is shown.
Models that see and understand images and text together. CLIP for zero-shot image classification, LLaVA for visual question answering.
🎯 Key Takeaways
- ✓Fine-tune only when prompting and RAG cannot get you there. The decision hierarchy: prompt engineering first (1 day, flexible, no training cost) → RAG for knowledge gaps (1-2 weeks) → LoRA fine-tuning for consistent behaviour change on high-volume tasks (2-4 weeks) → full fine-tuning almost never for applications. The bar for fine-tuning must be justified by volume and a clear quality gap over prompting.
- ✓Data quality trumps data quantity. 500 high-quality, diverse, correctly-labelled examples beat 5,000 mediocre ones every time. Evaluate your data before training: read 50 random examples manually. If you find labelling inconsistencies, fix the data first. The most impactful ML work is data cleaning, not model architecture.
- ✓Chat template format must match the base model exactly. LLaMA-3, Mistral, Phi-3, and Gemma all use different special tokens. Apply the template with tokenizer.apply_chat_template() — never hardcode template strings manually. Template mismatches are the most common silent failure in LLM fine-tuning.
- ✓Use DataCollatorForCompletionOnlyLM to compute loss on assistant tokens only. Training on prompt tokens wastes compute and teaches the model the wrong thing — it should learn to generate responses, not re-generate inputs. Verify by printing token labels: prompt positions must be -100 (ignored).
- ✓QLoRA (4-bit quantisation + LoRA, rank 16, all projection layers) on a 7B model fits in 16GB VRAM with batch_size=4 and gradient_accumulation=4. Use paged_adamw_32bit optimiser, gradient_checkpointing=True, and packing=True in SFTTrainer. Training 3 epochs on 2,000 examples takes approximately 30-60 minutes on a T4.
- ✓Always compare fine-tuned model against: base model with no prompt, base model with optimised prompt, and a stronger model API (GPT-4) with optimised prompt. If GPT-4 with a good prompt beats your fine-tuned model, you have a data or training problem, not a capability gap. Ship the simpler approach until fine-tuning genuinely wins on your evaluation set.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.