Fine-Tuning with PEFT — LoRA and Adapters
Tune less than 1% of a model's parameters and get 95% of the performance. LoRA, adapters, and prefix tuning — when and how to use each.
Full fine-tuning a 7B parameter model requires 28GB of GPU memory just to store the weights. LoRA fine-tunes the same model using 16MB of trainable parameters — on a single consumer GPU.
Module 50 showed full fine-tuning — update all 110M parameters of BERT for 3 epochs. That costs 4GB of GPU memory and 30 minutes. Acceptable for BERT. Completely impractical for LLaMA-3 (8B), Mistral (7B), or Falcon (40B). Full fine-tuning a 7B model requires storing the model weights (28GB in fp32), the gradients (another 28GB), and the optimiser states (56GB for Adam). Total: 112GB VRAM. No consumer GPU has that.
PEFT (Parameter-Efficient Fine-Tuning) solves this by updating only a tiny fraction of the model's parameters while freezing the rest. LoRA — the most popular PEFT method — adds small low-rank matrices alongside the frozen weight matrices. Only the small matrices are trained. Total trainable parameters: typically 0.1–1% of the full model. GPU memory required: a fraction of full fine-tuning. Quality: 90–95% of full fine-tuning.
A senior engineer at Razorpay knows everything about payments. You want to teach them your company's specific internal processes. You do not re-hire them and retrain them from scratch — you give them a small notebook of company-specific notes to carry alongside their existing expertise. LoRA is that notebook — small, lightweight, task-specific, sits alongside the frozen base model.
At inference time: base model knowledge + LoRA notebook = specialised expert. You can swap notebooks — same base model, different LoRA adapters for different tasks. One GPU, many specialists.
LoRA — Low-Rank Adaptation — the math in plain English
A weight matrix W in a Transformer has shape (d_out, d_in). For BERT's attention layers, d_in = d_out = 768. That is 768 × 768 = 589,824 parameters per matrix. LoRA's key insight: the change needed to adapt a pretrained model to a new task has low intrinsic rank — it lives in a much smaller subspace than the full matrix dimension.
Instead of updating W directly, LoRA adds two small matrices: A of shape (r, d_in) and B of shape (d_out, r) where r is the rank — typically 4, 8, or 16. The effective weight update is B @ A — a rank-r matrix. Training only A and B requires r × (d_in + d_out) parameters instead of d_in × d_out. With r=8, d=768: 8 × 1536 = 12,288 parameters vs 589,824. That is a 48× reduction per matrix.
HuggingFace PEFT library — LoRA in three lines of code
The PEFT library wraps any HuggingFace model with LoRA in three steps: define a LoraConfig, call get_peft_model(), done. PEFT automatically identifies which layers to apply LoRA to, freezes everything else, and gives you a model where only the LoRA matrices require gradients.
LoRA + quantisation — fine-tuning a 7B model on a single GPU
LoRA's main value is not for BERT (110M) — you can full fine-tune BERT easily. The value is for 7B, 13B, and 70B parameter models where full fine-tuning is impossible on consumer hardware. Combine LoRA with quantisation (4-bit or 8-bit weights via bitsandbytes) and you can fine-tune a 7B model on a 16GB GPU. This is QLoRA — Quantised LoRA.
Adapters, prefix tuning, and prompt tuning — when each is appropriate
LoRA is the most popular PEFT method but not the only one. Three other methods are widely used in production, each with different trade-offs between parameter count, training stability, and inference overhead.
Low-rank matrices added alongside frozen attention weights. Merged into weights at inference — zero latency overhead.
Small bottleneck MLP inserted between Transformer layers. Frozen base, only adapters train. Original method from Houlsby et al. 2019.
Prepend trainable virtual tokens (prefix) to the key and value in every attention layer. Only the prefix vectors are trained.
Prepend trainable soft tokens to the INPUT only (not every layer). Simplest PEFT method — only a few thousand parameters.
Merging LoRA weights — zero inference overhead in production
During training, LoRA runs a separate forward pass through B @ A and adds it to the frozen W output. At inference this adds latency. LoRA can be merged: the weight update B @ A is computed once and added directly to W — producing a standard model with no extra computation. Merged model = full fine-tuned quality at full fine-tuned speed. The LoRA matrices can be discarded after merging.
Every common PEFT mistake — explained and fixed
You can fine-tune any model efficiently. Next: give any LLM access to your own documents.
Fine-tuning teaches a model new behaviour patterns from labelled data. But what if you want the model to answer questions about documents it has never seen — your company's internal knowledge base, a legal corpus, a product catalogue? Fine-tuning cannot help here — the model still cannot access documents not in its weights. Retrieval-Augmented Generation (RAG) solves this by combining a retriever (find relevant documents from a vector database) with a generator (produce an answer grounded in those documents). Module 52 builds a complete RAG pipeline for a Razorpay knowledge base.
Vector databases, semantic search, chunking strategies, and the full RAG pipeline from document to answer.
🎯 Key Takeaways
- ✓Full fine-tuning a 7B model requires 112GB VRAM. LoRA fine-tunes the same model with 0.1–1% of parameters — fitting on a single 16GB GPU. The trade-off: 90–95% of full fine-tuning quality at 1% of the cost.
- ✓LoRA adds two small matrices A (r × d_in) and B (d_out × r) alongside each frozen weight matrix W. The effective update is B @ A — a rank-r approximation of the full weight change. B is initialised to zeros so LoRA starts identical to the pretrained model and gradually diverges.
- ✓QLoRA combines LoRA with 4-bit quantisation (bitsandbytes NF4) — the frozen base model weights are stored in 4-bit, reducing a 7B model from 28GB to 3.5GB. Only the LoRA matrices are stored in fp16. This enables 7B fine-tuning on a single consumer GPU.
- ✓PEFT library workflow: LoraConfig → get_peft_model(base_model, config) → standard Trainer. Three lines to convert any HuggingFace model to LoRA. Always call model.print_trainable_parameters() to verify the right layers are being trained.
- ✓Target modules must match your model family exactly: BERT → ["query","value"], DistilBERT → ["q_lin","v_lin"], LLaMA → ["q_proj","v_proj","k_proj","o_proj"], GPT-2 → ["c_attn"]. Use target_modules="all-linear" as a safe fallback when unsure.
- ✓Merge LoRA weights before production deployment: model.merge_and_unload() adds B @ A directly into W and discards the LoRA matrices. The merged model runs at full speed with no PEFT overhead — indistinguishable from a fully fine-tuned model at inference time.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.