Python · SQL · Web Dev · Java · AI/ML tracks launching soon — your one platform for all of IT

Fine-Tuning with PEFT — LoRA and Adapters

Tune less than 1% of a model's parameters and get 95% of the performance. LoRA, adapters, and prefix tuning — when and how to use each.

40–45 min March 2026
Section 08 · Natural Language Processing
Before any code — why PEFT exists

Full fine-tuning a 7B parameter model requires 28GB of GPU memory just to store the weights. LoRA fine-tunes the same model using 16MB of trainable parameters — on a single consumer GPU.

Module 50 showed full fine-tuning — update all 110M parameters of BERT for 3 epochs. That costs 4GB of GPU memory and 30 minutes. Acceptable for BERT. Completely impractical for LLaMA-3 (8B), Mistral (7B), or Falcon (40B). Full fine-tuning a 7B model requires storing the model weights (28GB in fp32), the gradients (another 28GB), and the optimiser states (56GB for Adam). Total: 112GB VRAM. No consumer GPU has that.

PEFT (Parameter-Efficient Fine-Tuning) solves this by updating only a tiny fraction of the model's parameters while freezing the rest. LoRA — the most popular PEFT method — adds small low-rank matrices alongside the frozen weight matrices. Only the small matrices are trained. Total trainable parameters: typically 0.1–1% of the full model. GPU memory required: a fraction of full fine-tuning. Quality: 90–95% of full fine-tuning.

🧠 Analogy — read this first

A senior engineer at Razorpay knows everything about payments. You want to teach them your company's specific internal processes. You do not re-hire them and retrain them from scratch — you give them a small notebook of company-specific notes to carry alongside their existing expertise. LoRA is that notebook — small, lightweight, task-specific, sits alongside the frozen base model.

At inference time: base model knowledge + LoRA notebook = specialised expert. You can swap notebooks — same base model, different LoRA adapters for different tasks. One GPU, many specialists.

🎯 Pro Tip
Install: pip install peft transformers accelerate bitsandbytes. The PEFT library from HuggingFace handles LoRA, adapters, and prefix tuning with a unified API. Three lines of code convert any model to LoRA fine-tuning.
The core idea

LoRA — Low-Rank Adaptation — the math in plain English

A weight matrix W in a Transformer has shape (d_out, d_in). For BERT's attention layers, d_in = d_out = 768. That is 768 × 768 = 589,824 parameters per matrix. LoRA's key insight: the change needed to adapt a pretrained model to a new task has low intrinsic rank — it lives in a much smaller subspace than the full matrix dimension.

Instead of updating W directly, LoRA adds two small matrices: A of shape (r, d_in) and B of shape (d_out, r) where r is the rank — typically 4, 8, or 16. The effective weight update is B @ A — a rank-r matrix. Training only A and B requires r × (d_in + d_out) parameters instead of d_in × d_out. With r=8, d=768: 8 × 1536 = 12,288 parameters vs 589,824. That is a 48× reduction per matrix.

LoRA forward pass — frozen W plus trainable low-rank update
Standard forward pass:
h = W × x ← W is frozen, 768×768 = 589k params
LoRA forward pass:
h = W × x + (B @ A) × x × (α/r)
W: frozen pretrained weight (768 × 768) — never updated
A: trainable, initialised with random normal (r × 768)
B: trainable, initialised with zeros (768 × r)
r: rank hyperparameter — typically 4, 8, or 16
α: scaling factor — typically equal to r (so α/r = 1)
B initialised to zeros → LoRA output is zero at start → identical to pretrained model
python
import torch
import torch.nn as nn
import numpy as np

# ── LoRA layer from scratch — see exactly what happens ────────────────
class LoRALinear(nn.Module):
    """
    Drop-in replacement for nn.Linear that adds LoRA adaptation.
    The original weight W is frozen. Only A and B are trained.
    """
    def __init__(self, linear: nn.Linear, rank: int = 8, alpha: float = 8.0):
        super().__init__()
        self.linear = linear
        self.rank   = rank
        self.alpha  = alpha
        self.scale  = alpha / rank

        d_out, d_in = linear.weight.shape

        # Freeze the original weight
        for param in self.linear.parameters():
            param.requires_grad = False

        # LoRA matrices — small and trainable
        self.lora_A = nn.Parameter(torch.randn(rank, d_in) * 0.02)
        self.lora_B = nn.Parameter(torch.zeros(d_out, rank))   # zeros → no change at start

    def forward(self, x):
        # Original frozen output
        base_out  = self.linear(x)
        # LoRA update: x @ A.T @ B.T * scale
        lora_out  = (x @ self.lora_A.T @ self.lora_B.T) * self.scale
        return base_out + lora_out

# ── Demonstrate parameter counts ──────────────────────────────────────
torch.manual_seed(42)
d_model = 768
original_linear = nn.Linear(d_model, d_model)

for rank in [4, 8, 16, 32]:
    lora_linear = LoRALinear(original_linear, rank=rank)

    total    = sum(p.numel() for p in lora_linear.parameters())
    trainable = sum(p.numel() for p in lora_linear.parameters() if p.requires_grad)
    frozen   = total - trainable
    pct      = trainable / total * 100

    print(f"rank={rank:2d}: total={total:,}  frozen={frozen:,}  "
          f"trainable={trainable:,}  ({pct:.2f}%)")

# ── Verify: LoRA output equals base at initialisation ─────────────────
lora = LoRALinear(original_linear, rank=8)
x    = torch.randn(4, d_model)

base_out = original_linear(x)
lora_out = lora(x)

diff = (base_out - lora_out).abs().max().item()
print(f"
Max diff between base and LoRA at init: {diff:.2e}  ← should be ~0")
print("B is initialised to zeros → LoRA adds nothing at the start")
print("Training gradually moves A and B to add task-specific adjustments")
Production implementation

HuggingFace PEFT library — LoRA in three lines of code

The PEFT library wraps any HuggingFace model with LoRA in three steps: define a LoraConfig, call get_peft_model(), done. PEFT automatically identifies which layers to apply LoRA to, freezes everything else, and gives you a model where only the LoRA matrices require gradients.

python
# pip install peft transformers accelerate

from peft import (
    LoraConfig, get_peft_model, TaskType,
    PeftModel, prepare_model_for_kbit_training,
)
from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    AutoModelForCausalLM, TrainingArguments, Trainer,
    DataCollatorWithPadding,
)
from datasets import Dataset
import evaluate
import numpy as np
import torch

# ── LoRA for sequence classification (BERT-style) ─────────────────────
MODEL_NAME = 'distilbert-base-uncased'
tokenizer  = AutoTokenizer.from_pretrained(MODEL_NAME)
base_model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME, num_labels=3,
)

# ── Configure LoRA ────────────────────────────────────────────────────
lora_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,  # sequence classification
    r=8,                          # rank — lower = fewer params, higher = more capacity
    lora_alpha=16,                # scaling = alpha/r = 2
    lora_dropout=0.1,             # dropout on LoRA layers
    target_modules=['q_lin', 'v_lin'],  # which layers to add LoRA to
    # For BERT: ['query', 'value']
    # For LLaMA: ['q_proj', 'v_proj', 'k_proj', 'o_proj']
    bias='none',                  # don't train biases
    inference_mode=False,
)

# ── Apply LoRA to model — three lines ─────────────────────────────────
peft_model = get_peft_model(base_model, lora_config)

# Print trainable parameters
peft_model.print_trainable_parameters()
# Output: trainable params: 296,451 || all params: 67,252,227 || trainable%: 0.44%

# ── Verify what is frozen and what is trainable ───────────────────────
print("
Layer-by-layer trainability:")
for name, param in peft_model.named_parameters():
    if param.requires_grad:
        print(f"  TRAINABLE: {name:<60} {param.numel():>10,}")
    # Frozen layers are too many to print — just count them
frozen_count = sum(1 for p in peft_model.parameters() if not p.requires_grad)
print(f"  (+ {frozen_count} frozen parameter tensors)")

# ── Fine-tune with standard Trainer ───────────────────────────────────
LABELS   = ['positive', 'negative', 'neutral']
label2id = {l: i for i, l in enumerate(LABELS)}

reviews = [
    ("Excellent product quality, very happy", 0),
    ("Terrible, broke within a week", 1),
    ("Average product, nothing special", 2),
    ("Amazing value for money", 0),
    ("Worst purchase ever made", 1),
    ("Decent enough for the price", 2),
] * 40

def tokenise(examples):
    return tokenizer(examples['text'], max_length=64,
                     truncation=True, padding=False)

raw   = Dataset.from_dict({
    'text':  [r[0] for r in reviews],
    'label': [r[1] for r in reviews],
})
raw   = raw.train_test_split(test_size=0.2, seed=42)
tok   = raw.map(tokenise, batched=True, remove_columns=['text'])

acc_metric = evaluate.load('accuracy')
def compute_metrics(eval_pred):
    preds = np.argmax(eval_pred.predictions, axis=1)
    return acc_metric.compute(predictions=preds, references=eval_pred.label_ids)

args = TrainingArguments(
    output_dir='./lora-sentiment',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    learning_rate=3e-4,   # LoRA can use higher lr than full fine-tuning
    warmup_ratio=0.1,
    evaluation_strategy='epoch',
    save_strategy='epoch',
    load_best_model_at_end=True,
    seed=42,
)

trainer = Trainer(
    model=peft_model, args=args,
    train_dataset=tok['train'],
    eval_dataset=tok['test'],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer),
    compute_metrics=compute_metrics,
)

print("
Fine-tuning with LoRA (0.44% of parameters):")
trainer.train()
results = trainer.evaluate()
print(f"Accuracy: {results['eval_accuracy']:.4f}")
The real use case — large language models

LoRA + quantisation — fine-tuning a 7B model on a single GPU

LoRA's main value is not for BERT (110M) — you can full fine-tune BERT easily. The value is for 7B, 13B, and 70B parameter models where full fine-tuning is impossible on consumer hardware. Combine LoRA with quantisation (4-bit or 8-bit weights via bitsandbytes) and you can fine-tune a 7B model on a 16GB GPU. This is QLoRA — Quantised LoRA.

Memory comparison — full fine-tuning vs LoRA vs QLoRA for 7B model
MethodModel weightsGradientsOptimiserTotal VRAMGPU needed
Full fine-tune (fp32)28GB28GB56GB~112GBA100 80GB ×2
Full fine-tune (fp16)14GB14GB28GB~56GBA100 80GB
LoRA (fp16)14GB0.06GB0.12GB~16GBA100 40GB
QLoRA (4-bit)3.5GB0.06GB0.12GB~6GBRTX 4090 / T4

Model size: 7B parameters. Gradients and optimiser states only for LoRA matrices (≈0.5% of params). QLoRA quantises frozen weights to 4-bit — 8× memory reduction vs fp32.

python
# QLoRA — Fine-tuning a 7B model on a single GPU
# pip install peft transformers accelerate bitsandbytes

from transformers import (
    AutoTokenizer, AutoModelForCausalLM,
    BitsAndBytesConfig, TrainingArguments,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch

# ── Step 1: Load model in 4-bit quantisation (QLoRA) ─────────────────
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',         # NormalFloat4 — best quality 4-bit
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,    # nested quantisation for extra savings
)

# In production — use a real model:
# model_id = 'meta-llama/Llama-3-8b-hf'  # requires HuggingFace access token
# model_id = 'mistralai/Mistral-7B-v0.3'
# model_id = 'google/gemma-7b'
model_id = 'facebook/opt-125m'   # small model for demonstration

tokenizer = AutoTokenizer.from_pretrained(model_id)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,    # 4-bit quantisation
    device_map='auto',                  # distribute across available GPUs
)

# ── Step 2: Prepare for training ──────────────────────────────────────
# Must call this before adding LoRA — handles gradient checkpointing
model = prepare_model_for_kbit_training(model)

# ── Step 3: Add LoRA ──────────────────────────────────────────────────
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    # Target the attention projection layers
    # For LLaMA/Mistral: ['q_proj', 'v_proj', 'k_proj', 'o_proj', 'gate_proj']
    # For OPT: ['q_proj', 'v_proj']
    target_modules=['q_proj', 'v_proj'],
    lora_dropout=0.05,
    bias='none',
    task_type='CAUSAL_LM',
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

# ── Step 4: Training with gradient checkpointing ──────────────────────
# Gradient checkpointing trades compute for memory —
# recomputes activations during backward instead of storing them
model.gradient_checkpointing_enable()

training_args = TrainingArguments(
    output_dir='./qlora-output',
    num_train_epochs=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,    # effective batch = 4 × 4 = 16
    learning_rate=2e-4,               # QLoRA uses higher lr
    fp16=True,
    logging_steps=10,
    optim='paged_adamw_32bit',        # paged optimizer saves memory
    lr_scheduler_type='cosine',
    warmup_ratio=0.03,
    save_strategy='epoch',
)

print("QLoRA configuration:")
print(f"  Rank (r):          {lora_config.r}")
print(f"  Alpha:             {lora_config.lora_alpha}")
print(f"  Target modules:    {lora_config.target_modules}")
print(f"  Quantisation:      4-bit NF4")
print(f"  Effective batch:   {training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps}")

# ── Step 5: Save and load LoRA adapter separately ─────────────────────
# The adapter is tiny — only the LoRA matrices (a few MB)
# model.save_pretrained('./my-lora-adapter')
# tokenizer.save_pretrained('./my-lora-adapter')

# Load later:
# base_model = AutoModelForCausalLM.from_pretrained(model_id, ...)
# model = PeftModel.from_pretrained(base_model, './my-lora-adapter')
Beyond LoRA

Adapters, prefix tuning, and prompt tuning — when each is appropriate

LoRA is the most popular PEFT method but not the only one. Three other methods are widely used in production, each with different trade-offs between parameter count, training stability, and inference overhead.

LoRA0.1–1% of modelinference: Zero overhead (weights merged)

Low-rank matrices added alongside frozen attention weights. Merged into weights at inference — zero latency overhead.

Default choice. Best quality-to-params ratio. Works for all model types.
Requires access to model weights.
Adapters1–5% of modelinference: Small latency overhead (extra layers)

Small bottleneck MLP inserted between Transformer layers. Frozen base, only adapters train. Original method from Houlsby et al. 2019.

When you need to swap task adapters frequently at inference time without reloading the model.
Adds latency. Slightly more params than LoRA.
Prefix Tuning0.1% of modelinference: Small overhead (longer key/value sequences)

Prepend trainable virtual tokens (prefix) to the key and value in every attention layer. Only the prefix vectors are trained.

Generation tasks where you want to condition the model on a task without changing weights at all.
Less stable training than LoRA. Sensitive to prefix length.
Prompt Tuning<0.01% of modelinference: Minimal — slightly longer input

Prepend trainable soft tokens to the INPUT only (not every layer). Simplest PEFT method — only a few thousand parameters.

Very large models (10B+) where even LoRA is expensive. Competitive with full fine-tuning at 10B+ scale.
Underperforms LoRA on smaller models. Slower to converge.
python
from peft import (
    LoraConfig, PromptTuningConfig, PrefixTuningConfig,
    get_peft_model, TaskType, PromptTuningInit,
)
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch

# ── Compare PEFT methods on the same model ────────────────────────────
model_id  = 'facebook/bart-base'   # seq2seq for demonstration
tokenizer = AutoTokenizer.from_pretrained(model_id)

def count_params(model):
    total     = sum(p.numel() for p in model.parameters())
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    return total, trainable

# ── LoRA ──────────────────────────────────────────────────────────────
from transformers import AutoModelForSeq2SeqLM
base = AutoModelForSeq2SeqLM.from_pretrained(model_id)
lora_cfg = LoraConfig(
    task_type=TaskType.SEQ_2_SEQ_LM,
    r=8, lora_alpha=16, lora_dropout=0.05,
    target_modules=['q_proj', 'v_proj'],
    bias='none',
)
lora_model = get_peft_model(base, lora_cfg)
t, tr = count_params(lora_model)
print(f"LoRA:          {tr:>10,} / {t:>10,}  ({tr/t*100:.3f}%)")

# ── Prefix Tuning ─────────────────────────────────────────────────────
base2 = AutoModelForSeq2SeqLM.from_pretrained(model_id)
prefix_cfg = PrefixTuningConfig(
    task_type=TaskType.SEQ_2_SEQ_LM,
    num_virtual_tokens=20,        # 20 trainable prefix tokens
    encoder_hidden_size=768,
)
prefix_model = get_peft_model(base2, prefix_cfg)
t2, tr2 = count_params(prefix_model)
print(f"Prefix tuning: {tr2:>10,} / {t2:>10,}  ({tr2/t2*100:.3f}%)")

# ── Prompt Tuning ─────────────────────────────────────────────────────
base3 = AutoModelForSeq2SeqLM.from_pretrained(model_id)
prompt_cfg = PromptTuningConfig(
    task_type=TaskType.SEQ_2_SEQ_LM,
    num_virtual_tokens=8,
    prompt_tuning_init=PromptTuningInit.TEXT,
    prompt_tuning_init_text="Summarise the following payment dispute: ",
    tokenizer_name_or_path=model_id,
)
prompt_model = get_peft_model(base3, prompt_cfg)
t3, tr3 = count_params(prompt_model)
print(f"Prompt tuning: {tr3:>10,} / {t3:>10,}  ({tr3/t3*100:.3f}%)")

print(f"
Guideline:")
print(f"  Use LoRA for most tasks — best quality per parameter")
print(f"  Use prefix tuning for generation with frequent task switching")
print(f"  Use prompt tuning only for very large models (10B+)")
Deploying LoRA models

Merging LoRA weights — zero inference overhead in production

During training, LoRA runs a separate forward pass through B @ A and adds it to the frozen W output. At inference this adds latency. LoRA can be merged: the weight update B @ A is computed once and added directly to W — producing a standard model with no extra computation. Merged model = full fine-tuned quality at full fine-tuned speed. The LoRA matrices can be discarded after merging.

python
from peft import PeftModel, LoraConfig, get_peft_model, TaskType
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# ── Simulate: train LoRA then merge for production ────────────────────
MODEL_NAME = 'distilbert-base-uncased'
tokenizer  = AutoTokenizer.from_pretrained(MODEL_NAME)

# 1. Create LoRA model
base_model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME, num_labels=3,
)
lora_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,
    r=8, lora_alpha=16, lora_dropout=0.0,
    target_modules=['q_lin', 'v_lin'],
    bias='none',
)
lora_model = get_peft_model(base_model, lora_config)

# 2. [Training happens here]
# trainer.train()

# 3. Save LoRA adapter (tiny — just A and B matrices)
lora_model.save_pretrained('./my-lora-adapter')
tokenizer.save_pretrained('./my-lora-adapter')

adapter_size = sum(
    p.numel() * 4  # bytes for float32
    for p in lora_model.parameters()
    if p.requires_grad
) / 1024 / 1024
print(f"LoRA adapter size: ~{adapter_size:.1f} MB")

# 4. Merge LoRA into base model — zero inference overhead
merged_model = lora_model.merge_and_unload()
# Now merged_model is a standard nn.Module — no PEFT overhead
# Same weights as if you had done full fine-tuning

print(f"
After merging:")
print(f"  Type:     {type(merged_model).__name__}  ← standard model, no PEFT overhead")
trainable = sum(p.numel() for p in merged_model.parameters() if p.requires_grad)
total     = sum(p.numel() for p in merged_model.parameters())
print(f"  Params:   {total:,} total, {trainable:,} trainable")

# 5. Save merged model
merged_model.save_pretrained('./merged-model')
tokenizer.save_pretrained('./merged-model')

# 6. Load and serve — identical to a standard fine-tuned model
loaded = AutoModelForSequenceClassification.from_pretrained('./merged-model')
loaded.eval()

test_input = tokenizer("Payment declined please help",
                        return_tensors='pt', max_length=64, truncation=True)
with torch.no_grad():
    logits = loaded(**test_input).logits
print(f"
Inference works identically after merge: {logits.shape}")

# ── Serving multiple tasks from one base model ─────────────────────────
print("""
# Production pattern: one base model, multiple LoRA adapters
# Load base once, swap adapters per request

from peft import PeftModel

base = AutoModelForCausalLM.from_pretrained('mistralai/Mistral-7B-v0.3', ...)

# Task 1: customer support
support_model = PeftModel.from_pretrained(base, './lora-support-adapter')

# Task 2: code generation
code_model = PeftModel.from_pretrained(base, './lora-code-adapter')

# Task 3: SQL generation
sql_model = PeftModel.from_pretrained(base, './lora-sql-adapter')

# One GPU, three specialists — much cheaper than three separate 7B models
""")
Errors you will hit

Every common PEFT mistake — explained and fixed

ValueError: Target modules not found in model — lora_config target_modules has no match
Why it happens

The target_modules names in LoraConfig do not match the actual layer names in your model. Every model family uses different naming: BERT uses 'query', 'value'. DistilBERT uses 'q_lin', 'v_lin'. LLaMA uses 'q_proj', 'v_proj'. Mistral uses 'q_proj', 'v_proj', 'k_proj'. GPT-2 uses 'c_attn'. Using the wrong names means no layers get LoRA applied and PEFT raises this error.

Fix

Print all model layer names first: [name for name, _ in model.named_modules()]. Find the attention projection layers — they typically contain 'query', 'key', 'value', 'q_proj', 'v_proj' or similar. Use those exact strings in target_modules. Alternatively pass target_modules='all-linear' to target all linear layers automatically — less efficient but always works.

LoRA training loss is NaN from the first step
Why it happens

Learning rate is too high for LoRA. LoRA adapters start at zero — all gradient signal flows through tiny near-zero matrices. A large learning rate causes the LoRA matrices to explode immediately. Also caused by fp16 overflow when gradient norms are large — the small LoRA matrices amplify gradient magnitudes relative to their scale.

Fix

Use lr=1e-4 to 3e-4 for LoRA (higher than full fine-tuning at 2e-5 but lower than you might expect). Add max_grad_norm=0.3 in TrainingArguments to clip gradients. If using fp16, ensure the base model is loaded in fp16 and use optim='paged_adamw_32bit' — the 32-bit Adam states prevent overflow. Use warmup_ratio=0.05 to start from near-zero lr.

PEFT model accuracy is much lower than full fine-tuning on the same task
Why it happens

LoRA rank is too low for the task complexity, or the wrong layers are targeted. Very low rank (r=2, r=4) has very few parameters and limited capacity. Only targeting q and v projections misses important weights — for complex tasks adding k, o, and MLP layers helps significantly. Also: LoRA alpha scaling may be off.

Fix

Increase rank: try r=16 or r=32 for complex tasks. Add more target modules: for LLaMA use ['q_proj', 'v_proj', 'k_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj'] to cover all projection and MLP layers. Set lora_alpha = 2 × r as a starting point. If accuracy is still low, consider full fine-tuning — PEFT is not always the right choice for small models.

RuntimeError: Expected all tensors on the same device after loading LoRA adapter
Why it happens

The base model was loaded on one device (CPU or CUDA:0) but the LoRA adapter was loaded on a different device, or device_map='auto' split the model across GPUs in a way that conflicts with the adapter loading. Also happens when you call PeftModel.from_pretrained without specifying the device consistently.

Fix

Load both base model and adapter with the same device specification: base = AutoModelForCausalLM.from_pretrained(model_id, device_map='auto'); model = PeftModel.from_pretrained(base, adapter_path, device_map='auto'). Or load to CPU first and move: model = model.to('cuda'). After merging with merge_and_unload(), always explicitly move the merged model to the target device.

What comes next

You can fine-tune any model efficiently. Next: give any LLM access to your own documents.

Fine-tuning teaches a model new behaviour patterns from labelled data. But what if you want the model to answer questions about documents it has never seen — your company's internal knowledge base, a legal corpus, a product catalogue? Fine-tuning cannot help here — the model still cannot access documents not in its weights. Retrieval-Augmented Generation (RAG) solves this by combining a retriever (find relevant documents from a vector database) with a generator (produce an answer grounded in those documents). Module 52 builds a complete RAG pipeline for a Razorpay knowledge base.

Next — Module 52 · NLP
RAG — Retrieval-Augmented Generation

Vector databases, semantic search, chunking strategies, and the full RAG pipeline from document to answer.

coming soon

🎯 Key Takeaways

  • Full fine-tuning a 7B model requires 112GB VRAM. LoRA fine-tunes the same model with 0.1–1% of parameters — fitting on a single 16GB GPU. The trade-off: 90–95% of full fine-tuning quality at 1% of the cost.
  • LoRA adds two small matrices A (r × d_in) and B (d_out × r) alongside each frozen weight matrix W. The effective update is B @ A — a rank-r approximation of the full weight change. B is initialised to zeros so LoRA starts identical to the pretrained model and gradually diverges.
  • QLoRA combines LoRA with 4-bit quantisation (bitsandbytes NF4) — the frozen base model weights are stored in 4-bit, reducing a 7B model from 28GB to 3.5GB. Only the LoRA matrices are stored in fp16. This enables 7B fine-tuning on a single consumer GPU.
  • PEFT library workflow: LoraConfig → get_peft_model(base_model, config) → standard Trainer. Three lines to convert any HuggingFace model to LoRA. Always call model.print_trainable_parameters() to verify the right layers are being trained.
  • Target modules must match your model family exactly: BERT → ["query","value"], DistilBERT → ["q_lin","v_lin"], LLaMA → ["q_proj","v_proj","k_proj","o_proj"], GPT-2 → ["c_attn"]. Use target_modules="all-linear" as a safe fallback when unsure.
  • Merge LoRA weights before production deployment: model.merge_and_unload() adds B @ A directly into W and discards the LoRA matrices. The merged model runs at full speed with no PEFT overhead — indistinguishable from a fully fine-tuned model at inference time.
Share

Discussion

0

Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.

Continue with GitHub
Loading...