Multimodal Models — CLIP, LLaVA, and Vision-Language
Models that see and understand images and text together. CLIP for zero-shot image classification, LLaVA for visual question answering.
Every model in this track so far processes one modality — text or images. Multimodal models process both simultaneously and reason about how they relate to each other.
A vision model can tell you "this image contains a saree." A language model can tell you "sarees are traditional Indian garments." Neither can answer: "does this product photo match this description — A red Banarasi silk saree with gold zari border?" That requires understanding both modalities and the relationship between them. Multimodal models do exactly this.
The two dominant approaches: CLIP (Contrastive Language-Image Pre-training, OpenAI 2021) learns a shared embedding space where semantically similar images and text are close together. It enables zero-shot image classification with any text labels — no training on those labels required. LLaVA (Large Language and Vision Assistant) connects a vision encoder to an LLM, enabling open-ended conversations about images. Ask it any question about any image and it generates a natural language answer.
Real production uses at Indian companies: Meesho uses CLIP-based retrieval to match user search queries to product images without pre-defined categories. Flipkart uses multimodal models to verify that product photos match product descriptions. Swiggy uses them to check that restaurant dish photos match their menu descriptions. Every e-commerce platform now has multimodal search — text query → image results, or image query → similar products.
Think of a bilingual dictionary — it maps words from English to Hindi and back. CLIP is a bilingual dictionary between visual language and text language. Show it an image of a saree and it gives you a vector. Show it the text "traditional Indian silk garment" and it gives you a similar vector. They are translations of the same concept into a shared numeric language. Similarity in this shared space means semantic similarity across modalities.
The critical insight: CLIP was trained on 400 million (image, text) pairs from the internet. It never needed explicit labels. The training signal came purely from the natural language captions that humans wrote alongside images. This is the largest self-supervised multimodal dataset ever assembled.
CLIP — contrastive pretraining in a shared embedding space
CLIP has two encoders: an image encoder (Vision Transformer or ResNet) and a text encoder (Transformer). Both encoders project their inputs into the same 512 or 768 dimensional embedding space. Training uses contrastive loss: for a batch of N (image, text) pairs, the N correct pairs should be close in embedding space and the N² − N incorrect pairs should be far apart. After training, any image and any text can be compared by cosine similarity of their embeddings.
What you can build with CLIP — zero-shot, retrieval, and embeddings
LLaVA — connecting a vision encoder to an LLM for image conversation
CLIP maps images to embeddings but cannot generate text about images — it can only score similarity. LLaVA (Liu et al., 2023) bridges this gap by connecting a visual encoder to a language model. The architecture is three components: a CLIP vision encoder that extracts image patch features, a projection MLP that maps vision features into the LLM's embedding space, and a language model (LLaMA or Mistral) that generates responses conditioned on both image features and text.
Three production patterns — product search, document understanding, and quality control
CLIP vs LLaVA vs GPT-4V vs Gemini Vision — which to use
Image search, zero-shot classification, visual deduplication, embedding index. Cannot generate text.
Document parsing, product description generation, open-ended visual QA, image captioning.
Complex visual reasoning, charts, diagrams, medical images, multi-image comparison.
Long documents with many images, video understanding, cost-effective GPT-4V alternative.
Every common multimodal mistake — explained and fixed
You can build with multimodal models. Next: production RAG systems that go beyond the basics.
You now understand the full generative AI landscape — GANs, VAEs, diffusion models, LLMs, fine-tuning, and multimodal models. Module 67 returns to RAG with production techniques: reranking retrieved chunks for better precision, hybrid dense-sparse search that combines semantic and keyword retrieval, and evaluation frameworks that measure RAG quality systematically. These are the techniques that separate toy RAG demos from production systems that customers actually trust.
Reranking retrieved chunks, hybrid dense-sparse search, and the patterns that separate production RAG from toy RAG.
🎯 Key Takeaways
- ✓CLIP trains two encoders — image (ViT) and text (Transformer) — to produce embeddings in a shared 512/768-dim space using contrastive loss on 400M (image, text) pairs. After training, cosine similarity between any image and text embedding measures their semantic relatedness. No task-specific training required — this is what enables zero-shot classification.
- ✓CLIP contrastive (InfoNCE) loss: for a batch of N pairs, maximise similarity for the N correct (image, text) pairs and minimise similarity for the N²−N incorrect pairs. The loss is symmetric cross-entropy along both rows (image→text) and columns (text→image) of the N×N similarity matrix. Larger batches = more negatives = stronger learning signal.
- ✓Always write descriptive text labels for CLIP, not just category names: "a photo of a red silk saree" outperforms "saree" significantly. Always L2-normalise embeddings before computing cosine similarity. Use ViT-L/14 over ViT-B/32 for better fine-grained product representations.
- ✓LLaVA connects a CLIP vision encoder → 2-layer projection MLP → LLM backbone. The projection MLP is the only new component — it maps 256 patch tokens from CLIP (1024-dim) into the LLM embedding space (4096-dim). The LLM then generates text attending to both visual tokens and text tokens simultaneously.
- ✓Production decision: CLIP for high-volume retrieval and classification (5ms, free, self-hosted), LLaVA-7B for text generation about images (1-5s, free, needs GPU), GPT-4o Vision for complex reasoning (3-10s, $0.01-0.03/image), Gemini Flash for cost-effective high-quality VQA. Never use a generative VQA model for pure retrieval — embeddings are orders of magnitude faster.
- ✓Three key production patterns: multimodal search (CLIP embeddings + FAISS index, text or image queries against indexed product catalogue), document understanding (LLaVA extracts structured data from receipts, invoices, screenshots without OCR), quality control (CLIP zero-shot scores photos against quality criteria descriptions — no labelled examples needed).
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.