5 July, 2025
While I was working at IIMN this summer, I had an idea—to compare the original captions of social media images versus what is actually happening in the image. Exploring this, I came across BLIP. It's by far one of the coolest things I've read. It refreshed my whole course of Deep Learning—it was a long afternoon that day.
In the world of artificial intelligence, the ability to jointly understand and generate visual and textual content has become an essential frontier. The emergence of vision-language models such as CLIP and ALIGN revealed that combining image and text modalities can unlock powerful capabilities in tasks like image captioning, visual question answering, and cross-modal retrieval. However, these earlier models heavily relied on enormous and noisy web-scraped datasets, raising concerns about scalability, quality control, and accessibility. To address these limitations, the BLIP model — Bootstrapped Language-Image Pretraining — was proposed by Salesforce Research as a more efficient and versatile alternative.
BLIP presents a unified vision-language pretraining framework capable of both understanding and generation. It combines three training objectives: contrastive learning, image-text matching, and caption generation. Central to its design is the Q-former module, a novel Transformer component with learnable query tokens that enable efficient interaction between visual and textual modalities. With the Q-former and a bootstrapped self-learning strategy, BLIP can use both curated and synthetic data to improve performance without requiring huge datasets.
BLIP has three key components: a vision encoder, the Q-former, and a text encoder/decoder. The vision encoder (typically ViT) extracts patch-level image embeddings. Rather than feeding all these embeddings directly into the language module — which is expensive — BLIP introduces the Q-former. It contains a fixed number of learnable query tokens that attend to the visual embeddings via cross-attention, condensing only the most relevant features.
The text module acts either as an encoder (like BERT) for understanding tasks, or as a decoder (like GPT) for generation tasks. This allows BLIP to perform both vision-language understanding and generation tasks effectively.
1. Image-Text Contrastive Loss (ITC):
Aligns images and texts in a shared space:
\[ \mathcal{L}_{\text{ITC}} = -\log \left( \frac{\exp(\text{sim}(z_i, z_t)/\tau)}{\sum_j \exp(\text{sim}(z_i, z_{t_j})/\tau)} \right) \]
2. Image-Text Matching Loss (ITM):
A binary classification task:
\[ \mathcal{L}_{\text{ITM}} = -y \log(p) - (1 - y) \log(1 - p) \]
3. Captioning Loss:
Cross-entropy over generated words:
\[ \mathcal{L}_{\text{CAP}} = - \sum_t \log P(w_t \mid w_{1:t-1}, \text{image}) \]
BLIP uses self-generated captions for unlabeled images. These are filtered using the ITM head and reused for training — a bootstrapped loop that improves performance and generalization without labeled datasets.
BLIP is a major step in multimodal AI. It bridges contrastive and generative learning while staying efficient and versatile. Thanks to the Q-former and bootstrapping, BLIP achieves competitive results without billions of samples, making it a key foundation for future vision-language systems.