blip-2-vision-language
Vision-language pre-training framework bridging frozen image encoders and LLMs. Use when you need image captioning, visual question answering, image-text retrieval, or multimodal chat with state-of-the-art zero-shot performance.
Install
mkdir -p .claude/skills/blip-2-vision-language && curl -L -o skill.zip "https://mcp.directory/api/skills/download/4034" && unzip -o skill.zip -d .claude/skills/blip-2-vision-language && rm skill.zipInstalls to .claude/skills/blip-2-vision-language
About this skill
BLIP-2: Vision-Language Pre-training
Comprehensive guide to using Salesforce's BLIP-2 for vision-language tasks with frozen image encoders and large language models.
When to use BLIP-2
Use BLIP-2 when:
- Need high-quality image captioning with natural descriptions
- Building visual question answering (VQA) systems
- Require zero-shot image-text understanding without task-specific training
- Want to leverage LLM reasoning for visual tasks
- Building multimodal conversational AI
- Need image-text retrieval or matching
Key features:
- Q-Former architecture: Lightweight query transformer bridges vision and language
- Frozen backbone efficiency: No need to fine-tune large vision/language models
- Multiple LLM backends: OPT (2.7B, 6.7B) and FlanT5 (XL, XXL)
- Zero-shot capabilities: Strong performance without task-specific training
- Efficient training: Only trains Q-Former (~188M parameters)
- State-of-the-art results: Beats larger models on VQA benchmarks
Use alternatives instead:
- LLaVA: For instruction-following multimodal chat
- InstructBLIP: For improved instruction-following (BLIP-2 successor)
- GPT-4V/Claude 3: For production multimodal chat (proprietary)
- CLIP: For simple image-text similarity without generation
- Flamingo: For few-shot visual learning
Quick start
Installation
# HuggingFace Transformers (recommended)
pip install transformers accelerate torch Pillow
# Or LAVIS library (Salesforce official)
pip install salesforce-lavis
Basic image captioning
import torch
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration
# Load model and processor
processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained(
"Salesforce/blip2-opt-2.7b",
torch_dtype=torch.float16,
device_map="auto"
)
# Load image
image = Image.open("photo.jpg").convert("RGB")
# Generate caption
inputs = processor(images=image, return_tensors="pt").to("cuda", torch.float16)
generated_ids = model.generate(**inputs, max_new_tokens=50)
caption = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(caption)
Visual question answering
# Ask a question about the image
question = "What color is the car in this image?"
inputs = processor(images=image, text=question, return_tensors="pt").to("cuda", torch.float16)
generated_ids = model.generate(**inputs, max_new_tokens=50)
answer = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(answer)
Using LAVIS library
import torch
from lavis.models import load_model_and_preprocess
from PIL import Image
# Load model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model, vis_processors, txt_processors = load_model_and_preprocess(
name="blip2_opt",
model_type="pretrain_opt2.7b",
is_eval=True,
device=device
)
# Process image
image = Image.open("photo.jpg").convert("RGB")
image = vis_processors["eval"](image).unsqueeze(0).to(device)
# Caption
caption = model.generate({"image": image})
print(caption)
# VQA
question = txt_processors["eval"]("What is in this image?")
answer = model.generate({"image": image, "prompt": question})
print(answer)
Core concepts
Architecture overview
BLIP-2 Architecture:
┌─────────────────────────────────────────────────────────────┐
│ Q-Former │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Learned Queries (32 queries × 768 dim) │ │
│ └────────────────────────┬────────────────────────────┘ │
│ │ │
│ ┌────────────────────────▼────────────────────────────┐ │
│ │ Cross-Attention with Image Features │ │
│ └────────────────────────┬────────────────────────────┘ │
│ │ │
│ ┌────────────────────────▼────────────────────────────┐ │
│ │ Self-Attention Layers (Transformer) │ │
│ └────────────────────────┬────────────────────────────┘ │
└───────────────────────────┼─────────────────────────────────┘
│
┌───────────────────────────▼─────────────────────────────────┐
│ Frozen Vision Encoder │ Frozen LLM │
│ (ViT-G/14 from EVA-CLIP) │ (OPT or FlanT5) │
└─────────────────────────────────────────────────────────────┘
Model variants
| Model | LLM Backend | Size | Use Case |
|---|---|---|---|
blip2-opt-2.7b | OPT-2.7B | ~4GB | General captioning, VQA |
blip2-opt-6.7b | OPT-6.7B | ~8GB | Better reasoning |
blip2-flan-t5-xl | FlanT5-XL | ~5GB | Instruction following |
blip2-flan-t5-xxl | FlanT5-XXL | ~13GB | Best quality |
Q-Former components
| Component | Description | Parameters |
|---|---|---|
| Learned queries | Fixed set of learnable embeddings | 32 × 768 |
| Image transformer | Cross-attention to vision features | ~108M |
| Text transformer | Self-attention for text | ~108M |
| Linear projection | Maps to LLM dimension | Varies |
Advanced usage
Batch processing
from PIL import Image
import torch
# Load multiple images
images = [Image.open(f"image_{i}.jpg").convert("RGB") for i in range(4)]
questions = [
"What is shown in this image?",
"Describe the scene.",
"What colors are prominent?",
"Is there a person in this image?"
]
# Process batch
inputs = processor(
images=images,
text=questions,
return_tensors="pt",
padding=True
).to("cuda", torch.float16)
# Generate
generated_ids = model.generate(**inputs, max_new_tokens=50)
answers = processor.batch_decode(generated_ids, skip_special_tokens=True)
for q, a in zip(questions, answers):
print(f"Q: {q}\nA: {a}\n")
Controlling generation
# Control generation parameters
generated_ids = model.generate(
**inputs,
max_new_tokens=100,
min_length=20,
num_beams=5, # Beam search
no_repeat_ngram_size=2, # Avoid repetition
top_p=0.9, # Nucleus sampling
temperature=0.7, # Creativity
do_sample=True, # Enable sampling
)
# For deterministic output
generated_ids = model.generate(
**inputs,
max_new_tokens=50,
num_beams=5,
do_sample=False,
)
Memory optimization
# 8-bit quantization
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = Blip2ForConditionalGeneration.from_pretrained(
"Salesforce/blip2-opt-6.7b",
quantization_config=quantization_config,
device_map="auto"
)
# 4-bit quantization (more aggressive)
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16
)
model = Blip2ForConditionalGeneration.from_pretrained(
"Salesforce/blip2-flan-t5-xxl",
quantization_config=quantization_config,
device_map="auto"
)
Image-text matching
# Using LAVIS for ITM (Image-Text Matching)
from lavis.models import load_model_and_preprocess
model, vis_processors, txt_processors = load_model_and_preprocess(
name="blip2_image_text_matching",
model_type="pretrain",
is_eval=True,
device=device
)
image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)
text = txt_processors["eval"]("a dog sitting on grass")
# Get matching score
itm_output = model({"image": image, "text_input": text}, match_head="itm")
itm_scores = torch.nn.functional.softmax(itm_output, dim=1)
print(f"Match probability: {itm_scores[:, 1].item():.3f}")
Feature extraction
# Extract image features with Q-Former
from lavis.models import load_model_and_preprocess
model, vis_processors, _ = load_model_and_preprocess(
name="blip2_feature_extractor",
model_type="pretrain",
is_eval=True,
device=device
)
image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)
# Get features
features = model.extract_features({"image": image}, mode="image")
image_embeds = features.image_embeds # Shape: [1, 32, 768]
image_features = features.image_embeds_proj # Projected for matching
Common workflows
Workflow 1: Image captioning pipeline
import torch
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration
from pathlib import Path
class ImageCaptioner:
def __init__(self, model_name="Salesforce/blip2-opt-2.7b"):
self.processor = Blip2Processor.from_pretrained(model_name)
self.model = Blip2ForConditionalGeneration.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
def caption(self, image_path: str, prompt: str = None) -> str:
image = Image.open(image_path).convert("RGB")
if prompt:
inputs = self.processor(images=image, text=prompt, return_tensors="pt")
else:
inputs = self.processor(images=image, return_tensors="pt")
inputs = inputs.to("cuda", torch.float16)
generated_ids = self.model.generate(
**inputs,
max_new_tokens=50,
num_beams=5
)
return self.processor.decode(generated_ids[0], skip_special_tokens=True)
def caption_batch(self, image_paths: list, prompt: str = None) -> list:
images = [Image.open(p).convert("RGB") for p in image_paths]
if prompt:
inputs = self.processor(
images=images,
text=[prompt] * len(images),
return_tensors="pt",
padding=True
)
else:
inputs = self.processor(images=images, return_tensors="pt", padding=True)
inputs = inputs.to("cuda", torch.
---
*Content truncated.*
More by davila7
View all skills by davila7 →You might also like
flutter-development
aj-geddes
Build beautiful cross-platform mobile apps with Flutter and Dart. Covers widgets, state management with Provider/BLoC, navigation, API integration, and material design.
drawio-diagrams-enhanced
jgtolentino
Create professional draw.io (diagrams.net) diagrams in XML format (.drawio files) with integrated PMP/PMBOK methodologies, extensive visual asset libraries, and industry-standard professional templates. Use this skill when users ask to create flowcharts, swimlane diagrams, cross-functional flowcharts, org charts, network diagrams, UML diagrams, BPMN, project management diagrams (WBS, Gantt, PERT, RACI), risk matrices, stakeholder maps, or any other visual diagram in draw.io format. This skill includes access to custom shape libraries for icons, clipart, and professional symbols.
ui-ux-pro-max
nextlevelbuilder
"UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient."
godot
bfollington
This skill should be used when working on Godot Engine projects. It provides specialized knowledge of Godot's file formats (.gd, .tscn, .tres), architecture patterns (component-based, signal-driven, resource-based), common pitfalls, validation tools, code templates, and CLI workflows. The `godot` command is available for running the game, validating scripts, importing resources, and exporting builds. Use this skill for tasks involving Godot game development, debugging scene/resource files, implementing game systems, or creating new Godot components.
nano-banana-pro
garg-aayush
Generate and edit images using Google's Nano Banana Pro (Gemini 3 Pro Image) API. Use when the user asks to generate, create, edit, modify, change, alter, or update images. Also use when user references an existing image file and asks to modify it in any way (e.g., "modify this image", "change the background", "replace X with Y"). Supports both text-to-image generation and image-to-image editing with configurable resolution (1K default, 2K, or 4K for high resolution). DO NOT read the image file first - use this skill directly with the --input-image parameter.
fastapi-templates
wshobson
Create production-ready FastAPI projects with async patterns, dependency injection, and comprehensive error handling. Use when building new FastAPI applications or setting up backend API projects.
Related MCP Servers
Browse all serversUnlock seamless Figma to code: streamline Figma to HTML with Framelink MCP Server for fast, accurate design-to-code work
Uno Platform — Documentation and prompts for building cross-platform .NET apps with a single codebase. Get guides, sampl
The fullstack MCP framework for developing MCP apps for ChatGPT, Claude, and building MCP servers for AI agents. Connect
Cipher empowers agents with persistent memory using vector databases and embeddings for seamless context retention and t
Unlock browser automation studio with Browserbase MCP Server. Enhance Selenium software testing and AI-driven workflows
Access shadcn/ui v4 components, blocks, and demos for rapid React UI library development. Seamless integration and sourc
Stay ahead of the MCP ecosystem
Get weekly updates on new skills and servers.