llava

Name: llava
Author: zechenzhangAGI

117views

7installs

Large Language and Vision Assistant. Enables visual instruction tuning and image-based conversations. Combines CLIP vision encoder with Vicuna/LLaMA language models. Supports multi-turn image chat, visual question answering, and instruction following. Use for vision-language chatbots or image understanding tasks. Best for conversational image analysis.

Install

mkdir -p .claude/skills/llava && curl -L -o skill.zip "https://mcp.directory/api/skills/download/203" && unzip -o skill.zip -d .claude/skills/llava && rm skill.zip

Installs to .claude/skills/llava

About this skill

LLaVA - Large Language and Vision Assistant

Open-source vision-language model for conversational image understanding.

When to use LLaVA

Use when:

Building vision-language chatbots
Visual question answering (VQA)
Image description and captioning
Multi-turn image conversations
Visual instruction following
Document understanding with images

Metrics:

23,000+ GitHub stars
GPT-4V level capabilities (targeted)
Apache 2.0 License
Multiple model sizes (7B-34B params)

Use alternatives instead:

GPT-4V: Highest quality, API-based
CLIP: Simple zero-shot classification
BLIP-2: Better for captioning only
Flamingo: Research, not open-source

Quick start

Installation

# Clone repository
git clone https://github.com/haotian-liu/LLaVA
cd LLaVA

# Install
pip install -e .

Basic usage

from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
from llava.conversation import conv_templates
from PIL import Image
import torch

# Load model
model_path = "liuhaotian/llava-v1.5-7b"
tokenizer, model, image_processor, context_len = load_pretrained_model(
    model_path=model_path,
    model_base=None,
    model_name=get_model_name_from_path(model_path)
)

# Load image
image = Image.open("image.jpg")
image_tensor = process_images([image], image_processor, model.config)
image_tensor = image_tensor.to(model.device, dtype=torch.float16)

# Create conversation
conv = conv_templates["llava_v1"].copy()
conv.append_message(conv.roles[0], DEFAULT_IMAGE_TOKEN + "\nWhat is in this image?")
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()

# Generate response
input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).to(model.device)

with torch.inference_mode():
    output_ids = model.generate(
        input_ids,
        images=image_tensor,
        do_sample=True,
        temperature=0.2,
        max_new_tokens=512
    )

response = tokenizer.decode(output_ids[0], skip_special_tokens=True).strip()
print(response)

Available models

Model	Parameters	VRAM	Quality
LLaVA-v1.5-7B	7B	~14 GB	Good
LLaVA-v1.5-13B	13B	~28 GB	Better
LLaVA-v1.6-34B	34B	~70 GB	Best

# Load different models
model_7b = "liuhaotian/llava-v1.5-7b"
model_13b = "liuhaotian/llava-v1.5-13b"
model_34b = "liuhaotian/llava-v1.6-34b"

# 4-bit quantization for lower VRAM
load_4bit = True  # Reduces VRAM by ~4×

CLI usage

# Single image query
python -m llava.serve.cli \
    --model-path liuhaotian/llava-v1.5-7b \
    --image-file image.jpg \
    --query "What is in this image?"

# Multi-turn conversation
python -m llava.serve.cli \
    --model-path liuhaotian/llava-v1.5-7b \
    --image-file image.jpg
# Then type questions interactively

Web UI (Gradio)

# Launch Gradio interface
python -m llava.serve.gradio_web_server \
    --model-path liuhaotian/llava-v1.5-7b \
    --load-4bit  # Optional: reduce VRAM

# Access at http://localhost:7860

Multi-turn conversations

# Initialize conversation
conv = conv_templates["llava_v1"].copy()

# Turn 1
conv.append_message(conv.roles[0], DEFAULT_IMAGE_TOKEN + "\nWhat is in this image?")
conv.append_message(conv.roles[1], None)
response1 = generate(conv, model, image)  # "A dog playing in a park"

# Turn 2
conv.messages[-1][1] = response1  # Add previous response
conv.append_message(conv.roles[0], "What breed is the dog?")
conv.append_message(conv.roles[1], None)
response2 = generate(conv, model, image)  # "Golden Retriever"

# Turn 3
conv.messages[-1][1] = response2
conv.append_message(conv.roles[0], "What time of day is it?")
conv.append_message(conv.roles[1], None)
response3 = generate(conv, model, image)

Common tasks

Image captioning

question = "Describe this image in detail."
response = ask(model, image, question)

Visual question answering

question = "How many people are in the image?"
response = ask(model, image, question)

Object detection (textual)

question = "List all the objects you can see in this image."
response = ask(model, image, question)

Scene understanding

question = "What is happening in this scene?"
response = ask(model, image, question)

Document understanding

question = "What is the main topic of this document?"
response = ask(model, document_image, question)

Training custom model

# Stage 1: Feature alignment (558K image-caption pairs)
bash scripts/v1_5/pretrain.sh

# Stage 2: Visual instruction tuning (150K instruction data)
bash scripts/v1_5/finetune.sh

Quantization (reduce VRAM)

# 4-bit quantization
tokenizer, model, image_processor, context_len = load_pretrained_model(
    model_path="liuhaotian/llava-v1.5-13b",
    model_base=None,
    model_name=get_model_name_from_path("liuhaotian/llava-v1.5-13b"),
    load_4bit=True  # Reduces VRAM ~4×
)

# 8-bit quantization
load_8bit=True  # Reduces VRAM ~2×

Best practices

Start with 7B model - Good quality, manageable VRAM
Use 4-bit quantization - Reduces VRAM significantly
GPU required - CPU inference extremely slow
Clear prompts - Specific questions get better answers
Multi-turn conversations - Maintain conversation context
Temperature 0.2-0.7 - Balance creativity/consistency
max_new_tokens 512-1024 - For detailed responses
Batch processing - Process multiple images sequentially

Performance

Model	VRAM (FP16)	VRAM (4-bit)	Speed (tokens/s)
7B	~14 GB	~4 GB	~20
13B	~28 GB	~8 GB	~12
34B	~70 GB	~18 GB	~5

On A100 GPU

Benchmarks

LLaVA achieves competitive scores on:

VQAv2: 78.5%
GQA: 62.0%
MM-Vet: 35.4%
MMBench: 64.3%

Limitations

Hallucinations - May describe things not in image
Spatial reasoning - Struggles with precise locations
Small text - Difficulty reading fine print
Object counting - Imprecise for many objects
VRAM requirements - Need powerful GPU
Inference speed - Slower than CLIP

Integration with frameworks

LangChain

from langchain.llms.base import LLM

class LLaVALLM(LLM):
    def _call(self, prompt, stop=None):
        # Custom LLaVA inference
        return response

llm = LLaVALLM()

Gradio App

import gradio as gr

def chat(image, text, history):
    response = ask_llava(model, image, text)
    return response

demo = gr.ChatInterface(
    chat,
    additional_inputs=[gr.Image(type="pil")],
    title="LLaVA Chat"
)
demo.launch()

Resources

GitHub: https://github.com/haotian-liu/LLaVA ⭐ 23,000+
Paper: https://arxiv.org/abs/2304.08485
Demo: https://llava.hliu.cc
Models: https://huggingface.co/liuhaotian
License: Apache 2.0

More by zechenzhangAGI

View all skills by zechenzhangAGI →

langchain

zechenzhangAGI

Framework for building LLM-powered applications with agents, chains, and RAG. Supports multiple providers (OpenAI, Anthropic, Google), 500+ integrations, ReAct agents, tool calling, memory management, and vector store retrieval. Use for building chatbots, question-answering systems, autonomous agents, or RAG applications. Best for rapid prototyping and production deployments.

14328

llama-cpp

zechenzhangAGI

Runs LLM inference on CPU, Apple Silicon, and consumer GPUs without NVIDIA hardware. Use for edge deployment, M1/M2/M3 Macs, AMD/Intel GPUs, or when CUDA is unavailable. Supports GGUF quantization (1.5-8 bit) for reduced memory and 4-10× speedup vs PyTorch on CPU.

50721

qdrant-vector-search

zechenzhangAGI

High-performance vector similarity search engine for RAG and semantic search. Use when building production RAG systems requiring fast nearest neighbor search, hybrid search with filtering, or scalable vector storage with Rust-powered performance.

16118

unsloth

zechenzhangAGI

Expert guidance for fast fine-tuning with Unsloth - 2-5x faster training, 50-80% less memory, LoRA/QLoRA optimization

13717

llama-factory

zechenzhangAGI

Expert guidance for fine-tuning LLMs with LLaMA-Factory - WebUI no-code, 100+ models, 2/3/4/5/6/8-bit QLoRA, multimodal support

11615

ui-ux-pro-max

nextlevelbuilder

"UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient."

2,6152,343

flutter-development

aj-geddes

Build beautiful cross-platform mobile apps with Flutter and Dart. Covers widgets, state management with Provider/BLoC, navigation, API integration, and material design.

2,1121,621

pdf-to-markdown

aliceisjustplaying

Convert entire PDF documents to clean, structured Markdown for full context loading. Use this skill when the user wants to extract ALL text from a PDF into context (not grep/search), when discussing or analyzing PDF content in full, when the user mentions "load the whole PDF", "bring the PDF into context", "read the entire PDF", or when partial extraction/grepping would miss important context. This is the preferred method for PDF text extraction over page-by-page or grep approaches.

3,4411,494

drawio-diagrams-enhanced

jgtolentino

Create professional draw.io (diagrams.net) diagrams in XML format (.drawio files) with integrated PMP/PMBOK methodologies, extensive visual asset libraries, and industry-standard professional templates. Use this skill when users ask to create flowcharts, swimlane diagrams, cross-functional flowcharts, org charts, network diagrams, UML diagrams, BPMN, project management diagrams (WBS, Gantt, PERT, RACI), risk matrices, stakeholder maps, or any other visual diagram in draw.io format. This skill includes access to custom shape libraries for icons, clipart, and professional symbols.

2,1961,420

godot

bfollington

This skill should be used when working on Godot Engine projects. It provides specialized knowledge of Godot's file formats (.gd, .tscn, .tres), architecture patterns (component-based, signal-driven, resource-based), common pitfalls, validation tools, code templates, and CLI workflows. The `godot` command is available for running the game, validating scripts, importing resources, and exporting builds. Use this skill for tasks involving Godot game development, debugging scene/resource files, implementing game systems, or creating new Godot components.

2,3181,177

nano-banana-pro

garg-aayush

Generate and edit images using Google's Nano Banana Pro (Gemini 3 Pro Image) API. Use when the user asks to generate, create, edit, modify, change, alter, or update images. Also use when user references an existing image file and asks to modify it in any way (e.g., "modify this image", "change the background", "replace X with Y"). Supports both text-to-image generation and image-to-image editing with configurable resolution (1K default, 2K, or 4K for high resolution). DO NOT read the image file first - use this skill directly with the --input-image parameter.

1,888941

Related MCP Servers

Browse all servers

Playwright Browser Automation

Enhance software testing with Playwright MCP: Fast, reliable browser automation, an innovative alternative to Selenium software testing tools.

28,44922 tools

Mobile Next

Mobile Next offers fast, seamless mobile automation for iOS and Android. Automate apps, extract data, and simplify mobile workflows effortlessly.

3,75219 tools

Supabase MCP Server

Connect Supabase projects to AI with Supabase MCP Server. Standardize LLM communication for secure, efficient development and data management.

2,5160 tools

Unity

Empower your Unity projects with Unity-MCP: AI-driven control, seamless integration, and advanced workflows within the Unity Editor.

1,2310 tools

Meta Ads

Optimize Facebook ad campaigns with AI-driven insights, creative analysis, and campaign control in Meta Ads Manager for Facebook.

5940 tools

IDA Pro

IDA Pro software enables programmatic access to IDA disassembler databases for automated reverse engineering and binary analysis tools.

52519 tools

Install

mkdir -p .claude/skills/llava && curl -L -o skill.zip "https://mcp.directory/api/skills/download/203" && unzip -o skill.zip -d .claude/skills/llava && rm skill.zip

Installs to .claude/skills/llava

Stats

Views

117

Installs

Author

zechenzhangAGI

6 skills published

Links

Source Code

llava

Install

About this skill

LLaVA - Large Language and Vision Assistant

When to use LLaVA

Quick start

Installation

Basic usage

Available models

CLI usage

Web UI (Gradio)

Multi-turn conversations

Common tasks

Image captioning

Visual question answering

Object detection (textual)

Scene understanding

Document understanding

Training custom model

Quantization (reduce VRAM)

Best practices

Performance

Benchmarks

Limitations

Integration with frameworks

LangChain

Gradio App

Resources

More by zechenzhangAGI

langchain

llama-cpp

qdrant-vector-search

unsloth

llama-factory

You might also like

ui-ux-pro-max

flutter-development

pdf-to-markdown

drawio-diagrams-enhanced

godot

nano-banana-pro

Related MCP Servers