Vision-Language Models (VLMs) are a powerful class of multimodal AI systems that understand both images and text , allowing capabilities such as:
- Image captioning
- Visual question answering (VQA)
- Image classification with natural language descriptions
- Text-to-image retrieval and vice versa
- Instruction following based on visual input
🧠 What Are Vision-Language Models (VLMs)?
VLMs combine computer vision models (like CNNs or Vision Transformers) with language models (like Transformers or LLMs), learning to align visual features with semantic meaning in text.
They’re typically trained on image-caption pairs (e.g., COCO, Conceptual Captions, LAION) using contrastive learning (like CLIP) or causal language modeling objectives.
🔍 Categories of VLMs
Type | Description | Examples |
---|---|---|
Dual-Encoder Models | Encode image and text separately, then compute similarity | CLIP, ALIGN, FILIP |
Fusion / Multimodal Encoder Models | Fuse image and text into a joint representation for tasks like VQA | Flamingo, BLIP, BLIP-2, Kosmos-1, Florence |
Large VLMs + Language Model (LLM-based) | Use a frozen vision encoder + LLM (e.g., LLaMA, Mistral) | LLaVA, MiniGPT-4, InstructBLIP, OpenFlamingo, Qwen-VL |
🧪 Popular Open Source VLMs
Here’s a list of open-source and self-hostable VLMs , including their architecture type and license:
Model | Architecture | Modality | License | Notes |
---|---|---|---|---|
LLaVA | Vision Encoder + LLaMA | Image + Text | MIT | Strong performance; easy to run locally |
LLaVA-Next / LLaVA-NeXT-Video | Improved version, supports video | Image/Video + Text | MIT | Better reasoning & video understanding |
MiniGPT-4 | Vision Encoder + Vicuna/Qwen | Image + Text | Apache 2.0 | Generates high-quality captions |
InstructBLIP | BLIP-2 + instruction tuning | Image + Text | MIT | Good for detailed instructions |
OpenFlamingo | Multi-modal model with cross-attention | Image + Text | Apache 2.0 | Inspired by Flamingo |
Kosmos-1 | Multimodal sequence modeling | Text + Image | MIT | Foundation model from Microsoft |
Florence | Vision Transformer + Transformer | Image + Text | Microsoft Research License | Large-scale foundation model |
Qwen-VL | Qwen + Vision Encoder | Image + Text | Tongyi Lab License | High performance from Alibaba |
InternVL | Vit + LLM | Image + Text | Apache 2.0 | SOTA performance on many benchmarks |
Otter | Flamingo-style | Image + Text | MIT | Developed by UC Berkeley |
OFv2 | Improved OpenFlamingo | Image + Text | Apache 2.0 | More capable than v1 |
🛠 How to Run an Open Source VLM Locally?
Example: Running LLaVA with llama.cpp
or Hugging Face Transformers
Step 1: Download the model
bash
git clone https://huggingface.co/liuhaotian/llava-v1.5-7b
Step 2: Run with Hugging Face Transformers (Python)
python
from transformers import AutoProcessor, LlavaForConditionalGeneration
import torch
from PIL import Image
import requests
model = LlavaForConditionalGeneration.from_pretrained(“liuhaotian/llava-v1.5-7b”)
processor = AutoProcessor.from_pretrained(“liuhaotian/llava-v1.5-7b”)
url = “http://images.cocodataset.org/val2017/000000039769.jpg”
image = Image.open(requests.get(url, stream=True).raw)
prompt = “USER: <image>\nWhat is the color of the cat?\nASSISTANT:”
inputs = processor(text=prompt, images=image, return_tensors=”pt”).to(0, torch.float16)
generate_ids = model.generate(**inputs, max_new_tokens=100, do_sample=False)
output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(output)
Alternative: Run via Ollama
(Simpler CLI Tool)
If your model is supported on Ollama , you can just run:
bash
ollama pull llava
ollama run llava
💾 Hardware Considerations
Most open-source VLMs require GPU acceleration for real-time inference. Here’s what you need:
Model Size | GPU RAM Required | Notes |
---|---|---|
LLaVA 7B | >= 12GB VRAM | Can be quantized to 4–8 GB |
LLaVA 13B | >= 24GB VRAM | Requires larger GPUs or multi-GPU |
Qwen-VL | >= 24GB | Larger models provide better accuracy |
InternVL | 12–24GB | Optimized versions available |
Otter / MiniGPT-4 | ~8–12GB | Lightweight but still powerful |
Tip: You can use quantized versions (e.g., GGUF) for running on smaller GPUs or even CPUs using tools like
llama.cpp
.
🧬 Applications of Open Source VLMs
- Visual assistants (help users describe surroundings)
- Education : Explain diagrams, charts, equations
- Accessibility : Describe images for visually impaired users
- E-commerce : Understand product photos + generate descriptions
- Robotics : Combine perception + reasoning for complex tasks
📚 Datasets Commonly Used for Training VLMs
Dataset | Purpose | Description |
---|---|---|
COCO | Captioning, Detection | 120k images with captions |
Conceptual Captions | Scale-up pre-training | Millions of image-text pairs from web |
LAION | Image-text pairs | Billions of pairs scraped from the internet |
Visual Genome | Detailed scene understanding | Rich annotations (objects, relations, QA) |
OK-VQA, A-OKVQA | Reasoning over images | Requires outside knowledge |
ScienceQA | Science reasoning | Combines diagrams and questions |
DocVQA | Document understanding | OCR + reasoning over document images |
🧠 Advanced Topics
- Multimodal Prompt Tuning
- LoRA fine-tuning for domain-specific adaptation
- Multilingual VLMs (e.g., support for non-English languages)
- Video understanding (e.g., LLaVA-Next Video)
- Agent-based workflows (e.g., VLM-controlled agents or robots)
✅ Learn More
- Hugging Face Models Hub (for VLMs) : https://huggingface.co/models?pipeline_tag=image-to-text
- LLaVA GitHub Repo : https://github.com/haotian-liu/LLaVA
- OpenFlamingo Repository : https://github.com/mlfoundations/open_flamingo
- InternVL GitHub : https://github.com/OpenGVLab/InternVL
- Ollama (supports some VLMs) : https://ollama.ai