Vision-Language Models (VLMs)

Vision-Language Models (VLMs) are a powerful class of multimodal AI systems that understand both images and text , allowing capabilities such as:

Image captioning
Visual question answering (VQA)
Image classification with natural language descriptions
Text-to-image retrieval and vice versa
Instruction following based on visual input

Table of Contents

🧠 What Are Vision-Language Models (VLMs)?

VLMs combine computer vision models (like CNNs or Vision Transformers) with language models (like Transformers or LLMs), learning to align visual features with semantic meaning in text.

They’re typically trained on image-caption pairs (e.g., COCO, Conceptual Captions, LAION) using contrastive learning (like CLIP) or causal language modeling objectives.

🔍 Categories of VLMs

Type	Description	Examples
Dual-Encoder Models	Encode image and text separately, then compute similarity	CLIP, ALIGN, FILIP
Fusion / Multimodal Encoder Models	Fuse image and text into a joint representation for tasks like VQA	Flamingo, BLIP, BLIP-2, Kosmos-1, Florence
Large VLMs + Language Model (LLM-based)	Use a frozen vision encoder + LLM (e.g., LLaMA, Mistral)	LLaVA, MiniGPT-4, InstructBLIP, OpenFlamingo, Qwen-VL

🧪 Popular Open Source VLMs

Here’s a list of open-source and self-hostable VLMs , including their architecture type and license:

Model	Architecture	Modality	License	Notes
LLaVA	Vision Encoder + LLaMA	Image + Text	MIT	Strong performance; easy to run locally
LLaVA-Next / LLaVA-NeXT-Video	Improved version, supports video	Image/Video + Text	MIT	Better reasoning & video understanding
MiniGPT-4	Vision Encoder + Vicuna/Qwen	Image + Text	Apache 2.0	Generates high-quality captions
InstructBLIP	BLIP-2 + instruction tuning	Image + Text	MIT	Good for detailed instructions
OpenFlamingo	Multi-modal model with cross-attention	Image + Text	Apache 2.0	Inspired by Flamingo
Kosmos-1	Multimodal sequence modeling	Text + Image	MIT	Foundation model from Microsoft
Florence	Vision Transformer + Transformer	Image + Text	Microsoft Research License	Large-scale foundation model
Qwen-VL	Qwen + Vision Encoder	Image + Text	Tongyi Lab License	High performance from Alibaba
InternVL	Vit + LLM	Image + Text	Apache 2.0	SOTA performance on many benchmarks
Otter	Flamingo-style	Image + Text	MIT	Developed by UC Berkeley
OFv2	Improved OpenFlamingo	Image + Text	Apache 2.0	More capable than v1

🛠 How to Run an Open Source VLM Locally?

Example: Running LLaVA with `llama.cpp` or `Hugging Face Transformers`

Step 1: Download the model

bash

git clone https://huggingface.co/liuhaotian/llava-v1.5-7b

Step 2: Run with Hugging Face Transformers (Python)

python

from transformers import AutoProcessor, LlavaForConditionalGeneration

import torch

from PIL import Image

import requests

model = LlavaForConditionalGeneration.from_pretrained(“liuhaotian/llava-v1.5-7b”)

processor = AutoProcessor.from_pretrained(“liuhaotian/llava-v1.5-7b”)

url = “http://images.cocodataset.org/val2017/000000039769.jpg”

image = Image.open(requests.get(url, stream=True).raw)

prompt = “USER: <image>\nWhat is the color of the cat?\nASSISTANT:”

inputs = processor(text=prompt, images=image, return_tensors=”pt”).to(0, torch.float16)

generate_ids = model.generate(**inputs, max_new_tokens=100, do_sample=False)

output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

print(output)

Alternative: Run via `Ollama` (Simpler CLI Tool)

If your model is supported on Ollama , you can just run:

bash

ollama pull llava

ollama run llava

💾 Hardware Considerations

Most open-source VLMs require GPU acceleration for real-time inference. Here’s what you need:

Model Size	GPU RAM Required	Notes
LLaVA 7B	>= 12GB VRAM	Can be quantized to 4–8 GB
LLaVA 13B	>= 24GB VRAM	Requires larger GPUs or multi-GPU
Qwen-VL	>= 24GB	Larger models provide better accuracy
InternVL	12–24GB	Optimized versions available
Otter / MiniGPT-4	~8–12GB	Lightweight but still powerful

Tip: You can use quantized versions (e.g., GGUF) for running on smaller GPUs or even CPUs using tools like llama.cpp.

🧬 Applications of Open Source VLMs

Visual assistants (help users describe surroundings)
Education : Explain diagrams, charts, equations
Accessibility : Describe images for visually impaired users
E-commerce : Understand product photos + generate descriptions
Robotics : Combine perception + reasoning for complex tasks

📚 Datasets Commonly Used for Training VLMs

Dataset	Purpose	Description
COCO	Captioning, Detection	120k images with captions
Conceptual Captions	Scale-up pre-training	Millions of image-text pairs from web
LAION	Image-text pairs	Billions of pairs scraped from the internet
Visual Genome	Detailed scene understanding	Rich annotations (objects, relations, QA)
OK-VQA, A-OKVQA	Reasoning over images	Requires outside knowledge
ScienceQA	Science reasoning	Combines diagrams and questions
DocVQA	Document understanding	OCR + reasoning over document images

🧠 Advanced Topics

Multimodal Prompt Tuning
LoRA fine-tuning for domain-specific adaptation
Multilingual VLMs (e.g., support for non-English languages)
Video understanding (e.g., LLaVA-Next Video)
Agent-based workflows (e.g., VLM-controlled agents or robots)

✅ Learn More

Hugging Face Models Hub (for VLMs) : https://huggingface.co/models?pipeline_tag=image-to-text
LLaVA GitHub Repo : https://github.com/haotian-liu/LLaVA
OpenFlamingo Repository : https://github.com/mlfoundations/open_flamingo
InternVL GitHub : https://github.com/OpenGVLab/InternVL
Ollama (supports some VLMs) : https://ollama.ai

🧠 What Are Vision-Language Models (VLMs)?

🔍 Categories of VLMs

🧪 Popular Open Source VLMs

🛠 How to Run an Open Source VLM Locally?

Example: Running LLaVA with `llama.cpp` or `Hugging Face Transformers`

Step 1: Download the model

Step 2: Run with Hugging Face Transformers (Python)

Alternative: Run via `Ollama` (Simpler CLI Tool)

💾 Hardware Considerations

🧬 Applications of Open Source VLMs

📚 Datasets Commonly Used for Training VLMs

🧠 Advanced Topics

✅ Learn More

Leave a Comment Cancel Reply

techyengineer

Menu

Our Blogs

Contact Us

Call Us

E-Mail

head Office

🧠 What Are Vision-Language Models (VLMs)?

🔍 Categories of VLMs

🧪 Popular Open Source VLMs

🛠 How to Run an Open Source VLM Locally?

Example: Running LLaVA with llama.cpp or Hugging Face Transformers

Step 1: Download the model

Step 2: Run with Hugging Face Transformers (Python)

Alternative: Run via Ollama (Simpler CLI Tool)

💾 Hardware Considerations

🧬 Applications of Open Source VLMs

📚 Datasets Commonly Used for Training VLMs

🧠 Advanced Topics

✅ Learn More

Related Posts

Leave a Comment Cancel Reply

techyengineer

Menu

Our Blogs

Contact Us

Call Us

E-Mail

head Office

Example: Running LLaVA with `llama.cpp` or `Hugging Face Transformers`

Alternative: Run via `Ollama` (Simpler CLI Tool)