Concept #134Mediumextended-ai-concepts

What is vision language support in AI?

#gen-ai#llm

Answer

Vision Language Support in AI

Vision language support (also called multimodal AI or VLM — Vision Language Models) is the ability of an AI model to process and understand both visual inputs (images, screenshots, video frames) and text together.

What It Enables

CapabilityExample
Image understanding"What's in this photo?"
OCR / text extractionRead text from screenshots
Code screenshot analysis"Debug this error screenshot"
Chart/diagram reading"Summarize this graph"
Visual Q&A"How many people are in this image?"
Image comparison"What changed between these two screenshots?"
Document analysisProcess PDFs with images
UI/UX review"What's wrong with this interface design?"

Models with Vision Support

ModelProviderVision Capability
Claude 3.5 SonnetAnthropicImages, PDFs, screenshots
GPT-4oOpenAIImages, screenshots
Gemini 1.5 ProGoogleImages, video, PDFs
LLaVAOpen sourceImages
CogVLMZhipu AIImages
Qwen-VLAlibabaImages
MoondreamOpen sourceLightweight image understanding

Using Vision with Claude API

python
from anthropic import Anthropic
import base64

client = Anthropic()

def analyze_image_file(image_path: str, question: str) -> str:
    # Read and encode image
    with open(image_path, "rb") as f:
        image_data = base64.standard_b64encode(f.read()).decode("utf-8")

    # Determine media type
    ext = image_path.lower().split(".")[-1]
    media_types = {"jpg": "image/jpeg", "jpeg": "image/jpeg",
                   "png": "image/png", "gif": "image/gif",
                   "webp": "image/webp"}
    media_type = media_types.get(ext, "image/jpeg")

    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": media_type,
                        "data": image_data
                    }
                },
                {
                    "type": "text",
                    "text": question
                }
            ]
        }]
    )
    return response.content[0].text

# Examples
result = analyze_image_file("error_screenshot.png",
    "What error is shown and how do I fix it?")

result = analyze_image_file("architecture_diagram.png",
    "Explain this system architecture")

result = analyze_image_file("chart.png",
    "What are the key trends shown in this chart?")

Using Vision with OpenAI

python
from openai import OpenAI
import base64

client = OpenAI()

def analyze_image_url(image_url: str, question: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": image_url}},
                {"type": "text", "text": question}
            ]
        }]
    )
    return response.choices[0].message.content

# Also works with base64
def analyze_image_base64(image_path: str, question: str) -> str:
    with open(image_path, "rb") as f:
        b64 = base64.b64encode(f.read()).decode()

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": [
            {"type": "image_url",
             "image_url": {"url": f"data:image/jpeg;base64,{b64}"}},
            {"type": "text", "text": question}
        ]}]
    )
    return response.choices[0].message.content

Architecture: How VLMs Work

text
Image Input
Vision Encoder (e.g., CLIP ViT)
    ↓ image patches as embeddings
Cross-modal Fusion / Projection layer
    ↓ image tokens + text tokens
Transformer (LLM)
Text Output

The key innovation is converting images into "visual tokens" that the LLM can process alongside text tokens.

Use Cases in Production

  • Customer support — Analyze product photos customers upload
  • Code review — Screenshot of code or errors
  • Document processing — Invoices, receipts, medical forms
  • Quality control — Analyze product images for defects
  • Accessibility — Generate image descriptions for screen readers