Image RAG

Upload images to your knowledge base, have them captioned by AI, and retrieve them via semantic search when users ask relevant questions.

Visual Knowledge Base

Image RAG extends your knowledge base beyond text. Upload product photos, diagrams, screenshots, or any visual content - ChatRAG will understand and retrieve them when contextually relevant.

How It Works

1. Upload

Upload images via Config UI with optional context description

2. Caption

GPT-4o Vision analyzes and generates a detailed description

3. Embed

Caption embedded as vector for semantic search

4. Display

Image appears in chat when query matches

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                      IMAGE RAG DATA FLOW                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  UPLOAD PHASE                                                    │
│  ┌──────────────┐    ┌──────────────────┐    ┌────────────────┐ │
│  │ Config UI    │───▶│ /api/upload      │───▶│ GPT-4o Vision  │ │
│  │ Image Upload │    │ (process image)  │    │ (caption)      │ │
│  └──────────────┘    └──────────────────┘    └────────────────┘ │
│                                                      │           │
│                                                      ▼           │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │ document_chunks table                                       │ │
│  │   - content: AI-generated caption                          │ │
│  │   - embedding: vector from caption                         │ │
│  │   - metadata.image_url: Supabase storage URL               │ │
│  └────────────────────────────────────────────────────────────┘ │
│                                                                  │
│  RETRIEVAL PHASE                                                 │
│  ┌──────────────┐    ┌──────────────────┐    ┌────────────────┐ │
│  │ User Query   │───▶│ match_documents  │───▶│ Stream to UI   │ │
│  │ "Show me X"  │    │ (vector search)  │    │ (source_images)│ │
│  └──────────────┘    └──────────────────┘    └────────────────┘ │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Uploading Images

Via Config Dashboard

Run npm run config and open http://localhost:3333
Navigate to the "Image RAG Uploads" tab
Click "Choose File" and select your image (PNG, JPG, WEBP supported)
(Optional) Add a context description to improve retrieval accuracy
Click "Upload for RAG"

Context Descriptions

Adding context helps the AI generate better captions and improves retrieval:

Good examples:
  "Company logo used in marketing materials"
  "Product screenshot showing the dashboard"
  "Team photo from Q4 2024 retreat"
  "Architecture diagram of the payment system"

Bad examples:
  "image1.png"
  "screenshot"
  (blank)

Pro Tip: Batch Uploads

You can upload multiple images, but each is processed sequentially. Give each image a unique context description for best retrieval accuracy.

Retrieval Behavior

Semantic Matching

Images are retrieved based on semantic similarity between the user's query and the AI-generated caption. The query "Show me the company logo" will match an image captioned "A blue and white logo showing the ChatRAG brand identity."

Display Position

Retrieved images appear above the AI's text response, making them immediately visible. Images are left-aligned and sized appropriately (single images are larger, multiple images appear in a grid).

Images Are NOT Cited

Unlike text documents, images are not cited in the AI's response sources. They are displayed visually but filtered from the LLM context to prevent the AI from citing filenames like "logo.png" in its response.

Voice Integration

Image RAG works seamlessly with the Voice Agent. Users can speak queries like:

"Show me the ChatRAG logo"

Retrieves and displays the logo image

"What does the dashboard look like?"

Shows relevant dashboard screenshots

Voice queries are automatically rewritten for better RAG retrieval. "Show me its pricing" becomes "Show me ChatRAG's pricing page" for improved accuracy.

Managing Uploaded Images

View Uploaded Images

The Config UI shows a visual grid of all uploaded images with hover previews. Images are displayed in a 2-row horizontal scrolling layout.

Delete Images

Hover over any image to see the delete button. Clicking delete removes:

The document record from documents table
Associated chunks from document_chunks table
The file from Supabase Storage

Technical Requirements

Requirement	Details
OpenAI API Key	Required for GPT-4o Vision captioning
Supabase Storage	`chat-images` bucket must exist
Database Schema	`document_chunks` needs `metadata` JSONB column
Supported Formats	PNG, JPG, JPEG, WebP, GIF

Troubleshooting

Image retrieved but not displayed

Check browser console for [Stream] Received source_images event
Verify metadata.image_url exists in the chunk record
Ensure the image URL is publicly accessible

Wrong image retrieved

Add more specific context descriptions when uploading
Use distinct keywords in captions (e.g., "logo" vs "team photo")
Image similarity threshold is 0.25 - very low scores may retrieve wrong images

Image appears during stream but disappears

This indicates a frontend issue with message preservation
Check that source_images is preserved in [Manual onFinish] logic

Upload succeeds but image doesn't appear in grid

Wait a few seconds - processing happens in background
Click the "Refresh" button in the Config UI
Check server logs for [Upload] Image RAG processing completed

Key Implementation Files

File	Purpose
`src/lib/document-processor.ts`	processImageDocument() - Vision captioning & embedding
`src/app/api/upload/route.ts`	Triggers RAG processing for uploaded images
`src/app/api/chat/route.ts`	Injects source_images into response stream
`src/components/ui/source-images-grid.tsx`	Renders retrieved images in chat
`scripts/config-ui/index.html`	Image upload UI in Config Dashboard

Image RAG Features

AI Vision Captioning: GPT-4o generates searchable descriptions
Semantic Search: Find images by meaning, not just keywords
Voice Compatible: Works with Voice Agent queries
Left-Aligned Display: Images appear above text responses
Chat History: Images persist in saved conversations

← Previous: Cloud Connectors Next: AI Models →