Document Processing

Upload and manage documents that power ChatRAG's retrieval-augmented generation system.

Two Ways to Manage Documents

In-App Dashboard: For end users at http://localhost:3000
Config UI: For admin workflows at http://localhost:3333

Document Management Methods

In-App Document Dashboard

Recommended for end users

Access at http://localhost:3000
User-friendly upload interface
Row Level Security (users see only their docs)
Requires authentication

Configuration:

NEXT_PUBLIC_HIDE_DOCUMENT_DASHBOARD=false
NEXT_PUBLIC_READ_ONLY_DOCUMENTS_ENABLED=false

Config UI Admin Tools

For admin-oriented workflows

Access at http://localhost:3333
Bulk operations and reprocessing
Advanced configuration controls
Requires SUPABASE_SERVICE_ROLE_KEY

Access via:

npm run config

Document Upload Process

What happens when you upload a document:

File Upload

Document uploaded to Supabase Storage with secure access policies

LlamaCloud Parsing

LlamaCloud extracts text, tables, images, and metadata from the document

Intelligent Chunking

Content split into semantic chunks (default: 2500 chars with 992 overlap)

Embedding Generation

OpenAI generates 1536-dimensional embeddings for each chunk

Database Storage

Chunks stored in document_chunks table with HNSW vector index

Ready for Retrieval

Document immediately available for semantic search and RAG

Supported Document Formats

PDF

Portable Document Format with OCR support

DOCX

Microsoft Word documents

TXT

Plain text files

HTML

Web pages and HTML documents

RTF

Rich Text Format

EPUB

E-book format

LlamaCloud Configuration

Configure document parsing behavior through environment variables:

Basic Configuration

LLAMA_CLOUD_API_KEY=llx-...
LLAMACLOUD_PARSING_MODE=balanced  # or "fast" or "premium"
LLAMACLOUD_CHUNK_STRATEGY=sentence
LLAMACLOUD_CHUNK_SIZE=2500
LLAMACLOUD_CHUNK_OVERLAP=992
LLAMACLOUD_MULTIMODAL_PARSING=true

Advanced Parsing

LLAMACLOUD_PARSE_MODE=parse_page_with_agent
LLAMACLOUD_PARSE_MODEL=anthropic-sonnet-4.0
LLAMACLOUD_HIGH_RES_OCR=true
LLAMACLOUD_ADAPTIVE_LONG_TABLE=true
LLAMACLOUD_OUTLINED_TABLE_EXTRACTION=true
LLAMACLOUD_OUTPUT_TABLES_AS_HTML=true

Parsing Modes

Fast: Quick processing, good for simple documents
Balanced: Recommended for most use cases
Premium: Maximum accuracy, slower processing

Admin Features

Admin Access Control

Designate admin users who can manage documents for all users:

Open Config UI → Admin section
Enter user's email address
Email must match existing Supabase user
Requires SUPABASE_SERVICE_ROLE_KEY

Document Reprocessing

Rebuild document index with updated settings:

node scripts/rag/reprocess-documents.js

Useful after changing chunking settings or upgrading embedding models

Read-Only Mode

Prevent users from uploading documents (admin-only dataset):

NEXT_PUBLIC_READ_ONLY_DOCUMENTS_ENABLED=true

Verification Steps

Verify your document processing is working correctly:

Upload a Test Document

Choose a PDF or DOCX with known content you can query

Wait for Processing

Status will change from "Processing" to "Completed"

Ask About Document Content

Query a specific fact from your uploaded document

Verify AI Response

AI should reference uploaded content in its response

If the AI doesn't use document content, verify your system prompt includes {{context}}

Storage & Security

Storage Buckets

Documents stored in Supabase Storage with automatic bucket creation:

• Secure file storage
• Automatic cleanup on deletion
• CDN delivery for fast access

Row Level Security (RLS)

Multi-tenant isolation ensures users only see their documents:

• User-based access control
• Automatic policy enforcement
• Admin override capability

Document Processing Pipeline

ChatRAG's document processing includes:

Document Processor: LlamaCloud integration (15KB)
Semantic Chunker: Intelligent splitting (18KB)
Upload Utils: File handling and validation
Storage Integration: Supabase Storage with RLS
Database Tables: documents, document_chunks with HNSW indexes

← Previous: RAG System Next: AI Models →