Image-centric RAG augments (or replaces) text-only retrieval by indexing image embeddings directly. Instead of captioning images first (and losing detail), we embed images (e.g., CLIP) and run vector similarity search to fetch the most relevant visuals for a text or image query. LlamaIndex’s MultiModalVectorStoreIndex can store CLIP/VoyageAI embeddings in MongoDB Atlas, so a plain text query retrieves semantically similar images (and/or their captions) from one vector store—often more accurate than caption-only pipelines (OpenAI Cookbook; LlamaIndex → Mongo).
Atlas Vector Search is built-in (no extra fee for the feature), and even the Free Tier supports vector indexing—making image RAG cost-friendly for startups (Mongo forum; Mongo pricing).
Architecture at a glance
S3 (images) → Lambda (embeddings/captions) → MongoDB Atlas (vectors + metadata) → LlamaIndex (retriever) → LLM/UI
- Storage: Amazon S3 holds raw images (≈$0.023/GB-mo for Standard) and triggers processing on upload (S3 pricing guide).
- Compute: An embedding service (Lambda, SageMaker, or a small GPU container) generates vectors (CLIP, VoyageAI) and optional captions (BLIP → then text-embeddings). Lambda pricing is $0.20/million requests + $0.00001667/GB-s (AWS Lambda pricing).
- Index: MongoDB Atlas with a Vector Search index on
embedding. LlamaIndex’s MongoDBAtlasVectorSearch adapter wires it up (LlamaIndex Mongo). - Query: A user’s text or image query is embedded in the same space; LlamaIndex retrieves Top-K vectors (with optional metadata filters) and returns images + captions to the app/LLM (LlamaIndex multimodal example; OpenAI Cookbook).
Choosing embedding models (CLIP, VoyageAI, BLIP)
- Direct image embeddings: CLIP (e.g., ViT-B/32 via PyTorch) or VoyageAI multimodal map images and text into a shared vector space—perfect for text→image and image→image search (OpenAI Cookbook; LlamaIndex multimodal example).
- Caption-then-embed: If you need captions, run BLIP/BLIP-2 to generate one, then embed with a text model (e.g., OpenAI text-embedding). This is flexible, but tends to be lossier than CLIP-style direct image embeddings for nearest-neighbor retrieval (OpenAI Cookbook).
Implementation tip: Keep one canonical dimension (e.g., 512 or 768) across the corpus; don’t mix vector sizes in the same index.
Ingestion pipeline (step-by-step)
- Upload to S3 (with metadata)
Store the image and record metadata (filename, tags, EXIF/GPS). S3 events will kick off embedding. Costs are tiny: 100 GB ≈ $2.30/mo, 1 TB ≈ $23/mo (S3 pricing guide). - Embedding extraction (Lambda or endpoint)
- S3 event → Lambda pulls the image, calls CLIP/VoyageAI (local PyTorch, SageMaker endpoint, or Bedrock-hosted).
- Optional: run BLIP to create a caption and a text embedding for hybrid search.
Cost sanity check: 3 M images @ 120 ms each, 1.5 GB memory → ~540k GB-s. With 400k GB-s free + 1M free requests, net ≈ $2.33 (compute) + $0.40 (2M billable requests) ≈ $2.73 total (AWS Lambda pricing).
- Index in MongoDB Atlas (vectors + metadata)
Configure Vector Search on embedding (cosine or dot-product). Then store documents like: {
"_id": "img_123",
"s3_key": "catalog/2025/08/23/img_123.jpg",
"embedding": [/* d floats */],
"caption": "vintage red coupe on city street",
"meta": {"brand":"Acme", "category":"car", "uploadedAt":"2025-08-23T15:12:00Z"}
}
- Vector Search is included; you pay for the cluster (e.g., Shared/Free, or M20 ≈ $0.08/hr ≈ $60/mo) (Mongo pricing; forum).
- Build the LlamaIndex
Use MongoDBAtlasVectorSearch in the StorageContext, then build a MultiModalVectorStoreIndex from your image docs (LlamaIndex Mongo; multimodal example).
Minimal setup (illustrative)
Create Atlas vector index & build LlamaIndex
# pip install llama-index llama-index-vector-stores-mongodb
from llama_index.core import StorageContext, VectorStoreIndex, Document
from llama_index.vector_stores.mongodb import MongoDBAtlasVectorSearch
from pymongo import MongoClient
MONGO_URI = "mongodb+srv://..."
client = MongoClient(MONGO_URI)
db = client["image_search"]
coll = db["images"]
# 1) Configure Atlas Vector Search (one-time, in Atlas UI or via code)
# Example (conceptual): dimensions=512, cosine similarity
# vector_store.create_vector_search_index(path="embedding", dimensions=512, similarity="cosine")
# 2) Wire Mongo vector store into LlamaIndex
vector_store = MongoDBAtlasVectorSearch.from_collection(coll, index_name="embedding_index")
storage_ctx = StorageContext.from_defaults(vector_store=vector_store)
# 3) Upsert documents with embeddings already present in Mongo (from your Lambda step)
# Or, if you embed here, attach your image embed model to VectorStoreIndex.from_documents(...)
index = VectorStoreIndex.from_documents([], storage_context=storage_ctx)
# 4) Query (text -> image)
retriever = index.as_retriever(similarity_top_k=6)
results = retriever.retrieve("red vintage cars at night")
for node in results:
print(node.metadata.get("s3_key"), node.score)
(Exact helpers vary by version; align with the current LlamaIndex API and your Atlas index settings.)
Docs: LlamaIndex Mongo adapter/API (link); multimodal example (link).
Hybrid retrieval: vector + filters
Blend semantic and structured search in one call:
- Vector: nearest neighbors in
embedding. - Filters: Mongo fields—e.g.,
{ "meta.category": "car", "meta.uploadedAt": { "$gte": ... } }. - LlamaIndex supports metadata filters + Top-K vector retrieval, e.g., “red sneakers” AND
brand=Acme.
This yields precise results without over-fetching and keeps your index compact.
Serving results
- Keep originals in S3; front with CloudFront for global, low-latency delivery.
- Return signed URLs to clients, or pipe results into an LLM for multimodal chat (“show and describe the top-3 images”).
- For “find similar to this image,” embed the query image client-side and hit the same vector store.
Cost & sizing cheat-sheet
- Atlas Vector Search: feature is free; pay for cluster (Free/Shared → $, M20 ≈ $60/mo). Plenty for 10^5–10^6 images if vectors are small (forum; pricing).
- S3: $0.023/GB-mo (Standard). 50 GB ≈ $1.15/mo; 1 TB ≈ $23/mo (S3 guide).
- Lambda embedding jobs: essentially dollars-scale for millions of images, thanks to free-tiers + per-use pricing (Lambda pricing).
- Throughput: Use S3 events + Lambda concurrency for bursts; fall back to SageMaker or a small GPU service for heavy models/batching.
Best practices
- Canonicalize embeddings: uniform dims & metric (cosine vs dot).
- Normalize vectors: improves search stability.
- Store captions + EXIF: hybrid queries (“red coats” +
city=Paris). - Chunk big batches: throttle to respect Atlas write limits; use bulk writes.
- Version your models: keep
embedding_v in docs; reindex selectively on upgrades. - Test metrics: A/B CLIP vs BLIP-caption+text-embed on your data; CLIP often wins for pure image similarity (OpenAI Cookbook).
TL;DR
- Image-first RAG with CLIP/VoyageAI embeddings in MongoDB Atlas Vector Search improves accuracy over caption-only pipelines (OpenAI Cookbook).
- AWS + LlamaIndex gives a tiny, pay-as-you-go stack: S3 → Lambda → Atlas; LlamaIndex handles multimodal retrieval & filters (LlamaIndex Mongo; multimodal example).
- Costs stay low: Atlas small cluster (~$60/mo), S3 pennies/GB, Lambda dollars for millions of embeddings (Mongo pricing; S3; Lambda).
- Result: fast, accurate visual search + hybrid filters that make your images as searchable and actionable as text.
URL Index
- Multimodal RAG with CLIP (image search) — OpenAI Cookbook
https://cookbook.openai.com/examples/custom_image_embedding_search - Atlas Vector Search paid or free? — MongoDB Forum
https://www.mongodb.com/community/forums/t/is-vector-search-feature-paid-or-free/267191 - LlamaIndex → MongoDB Atlas Vector Search (API)
https://docs.llamaindex.ai/en/stable/api_reference/storage/vector_store/mongodb/ - MongoDB Pricing (Shared/Dedicated incl. M20)
https://www.mongodb.com/pricing - S3 Pricing (guide/estimates)
https://www.cloudzero.com/blog/s3-pricing/ - AWS Lambda Pricing (GB-s & free-tier)
https://aws.amazon.com/lambda/pricing/ - (same)
https://aws.amazon.com/lambda/pricing/ - LlamaIndex multimodal (VoyageAI + Mongo) example
https://docs.llamaindex.ai/en/stable/examples/multi_modal/llamaindex_mongodb_voyageai_multimodal/