Blogs

Designing a Multi-Model Inference Routing System for Vision + LLM Workloads

Shawn Wilborne
August 27, 2025
6
min read

Modern AI apps often need to route requests across multiple models—handing images to vision models (e.g., YOLOv8) and text to LLMs (e.g., GPT or Claude). A solid routing layer uses an API gateway + orchestrator (e.g., API Gateway + Step Functions) to dispatch each request to the optimal backend based on cost, latency, or accuracy—and can even use LLM-assisted routing where a classifier LLM decides which model to call (multi-LLM routing strategies). Think “OpenRouter-style” hub for mixed vision+LLM workloads that picks the right model at the right time.

Figure: API Gateway fronts clients and invokes AWS Step Functions, which routes to backends (Lambda, ECS/Fargate GPUs, or SageMaker endpoints) and logs usage to a DB (Step Functions orchestration; multi-model inference reference).

Architecture Overview

Expose an Inference API via Amazon API Gateway, then trigger a Step Functions state machine that implements routing logic (Choice states or LLM-assisted routing for dynamic policies) (routing patterns & tiers). Once a model is chosen, Step Functions invokes the appropriate backend:

  • AWS Lambda for lightweight/stateless inference (calling OpenAI/Anthropic APIs, or small CPU vision). It’s pay-per-use and cheap (e.g., $0.20 per 1M requests + GB-seconds) (Lambda pricing).
  • Amazon ECS (Fargate) or SageMaker Endpoint for heavy models. YOLOv8 can run in a container on Fargate or as a SageMaker endpoint (e.g., ml.g4dn.xlarge ~$0.74/hr) (SageMaker instance pricing explainer; YOLOv8 on SageMaker guide). Use Map states to fan out to multiple endpoints in parallel and then aggregate results (multi-model orchestration).
  • External LLM APIs (OpenAI, Anthropic) when policy dictates; Step Functions tasks can make outbound calls as part of the flow.

Key points: Routing logic in Step Functions (Choice/LLM-assisted) (routing options & trade-offs); parallel inference with Map to run, say, YOLOv8 and CLIP/LLM in tandem (reference pattern); elasticity via Lambda/Fargate/SageMaker autoscaling; and monitoring by logging each step to a database for analytics and billing (end-to-end Step Functions orchestration).

Practical Routing Scenarios

  • Low-cost Batch Labeling: Route bulk image annotation/summarization to cheaper models (e.g., GPT-3.5/distilled vision) and use Step Functions Map for parallelism. Throughput > single-request latency (orchestration approach).
  • Real-time UI / Low Latency: For chat and live dashboards, pick GPU-backed or optimized models (e.g., warm SageMaker endpoint for YOLOv8) and lower-latency LLMs (e.g., Claude Instant or GPT-3.5 Turbo) (routing strategies).
  • High-Accuracy Tasks: Legal/medical summarization goes to top-tier LLMs (e.g., GPT-4/Claude 2) and may include RAG beforehand; accept higher cost/latency for quality (OpenAI pricing).
  • Multi-Tenancy / SaaS Tiers: Route Basic tier to smaller/faster LLMs; Pro tier to premium/custom models—example pattern from AWS gen-AI guidance (tiered routing).

Implementation Highlights

MongoDB Logging & Billing

Log every invocation for analytics and chargeback (user/tenant, model, tokens/size, latency, cost). Keep a rates table for LLM tokens (e.g., GPT-4o $5 per 1M input / $20 per 1M output tokens) and AWS compute so you can compute costUSD per call (OpenAI pricing). MongoDB Atlas offers a free tier and low-cost paid clusters suitable for usage logs and dashboards (Atlas pricing).

Monitoring Dashboard (React)

Provide live query volume, latency percentiles by backend, cost breakdown, model usage, and recent activity. Pull data from MongoDB via an internal API. This gives product, finance, and ops a shared view for routing policy tweaks (e.g., downgrading low-value traffic to cheaper models).

Pricing & Cost–Performance Trade-offs

  • OpenAI LLMs: e.g., GPT-4o $5/1M input, $20/1M output tokens; use GPT-3.5 for cost-sensitive paths (pricing).
  • Lambda: $0.20 per 1M requests + GB-seconds—great for glue code and light inference (Lambda pricing).
  • Fargate (ECS): per-second vCPU/GB billing; run containers only when needed (Fargate pricing).
  • SageMaker: always-on endpoints (e.g., ml.g4dn.xlarge ~$0.74/hr) for low-latency GPU inference; higher fixed cost but best UX for real-time vision (pricing explainer).
  • MongoDB Atlas: free tier for dev; small dedicated clusters scale with traffic (Atlas pricing).

Startup vs. Enterprise: Start lean with serverless + cheaper models; as traffic and SLA demands grow, add GPU endpoints for low latency and route premium tasks to top-tier LLMs. The Step Functions router lets you evolve policy without rewriting apps (routing strategy playbook).

URL Index

  1. Multi-LLM routing strategies on AWS
    https://aws.amazon.com/blogs/machine-learning/multi-llm-routing-strategies-for-generative-ai-applications-on-aws/
  2. Step Functions: orchestrate custom DL HPO/training/inference
    https://aws.amazon.com/blogs/machine-learning/orchestrate-custom-deep-learning-hpo-training-and-inference-using-aws-step-functions/
  3. Multi-Model Inference Workflow Orchestration (reference architecture)
    https://d1.awsstatic.com/architecture-diagrams/ArchitectureDiagrams/multi-model-inference-workflow-orchestration-ra.pdf?did=wp_card&trk=wp_card
  4. AWS Lambda Pricing
    https://aws.amazon.com/lambda/pricing/
  5. SageMaker pricing explainer (g4dn.xlarge example)
    https://saturncloud.io/sagemaker-pricing/
  6. OpenAI API Pricing
    https://openai.com/api/pricing/
  7. AWS Fargate Pricing
    https://aws.amazon.com/fargate/pricing/
  8. MongoDB Atlas on AWS — Pricing
    https://www.mongodb.com/products/platform/atlas-cloud-providers/aws/pricing
  9. Hosting YOLOv8 on Amazon SageMaker Endpoints (how-to)
    https://aws.amazon.com/blogs/machine-learning/hosting-yolov8-pytorch-model-on-amazon-sagemaker-endpoints/

Written By
Shawn Wilborne
AI Builder