Our Tech Stack, Your Superpower

We build blazing-fast, AI-powered web apps using the latest tech. From React to GPT-4, our stack is built for speed, scale, and serious results.

What Powers Our Projects

React.js, Node.js, MongoDB, AWS
GPT-4, Claude, Ollama, Vector DBs
Three.js, Firebase, Supabase, TailwindCSS

Every project gets a custom blend of tools—no cookie-cutter code here. We pick the right tech for your goals, so your app runs smooth and grows with you.

“Great tech is invisible—until it blows your mind.”

We obsess over clean code, modular builds, and explainable AI. Weekly updates and async check-ins keep you in the loop, minus the jargon.

Trusted by startups, educators, and SaaS teams who want more than just ‘off-the-shelf’ solutions.

Why Our Stack Stands Out

We don’t just follow trends—we set them. Our toolkit is always evolving, so your product stays ahead of the curve.

From MVPs to full-scale platforms, we deliver fast, flexible, and future-proof solutions. No tech headaches, just results.

Ready to build smarter? Let’s turn your vision into a launch-ready app—powered by the best in AI and web tech.

Lid Vizion: Miami-based, globally trusted, and always pushing what’s possible with AI.

interface image of employee interacting with hr software — Every pixel, powered by AI & code.

AI Web Apps. Built to Win.

From Miami to the world—Lid Vizion crafts blazing-fast, AI-powered web apps for startups, educators, and teams who want to move fast and scale smarter. We turn your wildest ideas into real, working products—no fluff, just results.

Our Tech Stack Superpowers

React.js, Node.js, MongoDB, AWS
GPT-4, Claude, Ollama, Vector DBs
Three.js, Firebase, Supabase, Tailwind

We blend cutting-edge AI with rock-solid engineering. Whether you need a chatbot, a custom CRM, or a 3D simulation, we’ve got the tools (and the brains) to make it happen—fast.

No cookie-cutter code here. Every project is custom-built, modular, and ready to scale. We keep you in the loop with weekly updates and async check-ins, so you’re never left guessing.

“Tech moves fast. We move faster.”

Trusted by startups, educators, and SaaS teams who want more than just another app. We deliver MVPs that are ready for prime time—no shortcuts, no surprises.

Ready to level up? Our team brings deep AI expertise, clean APIs, and a knack for building tools people actually love to use. Let’s make your next big thing, together.

From edge AI to interactive learning tools, our portfolio proves we don’t just talk tech—we ship it. See what we’ve built, then imagine what we can do for you.

Questions? Ideas? We’re all ears. Book a free consult or drop us a line—let’s build something awesome.

Why Lid Vizion?

Fast MVPs. Modular code. Clear comms. Flexible models. We’re the partner you call when you want it done right, right now.

Startups, educators, agencies, SaaS—if you’re ready to move beyond just ‘playing’ with AI, you’re in the right place. We help you own and scale your tools.

No in-house AI devs? No problem. We plug in, ramp up, and deliver. You get the power of a full-stack team, minus the overhead.

Let’s turn your vision into code. Book a call, meet the team, or check out our latest builds. The future’s waiting—let’s build it.

What We Build

• AI-Powered Web Apps • Interactive Quizzes & Learning Tools • Custom CRMs & Internal Tools • Lightweight 3D Simulations • Full-Stack MVPs • Chatbot Integrations

Frontend: React.js, Next.js, TailwindCSS Backend: Node.js, Express, Supabase, Firebase, MongoDB AI/LLMs: OpenAI, Claude, Ollama, Vector DBs Infra: AWS, GCP, Azure, Vercel, Bitbucket 3D: Three.js, react-three-fiber, Cannon.js

Published

10 Feb 2024

Words

Jane Doe

Blogs

Designing a Multi-Model Inference Routing System for Vision + LLM Workloads

min read

‍

Modern AI apps often need to route requests across multiple models—handing images to vision models (e.g., YOLOv8) and text to LLMs (e.g., GPT or Claude). A solid routing layer uses an API gateway + orchestrator (e.g., API Gateway + Step Functions) to dispatch each request to the optimal backend based on cost, latency, or accuracy—and can even use LLM-assisted routing where a classifier LLM decides which model to call (multi-LLM routing strategies). Think “OpenRouter-style” hub for mixed vision+LLM workloads that picks the right model at the right time.

Figure: API Gateway fronts clients and invokes AWS Step Functions, which routes to backends (Lambda, ECS/Fargate GPUs, or SageMaker endpoints) and logs usage to a DB (Step Functions orchestration; multi-model inference reference).

Architecture Overview

Expose an Inference API via Amazon API Gateway, then trigger a Step Functions state machine that implements routing logic (Choice states or LLM-assisted routing for dynamic policies) (routing patterns & tiers). Once a model is chosen, Step Functions invokes the appropriate backend:

AWS Lambda for lightweight/stateless inference (calling OpenAI/Anthropic APIs, or small CPU vision). It’s pay-per-use and cheap (e.g., $0.20 per 1M requests + GB-seconds) (Lambda pricing).
Amazon ECS (Fargate) or SageMaker Endpoint for heavy models. YOLOv8 can run in a container on Fargate or as a SageMaker endpoint (e.g., ml.g4dn.xlarge ~$0.74/hr) (SageMaker instance pricing explainer; YOLOv8 on SageMaker guide). Use Map states to fan out to multiple endpoints in parallel and then aggregate results (multi-model orchestration).
External LLM APIs (OpenAI, Anthropic) when policy dictates; Step Functions tasks can make outbound calls as part of the flow.

Key points: Routing logic in Step Functions (Choice/LLM-assisted) (routing options & trade-offs); parallel inference with Map to run, say, YOLOv8 and CLIP/LLM in tandem (reference pattern); elasticity via Lambda/Fargate/SageMaker autoscaling; and monitoring by logging each step to a database for analytics and billing (end-to-end Step Functions orchestration).

Practical Routing Scenarios

Low-cost Batch Labeling: Route bulk image annotation/summarization to cheaper models (e.g., GPT-3.5/distilled vision) and use Step Functions Map for parallelism. Throughput > single-request latency (orchestration approach).
Real-time UI / Low Latency: For chat and live dashboards, pick GPU-backed or optimized models (e.g., warm SageMaker endpoint for YOLOv8) and lower-latency LLMs (e.g., Claude Instant or GPT-3.5 Turbo) (routing strategies).
High-Accuracy Tasks: Legal/medical summarization goes to top-tier LLMs (e.g., GPT-4/Claude 2) and may include RAG beforehand; accept higher cost/latency for quality (OpenAI pricing).
Multi-Tenancy / SaaS Tiers: Route Basic tier to smaller/faster LLMs; Pro tier to premium/custom models—example pattern from AWS gen-AI guidance (tiered routing).

Implementation Highlights

Step Functions: Define Choice/Map/ErrorHandling; each Task calls Lambda or SageMaker/ECS. Update status/cost logs at each step (Step Functions orchestration).
Model Backends:
- Lambda to call OpenAI/Claude or run small CPU vision (Lambda pricing).
- ECS/Fargate (GPU) for heavy vision; pricing is per vCPU-second/GB-second (Fargate pricing).
- SageMaker Endpoint for managed GPU inference (e.g., YOLOv8 on ml.g4dn.xlarge) (YOLOv8 on SageMaker; SageMaker costs overview).
Parallel Inference: Use Map to fan out (e.g., vision + LLM concurrently) and then join (reference architecture).
Error Handling: Catch/Retry in Step Functions; on failure, log errorCause and return fallback (orchestration pattern).

MongoDB Logging & Billing

Log every invocation for analytics and chargeback (user/tenant, model, tokens/size, latency, cost). Keep a rates table for LLM tokens (e.g., GPT-4o $5 per 1M input / $20 per 1M output tokens) and AWS compute so you can compute costUSD per call (OpenAI pricing). MongoDB Atlas offers a free tier and low-cost paid clusters suitable for usage logs and dashboards (Atlas pricing).

Monitoring Dashboard (React)

Provide live query volume, latency percentiles by backend, cost breakdown, model usage, and recent activity. Pull data from MongoDB via an internal API. This gives product, finance, and ops a shared view for routing policy tweaks (e.g., downgrading low-value traffic to cheaper models).

Pricing & Cost–Performance Trade-offs

OpenAI LLMs: e.g., GPT-4o $5/1M input, $20/1M output tokens; use GPT-3.5 for cost-sensitive paths (pricing).
Lambda: $0.20 per 1M requests + GB-seconds—great for glue code and light inference (Lambda pricing).
Fargate (ECS): per-second vCPU/GB billing; run containers only when needed (Fargate pricing).
SageMaker: always-on endpoints (e.g., ml.g4dn.xlarge ~$0.74/hr) for low-latency GPU inference; higher fixed cost but best UX for real-time vision (pricing explainer).
MongoDB Atlas: free tier for dev; small dedicated clusters scale with traffic (Atlas pricing).

Startup vs. Enterprise: Start lean with serverless + cheaper models; as traffic and SLA demands grow, add GPU endpoints for low latency and route premium tasks to top-tier LLMs. The Step Functions router lets you evolve policy without rewriting apps (routing strategy playbook).

URL Index

Multi-LLM routing strategies on AWS
https://aws.amazon.com/blogs/machine-learning/multi-llm-routing-strategies-for-generative-ai-applications-on-aws/
Step Functions: orchestrate custom DL HPO/training/inference
https://aws.amazon.com/blogs/machine-learning/orchestrate-custom-deep-learning-hpo-training-and-inference-using-aws-step-functions/
Multi-Model Inference Workflow Orchestration (reference architecture)
https://d1.awsstatic.com/architecture-diagrams/ArchitectureDiagrams/multi-model-inference-workflow-orchestration-ra.pdf?did=wp_card&trk=wp_card
AWS Lambda Pricing
https://aws.amazon.com/lambda/pricing/
SageMaker pricing explainer (g4dn.xlarge example)
https://saturncloud.io/sagemaker-pricing/
OpenAI API Pricing
https://openai.com/api/pricing/
AWS Fargate Pricing
https://aws.amazon.com/fargate/pricing/
MongoDB Atlas on AWS — Pricing
https://www.mongodb.com/products/platform/atlas-cloud-providers/aws/pricing
Hosting YOLOv8 on Amazon SageMaker Endpoints (how-to)
https://aws.amazon.com/blogs/machine-learning/hosting-yolov8-pytorch-model-on-amazon-sagemaker-endpoints/

‍

Multi-Model Routing Architecture

3D Avatar Generation

On This Page

How We Work

Topics :

Multi-Model Routing Architecture

3D Avatar Generation

MongoDB Atlas Vector Search

Image Retrieval Augmented Generation