AWS Kinesis Video Streams and Real Time Computer Vision Architecture (What to Build First)
Key Takeaways
- Real time CV reliability depends on ingest, buffering, and event routing.
- Start with measurable event definitions and latency targets.
- Use Kinesis Video Streams for ingest, and containers for inference.
- Store events and evidence with enough metadata to audit and improve.
Real time computer vision projects often fail for non ML reasons: stream reliability, buffering, latency, and alert routing. This post outlines a practical AWS architecture for analyzing camera streams and producing reliable events.
Start with the outcome, not the model
Define:
- What event matters (person detected, loitering, PPE compliance)
- Alert channel (webhook, SMS, incident system)
- Latency target (p95)
- False positive tolerance
Without these, tuning is endless.
Core AWS building blocks
Video ingest
Compute
Common options:
- ECS on Fargate for containerized inference
- EC2 with GPUs for heavier models
- EKS when you already run Kubernetes
References:
Event routing
Model layer options
- Managed: Amazon Rekognition for common detection use cases
- Custom: YOLO family models, fine tuned on your data
References:
A minimal production pipeline
- Ingest RTSP or camera feed into Kinesis Video Streams
- Consumer service pulls fragments
- Decode frames using FFmpeg
- Run inference on sampled frames
- Apply post processing rules, tracking, smoothing
- Emit event with confidence and evidence snapshot
- Persist events and review artifacts
FFmpeg reference:
Storing events and evidence in MongoDB
MongoDB works well for event records and review workflows:
camera_events: event type, timestamps, confidence, references
evidence: S3 pointers to snapshots or clips
alert_deliveries: webhook attempts and retries
Observability and reliability
Do not ship without:
- Metrics: ingest lag, frames processed per second, inference latency
- Logs with correlation IDs
- Replay capability for incident review
References:
ze every frame?**
Usually no. Sampling plus tracking can meet most outcomes while controlling cost.
Q: When should we use Rekognition vs a custom model?
Rekognition is a great baseline for common detections. Custom models win when your environment is specific, you need higher accuracy, or you need custom classes.
Q: How do we handle multiple cameras and scaling?
Use a queue or partitioning strategy, and scale inference workers horizontally. Keep per camera state separate.
Internal reference:
FAQs
**Q: Do we need to analy
Q: What should we link to internally?
A: Link to relevant solution pages like Computer Vision or Document Intelligence, and only link to published blog URLs on the main domain. Avoid staging links.