Blogs

AWS Kinesis Video Streams and Real Time Computer Vision Architecture (What to Build First)

min read

AWS Kinesis Video Streams and Real Time Computer Vision Architecture (What to Build First)

Key Takeaways

Real time CV reliability depends on ingest, buffering, and event routing.
Start with measurable event definitions and latency targets.
Use Kinesis Video Streams for ingest, and containers for inference.
Store events and evidence with enough metadata to audit and improve.

Real time computer vision projects often fail for non ML reasons: stream reliability, buffering, latency, and alert routing. This post outlines a practical AWS architecture for analyzing camera streams and producing reliable events.

Start with the outcome, not the model

Define:

What event matters (person detected, loitering, PPE compliance)
Alert channel (webhook, SMS, incident system)
Latency target (p95)
False positive tolerance

Without these, tuning is endless.

Core AWS building blocks

Video ingest

Amazon Kinesis Video Streams: https://docs.aws.amazon.com/kinesisvideostreams/latest/dg/what-is-kinesis-video.html

Compute

Common options:

ECS on Fargate for containerized inference
EC2 with GPUs for heavier models
EKS when you already run Kubernetes

References:

Amazon ECS: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/Welcome.html
Amazon EC2 GPU instances: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/accelerated-computing-instances.html

Event routing

Amazon EventBridge: https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-what-is.html
Amazon SNS: https://docs.aws.amazon.com/sns/latest/dg/welcome.html

Model layer options

Managed: Amazon Rekognition for common detection use cases
Custom: YOLO family models, fine tuned on your data

References:

Amazon Rekognition: https://docs.aws.amazon.com/rekognition/latest/dg/what-is.html
Ultralytics YOLO docs: https://docs.ultralytics.com/

A minimal production pipeline

Ingest RTSP or camera feed into Kinesis Video Streams
Consumer service pulls fragments
Decode frames using FFmpeg
Run inference on sampled frames
Apply post processing rules, tracking, smoothing
Emit event with confidence and evidence snapshot
Persist events and review artifacts

FFmpeg reference:

https://ffmpeg.org/documentation.html

Storing events and evidence in MongoDB

MongoDB works well for event records and review workflows:

camera_events: event type, timestamps, confidence, references
evidence: S3 pointers to snapshots or clips
alert_deliveries: webhook attempts and retries

Observability and reliability

Do not ship without:

Metrics: ingest lag, frames processed per second, inference latency
Logs with correlation IDs
Replay capability for incident review

References:

AWS Builders Library on observability: https://aws.amazon.com/builders-library/

ze every frame?** Usually no. Sampling plus tracking can meet most outcomes while controlling cost.

Q: When should we use Rekognition vs a custom model? Rekognition is a great baseline for common detections. Custom models win when your environment is specific, you need higher accuracy, or you need custom classes.

Q: How do we handle multiple cameras and scaling? Use a queue or partitioning strategy, and scale inference workers horizontally. Keep per camera state separate.

Internal reference:

Lid Vizion background and demos: https://lidvizion.ai/

FAQs

**Q: Do we need to analy

Q: What should we link to internally? A: Link to relevant solution pages like Computer Vision or Document Intelligence, and only link to published blog URLs on the main domain. Avoid staging links.

On This Page

Topics :