On-device ML lets modern iOS apps analyze and organize photos without sending images to a server. Apple’s own Photos app “uses a number of machine learning algorithms, running privately on-device,” to power features like People and Memories (private knowledge graphs of people/places/things) (Apple ML Research). Keeping inference local means images never leave the device—great for latency, offline use, and privacy/GDPR risk reduction (Fritz: on-device benefits; Apple ML Research). By contrast, cloud-only filters like the 2019 FaceApp spike triggered public concern by sending faces to remote servers (Fritz: FaceApp discussion).
Core idea: train/distill heavy models off-device, ship a compact Core ML model to iOS, compute embeddings locally, and do similarity search & clustering on-device. Optionally sync embeddings/labels (not raw photos) to the cloud for cross-device personalization.
Why on-device?
- Privacy by default: images remain on the phone; only optional features/embeddings may sync. Less GDPR/PII exposure, fewer breach vectors (Fritz; Apple ML Research).
- Latency & offline: Apple Neural Engine accelerates inference with near-instant response and no round-trip delays (Fritz).
- Personalization: build a private, on-device knowledge graph of people/places/things that powers naming, clustering, deduping, and “Memories” (Apple ML Research).
System at a glance
- Teacher model (CLIP/Vit) → 2) Distilled student (MobileNet/Tiny ViT) → 3) Convert to Core ML → 4) On-device embeddings (Vision/Core ML) → 5) Local index & clustering (SQLite/Core Data) → 6) Optional cloud sync (only embeddings/labels).
Distillation: CLIP teacher → tiny student
Compress the teacher’s representational power into a small model that runs great on iPhones. Knowledge distillation trains the student to mimic teacher outputs (logits/embeddings), “compressing and accelerating” without big accuracy loss (Distillation explainer).
- Teacher: CLIP image encoder (e.g., ViT) is powerful but large (≈350 MB FP32 typical for original CLIP artifacts) (PicCollage).
- Student: MobileNet/Tiny-ViT sized for Core ML/ANE. Teams have reported ~7× compression (350 MB → 48 MB FP32, 24 MB FP16) with “negligible” search accuracy loss after Core ML conversion (PicCollage). Apple’s MobileCLIP family shows similar size/quality tradeoffs (e.g., largest variant ≈173 MB) (MobileCLIP overview).
- Training is offline: distill on GPUs in the cloud; only ship the student to devices—no on-device training assumed (distillation workflow).
Cost sanity check: an AWS p3.2xlarge (V100) is ≈$3.06/hr on-demand (spot ≈$0.97/hr). A 5–10 hr distillation job runs roughly $15–$30 on-demand; less on spot (Vantage: p3.2xlarge).
PyTorch → Core ML (and quantize)
Use coremltools to convert PyTorch directly to .mlmodel (TorchScript tracing/scripting), then apply FP16 or even 8-bit post-training quantization to cut size/latency (coremltools: PyTorch conversion). If you hit unsupported ops, ONNX can be a fallback—but Apple notes direct PyTorch conversion is preferred (ONNX notes).
Extract embeddings on iOS (Vision/Core ML)
Two practical options:
- Your distilled Core ML model via
VNCoreMLRequest to get a 512/768-d embedding per photo (CLIP-style). - Vision feature prints:
VNGenerateImageFeaturePrintRequest yields normalized 768-d vectors (iOS 17), comparable with Euclidean (≈cosine) distance. In practice, near-duplicate thresholds around ~0.4–0.6 (normalized distance) work well—tune per dataset (Vision feature prints write-up).
Clustering & deduping:
- Persist embeddings locally (Core Data/SQLite).
- For small libraries, brute-force NN (cosine/Euclidean) is fine; for larger sets, use product quantization or HNSW (client-side) as needed.
- Apple’s Photos research uses agglomerative clustering on on-device face/body embeddings for People albums—privacy-preserving and effective (Apple ML Research).
Personalization scenarios (all on-device)
- Duplicate & burst pruning: thresholded NN + lightweight agglomerative clusters to collapse near-duplicates.
- Smart albums: cluster by scene/subject; combine embeddings with EXIF/time/location for “Trip to SF 2024.”
- Semantic search: with a text encoder (distilled CLIP text tower), compare text embeddings to image embeddings for queries like “corgi in the snow.” The same vector search idea powers server demos too (MongoDB tutorial examples).
Optional cloud sync (hybrid)
- On-device only: maximum privacy—no images or vectors leave the phone (Fritz).
- Cloud-optional: sync embeddings/labels only (encrypted) for cross-device search. MongoDB Atlas supports Vector Search (store
{"_id": photoId, "embedding": [...]}; query via $vectorSearch) (Atlas examples).- Embedding sizes are tiny (~768 floats ≈ ~3 KB each). Even 100k photos ≈ 300 MB of vectors.
- Remember: embeddings can leak semantics if compromised; treat as sensitive (encrypt at rest/in transit).
iOS implementation checklist
- Model: distill, export
.mlmodel, FP16 if quality holds. - Runtime: use Vision for
VNCoreMLRequest/VNGenerateImageFeaturePrintRequest; batch over Photos library with background tasks. - Index: Core Data/SQLite with schema:
{photoId, ts, exif, embedding, clusters}. - Clustering: start with thresholded NN + agglomerative; add HNSW if you need faster queries.
- UX: privacy notice + toggles, progress UI for first-run indexing, “review duplicates” surfaces.
- Power: schedule heavy work on charge/Wi-Fi; incremental updates via PhotoKit change events.
- Testing: calibrate distance thresholds per device/photo domain; A/B FP16 vs FP32.
Architecture (one slide view)
Cloud (offline): Pretrain/Distill CLIP → export student → Core ML convert/quantize → deliver .mlmodel.
Device: iOS app (Swift) → Core ML & Vision infer → store embeddings locally → NN search & clustering → personalization UI.
Optional: Encrypted sync of embeddings/labels to MongoDB Atlas vector index for cross-device search.
tl;dr
- Privacy-first wins: keep photos and inference on-device; build a private knowledge graph (Apple does this in Photos) (Apple ML Research).
- Distill big → small: CLIP-class teacher → compact Core ML student; FP16 cuts size with minimal loss (PicCollage).
- Use Vision/Core ML: 512/768-d embeddings; cluster & dedupe with simple thresholds; agglomerative works well for People-style grouping (Vision feature prints; Apple ML Research).
- Cloud optional: if you must sync, upload embeddings/labels only and treat them as sensitive; Atlas Vector Search can power cross-device queries (MongoDB tutorial).
- Cost: one-time GPU training is the big line item; on-device inference is effectively free at runtime (p3.2xlarge pricing).
URL Index
- Apple Photos research (on-device People/knowledge graph)
https://machinelearning.apple.com/research/recognizing-people-photos - On-device ML benefits, latency, FaceApp discussion
https://fritz.ai/on-device-ml-benefits/ - Vision feature prints & similarity thresholds (768-d)
https://medium.com/@MWM.io/apples-vision-framework-exploring-advanced-image-similarity-techniques-f7bb7d008763 - Knowledge distillation explainer
https://medium.com/@nminhquang380/knowledge-distillation-explained-model-compression-49517b039429 - CLIP distillation & Core ML conversion results (48 MB / 24 MB FP16)
https://tech.pic-collage.com/distillation-of-clip-model-and-other-experiments-f8394b7321ce?gi=04aec5c36161 - Apple MobileCLIP sizes/overview
https://blog.jacobstechtavern.com/p/offline-ai-clip - coremltools: converting from PyTorch
https://apple.github.io/coremltools/docs-guides/source/convert-pytorch.html - coremltools: ONNX conversion notes
https://coremltools.readme.io/v4.0/docs/onnx-conversion - Build an image search engine with CLIP embeddings (Atlas Vector Search)
https://www.mongodb.com/developer/products/atlas/multi-modal-image-vector-search/ - AWS p3.2xlarge pricing (training cost sanity check)
https://instances.vantage.sh/aws/ec2/p3.2xlarge