Production ML Case Study
Production Multistage Multimodal Recommender on Amazon Elastic Kubernetes Service (EKS)
End-to-end recommender system with candidate generation, ranking, reranking, filtering, feature caching, and Triton serving on Kubernetes.
System
- Two-Tower candidate generation, DLRM ranking, diversity reranking, and seen-item filtering.
- Cold-start handling with feature masking, context-aware recommendations, CLIP embeddings, and Sentence-BERT embeddings.
- Serving stack with Amazon EKS, NVIDIA Triton, Feast, FAISS, Kubeflow, and Valkey-backed Bloom filters.
Impact
- 99.7%
- item feature lookup latency reduction
- 54%
- end-to-end latency reduction
- 310%
- throughput improvement
Interactive Demo
The demo exposes the serving system as a recommender UI with controls for user ID, device type, time of day, and Top-K results, making the personalization inputs and scored outputs visible without opening the code.
Triton Serving Ensemble
A single client request flows through a Triton ensemble: context preprocessing, Feast-backed user lookup, NVTabular transforms, Two-Tower retrieval with FAISS, Bloom-filter seen-item removal, item feature lookup, DLRM ranking, and final softmax sampling.
Feature Caching Optimization
Profiling showed per-request item feature lookup caused hundreds of Redis/Valkey round trips through Feast. Loading item features into an in-memory NumPy cache at model initialization reduced lookup latency from 195 ms to 0.5 ms.
Training, Deployment, and Monitoring
The MLOps flow keeps training and serving on Amazon EKS: Kubeflow prepares data and models, artifacts are persisted to Amazon EFS, NVIDIA Triton Inference Server serves the 14-model ensemble, and Prometheus/Grafana track utilization, throughput, and latency for capacity planning.
Tools: Amazon EKS, NVIDIA Merlin, NVIDIA Triton, Feast, FAISS, Kubeflow, Redis/Valkey, CLIP, Sentence-BERT

Keras
TensorFlow
PyTorch
FastAI
PyMC
Django
Flask
Python
SQL
Shell Scripting
NumPy
Pandas
OpenCV
scikit-learn
matplotlib
AWS
GCP
Azure
Heroku
Git