Mustapha Unubi Momoh | Machine Learning Engineer

Hi, I'm Mustapha Unubi Momoh.

Machine Learning Engineer

Master's in Systems Design Engineering from the University of Waterloo, Canada.

About

Machine Learning Engineer

I build production machine learning systems (e.g., recommender systems), model serving stacks, and machine learning infrastructure on cloud and Kubernetes-based systems.

Core Focus

Recommender systems and Personalization
Model serving and Inference infrastructure
Training pipelines, Evaluation, and Monitoring
Applied GenAI and Document Intelligence

Featured Projects

Production ML Case Study

Production Multistage Multimodal Recommender on Amazon Elastic Kubernetes Service (EKS)

Built and deployed an end-to-end recommender system with candidate generation, ranking, reranking, filtering, feature caching, and Triton serving on Kubernetes.

TDS Article Medium Article Demo Code

Multistage multimodal recommender system serving pipeline

Why This Design

The target use case is an ecommerce homepage recommender that serves both registered users and anonymous visitors. Recommendations need to account for request context such as device type, time of day, and day of week, while still producing reasonable cold-start results for users with little or no history.

The system also has to scale to large product catalogs. Scoring millions of items on every request is impractical, so the architecture uses a multistage design: a lightweight retrieval stage quickly narrows the candidate set, then a heavier ranking stage scores the smaller pool.

To keep the system current without rebuilding the full retrieval stack every day, I separated the workflow into an initial Kubeflow pipeline and an incremental fine-tuning pipeline. The initial pipeline builds preprocessing workflows, trains models from scratch, creates the ANN index, and deploys Triton. The incremental pipeline updates the query tower and ranker with new interactions while keeping item embeddings fixed.

System summary

Two-Tower candidate generation, Seen-items filtering, DLRM ranking, score-based Diversity reranking.
A feature masking technique allows the Two-Tower to learn dedicated "starting" embeddings for unknown users.
Context-aware ranking based on device type and timestamp enables adaptation of recommendations to the current request context
Real-time user feature updates allows the system to adapt to changing user intent in near real-time.
Multimodal learning with CLIP-image and Sentence-BERT embeddings improves item cold-start performance and overall recommendation quality.
Bloom filters are used to exclude items that have already been seen by the user.
Infrastructure is autoscaled with Kubernetes Horizontal Pod Autoscaler (HPA) and Karpenter
Serving stack includes Amazon EKS, NVIDIA Triton Inference Server, TensorFlow, Feast feature stores (Offline backed by S3 and Athena, Online backed by Valkey (Redis)), ANN index (FAISS), Kubeflow pipelines, and Valkey-backed Bloom filters.

Online Feature Updates

Recommendation requests and feature updates run through separate Lambda functions in the same VPC as ElastiCache and the EKS node subnets hosting Triton. Triton is exposed through an internal AWS Network Load Balancer, while DynamoDB, S3, and SQS are reached privately through VPC endpoints.

Online feature update and recommendation request flow with Lambda, internal Network Load Balancer, Triton on EKS, ElastiCache, DynamoDB, S3, SQS, and VPC endpoints

Interactive Demo

The demo shows recommendations adapting to changing user preferences in near real time, using the online feature update path above to refresh user features as new interactions are generated. Click the image to view the demo.

YouTube thumbnail for the interactive multistage recommender system demo with user, context, Top-K controls, scores, and recommendation cards

Triton Serving Ensemble

A single client request flows through a Triton ensemble including context preprocessing, Feast-backed user lookup, NVTabular transforms, Two-Tower retrieval with FAISS, Bloom-filter seen-item removal, item feature lookup, DLRM ranking, and final softmax sampling.

Triton serving graph for multistage recommender with context, retrieval, filtering, item features, ranking, and response stages

Feature Caching Optimization

Loading item features into an in-memory NumPy cache at model initialization reduced lookup latency from 195 ms to 0.5 ms, cut end-to-end latency by 54% and improved throughput by 310%.

Training, Deployment, and Monitoring

The MLOps flow keeps training and serving on Amazon EKS: Kubeflow prepares data and models, artifacts are persisted to Amazon EFS, NVIDIA Triton Inference Server serves the 14-model ensemble, and Prometheus/Grafana track utilization, throughput, and latency for capacity planning.

MLOps architecture for multistage recommender on Amazon EKS with Kubeflow, Triton, Prometheus, Grafana, GPU nodes, and CPU nodes

Initial Pipeline Run

The initial run builds the system from scratch. This includes the data preprocessing, feature store setup, Two-Tower and Deep Learning Recommendation Model training, ANN index setup, and Triton Server deployment.

Initial training pipeline for multistage recommender showing full data preparation, feature engineering, model training, artifact generation, and serving preparation

Incremental Pipeline Run

The incremental run updates the system without rebuilding everything from scratch. The Two-Tower is finetuned with the candidate encoder frozen, so only the query tower is updated. The ranker is finetuned with all layers trainable. Training uses recent data and some historical data for stability. The fine-tuned models are deployed to the server and Triton picks them up.

Incremental update pipeline for multistage recommender showing partial refresh of data, features, embeddings, indexes, and serving artifacts

For the full implementation details, architecture decisions, and deployment notes, see the TDS article, Medium article, demo, or source code linked above.

Tools: Amazon EKS, NVIDIA Merlin, NVIDIA Triton, Feast, FAISS, Kubeflow, Redis/Valkey, CLIP, Sentence-BERT

ML Infrastructure Case Study

Recommender System with Continuous Retraining on Amazon Elastic Kubernetes Service (EKS)

Built and deployed a recommender system for Ads ranking on Amazon EKS. It includes a monitoring component that triggers incremental retraining when model performance drifts below a defined threshold.

Medium Article Code

DCN-based recommender system architecture with continuous retraining on Amazon EKS

System summary

Trains a Deep and Cross Network model to predict the click probability for an Ad, using the user and item (Ad) features. The training data is a subset of the Criteo 1TB click logs dataset.
The monitoring component triggers incremental finetuning when model performance (based on AUC-ROC) drifts below a defined threshold.
One Kubeflow pipeline orchestrates both the full and incremental training runs. Incremental training is either triggered by the monitoring component or scheduled periodically.
Infrastructure is autoscaled with Kubernetes Horizontal Pod Autoscaler (HPA), Karpenter, or Cluster Autoscaler.

Autoscaling Strategies

Triton autoscaling uses a custom queue latency metric. When average request queue time exceeds the 200 ms target, HPA schedules additional Triton replicas. The project includes two node-scaling paths for pending GPU workloads. In one variant, Karpenter launches GPU nodes directly, and in the other, Cluster Autoscaler increases the desired capacity of a GPU Auto Scaling Group and lets AWS Auto Scaling provision the nodes.

Kubernetes HPA with Cluster Autoscaler for recommender serving on Amazon EKS — HPA with Cluster Autoscaler

Kubernetes HPA with Karpenter for recommender serving on Amazon EKS — HPA with Karpenter

Full implementation details and source code are linked at the top of this case study.

Tools: Amazon EKS, NVIDIA Merlin, HugeCTR, Kubernetes, model monitoring, autoscaling

Experience

Pixite Inc.

Machine Learning Engineer (Recommender systems)

Designed and proposed recommender-system architecture options on AWS and GCP, evaluating tradeoffs in training speed, inference latency, delivery timelines, and operating costs across data ingestion, model training, and inference.
Collaborated with the product team to define data and ranking requirements for personalized search and recommendation features for Pigment app.
Collaborated with engineering to train recommendation models for Pigment app enabling homepage content personalization for millions of users.
Led discussions around recommendation request/response caching to optimize performance, including evaluating trade-offs between different cache types.
Tools: Recommendation algorithms, Vertex AI, and Cloud functions

November 2024 – July 2025 | United States, Remote (Contract)

EveryRate

Data Scientist (OCR, ETL, and Automation)

Designed and deployed an ETL pipeline to extract mortgage rates from structured documents using Azure AI Document Intelligence, Azure Functions, and Blob triggers.
Benchmarked OCR pipeline tools, including Amazon Textract, Google Document AI, Azure AI Document Intelligence, and vision-language models for tabular data extraction.
Automated document processing with blob-triggered functions and upserted extracted mortgage-rate data into PostgreSQL for application use.
Tools: Azure AI Document Intelligence, Azure Functions, Blob Storage, PostgreSQL, Amazon Textract, Google Document AI

May 2024 – December 2024 | Vancouver, Remote

Contracts

Machine Learning and Generative AI Engineer (Contracts)

Several Companies including Stealth startups and Upwork clients

Trained, packaged, and deployed deep learning models for spoofing verification for credit card and spend management companies while at OKRFI (Stealth startup)
Worked with the VP of Engineering to set up API gateways and collaborated on API specifications and technical reports detailing benchmarking results while at OKRFI (Stealth startup).
Led the Data Science team in pitches to two corporate credit card and spend management companies with positive feedback while at OKRFI (Stealth startup).
Worked as a Generative AI consultant for a Copilot development for a Visual programming language with a client on Upwork.
Worked on a POC of an AI shopping Assistant similar to Shopify’s shop.app to improve product discovery for an e-commerce platform.
Worked on Beauty Retail Generative AI POC using PaLM-2, Stable Diffusion, and Vertex AI.
Worked on Causal understanding of REM sleep, Deep sleep, and Sleep latency project with TabNet, SHAP, and PyMC
Tools: Python, AWS Lambda, SageMaker Endpoint, TensorFlow Serving, SQS, Docker, API Gateway, AWS Bedrock, Large Language Models (Titan), text embedding, Vector DBs, Amazon Kendra, Streamlit, AWS EC2, Stable Diffusion, GCP, vertex AI, AI agents, Knowledge graphs, Amazon Neptune, neo4j, Amazon kendra, Entity extraction, Intent recognition, Explainable AI with SHAP, Bayesian Causal Inference, and Machine Learning

April 2023 - present | United States (NYC) | Canada (Remote)

Selected Projects and Hackathons

Click a card (or the ⋮ menu icon in its corner) to flip it and reveal more details, source code, demos, and live links.

                
Hybrid Search and Autocomplete on OpenSearch
                  Hybrid lexical and semantic search on OpenSearch with RRF ranking, category-diverse autocomplete, and a Dockerized local setup.
                
AccomplishmentsTools: OpenSearch, Flask, Docker
Hybrid lexical and semantic retrieval.
OpenSearch model deployment plus ingest and search pipelines.
RRF-based ranking for hybrid search results.
Autocomplete with category-diverse suggestions.
Dockerized local OpenSearch cluster setup.

2024 NVIDIA AI Hackathon: AI Assisted Documentation Review Review and Update
                  AI Assisted Documentation Review Review and Update Application using AWQ Quantized 13B llama and TensorRT-LLM
                
AccomplishmentsTools: llama-2, Nvidia TensorRT, TensorRT-LLM, Quantization, Streamlit, docker, Nvidia RTX 4090
Launch app and Login with your Atlassian Confluence Credentials.
Your documentation/articles in Confluence space will be auto downloaded and indexed
Chat with the documentation or
Create new content by providing a title, edit the generated content, and publish

Retrieval Augmented Generation with AWS Bedrock, Kendra, and Amazon Titan 
                  Retrieval Augmented Generation with AWS Bedrock, Kendra, and Amazon Titan for content and slides generation
                
AccomplishmentsTools: AWS Bedrock, AWS Kendra, EC2, Amazon Titan model, Prompt Engineering, Amazon s3
Clone the repo
Launch the application and create long form articles or short powerpoint slides

Medical Decision Support
                  Machine Learning Clinical Decision Support System Proof of Concept with LIME and Decision Trees.
                
Accomplishmentsengineered features such as speech speed, average characters, average nouns, sentiments from interview recording and transcripts
trained a decision tree classifier for detecting the likelihood of depression
used Local Interpretable Model Agnostic Explanations (LIME) to produce local feature contributions and Visualizations for interpretable ML
app can generate and display prediction probabilities, decision trees, LIME plots, and Feature importance on the interface
users can generate a short medical report with their assessment

Interactive Text Label Explorer
                  An Interactive Dashboard for Text Label Exploration.
                
Accomplishments
                  The preprocessing steps include:
                  creating word embeddings.
Projecting the embeddings vector to 2D plane using dimensionality reduction techniques (17 of them used in the project)
Topic modeling to produce clusters based on topics.

                  The dashboard allows users to interactively explore the data and labels in different panels including:
                  label-based groupings view
topic-based groupings view
top sentences view
top words and word cloud view

                  Based on findings from the explorations, the user can select data for review directly from the scatterplots. The selected data can be downloaded by clicking on a button. Please watch the demo video for more details.
                
Microsoft Responsible AI Hackathon - Deeplearning Assisted Diagnosis of Primary Open-Angle Glaucoma
                  Deeplearning Assisted Diagnosis of Primary Open-Angle Glaucoma
                
Accomplishmentsthe solution leverages a finetuned ResNet50 model and Azure Custom Vision Classifier to analyze fundus images for glaucomatous changes
it leverages techniques such as smart tagging for optic disc region identification, suitable for the calculation of cup-to-disc ratio
the datasets leveraged for training and testing include Retina Fundus Images for Glaucoma Analysis (RIGA) and the Dhrishti datasets 

Multivariate Regression and Explainable AI with SHAP
                  Multivariate Regression and Explainable AI with SHAP: exploring factors affecting sleep latency, rem sleep, deep sleep, and number of awakenings.
                
Accomplishmentsdeveloped regression models capable of predicting variables such as awake time, rem sleep time, deep sleep time, sleep latency, and number of awakenings
used SHAP and sensitivity analysis to explain the model's predictions
the models leveraged in this project include Support Vector Machines, XGboost, and TabNet Regressor

Animated Node-link and Adjacency Matrix Transition
                  Animated Node-link and Adjacency Matrix Transition using the Les Miserables dataset
                
AccomplishmentsAn implementation of an animated transition between a force directed graph and an adjacency matrix
Users can hover over a node to enlarge and highlight its direct connections. This will also display the character name and description.
Click and drag nodes to reposition them. Other nodes will repel and move accordingly
After initiating 'Start Transition', interactions are limited to hover details due to overlapping elements that disable other interactions like link highlighting and node dragging
Users can toggle between the node-link and adjacency matrix views
If you re-order nodes in the matrix view, ensure you allow the reordering process to complete before switching back to the node view.
Overlapping names in the matrix view can be resolved by completing the reordering process.

Skills

Production ML and Recommender Systems

Retrieval Ranking Reranking Two-Tower models Transformer-based sequence encoders Session-based recommendations DLRM DCN Context-aware recommendations Cold-start handling ANN search with FAISS Feature stores Bloom filters MRR, NDCG, Precision@K

ML Serving and MLOps

NVIDIA Triton Inference Server NVIDIA Merlin HugeCTR NVTabular Kubeflow Pipelines Docker Kubernetes Amazon EKS Prometheus Grafana HPA Karpenter Cluster Autoscaler

Cloud and Data Infrastructure

AWS Lambda Amazon S3 Amazon Athena DynamoDB SQS Amazon EFS ElastiCache / Valkey SageMaker API Gateway Vertex AI Cloud Functions Azure Functions Blob Storage PostgreSQL

GenAI, OCR, and Applied AI

RAG AWS Bedrock Amazon Titan Amazon Kendra Prompt engineering Stable Diffusion Azure AI Document Intelligence Amazon Textract Google Document AI Vision-language models OpenCV SHAP

Languages, Frameworks, and Analysis

Python SQL R Bash TensorFlow PyTorch Keras scikit-learn NumPy Pandas PyMC matplotlib Git

Education

University of Waterloo

Ontario, Canada

Degree: Master of Applied Science in Systems Design Engineering

Thesis: Remote Medical Diagnosis in Virtual Reality: A Mixed-methods approach to understanding Patients and Physicians’ Perceptions through Thematic Analysis and Regression Discontinuity Design.

Relevant Courseworks:

Selected Topics in Communication and Information Systems: Advanced Topics in Pattern Recognition (SYDE 770)
- Comparative Analysis: Real-World Weighted Cross-Entropy Loss Functions Across Various Activation Functions
InfoViz for AI Explainability (CS 889)
- Interactive Dashboard for Text Label Exploration
Data Structure in Health Informatics (CS 792)
- Depression Detection System with Decision Trees and LIME
Time Series Analysis (SYDE 631)
- Understanding Impact of Greenhouse Gas Emissions on Global Warming with Structural Time Series

Contact

mustaphaunubi@gmail.com

github.com/MustaphaU

linkedin.com/in/mustaphaunubi