Framework · 10 case studies · senior rounds

ML System Design Interview Guide

The round that decides mid & senior offers — and the one with the least free material. A repeatable framework plus 10 worked case studies: recommendations, ranking, fraud, search, RAG, churn, ETA, moderation, and forecasting.

The 8-step framework (use it for any prompt)

1

Clarify the problem

Pin down the business goal, the user, scale (QPS, users, items), latency budget, and what 'success' means before touching models. Restate it back to the interviewer.
2

Frame as an ML task

Map the business goal to a concrete ML formulation — classification, ranking, regression, retrieval, generation — and define the label. The hardest part is often where the label comes from.
3

Pick the metric

Separate the offline metric (AUC, NDCG, RMSE, recall@k) from the online/business metric (CTR, revenue, retention) and name the gap between them. Always propose an A/B test.
4

Data & labels

Where does training data come from? Implicit (clicks) vs explicit (ratings) labels, class imbalance, label delay, and feedback loops / position bias.
5

Features

User, item, context, and cross features. Discuss freshness (batch vs real-time), an embedding/feature store, and avoiding train-serve skew.
6

Model

Start simple (logistic regression / GBT baseline), then justify going deeper (two-tower, DLRM, transformer). State the latency/accuracy trade-off explicitly.
7

Serving & scale

Candidate generation vs ranking, ANN retrieval, caching, batching, and a heavy-then-light multi-stage funnel. Cover cold start and fallbacks.
8

Monitoring & iteration

Drift detection, guardrail metrics, online evaluation, retraining cadence, and failure modes. End by listing what you'd build next.

Worked case studies

🎬

Recommendation System

Design the 'recommended for you' feed for a video/e-commerce platform with 100M+ items.

Frame it

Two-stage funnel: candidate generation (retrieve ~1000 from millions) then ranking (score the 1000).
Frame ranking as predicting P(engagement) — watch, click, purchase.

Data & labels

Implicit feedback: clicks, watch-time, purchases as positives; impressions-without-action as negatives.
Beware position bias — top items get more clicks regardless of relevance.

Model

Retrieval: two-tower (user tower + item tower) with embeddings + approximate nearest neighbor (ANN / FAISS).
Ranking: gradient-boosted trees or a DLRM/DCN deep model on rich features.

Serving & scale

Precompute item embeddings offline; compute user embedding online.
ANN index for sub-50ms retrieval; cache popular results; fallback to trending for cold-start users.

Metrics

Offline: recall@k, NDCG. Online: CTR, watch-time, retention via A/B test.
Add diversity / freshness guardrails so the feed doesn't collapse to one genre.

Pitfalls

Feedback loops reinforce popular items; inject exploration (epsilon-greedy / bandits).
Filter-bubble and stale-embedding drift; retrain regularly.

📰

News Feed / Timeline Ranking

Rank posts in a social feed to maximize meaningful engagement.

Frame it

Multi-objective ranking: predict P(like), P(comment), P(share), P(hide) and combine with a weighted value model.
Score = w1·like + w2·comment + w3·share − w4·hide.

Features

User-author affinity, post recency, content type, historical engagement, social graph signals.
Real-time features (last session activity) via a feature store.

Model

Multi-task deep network with shared bottom + per-objective heads (MMoE).
Calibrated probabilities so the weighted combination is meaningful.

Metrics

Online: time-spent, meaningful interactions, day-N retention.
Guardrails: integrity (downrank misinformation), reported-content rate.

Pitfalls

Optimizing pure engagement can amplify clickbait/outrage — needs integrity objectives.
Cold-start for new users and new posts.

🎯

Ad Click-Through-Rate (CTR) Prediction

Predict the probability a user clicks an ad to drive the ad auction.

Frame it

Binary classification: P(click | user, ad, context). Expected value = bid × pCTR drives ranking.
Massive sparse categorical features (billions of feature crosses).

Model

Logistic regression with hashing as a baseline; then Wide & Deep, DeepFM, or DCN for feature interactions.
Embeddings for high-cardinality IDs (user, ad, advertiser).

Data

Severe class imbalance (clicks are rare ~1-5%); use negative downsampling + calibration.
Delayed feedback — a click may arrive minutes later.

Metrics

Offline: AUC, log-loss, calibration (predicted vs actual CTR).
Online: revenue, click yield, advertiser ROI.

Pitfalls

Calibration matters more than ranking accuracy because pCTR feeds a money auction.
Position bias; train with position as a feature, serve at fixed position.

🛡️

Fraud / Anomaly Detection

Detect fraudulent transactions in real time for a payments platform.

Frame it

Binary classification with extreme imbalance (fraud << 0.1%) and a hard real-time latency budget.
Cost-sensitive: a missed fraud costs far more than a false alarm.

Data & labels

Labels are delayed and noisy (chargebacks arrive weeks later).
Velocity / aggregate features: # transactions last hour, distance from last location, amount vs user baseline.

Model

Gradient-boosted trees for tabular features; graph features to catch fraud rings.
Unsupervised anomaly detection (isolation forest / autoencoder) to catch novel patterns.

Serving

Real-time scoring < 100ms inline with the transaction; rules engine for hard blocks + ML score for the gray zone.
Human review queue for borderline cases generates new labels.

Metrics

Precision/recall at a fixed alert budget, dollars saved, false-positive rate (customer friction).
Use PR-AUC, not ROC-AUC, under heavy imbalance.

Pitfalls

Adversaries adapt — concept drift is constant; retrain frequently.
Feedback loop: blocked transactions never get a fraud label.

🔎

Search Ranking

Build search ranking for a marketplace / document corpus.

Frame it

Retrieval then learning-to-rank. Query understanding + candidate retrieval + re-ranking.
Frame re-ranking as predicting relevance (pointwise / pairwise / listwise).

Retrieval

Lexical (BM25 / inverted index) + semantic (embedding + ANN) — hybrid retrieval beats either alone.
Query expansion and spell correction up front.

Model

LambdaMART (GBT) is still a strong LTR baseline; cross-encoder transformer for top-k re-rank.
Features: text match, popularity, freshness, personalization, click history.

Metrics

Offline: NDCG, MRR, recall@k from human-judged or click-derived labels.
Online: click-through, successful-session rate, time-to-success.

Pitfalls

Click labels are biased toward what was shown and ranked high.
Cross-encoders are accurate but slow — only re-rank a small top-k.

🤖

LLM-Powered RAG Assistant

Design a retrieval-augmented chatbot over a company's documents.

Frame it

Not a from-scratch model — orchestrate retrieval + a foundation LLM with grounding.
Goal: accurate, cited answers with low hallucination.

Pipeline

Ingest → chunk → embed → vector DB. At query time: embed query → retrieve top-k chunks → stuff into prompt → generate.
Add re-ranking of retrieved chunks and a citation step.

Key choices

Chunk size/overlap, embedding model, hybrid (keyword+vector) retrieval, context-window budget.
Guardrails: refuse when retrieval confidence is low; cite sources.

Metrics

Retrieval: recall@k, MRR. Generation: faithfulness/groundedness, answer relevance (LLM-as-judge + human eval).
Online: thumbs-up rate, deflection rate, escalation rate.

Pitfalls

Hallucination when retrieval misses; stale index; prompt injection from documents.
Latency and cost of long contexts — cache embeddings and frequent answers.

📉

Churn Prediction

Predict which subscribers will cancel in the next 30 days.

Frame it

Binary classification over a prediction window; define churn precisely (no activity 30d? cancellation event?).
Output a risk score that triggers a retention intervention.

Features

Engagement trend (declining usage is the strongest signal), tenure, support tickets, billing events, RFM.
Build features as of a cutoff date to avoid label leakage.

Model

Gradient-boosted trees on tabular features; survival analysis if you care about time-to-churn.
Calibrate scores so retention budget targets the right users.

Metrics

Offline: PR-AUC, lift@decile. Business: retained revenue from interventions (measured via holdout).
Uplift modeling: target users the intervention actually changes, not just high-risk users.

Pitfalls

Leakage from features recorded after the churn decision.
Acting on the score changes future data (intervention effect).

🚗

ETA / Delivery-Time Prediction

Predict arrival time for a ride-hailing or food-delivery app.

Frame it

Regression on travel time; often predict a distribution (quantiles), not a point, to set expectations.
Decompose: route time + pickup/handoff time + queueing.

Features

Distance, historical segment speeds, time-of-day, weather, traffic, driver/restaurant state.
Real-time signals (current congestion) via streaming features.

Model

GBT baseline; graph neural nets / road-segment models for routing-aware estimates.
Quantile loss for under-promise/over-deliver behavior.

Metrics

MAE / quantile loss; % of arrivals within the promised window.
Asymmetric cost: being late is worse than being early.

Pitfalls

Self-fulfilling: shown ETA changes user/driver behavior.
Long-tail events (accidents, surges) dominate user dissatisfaction.

🧹

Content Moderation

Detect harmful images/text (spam, NSFW, hate) at upload scale.

Frame it

Multi-label classification across policy categories; very high recall on severe categories.
Multi-stage: cheap filter → ML model → human review for the uncertain band.

Model

Fine-tuned vision/text transformers; multimodal for memes (image+text).
Hash-matching (PhotoDNA-style) for known-bad content before the model.

Data

Severe imbalance and shifting adversarial content; active learning on borderline cases.
Human-labeled golden set per policy with clear guidelines.

Metrics

Recall on harmful content (don't miss), precision (don't over-remove), appeal-overturn rate.
Per-category thresholds tuned to harm severity.

Pitfalls

Adversarial evasion (obfuscated text, cropped images); context matters (satire vs hate).
Reviewer well-being and consistency.

📈

Demand Forecasting

Forecast daily demand per product/store for inventory planning.

Frame it

Time-series regression at scale (thousands of series); predict horizon with uncertainty.
Hierarchical: forecasts should reconcile across product/store/region levels.

Features

Lags, rolling stats, seasonality (weekly/yearly), holidays, promotions, price, weather.
Avoid leakage — only use information available at forecast time.

Model

Baselines: seasonal naive, ETS/ARIMA. Scale: gradient-boosted trees (LightGBM) on engineered features, or global deep models (DeepAR / Temporal Fusion Transformer).
Quantile forecasts for safety stock.

Metrics

MAPE / WAPE / pinball loss; bias (consistent over/under-forecasting hurts inventory).
Backtest with rolling-origin evaluation, never random splits.

Pitfalls

Intermittent/zero-heavy demand; cold-start for new products.
Promotions and stockouts distort the historical signal.

ML System Design Interview Guide

The 8-step framework (use it for any prompt)

Clarify the problem

Frame as an ML task

Pick the metric

Data & labels

Features

Model

Serving & scale

Monitoring & iteration

Worked case studies

Recommendation System

Frame it

Data & labels

Model

Serving & scale

Metrics

Pitfalls

News Feed / Timeline Ranking

Frame it

Features

Model

Metrics

Pitfalls

Ad Click-Through-Rate (CTR) Prediction

Frame it

Model

Data

Metrics

Pitfalls

Fraud / Anomaly Detection

Frame it

Data & labels

Model

Serving

Metrics

Pitfalls

Search Ranking

Frame it

Retrieval

Model

Metrics

Pitfalls

LLM-Powered RAG Assistant

Frame it

Pipeline

Key choices

Metrics

Pitfalls

Churn Prediction

Frame it

Features

Model

Metrics

Pitfalls

ETA / Delivery-Time Prediction

Frame it

Features

Model

Metrics

Pitfalls

Content Moderation

Frame it

Model

Data

Metrics

Pitfalls

Demand Forecasting

Frame it

Features

Model

Metrics

Pitfalls

Round out your prep