Framework · 10 case studies · senior rounds

ML System Design Interview Guide

The round that decides mid & senior offers — and the one with the least free material. A repeatable framework plus 10 worked case studies: recommendations, ranking, fraud, search, RAG, churn, ETA, moderation, and forecasting.

The 8-step framework (use it for any prompt)

  1. 1

    Clarify the problem

    Pin down the business goal, the user, scale (QPS, users, items), latency budget, and what 'success' means before touching models. Restate it back to the interviewer.

  2. 2

    Frame as an ML task

    Map the business goal to a concrete ML formulation — classification, ranking, regression, retrieval, generation — and define the label. The hardest part is often where the label comes from.

  3. 3

    Pick the metric

    Separate the offline metric (AUC, NDCG, RMSE, recall@k) from the online/business metric (CTR, revenue, retention) and name the gap between them. Always propose an A/B test.

  4. 4

    Data & labels

    Where does training data come from? Implicit (clicks) vs explicit (ratings) labels, class imbalance, label delay, and feedback loops / position bias.

  5. 5

    Features

    User, item, context, and cross features. Discuss freshness (batch vs real-time), an embedding/feature store, and avoiding train-serve skew.

  6. 6

    Model

    Start simple (logistic regression / GBT baseline), then justify going deeper (two-tower, DLRM, transformer). State the latency/accuracy trade-off explicitly.

  7. 7

    Serving & scale

    Candidate generation vs ranking, ANN retrieval, caching, batching, and a heavy-then-light multi-stage funnel. Cover cold start and fallbacks.

  8. 8

    Monitoring & iteration

    Drift detection, guardrail metrics, online evaluation, retraining cadence, and failure modes. End by listing what you'd build next.

Worked case studies

🎬

Recommendation System

Design the 'recommended for you' feed for a video/e-commerce platform with 100M+ items.

Frame it

  • Two-stage funnel: candidate generation (retrieve ~1000 from millions) then ranking (score the 1000).
  • Frame ranking as predicting P(engagement) — watch, click, purchase.

Data & labels

  • Implicit feedback: clicks, watch-time, purchases as positives; impressions-without-action as negatives.
  • Beware position bias — top items get more clicks regardless of relevance.

Model

  • Retrieval: two-tower (user tower + item tower) with embeddings + approximate nearest neighbor (ANN / FAISS).
  • Ranking: gradient-boosted trees or a DLRM/DCN deep model on rich features.

Serving & scale

  • Precompute item embeddings offline; compute user embedding online.
  • ANN index for sub-50ms retrieval; cache popular results; fallback to trending for cold-start users.

Metrics

  • Offline: recall@k, NDCG. Online: CTR, watch-time, retention via A/B test.
  • Add diversity / freshness guardrails so the feed doesn't collapse to one genre.

Pitfalls

  • Feedback loops reinforce popular items; inject exploration (epsilon-greedy / bandits).
  • Filter-bubble and stale-embedding drift; retrain regularly.
📰

News Feed / Timeline Ranking

Rank posts in a social feed to maximize meaningful engagement.

Frame it

  • Multi-objective ranking: predict P(like), P(comment), P(share), P(hide) and combine with a weighted value model.
  • Score = w1·like + w2·comment + w3·share − w4·hide.

Features

  • User-author affinity, post recency, content type, historical engagement, social graph signals.
  • Real-time features (last session activity) via a feature store.

Model

  • Multi-task deep network with shared bottom + per-objective heads (MMoE).
  • Calibrated probabilities so the weighted combination is meaningful.

Metrics

  • Online: time-spent, meaningful interactions, day-N retention.
  • Guardrails: integrity (downrank misinformation), reported-content rate.

Pitfalls

  • Optimizing pure engagement can amplify clickbait/outrage — needs integrity objectives.
  • Cold-start for new users and new posts.
🎯

Ad Click-Through-Rate (CTR) Prediction

Predict the probability a user clicks an ad to drive the ad auction.

Frame it

  • Binary classification: P(click | user, ad, context). Expected value = bid × pCTR drives ranking.
  • Massive sparse categorical features (billions of feature crosses).

Model

  • Logistic regression with hashing as a baseline; then Wide & Deep, DeepFM, or DCN for feature interactions.
  • Embeddings for high-cardinality IDs (user, ad, advertiser).

Data

  • Severe class imbalance (clicks are rare ~1-5%); use negative downsampling + calibration.
  • Delayed feedback — a click may arrive minutes later.

Metrics

  • Offline: AUC, log-loss, calibration (predicted vs actual CTR).
  • Online: revenue, click yield, advertiser ROI.

Pitfalls

  • Calibration matters more than ranking accuracy because pCTR feeds a money auction.
  • Position bias; train with position as a feature, serve at fixed position.
🛡️

Fraud / Anomaly Detection

Detect fraudulent transactions in real time for a payments platform.

Frame it

  • Binary classification with extreme imbalance (fraud << 0.1%) and a hard real-time latency budget.
  • Cost-sensitive: a missed fraud costs far more than a false alarm.

Data & labels

  • Labels are delayed and noisy (chargebacks arrive weeks later).
  • Velocity / aggregate features: # transactions last hour, distance from last location, amount vs user baseline.

Model

  • Gradient-boosted trees for tabular features; graph features to catch fraud rings.
  • Unsupervised anomaly detection (isolation forest / autoencoder) to catch novel patterns.

Serving

  • Real-time scoring < 100ms inline with the transaction; rules engine for hard blocks + ML score for the gray zone.
  • Human review queue for borderline cases generates new labels.

Metrics

  • Precision/recall at a fixed alert budget, dollars saved, false-positive rate (customer friction).
  • Use PR-AUC, not ROC-AUC, under heavy imbalance.

Pitfalls

  • Adversaries adapt — concept drift is constant; retrain frequently.
  • Feedback loop: blocked transactions never get a fraud label.
🤖

LLM-Powered RAG Assistant

Design a retrieval-augmented chatbot over a company's documents.

Frame it

  • Not a from-scratch model — orchestrate retrieval + a foundation LLM with grounding.
  • Goal: accurate, cited answers with low hallucination.

Pipeline

  • Ingest → chunk → embed → vector DB. At query time: embed query → retrieve top-k chunks → stuff into prompt → generate.
  • Add re-ranking of retrieved chunks and a citation step.

Key choices

  • Chunk size/overlap, embedding model, hybrid (keyword+vector) retrieval, context-window budget.
  • Guardrails: refuse when retrieval confidence is low; cite sources.

Metrics

  • Retrieval: recall@k, MRR. Generation: faithfulness/groundedness, answer relevance (LLM-as-judge + human eval).
  • Online: thumbs-up rate, deflection rate, escalation rate.

Pitfalls

  • Hallucination when retrieval misses; stale index; prompt injection from documents.
  • Latency and cost of long contexts — cache embeddings and frequent answers.
📉

Churn Prediction

Predict which subscribers will cancel in the next 30 days.

Frame it

  • Binary classification over a prediction window; define churn precisely (no activity 30d? cancellation event?).
  • Output a risk score that triggers a retention intervention.

Features

  • Engagement trend (declining usage is the strongest signal), tenure, support tickets, billing events, RFM.
  • Build features as of a cutoff date to avoid label leakage.

Model

  • Gradient-boosted trees on tabular features; survival analysis if you care about time-to-churn.
  • Calibrate scores so retention budget targets the right users.

Metrics

  • Offline: PR-AUC, lift@decile. Business: retained revenue from interventions (measured via holdout).
  • Uplift modeling: target users the intervention actually changes, not just high-risk users.

Pitfalls

  • Leakage from features recorded after the churn decision.
  • Acting on the score changes future data (intervention effect).
🚗

ETA / Delivery-Time Prediction

Predict arrival time for a ride-hailing or food-delivery app.

Frame it

  • Regression on travel time; often predict a distribution (quantiles), not a point, to set expectations.
  • Decompose: route time + pickup/handoff time + queueing.

Features

  • Distance, historical segment speeds, time-of-day, weather, traffic, driver/restaurant state.
  • Real-time signals (current congestion) via streaming features.

Model

  • GBT baseline; graph neural nets / road-segment models for routing-aware estimates.
  • Quantile loss for under-promise/over-deliver behavior.

Metrics

  • MAE / quantile loss; % of arrivals within the promised window.
  • Asymmetric cost: being late is worse than being early.

Pitfalls

  • Self-fulfilling: shown ETA changes user/driver behavior.
  • Long-tail events (accidents, surges) dominate user dissatisfaction.
🧹

Content Moderation

Detect harmful images/text (spam, NSFW, hate) at upload scale.

Frame it

  • Multi-label classification across policy categories; very high recall on severe categories.
  • Multi-stage: cheap filter → ML model → human review for the uncertain band.

Model

  • Fine-tuned vision/text transformers; multimodal for memes (image+text).
  • Hash-matching (PhotoDNA-style) for known-bad content before the model.

Data

  • Severe imbalance and shifting adversarial content; active learning on borderline cases.
  • Human-labeled golden set per policy with clear guidelines.

Metrics

  • Recall on harmful content (don't miss), precision (don't over-remove), appeal-overturn rate.
  • Per-category thresholds tuned to harm severity.

Pitfalls

  • Adversarial evasion (obfuscated text, cropped images); context matters (satire vs hate).
  • Reviewer well-being and consistency.
📈

Demand Forecasting

Forecast daily demand per product/store for inventory planning.

Frame it

  • Time-series regression at scale (thousands of series); predict horizon with uncertainty.
  • Hierarchical: forecasts should reconcile across product/store/region levels.

Features

  • Lags, rolling stats, seasonality (weekly/yearly), holidays, promotions, price, weather.
  • Avoid leakage — only use information available at forecast time.

Model

  • Baselines: seasonal naive, ETS/ARIMA. Scale: gradient-boosted trees (LightGBM) on engineered features, or global deep models (DeepAR / Temporal Fusion Transformer).
  • Quantile forecasts for safety stock.

Metrics

  • MAPE / WAPE / pinball loss; bias (consistent over/under-forecasting hurts inventory).
  • Backtest with rolling-origin evaluation, never random splits.

Pitfalls

  • Intermittent/zero-heavy demand; cold-start for new products.
  • Promotions and stockouts distort the historical signal.