19 algorithms · Layman + in-depth · Animated visuals · Interview Q&A

Machine Learning Algorithms — Explained Simply

Every major ML algorithm in plain English first, then the real depth — with an animated visual, a worked example, the questions interviewers actually ask, and a project idea to prove you can build it. Grouped by Supervised, Unsupervised, Reinforcement, and Deep Learning.

🎯

Supervised Learning Learn from labelled examples

You give the model inputs AND the correct answers (labels). It learns the mapping so it can predict the answer for new, unseen inputs. Two flavours: regression (predict a number) and classification (predict a category).

📈

Linear Regression

Regression

Draw the best straight line through your data.

🟢 In simple words

Imagine plotting house size against price on a graph. Linear regression finds the single straight line that best passes through all the dots — so for any new house size you can read its likely price off the line.

🔬 How it actually works

It assumes output = w·x + b. Training finds the weights (w) and bias (b) that minimise the mean squared error — the average squared gap between predicted and actual values — usually via gradient descent or a closed-form solution. With many inputs the 'line' becomes a flat plane in higher dimensions.

💡 Real example

Predicting a flat's rent from area, number of bedrooms, and distance to the metro. Each feature gets a weight; their weighted sum is the predicted rent.

🎤 Interview Q&A20 questions
What is linear regression?

A supervised algorithm that models a continuous target as a weighted linear combination of input features plus a bias term: y = w·x + b.

What loss function does it minimise?

Mean Squared Error (MSE) — the average of squared differences between predicted and actual values.

Why square the errors instead of taking absolute values?

Squaring penalises large errors more, is differentiable everywhere (easy gradients), and yields a convex loss with a unique closed-form solution.

What is the closed-form (normal equation) solution?

w = (XᵀX)⁻¹Xᵀy. It solves for weights directly without iteration, but inverting XᵀX is O(n³) and fails if features are collinear.

When would you use gradient descent over the normal equation?

When the feature count is large (inversion too costly) or the data doesn't fit in memory — gradient descent scales better.

What are the key assumptions?

Linearity, independence of errors, homoscedasticity (constant error variance), normally distributed residuals, and little multicollinearity.

What is multicollinearity and why is it a problem?

When features are highly correlated, making coefficient estimates unstable and hard to interpret. Detect it with the Variance Inflation Factor (VIF).

What does R² measure?

The proportion of variance in the target explained by the model, from 0 to 1 (can be negative for a bad model).

Why use adjusted R² instead of R²?

R² never decreases when you add features; adjusted R² penalises extra features, so it only rises if the new feature genuinely helps.

What is the difference between Ridge and Lasso regression?

Ridge adds an L2 penalty (shrinks weights toward zero); Lasso adds an L1 penalty that can drive weights exactly to zero, performing feature selection.

What does the L1 penalty do that L2 doesn't?

L1 produces sparse solutions — it zeros out unimportant features — whereas L2 only shrinks them.

What is the bias-variance trade-off here?

A simple linear model has high bias, low variance; adding polynomial terms lowers bias but raises variance and overfitting risk.

How do you handle non-linear relationships?

Add polynomial or interaction features, transform variables (log, sqrt), or switch to a non-linear model.

Why is feature scaling sometimes needed?

Not for the closed-form solution, but gradient descent converges faster and regularisation penalties are fairer when features are on the same scale.

What is heteroscedasticity and how do you detect it?

Non-constant error variance across the range of predictions. Spot it with a residuals-vs-fitted plot showing a fan/cone shape.

When is MAE preferred over MSE?

When outliers shouldn't dominate the loss — MAE treats all errors linearly, so it's more robust to outliers.

How do you interpret a coefficient?

It's the expected change in the target for a one-unit increase in that feature, holding all others constant.

Can linear regression be used for classification?

Not well — it outputs unbounded values and is sensitive to outliers; logistic regression is the right tool for classification.

What happens if XᵀX is not invertible?

It means perfect multicollinearity or more features than samples; use regularisation (Ridge) or the pseudo-inverse instead.

How do you detect and handle outliers?

Use residual plots, Cook's distance, or leverage scores; then cap, remove, or switch to a robust regression / MAE loss.

🛠 Project idea & 📚 resources

🛠 Build it — project idea

House / rent price predictor. Train a regression model and explain each feature's weight; ship a small predict form.

📂 Dataset: Kaggle 'House Prices — Advanced Regression Techniques'

🚦

Logistic Regression

Classification

Linear regression bent into a yes/no probability.

🟢 In simple words

Despite the name, it answers yes/no questions — like 'will this email be spam?'. It squashes a straight-line score through an S-shaped curve so the output is a probability between 0 and 1, then picks a cut-off (usually 0.5).

🔬 How it actually works

It computes w·x + b, then passes it through the sigmoid function to get a probability. Training minimises log-loss (cross-entropy). The decision boundary is still linear; the sigmoid only maps scores to probabilities.

💡 Real example

Predicting whether a customer will churn (yes/no) from their usage, tenure, and complaints — the model outputs a churn probability you can rank customers by.

🎤 Interview Q&A20 questions
What is logistic regression?

A linear classifier that models the probability of a class by passing a linear score through the sigmoid (logistic) function.

Why is it called 'regression' if it classifies?

It regresses the log-odds (logit) of the outcome linearly on the features; the probability mapping then enables classification.

What does the sigmoid function do?

It squashes any real number into (0, 1), turning the linear score into a probability.

What loss function is used?

Log-loss (binary cross-entropy), which is convex and penalises confident wrong predictions heavily.

Why not use MSE for logistic regression?

MSE with the sigmoid is non-convex (many local minima) and produces weak gradients; log-loss is convex and trains reliably.

What are odds and log-odds?

Odds = p/(1-p); log-odds (logit) is its natural log. Logistic regression is linear in the log-odds.

How do you interpret a coefficient?

exp(coefficient) is the multiplicative change in odds for a one-unit increase in that feature.

How do you extend it to more than two classes?

Use softmax (multinomial logistic regression) or one-vs-rest, training one binary classifier per class.

What is the decision boundary shape?

Linear in feature space — a line/plane — unless you add polynomial or interaction features.

How do you handle class imbalance?

Use class weights, resampling (SMOTE/undersampling), adjust the decision threshold, and evaluate with precision/recall or PR-AUC rather than accuracy.

What does the decision threshold control?

The probability cut-off for the positive class; lowering it raises recall at the cost of precision, and vice versa.

What is regularisation in logistic regression?

An L1 or L2 penalty on the weights to prevent overfitting; C in scikit-learn is the inverse of regularisation strength.

What is the ROC curve and AUC?

ROC plots true-positive vs. false-positive rate across thresholds; AUC is the area under it — the probability a random positive outranks a random negative.

When is PR-AUC better than ROC-AUC?

On highly imbalanced data, where ROC-AUC can look optimistic and precision-recall better reflects positive-class performance.

Does logistic regression need feature scaling?

It helps gradient-based solvers converge and makes regularisation fair, though it's not strictly required for correctness.

What is the difference between precision and recall?

Precision = of predicted positives, how many are correct; recall = of actual positives, how many were caught.

What assumptions does it make?

Linearity between features and log-odds, independent observations, little multicollinearity, and (ideally) large sample size.

How does it handle perfectly separable data?

Weights diverge to infinity (the likelihood keeps improving); regularisation prevents this.

Why is it a popular baseline?

It's fast, interpretable, outputs calibrated probabilities, and is hard to beat on linearly separable or text problems.

What is the F1 score and when do you use it?

The harmonic mean of precision and recall; use it when you need a single metric balancing both, especially under imbalance.

🛠 Project idea & 📚 resources

🛠 Build it — project idea

Churn or spam classifier. Build a binary classifier, tune the threshold, and report precision/recall + ROC-AUC.

📂 Dataset: Telco Customer Churn (Kaggle) or SMS Spam Collection

👥

K-Nearest Neighbors (KNN)

Classification

You are the average of your closest neighbours.

🟢 In simple words

To classify a new point, look at the K closest known points and take a vote. If 4 of your 5 nearest neighbours are 'cats', you're probably a cat too. There's no real 'training' — it just memorises the data.

🔬 How it actually works

For a query point, compute distance (usually Euclidean) to every training point, take the K nearest, and predict the majority class (or average value for regression). K is the key hyperparameter: small K = noisy, large K = oversmoothed.

💡 Real example

Recommending a movie rating by finding the K users most similar to you and averaging what they rated it.

🎤 Interview Q&A20 questions
What is K-Nearest Neighbors?

A non-parametric algorithm that predicts a point's label from the majority vote (or average) of its K closest training points.

Why is KNN called a 'lazy learner'?

It does no real training — it just stores the data and defers all computation to prediction time.

How do you choose K?

By cross-validation; small K is noisy/overfits, large K oversmooths. Odd K avoids ties in binary classification.

Why is feature scaling essential for KNN?

Distances are dominated by large-range features; without scaling, a feature in thousands swamps one in fractions.

What distance metrics can KNN use?

Euclidean (most common), Manhattan, Minkowski, cosine, or Hamming for categorical data.

What is the curse of dimensionality for KNN?

In high dimensions all points become roughly equidistant, so 'nearest' loses meaning and accuracy degrades.

What is the time complexity of prediction?

O(n·d) per query for brute force (n samples, d features) — slow on large datasets.

How can you speed up KNN search?

Use KD-trees or Ball-trees for low dimensions, or approximate nearest-neighbour libraries (FAISS, Annoy) for high dimensions.

Can KNN do regression?

Yes — it averages (or distance-weights) the target values of the K nearest neighbours.

What is distance-weighted KNN?

Closer neighbours get larger votes (weight ∝ 1/distance), reducing sensitivity to the exact choice of K.

How does KNN handle categorical features?

Encode them (one-hot) and use a suitable metric like Hamming, since Euclidean distance is meaningless on raw categories.

Is KNN sensitive to outliers?

Yes for small K; a single noisy neighbour can flip the prediction. Larger K or distance weighting mitigates it.

What is the bias-variance behaviour vs. K?

Small K = low bias, high variance; large K = high bias, low variance.

How does KNN handle imbalanced classes?

Poorly — the majority class dominates votes. Use distance weighting, resampling, or class-balanced voting.

Does KNN have a training phase at all?

Only storing the data (and optionally building a search index); there are no learned parameters.

What are the main pros of KNN?

Simple, no training, naturally handles multi-class, and adapts to complex decision boundaries.

What are the main cons?

Slow at prediction, memory-heavy, needs scaling, and suffers in high dimensions.

How do you handle missing values for KNN?

Impute them first (KNN itself is even used as an imputer), since distance can't be computed with gaps.

What happens with K = 1?

The model perfectly fits training data (zero training error) but is highly sensitive to noise — classic overfitting.

What is KNN imputation?

Filling a missing value using the average/most-common value among the sample's nearest neighbours.

🛠 Project idea & 📚 resources

🛠 Build it — project idea

Handwritten digit recognizer. Classify digits with KNN; visualise misclassified examples and tune K.

📂 Dataset: MNIST or sklearn 'digits'

📨

Naive Bayes

Classification

Bayes' theorem with a bold independence shortcut.

🟢 In simple words

It asks: 'given these words, what's the probability this is spam?' It multiplies the probability of each clue separately — naively assuming the clues don't affect each other. That shortcut is often wrong but works surprisingly well, especially for text.

🔬 How it actually works

Uses Bayes' theorem: P(class | features) ∝ P(class) × ∏ P(featureᵢ | class). The 'naive' part is assuming features are conditionally independent. Variants: Multinomial (word counts), Bernoulli (presence), Gaussian (continuous features).

💡 Real example

Classic email spam filtering — each word contributes a likelihood, and the product decides spam vs. not-spam in milliseconds.

🎤 Interview Q&A20 questions
What is Naive Bayes?

A probabilistic classifier applying Bayes' theorem with the 'naive' assumption that features are conditionally independent given the class.

State Bayes' theorem.

P(class|features) = P(features|class)·P(class) / P(features); Naive Bayes picks the class maximising the numerator.

Why is it called 'naive'?

Because it assumes all features are independent given the class — usually false, but it still works surprisingly well.

What is Laplace (additive) smoothing?

Adding a small count (e.g. 1) to every feature-class tally so an unseen feature doesn't make the whole probability zero.

What's the zero-frequency problem?

If a feature value never appeared with a class in training, its probability is 0, zeroing the entire product — fixed by smoothing.

What are the main variants?

Multinomial (counts, e.g. word frequencies), Bernoulli (binary presence/absence), and Gaussian (continuous features).

Which variant suits text classification?

Multinomial NB with word counts or TF-IDF; Bernoulli works when only word presence matters.

Why is Naive Bayes fast?

Training is a single pass counting frequencies; prediction is just multiplying probabilities — both linear in data size.

Why compute with log-probabilities?

Multiplying many small probabilities underflows to zero; summing their logs is numerically stable.

Does it need a lot of data?

No — it performs well even on small datasets because it estimates few, simple parameters.

What does Gaussian Naive Bayes assume?

That each continuous feature is normally distributed within each class, estimated by its per-class mean and variance.

Is Naive Bayes generative or discriminative?

Generative — it models P(features|class) and the class prior, unlike discriminative logistic regression.

When does the independence assumption hurt most?

When features are strongly correlated, causing over-confident (poorly calibrated) probability estimates.

Are its probability outputs well-calibrated?

Often not — the independence assumption pushes probabilities toward 0 or 1, though the ranking/classification can still be good.

How does it handle continuous and categorical features together?

Use the matching likelihood per feature type (Gaussian for continuous, multinomial/Bernoulli for categorical) and combine.

Naive Bayes vs. logistic regression?

NB (generative) trains faster and needs less data; logistic regression (discriminative) usually wins with more data and correlated features.

What are classic use cases?

Spam filtering, sentiment analysis, document/topic classification — high-dimensional text problems.

What is the prior P(class)?

The base rate of each class, typically estimated as its frequency in the training set.

How do you handle continuous features without assuming Gaussian?

Discretise them into bins, or use kernel density estimation for the per-class likelihood.

Why does it work despite a false assumption?

Classification only needs the correct class to have the highest score, not exact probabilities — so independence errors often cancel out.

🛠 Project idea & 📚 resources

🛠 Build it — project idea

News topic / sentiment classifier. Build a text classifier with TF-IDF + Multinomial NB; compare against logistic regression.

📂 Dataset: 20 Newsgroups or IMDB reviews

🧲

Support Vector Machine (SVM)

Classification

Find the widest possible street between two classes.

🟢 In simple words

It draws the boundary that leaves the biggest gap between the two groups — like a road with the classes on either side. The closest points (the 'support vectors') hold the boundary in place. A 'kernel' trick lets it curve the boundary for messy data.

🔬 How it actually works

Maximises the margin between classes, defined only by the support vectors. The kernel trick (RBF, polynomial) implicitly maps data to higher dimensions to make non-linearly-separable data separable. C controls margin-width vs. misclassification trade-off.

💡 Real example

Classifying images as cat vs. dog from extracted features, or text sentiment, where a clear maximum-margin boundary generalises well.

🎤 Interview Q&A20 questions
What is a Support Vector Machine?

A classifier that finds the hyperplane maximising the margin — the distance to the nearest points of each class.

What are support vectors?

The training points lying on or inside the margin; they alone define the boundary — other points don't matter.

What is the margin?

The gap between the decision boundary and the closest points; SVM maximises it for better generalisation.

What is the kernel trick?

Computing dot products in a high-dimensional space implicitly, letting SVM build non-linear boundaries without explicit transformation.

Name common kernels.

Linear, polynomial, RBF (Gaussian), and sigmoid; RBF is the popular default for non-linear data.

What does the C hyperparameter control?

The trade-off between a wide margin and misclassifications; large C = low tolerance for errors (risk overfitting), small C = wider, softer margin.

What does gamma control in the RBF kernel?

The reach of a single training point; high gamma = tight, wiggly boundaries (overfit), low gamma = smooth, broad boundaries.

Hard-margin vs. soft-margin SVM?

Hard-margin allows no misclassification (only for separable data); soft-margin permits some via slack variables, controlled by C.

What are slack variables?

Per-point penalties that allow some points to violate the margin, enabling SVM to handle non-separable data.

Why is feature scaling important for SVM?

Distance- and dot-product-based kernels are dominated by large-scale features, so standardising is essential.

Is SVM affected by class imbalance?

Yes; use class weights (class_weight='balanced') so the minority class isn't ignored.

How does SVM handle multi-class problems?

It's inherently binary, so it uses one-vs-one or one-vs-rest schemes to combine multiple binary SVMs.

What is the hinge loss?

max(0, 1 - y·f(x)) — it penalises points inside the margin or misclassified, and is the loss SVM minimises.

Why does SVM scale poorly to huge datasets?

Training is roughly O(n²)–O(n³) in samples; use LinearSVC or SGD-based approximations for large data.

What is SVR (Support Vector Regression)?

The regression variant fitting a tube of width epsilon around the data, penalising only points outside it.

What problem does the dual formulation solve?

It expresses the optimisation in terms of dot products, enabling the kernel trick and depending only on support vectors.

When would you pick a linear kernel?

For high-dimensional, sparse data like text, where data is often already linearly separable and linear SVM is fast.

How do you tune C and gamma?

Grid or randomised search with cross-validation, often over a log scale.

Pros and cons of SVM?

Pros: strong in high dimensions, memory-efficient (uses support vectors), flexible kernels. Cons: slow on big data, sensitive to scaling and C/gamma, no direct probabilities.

How does SVM produce probabilities?

Not natively — it fits a logistic calibration (Platt scaling) on the decision scores via cross-validation.

🛠 Project idea & 📚 resources

🛠 Build it — project idea

Image / sentiment classifier with SVM. Train an RBF-kernel SVM; visualise the decision boundary and tune C and gamma.

📂 Dataset: sklearn 'digits' or a small image feature set

🌳

Decision Tree

Tree-based

A flowchart of yes/no questions.

🟢 In simple words

Like the game 'twenty questions'. It asks the most informative yes/no question first ('income > 50k?'), splits the data, then keeps asking until each group is pure enough to make a prediction. Easy to read top to bottom.

🔬 How it actually works

At each node it picks the split that best reduces impurity — Gini or entropy (information gain) for classification, variance for regression. It splits recursively until a stopping rule (max depth, min samples). Prone to overfitting, so it's pruned or depth-limited.

💡 Real example

A loan-approval rule: 'income > 50k → credit score > 700 → approve'. Stakeholders love it because every decision is explainable.

🎤 Interview Q&A20 questions
What is a decision tree?

A model that splits data with a sequence of feature-threshold questions, forming a tree whose leaves give predictions.

How does it choose splits for classification?

By the split that most reduces impurity — measured by Gini impurity or entropy (information gain).

What is Gini impurity?

The probability of misclassifying a random sample if labelled by the node's class distribution; 0 means pure.

What is entropy and information gain?

Entropy measures node disorder; information gain is the entropy reduction from a split — the tree maximises it.

Gini vs. entropy — practical difference?

They usually give similar trees; Gini is slightly faster (no logs), entropy can be marginally more balanced.

How does a tree split a continuous feature?

It sorts values and tests thresholds between them, choosing the cut-off that best reduces impurity.

How do regression trees choose splits?

By minimising variance (or MSE) of the target within the resulting child nodes.

Why do decision trees overfit?

Unconstrained, they grow until every leaf is pure — memorising noise. They're low-bias, high-variance.

How do you prevent overfitting?

Limit max_depth, min_samples_split/leaf, max_leaf_nodes, or post-prune with cost-complexity pruning.

What is pruning?

Removing branches that add little predictive value to simplify the tree and improve generalisation.

What is cost-complexity (ccp) pruning?

Pruning that penalises tree size via a parameter alpha, trading training fit against complexity.

Do decision trees need feature scaling?

No — splits are based on thresholds per feature, so monotonic scaling doesn't change them.

How does a tree handle missing values?

Via surrogate splits or by sending missing samples down a default branch; some implementations learn the best direction.

How is feature importance computed?

By the total impurity reduction each feature contributes across all its splits, normalised to sum to 1.

Are decision trees interpretable?

Yes — you can read the path of yes/no rules from root to leaf, which stakeholders value.

What is a greedy split and its limitation?

The tree picks the locally best split at each node, which may not yield the globally optimal tree.

Can a single tree capture feature interactions?

Yes — successive splits on different features naturally model interactions along a path.

How do trees handle categorical features?

By partitioning categories into subsets; scikit-learn requires encoding, while some libraries handle them natively.

Why are trees high-variance?

Small changes in the data can produce very different splits and tree structures — motivating ensembles.

When would you prefer a single tree over an ensemble?

When interpretability and an explainable decision path matter more than raw accuracy.

🛠 Project idea & 📚 resources

🛠 Build it — project idea

Explainable loan / admission decision. Train a depth-limited tree and visualise the rules; discuss fairness and overfitting.

📂 Dataset: UCI 'Adult Income' or a loan dataset

🌲

Random Forest

Ensemble (bagging)

Ask many random trees, then take a vote.

🟢 In simple words

One decision tree is opinionated and easily wrong. A random forest grows hundreds of slightly different trees (each on a random slice of data and features) and averages their votes — a crowd is wiser than one expert.

🔬 How it actually works

Bagging: train many trees on bootstrap samples, and at each split consider only a random subset of features (decorrelating the trees). Classification = majority vote; regression = average. Reduces variance dramatically with little extra bias.

💡 Real example

A go-to baseline for tabular data — fraud detection, credit scoring, churn — robust, needs little tuning, and gives feature-importance scores.

🎤 Interview Q&A20 questions
What is a random forest?

An ensemble of decision trees trained on bootstrap samples with random feature subsets, whose predictions are aggregated by vote or average.

What is bagging?

Bootstrap Aggregating — training models on random resamples (with replacement) and averaging them to reduce variance.

Why select a random subset of features at each split?

It decorrelates the trees so they don't all rely on the same dominant feature, which lowers ensemble variance.

How does a forest reduce overfitting vs. one tree?

Averaging many decorrelated high-variance trees cancels their individual errors, slashing variance with little added bias.

What is out-of-bag (OOB) error?

Each tree is tested on the ~37% of samples it didn't train on; averaging gives a built-in validation estimate without a holdout set.

Why ~37% out-of-bag?

With bootstrap sampling, the chance a given sample is never picked in n draws tends to 1/e ≈ 0.368.

How does it compute feature importance?

By mean impurity decrease across trees, or more reliably by permutation importance (drop in accuracy when a feature is shuffled).

What are the key hyperparameters?

n_estimators, max_features per split, max_depth, min_samples_leaf, and bootstrap on/off.

Does adding more trees cause overfitting?

No — more trees only stabilise the estimate (diminishing returns); they don't increase overfitting, just compute.

Random forest vs. gradient boosting?

Forest trains independent trees in parallel (reduces variance); boosting trains trees sequentially on errors (reduces bias) and usually scores higher but tunes harder.

Does a random forest need feature scaling?

No — like single trees, it's invariant to monotonic feature scaling.

Can it overfit at all?

Mildly, with very deep trees on noisy data, but far less than a single tree thanks to averaging.

How does it handle missing or noisy data?

It's robust to noise; missing values need imputation in scikit-learn, though averaging tolerates some outliers well.

Why is it a strong default for tabular data?

Good accuracy out of the box, little tuning, handles mixed feature types, and gives importances and OOB estimates.

What does max_features control?

How many features each split may consider; lower values increase tree diversity and reduce correlation.

Can random forests do regression?

Yes — each tree predicts a number and the forest averages them.

Is permutation importance better than impurity importance?

Usually yes — impurity importance is biased toward high-cardinality features; permutation importance measures real predictive impact.

How does it parallelise?

Trees are independent, so they train across CPU cores simultaneously (n_jobs=-1).

What are the main drawbacks?

Larger memory/inference cost than one tree, less interpretable, and typically a notch below tuned boosting on accuracy.

How do you get probability estimates?

Average the class proportions in the leaves across all trees (predict_proba).

🛠 Project idea & 📚 resources

🛠 Build it — project idea

Tabular baseline + feature importance. Beat a single tree with a forest; report OOB score and rank feature importances.

📂 Dataset: Any tabular Kaggle set (e.g. Titanic, churn)

🚀

Gradient Boosting (XGBoost / LightGBM)

Ensemble (boosting)

Each new tree fixes the last one's mistakes.

🟢 In simple words

Instead of a crowd voting at once, trees are added one at a time — every new tree focuses on the errors the previous ones made. Slowly the team of weak learners becomes a very strong predictor. This wins most tabular Kaggle competitions.

🔬 How it actually works

Sequentially fits each new tree to the residual errors (negative gradient of the loss) of the current ensemble, scaled by a learning rate. XGBoost/LightGBM add regularisation, clever split-finding, and speed. Powerful but easier to overfit than bagging.

💡 Real example

Click-through-rate prediction, credit risk, ranking — anywhere you have structured/tabular data and want top accuracy.

🎤 Interview Q&A20 questions
What is gradient boosting?

An ensemble that adds weak learners (shallow trees) sequentially, each fitting the residual errors of the current model.

How is boosting different from bagging?

Bagging trains independent models in parallel to cut variance; boosting trains models sequentially on prior errors to cut bias.

Why 'gradient' boosting?

Each new tree fits the negative gradient of the loss w.r.t. current predictions — generalising boosting to any differentiable loss.

What does the learning rate do?

It shrinks each tree's contribution; smaller rates need more trees but generalise better (it trades off with n_estimators).

Why are weak learners (shallow trees) used?

Each only needs to improve slightly; many small corrections combine into a strong learner without overfitting fast.

How does it overfit, and how do you prevent it?

Too many trees or too-high learning rate overfit; use early stopping, lower learning rate, subsampling, and regularisation.

What is early stopping?

Halting training when validation performance stops improving for a set number of rounds, preventing overfitting.

What is stochastic gradient boosting?

Fitting each tree on a random subsample of rows (and sometimes columns), adding randomness that improves generalisation.

What does XGBoost add over plain gradient boosting?

L1/L2 regularisation, second-order (Newton) optimisation, sparsity-aware splits, parallelised tree building, and built-in CV.

XGBoost vs. LightGBM?

LightGBM grows trees leaf-wise (faster, great on large data) and uses histogram binning; XGBoost grows level-wise and is often steadier on small data.

What is leaf-wise vs. level-wise growth?

Level-wise expands all nodes at a depth (balanced); leaf-wise expands the highest-loss leaf (deeper, more accurate, more overfit-prone).

How does it handle missing values?

Modern implementations learn a default direction at each split for missing data, requiring no imputation.

What are the most important hyperparameters?

learning_rate, n_estimators, max_depth, subsample, colsample_bytree, and regularisation (lambda/alpha).

How do learning rate and n_estimators interact?

Inversely — halving the learning rate roughly requires doubling the trees for similar fit.

Does boosting need feature scaling?

No — it's tree-based and invariant to monotonic feature scaling.

Why does boosting dominate tabular competitions?

It captures complex feature interactions, handles mixed types, and squeezes out top accuracy with proper tuning.

What loss functions can it optimise?

Any differentiable loss — squared error for regression, log-loss for classification, ranking losses, custom objectives.

How do you control regularisation in XGBoost?

Via lambda (L2), alpha (L1), gamma (min loss reduction to split), max_depth, and min_child_weight.

Is boosting parallelisable?

Trees are sequential, but the split-finding within each tree is parallelised across features/cores.

What are the drawbacks vs. random forest?

More hyperparameters, sensitive to tuning, slower to train sequentially, and easier to overfit if unchecked.

🛠 Project idea & 📚 resources

🛠 Build it — project idea

Kaggle-style tabular winner. Train XGBoost with early stopping and cross-validation; compare to Random Forest.

📂 Dataset: Any tabular competition set

🔍

Unsupervised Learning Find structure with no labels

No answers given — the model finds hidden patterns on its own: grouping similar things (clustering) or compressing many features into a few (dimensionality reduction).

🎨

K-Means Clustering

Clustering

Sort points into K groups around K centres.

🟢 In simple words

Tell it 'find 3 groups' and it drops 3 markers, assigns each point to its nearest marker, moves each marker to the centre of its group, and repeats until the groups stop changing. Great for customer segments.

🔬 How it actually works

Pick K centroids, assign each point to the nearest centroid, recompute centroids as the mean of their members, repeat until convergence (minimising within-cluster sum of squares). Sensitive to initialisation (use k-means++) and to the choice of K.

💡 Real example

Segmenting customers into groups (budget, premium, churning) for targeted marketing — no labels needed, the structure emerges.

🎤 Interview Q&A20 questions
What is K-Means clustering?

An unsupervised algorithm that partitions data into K clusters by minimising the distance of points to their cluster centroid.

How does the K-Means algorithm work?

Initialise K centroids, assign each point to the nearest one, recompute centroids as the mean of their members, and repeat until assignments stabilise.

What objective does it minimise?

Within-cluster sum of squares (inertia) — the total squared distance from points to their centroids.

How do you choose K?

The elbow method (inertia vs. K), silhouette score, or gap statistic; plus domain knowledge.

What is the elbow method?

Plot inertia against K and pick the 'elbow' where adding clusters stops sharply reducing inertia.

What is the silhouette score?

A measure from -1 to 1 of how well each point fits its cluster vs. the nearest other cluster; higher is better.

Why is initialisation important?

Random starts can converge to poor local optima; k-means++ spreads initial centroids to give better, more consistent results.

What does k-means++ do?

Chooses initial centroids probabilistically far apart, speeding convergence and improving final cluster quality.

Why is feature scaling important?

K-Means uses Euclidean distance, so unscaled large-range features dominate the clustering.

What are K-Means' key assumptions?

Clusters are roughly spherical, similar in size and density, and well separated.

When does K-Means fail?

On elongated, non-convex, varying-density, or differently-sized clusters — DBSCAN or GMM handle those better.

Is K-Means guaranteed to find the global optimum?

No — it converges to a local optimum, so it's run multiple times (n_init) and the best is kept.

What is the time complexity?

O(n·K·d·i) for n points, K clusters, d dimensions, i iterations — efficient and scalable.

K-Means vs. K-Medoids?

K-Medoids uses actual data points as centres and is more robust to outliers, but is slower than mean-based K-Means.

How is K-Means related to Gaussian Mixture Models?

K-Means is a hard-assignment special case of GMM with equal spherical covariances; GMM gives soft, probabilistic memberships.

How do outliers affect K-Means?

They pull centroids toward themselves because the mean is sensitive to extreme values.

What is mini-batch K-Means?

A variant updating centroids on small random batches, trading a little accuracy for much faster training on big data.

Can K-Means cluster categorical data?

Not directly (mean is undefined); use K-Modes for categorical or K-Prototypes for mixed data.

How do you interpret the resulting clusters?

Profile each cluster by its centroid feature values and size to give it a business meaning (e.g. 'high-value, low-frequency').

What are common applications?

Customer segmentation, image colour quantisation, document grouping, and anomaly pre-processing.

🛠 Project idea & 📚 resources

🛠 Build it — project idea

Customer segmentation. Cluster customers, pick K with the elbow + silhouette, and profile each segment.

📂 Dataset: Mall Customer Segmentation (Kaggle)

🪜

Hierarchical Clustering

Clustering

Build a family tree of your data.

🟢 In simple words

Start with every point as its own group, then repeatedly merge the two closest groups until everything is one big group. The result is a tree (dendrogram) you can 'cut' at any height to get however many clusters you want — no need to pick K up front.

🔬 How it actually works

Agglomerative (bottom-up) merges nearest clusters using a linkage rule (single, complete, average, Ward). The dendrogram records merge distances; cutting it at a chosen height yields the clusters. O(n²) memory limits it to smaller datasets.

💡 Real example

Grouping genes with similar expression, or organising documents into a topic hierarchy you can explore at different granularities.

🎤 Interview Q&A20 questions
What is hierarchical clustering?

A method that builds a tree (dendrogram) of nested clusters, either merging bottom-up or splitting top-down.

Agglomerative vs. divisive?

Agglomerative (bottom-up) starts with each point as a cluster and merges; divisive (top-down) starts with one cluster and splits. Agglomerative is far more common.

What is a dendrogram?

A tree diagram showing the order and distance of merges; cutting it at a height yields a chosen number of clusters.

How do you choose the number of clusters?

Cut the dendrogram where the vertical merge distances are largest (the biggest jump), or use a target count.

What is linkage?

The rule for measuring distance between clusters — single, complete, average, or Ward.

What does single linkage do?

Uses the minimum distance between clusters; can chain points into long, straggly clusters.

What does complete linkage do?

Uses the maximum distance between clusters; tends to produce compact, equal-diameter clusters.

What is average linkage?

Uses the mean pairwise distance between clusters — a compromise between single and complete.

What is Ward's linkage?

Merges the pair that minimises the increase in total within-cluster variance; produces balanced, compact clusters.

Does it require choosing K up front?

No — it builds the full hierarchy, and you decide the cluster count afterward by cutting the dendrogram.

What is the time and space complexity?

Typically O(n²) memory and O(n²–n³) time, limiting it to small/medium datasets.

Hierarchical vs. K-Means?

Hierarchical needs no K up front, gives a full hierarchy, and finds non-spherical shapes, but scales far worse than K-Means.

Is feature scaling needed?

Yes — like K-Means it depends on distances, so features should be standardised.

Can it find non-spherical clusters?

Yes, especially with single linkage, unlike K-Means which assumes spherical clusters.

How sensitive is it to noise and outliers?

Sensitive — outliers can form their own branches or distort merges, especially with single linkage.

Is the result deterministic?

Yes — given the data, metric, and linkage, the dendrogram is fixed (no random initialisation).

What distance metrics can be used?

Euclidean, Manhattan, cosine, correlation, etc.; Ward's linkage requires Euclidean.

What is the cophenetic correlation?

A measure of how faithfully the dendrogram preserves the original pairwise distances; higher means a better representation.

Where is hierarchical clustering commonly used?

Gene-expression analysis, taxonomy building, document organisation, and any setting wanting an interpretable hierarchy.

Can it be combined with K-Means?

Yes — hierarchical clustering on a sample can suggest K, which then seeds a scalable K-Means run.

🛠 Project idea & 📚 resources

🛠 Build it — project idea

Document / gene dendrogram. Cluster items and plot a dendrogram; cut it at two heights and compare groupings.

📂 Dataset: Any small feature dataset (e.g. iris, country indicators)

🌌

DBSCAN

Clustering

Clusters by density — and flags the outliers.

🟢 In simple words

It groups points that are packed closely together and labels lonely points in sparse regions as 'noise'. Unlike K-Means, you don't say how many clusters, and it can find odd, non-round shapes.

🔬 How it actually works

Two parameters: eps (neighbourhood radius) and minPts. Points with enough neighbours are 'core' points; reachable points form a cluster; points reachable from nobody are noise. Finds arbitrary shapes and is robust to outliers, but struggles with varying density.

💡 Real example

Detecting anomalies/fraud (the noise points) or finding crowd hotspots from GPS pings of arbitrary shape.

🎤 Interview Q&A20 questions
What is DBSCAN?

A density-based clustering algorithm that groups densely packed points and labels sparse points as noise.

What are its two parameters?

eps (neighbourhood radius) and minPts (minimum points within eps to form a dense region).

What is a core point?

A point with at least minPts neighbours within eps — it can grow a cluster.

What is a border point?

A point within eps of a core point but with fewer than minPts neighbours itself — it joins a cluster but can't extend it.

What is a noise point?

A point that is neither core nor reachable from any core point; DBSCAN labels it as an outlier.

How is DBSCAN better than K-Means?

It finds arbitrary cluster shapes, doesn't need K specified, and explicitly detects outliers.

Does DBSCAN need the number of clusters?

No — the number emerges from the data density given eps and minPts.

How do you choose eps?

Plot the sorted distance to each point's k-th nearest neighbour (k-distance graph) and pick eps at the 'knee'.

How do you choose minPts?

A common heuristic is minPts ≥ dimensions + 1, often 2×dimensions; larger values give denser, more robust clusters.

What is DBSCAN's main weakness?

It struggles when clusters have very different densities, since a single eps can't fit all.

What algorithm fixes varying density?

HDBSCAN, which builds a hierarchy over densities and selects clusters across multiple scales.

Is feature scaling needed?

Yes — eps is a distance, so unscaled features distort the neighbourhood definition.

What is the time complexity?

O(n log n) with a spatial index (KD/Ball tree), or O(n²) in the worst case.

Is DBSCAN deterministic?

Mostly — only the cluster assignment of border points reachable from multiple clusters can depend on processing order.

How does it perform in high dimensions?

Poorly — distances concentrate (curse of dimensionality), making a meaningful eps hard to choose.

How does DBSCAN handle outliers?

Natively — it leaves low-density points unassigned as noise, which is useful for anomaly detection.

What does density-reachable mean?

A point is density-reachable if there's a chain of core points connecting it within eps to a starting core point.

Can DBSCAN cluster data with no clear clusters?

If density is uniform it may label everything one cluster or all noise — it needs density contrast to work.

DBSCAN vs. hierarchical clustering?

DBSCAN scales better, finds noise, and needs no K; hierarchical gives a full interpretable tree but is O(n²).

Typical applications of DBSCAN?

Spatial/GPS hotspot detection, anomaly and fraud detection, and clustering data with irregular shapes.

🛠 Project idea & 📚 resources

🛠 Build it — project idea

Anomaly / hotspot detection. Cluster spatial data with DBSCAN, tune eps via a k-distance plot, and surface noise as anomalies.

📂 Dataset: GPS check-ins or a 2D synthetic 'moons' set

📐

Principal Component Analysis (PCA)

Dimensionality reduction

Squash many features into a few, keeping the signal.

🟢 In simple words

If you have 100 columns, many say the same thing. PCA finds a few new 'super-axes' along which the data spreads out the most, and projects everything onto them — keeping the important variation while throwing away redundancy. Great for plotting and speeding up models.

🔬 How it actually works

Computes the directions (eigenvectors of the covariance matrix) of maximum variance, ordered by how much variance they explain. Keep the top components to reduce dimensions. Components are uncorrelated; requires standardised features. It's linear and unsupervised.

💡 Real example

Reducing hundreds of sensor readings to 2–3 components to visualise, de-noise, or feed a faster downstream model.

🎤 Interview Q&A20 questions
What is PCA?

Principal Component Analysis — a linear technique that projects data onto new orthogonal axes capturing maximum variance, reducing dimensionality.

What is a principal component?

A direction (linear combination of features) along which the data varies most; components are ordered by variance explained.

How are principal components computed?

As the eigenvectors of the data's covariance matrix (or via SVD), with eigenvalues giving the variance each explains.

Why must you standardise features first?

PCA maximises variance, so unscaled large-variance features would dominate the components regardless of importance.

How do you choose the number of components?

Keep enough to reach a target cumulative explained variance (e.g. 95%), or use a scree-plot elbow.

What is explained variance ratio?

The fraction of total variance captured by each component, used to decide how many to retain.

Are principal components correlated?

No — they're orthogonal and therefore uncorrelated by construction.

Is PCA supervised or unsupervised?

Unsupervised — it ignores labels and only considers feature variance.

What is a scree plot?

A plot of explained variance per component; the 'elbow' suggests how many components to keep.

What are the main uses of PCA?

Visualisation, noise reduction, decorrelating features, and speeding up downstream models by cutting dimensions.

PCA vs. feature selection?

Feature selection keeps a subset of original features; PCA creates new combined features, losing direct interpretability.

What is the link between PCA and SVD?

PCA is typically computed via Singular Value Decomposition of the centred data, which is more numerically stable than eigendecomposition.

Does PCA capture non-linear structure?

No — it's linear; use Kernel PCA, t-SNE, UMAP, or autoencoders for non-linear manifolds.

What is Kernel PCA?

PCA applied in a kernel-induced high-dimensional space, enabling non-linear dimensionality reduction.

Can PCA be used to remove noise?

Yes — reconstructing data from the top components discards low-variance directions that often correspond to noise.

Why must data be centred (mean-subtracted)?

Components describe variance around the mean; without centring the first component would just point toward the mean.

Are principal components interpretable?

Sometimes — loadings show feature contributions — but combined axes are generally harder to explain than raw features.

Does PCA always improve model performance?

No — it can discard label-relevant low-variance signal; it mainly helps with speed, collinearity, and visualisation.

What is the reconstruction error?

The difference between original data and its reconstruction from kept components — it grows as you drop more components.

PCA vs. LDA?

PCA is unsupervised, maximising variance; LDA is supervised, maximising class separation.

🛠 Project idea & 📚 resources

🛠 Build it — project idea

Compress + visualise high-dim data. Reduce a many-feature dataset to 2D for plotting and speed up a classifier with PCA.

📂 Dataset: MNIST or a wine/cancer feature set

🗺️

t-SNE / UMAP

Dimensionality reduction

Make high-dimensional data visible in 2D.

🟢 In simple words

A tool purely for seeing your data. It places thousands of high-dimensional points onto a 2D map so that things that were close stay close — revealing clusters your eyes can actually spot. Use it to look, not to feed models.

🔬 How it actually works

t-SNE models pairwise similarities and minimises the divergence between high-dim and low-dim distributions, preserving local neighbourhoods. UMAP is faster and preserves more global structure. Both are for visualisation; distances/cluster sizes on the map aren't literally meaningful.

💡 Real example

Plotting word or image embeddings to see semantic clusters — e.g. all 'sports' articles landing together.

🎤 Interview Q&A20 questions
What is t-SNE?

t-distributed Stochastic Neighbor Embedding — a non-linear technique that maps high-dimensional data to 2D/3D for visualisation, preserving local neighbourhoods.

Is t-SNE used for preprocessing or visualisation?

Visualisation only — its output coordinates aren't meaningful features and shouldn't feed a downstream model.

What does the perplexity parameter control?

Roughly the effective number of neighbours each point considers; typical values are 5–50 and it noticeably changes the plot.

Why can cluster sizes in a t-SNE plot mislead?

t-SNE expands dense regions and contracts sparse ones, so cluster sizes and inter-cluster distances aren't faithful.

Why does t-SNE use a t-distribution in the low-dim space?

Its heavy tails prevent crowding, letting moderately distant points spread out and revealing clearer clusters.

t-SNE vs. PCA?

PCA is linear, fast, and deterministic (good for preprocessing); t-SNE is non-linear, slower, stochastic, and purely for visualising local structure.

t-SNE vs. UMAP?

UMAP is faster, scales better, and preserves more global structure, while t-SNE often shows cleaner local clusters.

Is t-SNE deterministic?

No — it depends on random initialisation, so different runs/seeds can produce different layouts.

What is its computational complexity?

Naively O(n²); Barnes-Hut t-SNE reduces it to about O(n log n) for larger datasets.

Should you run PCA before t-SNE?

Often yes — reducing to ~50 dimensions with PCA first speeds t-SNE and reduces noise.

Does global distance in a t-SNE map mean anything?

No — only local neighbourhoods are reliable; the gaps between far-apart clusters are not meaningful.

What does t-SNE optimise?

It minimises the KL divergence between pairwise-similarity distributions in the high- and low-dimensional spaces.

Can you embed new points with a trained t-SNE?

Not naturally — t-SNE has no simple transform for unseen points; UMAP supports this better.

Why does t-SNE sometimes show false clusters?

With a bad perplexity or too few iterations it can fragment uniform data into apparent clusters — always vary parameters.

What is early exaggeration?

An initial phase that inflates high-dim similarities so clusters form tight, well-separated groups before fine-tuning.

How many dimensions does t-SNE usually output?

Two or three, since its purpose is human-viewable visualisation.

When would you prefer UMAP over t-SNE?

For large datasets, when you need speed, the ability to transform new data, or better-preserved global structure.

What are common applications?

Visualising word/image embeddings, single-cell genomics, and inspecting neural-network feature spaces.

Does t-SNE preserve density?

No — it normalises local density, so you can't read relative cluster density from the plot.

What preprocessing does t-SNE need?

Scaling/standardisation and often a PCA pre-reduction; the metric should suit the data (e.g. cosine for embeddings).

🛠 Project idea & 📚 resources

🛠 Build it — project idea

Embedding explorer. Project embeddings to 2D with t-SNE/UMAP and colour by label to reveal structure.

📂 Dataset: MNIST or sentence embeddings

🕹️

Reinforcement Learning Learn by trial, reward, and error

An agent takes actions in an environment and gets rewards or penalties. Over many tries it learns a strategy (policy) that maximises long-term reward — how game-playing and robotics AI is trained.

🗺️

Q-Learning

Value-based

Learn the value of each move by trying them.

🟢 In simple words

Drop an agent in a maze. Every move earns reward or penalty. It keeps a table scoring 'how good is this action from this spot?' and updates it after each try. Over many runs the table tells it the best path to the goal.

🔬 How it actually works

Maintains a Q-table Q(state, action). After each step it updates: Q ← Q + α[reward + γ·maxₐ Q(next) − Q]. γ discounts future reward; an epsilon-greedy policy balances exploring new actions vs. exploiting known-good ones. Converges for small, discrete worlds.

💡 Real example

Teaching an agent to navigate a grid-world maze, or simple game strategy, where states and actions are few enough to tabulate.

🎤 Interview Q&A20 questions
What is Q-learning?

A model-free, off-policy reinforcement-learning algorithm that learns the value of taking an action in a state (the Q-value) to derive an optimal policy.

What is a Q-value?

Q(s, a) is the expected cumulative discounted reward from taking action a in state s and acting optimally thereafter.

State the Q-learning update rule.

Q(s,a) ← Q(s,a) + α[r + γ·maxₐ′Q(s′,a′) − Q(s,a)], moving the estimate toward the observed reward plus discounted best next value.

What is the learning rate α?

How much each update shifts the Q-value toward the new estimate; high α learns fast but unstably, low α is slow but smooth.

What is the discount factor γ?

How much future rewards are valued; near 0 is myopic, near 1 is far-sighted.

What is the exploration-exploitation trade-off?

Balancing trying new actions to discover reward (explore) against using known good actions (exploit).

What is epsilon-greedy exploration?

With probability ε pick a random action, otherwise the current best; ε is usually decayed over time.

Why is Q-learning 'off-policy'?

It learns the optimal policy's value (via max over next actions) regardless of the exploratory policy actually generating the data.

What is the Bellman equation here?

It expresses the optimal Q-value recursively: Q*(s,a) = E[r + γ·maxₐ′Q*(s′,a′)].

What is a model-free method?

One that learns directly from experience without modelling the environment's transition or reward dynamics.

Why doesn't a Q-table scale?

It needs an entry per state-action pair, which explodes for large or continuous state spaces — motivating function approximation (DQN).

What is a policy?

A mapping from states to actions; in Q-learning it's derived by taking the highest-Q action in each state.

What is a reward in RL?

A scalar feedback signal the environment returns after each action, which the agent learns to maximise cumulatively.

What is the difference between reward and return?

Reward is immediate; return is the (discounted) sum of future rewards the agent actually tries to maximise.

Q-learning vs. SARSA?

SARSA is on-policy (uses the action actually taken next); Q-learning is off-policy (uses the max next action). SARSA is more conservative.

Does Q-learning converge?

Yes, to the optimal Q-values for finite MDPs given sufficient exploration and a suitably decaying learning rate.

How is an episode defined?

A complete sequence of states, actions, and rewards from a start state to a terminal state.

What is a Markov Decision Process (MDP)?

The RL framework of states, actions, transition probabilities, rewards, and a discount factor, with the Markov (memoryless) property.

What does the Markov property assume?

That the next state and reward depend only on the current state and action, not the full history.

How do you encourage early exploration then later exploitation?

Start with high ε and decay it over episodes so the agent explores first and exploits its learned policy later.

🛠 Project idea & 📚 resources

🛠 Build it — project idea

Maze / FrozenLake solver. Train tabular Q-learning to solve a grid-world; plot reward-per-episode learning curve.

📂 Dataset: OpenAI Gymnasium 'FrozenLake' / 'Taxi'

🎮

Deep Q-Network (DQN)

Value-based + deep

Q-learning where a neural net replaces the table.

🟢 In simple words

When the world is too big for a table (like a video game screen), swap the table for a neural network that estimates move-values from raw input. This is how DeepMind's AI learned to play Atari games from pixels.

🔬 How it actually works

A neural network approximates Q(state, action). Stabilised by experience replay (training on random past transitions) and a separate target network. Takes raw state (e.g. pixels) and outputs a Q-value per action.

💡 Real example

Learning to play Atari/Pong directly from screen pixels, or control problems with large continuous-ish state spaces.

🎤 Interview Q&A20 questions
What is a Deep Q-Network (DQN)?

Q-learning where a neural network approximates the Q-function, enabling RL on large or high-dimensional (e.g. pixel) state spaces.

Why replace the Q-table with a network?

Tables can't cover huge/continuous state spaces; a network generalises Q-values across similar states.

What is experience replay?

Storing transitions in a buffer and training on random minibatches, which breaks correlation between consecutive samples and improves stability.

What is the target network?

A periodically-updated copy of the Q-network used to compute targets, preventing the moving-target instability of bootstrapping off itself.

Why does naive deep Q-learning diverge?

Correlated samples plus a target that shifts with the same weights create feedback loops; replay and a target network fix this.

What is the 'deadly triad'?

The instability risk when combining function approximation, bootstrapping, and off-policy learning together.

What loss does DQN minimise?

The squared temporal-difference error between predicted Q and the target r + γ·maxₐ′Q_target(s′,a′).

What is overestimation bias in DQN?

The max operator systematically over-estimates Q-values; Double DQN reduces it by decoupling action selection from evaluation.

What is Double DQN?

A variant that selects the best next action with the online network but evaluates it with the target network, curbing overestimation.

What is Dueling DQN?

An architecture splitting Q into a state-value stream and an action-advantage stream, improving learning where actions matter little.

What is prioritised experience replay?

Sampling high-error transitions more often so the agent learns faster from its most informative experiences.

How does DQN handle the input for Atari games?

It stacks several recent frames (for motion) and processes them with convolutional layers to estimate Q-values per action.

Can DQN handle continuous action spaces?

No — it needs a discrete action set for the max; continuous control uses DDPG, TD3, or SAC instead.

What was the significance of DeepMind's DQN?

It learned to play many Atari games from raw pixels at human level using one architecture — a landmark in deep RL.

How often is the target network updated?

Either copied every fixed number of steps (hard update) or blended slowly each step (soft update, Polyak averaging).

Why use a replay buffer of limited size?

To keep memory bounded and favour recent, more relevant experience while still decorrelating samples.

What hyperparameters matter most in DQN?

Learning rate, discount γ, replay buffer size, batch size, target-update frequency, and the ε-decay schedule.

DQN vs. tabular Q-learning?

DQN generalises across states via a network and needs replay/target tricks; tabular stores exact values and only suits small discrete worlds.

Is DQN on-policy or off-policy?

Off-policy — it learns from replayed transitions generated by older policies.

What are signs DQN training is unstable?

Diverging or oscillating Q-values and collapsing episode reward; remedies include lower learning rate, Double DQN, and reward clipping.

🛠 Project idea & 📚 resources

🛠 Build it — project idea

CartPole / Atari agent. Train a DQN with Stable-Baselines3 and plot the reward curve as it learns to balance/play.

📂 Dataset: Gymnasium 'CartPole' then an Atari env

🧭

Policy Gradient (REINFORCE / PPO)

Policy-based

Directly learn what to do, not what's valuable.

🟢 In simple words

Instead of scoring every move, the agent learns a policy that outputs an action directly, then nudges that policy to make good outcomes more likely. PPO (a modern version) trains robots and even fine-tunes chatbots with human feedback (RLHF).

🔬 How it actually works

Parameterises the policy πθ(action|state) and ascends the gradient of expected reward. REINFORCE uses full-episode returns (high variance, reduced with a baseline); PPO clips updates so each step doesn't change the policy too drastically. Handles continuous action spaces well.

💡 Real example

Robot locomotion, continuous control, and RLHF for aligning language models — anywhere actions are continuous or the policy is best learned directly.

🎤 Interview Q&A20 questions
What are policy gradient methods?

RL algorithms that directly learn a parameterised policy by ascending the gradient of expected reward, rather than learning value functions.

Policy-based vs. value-based methods?

Value-based (e.g. DQN) learn action values then act greedily; policy-based learn the action distribution directly, handling continuous actions and stochastic policies.

What is the REINFORCE algorithm?

A Monte-Carlo policy gradient that scales the gradient of log-probability of actions by the episode's return.

Why does REINFORCE have high variance?

It uses full-episode returns as noisy estimates of action quality, making gradient updates unstable.

What is a baseline and why use it?

A reference value (often the state value) subtracted from the return to reduce gradient variance without adding bias.

What is the advantage function?

A(s,a) = Q(s,a) − V(s): how much better an action is than the state's average — used to focus updates.

What is an actor-critic method?

A hybrid where an actor learns the policy and a critic learns a value function to provide low-variance feedback (the advantage).

What is PPO?

Proximal Policy Optimization — a stable, popular policy-gradient method that clips updates so the policy doesn't change too much per step.

What problem does PPO's clipping solve?

It prevents destructively large policy updates that can collapse performance, keeping each step within a trust region.

What is TRPO?

Trust Region Policy Optimization — constrains each update by a KL-divergence limit; PPO is a simpler, cheaper approximation of the same idea.

Can policy gradients handle continuous actions?

Yes — the policy can output parameters of a continuous distribution (e.g. a Gaussian's mean and variance).

What is a stochastic policy and why is it useful?

A policy outputting action probabilities; it enables exploration and can be optimal in partially observable or adversarial settings.

On-policy vs. off-policy for policy gradients?

Vanilla policy gradients and PPO are on-policy (need fresh data each update); off-policy variants (e.g. SAC) reuse a replay buffer.

What is entropy regularisation?

Adding the policy's entropy to the objective to encourage exploration and prevent premature convergence to a deterministic policy.

What is A2C/A3C?

Advantage Actor-Critic methods; A3C runs many parallel workers asynchronously, A2C is its synchronous variant.

What is the role of the critic?

To estimate value/advantage so the actor's updates have lower variance than raw returns.

How is RLHF related to policy gradients?

Fine-tuning language models with human feedback typically uses PPO to optimise a reward model of human preferences.

What is the credit assignment problem?

Determining which past actions deserve credit/blame for a delayed reward — central and hard in RL.

Why are policy gradients sample-inefficient?

On-policy methods discard data after each update and rely on high-variance return estimates, needing many environment interactions.

When choose policy gradient over DQN?

For continuous or high-dimensional action spaces, when you need a stochastic policy, or for stable training via PPO.

🛠 Project idea & 📚 resources

🛠 Build it — project idea

Continuous-control agent. Train PPO on a continuous-control task and compare learning stability vs. DQN.

📂 Dataset: Gymnasium 'LunarLander' / 'BipedalWalker'

🧠

Deep Learning Neural networks that learn features

Stacked layers of artificial 'neurons' that learn their own features from raw data — powering image, speech, and language models. Usually trained supervised, but powerful enough to deserve their own family.

🧠

Neural Network (MLP)

Feedforward

Layers of neurons that learn their own features.

🟢 In simple words

Loosely inspired by the brain: layers of simple units pass signals forward, each connection has a weight that strengthens or weakens it. By tweaking weights through many examples, the network learns complex patterns no straight line could capture.

🔬 How it actually works

Inputs flow through hidden layers; each neuron computes a weighted sum + non-linear activation (ReLU). Backpropagation computes the gradient of the loss w.r.t. every weight, and gradient descent updates them. Non-linear activations let it model arbitrarily complex functions.

💡 Real example

Predicting outcomes from rich tabular or signal data where relationships are non-linear and interacting — a flexible function approximator.

🎤 Interview Q&A20 questions
What is a neural network (MLP)?

A model of layered interconnected neurons that transforms inputs through weighted sums and non-linear activations to learn complex functions.

What is an activation function and why is it needed?

A non-linear function applied to each neuron's output; without it, stacked layers collapse into a single linear transform.

Name common activation functions.

ReLU (default for hidden layers), sigmoid, tanh, Leaky ReLU, GELU, and softmax for output probabilities.

Why is ReLU popular?

It's cheap, avoids saturating gradients for positive inputs, and speeds up training versus sigmoid/tanh.

What is backpropagation?

The algorithm that computes loss gradients w.r.t. every weight via the chain rule, propagating error backward through the layers.

What is gradient descent?

An optimisation that updates weights in the negative gradient direction to minimise the loss.

What is the vanishing gradient problem?

Gradients shrink toward zero in deep nets (with saturating activations), stalling early-layer learning; ReLU, residual connections, and normalisation help.

What is the exploding gradient problem?

Gradients grow uncontrollably, destabilising training; mitigated by gradient clipping and careful initialisation.

What is the role of the learning rate?

It scales weight updates; too high diverges, too low trains slowly — schedules and adaptive optimisers help.

What is Adam?

An adaptive optimiser combining momentum and per-parameter learning-rate scaling; a robust default for deep nets.

What is dropout?

Randomly zeroing a fraction of neurons during training to prevent co-adaptation and reduce overfitting.

What is batch normalisation?

Normalising layer inputs per minibatch to stabilise and speed training and allow higher learning rates.

What is an epoch, batch, and iteration?

An epoch is one full pass over the data; a batch is a subset processed at once; an iteration is one weight update on a batch.

Why use mini-batch gradient descent?

It balances the stable gradients of full-batch with the speed and noise (regularising effect) of stochastic updates.

How do you prevent overfitting in neural nets?

Dropout, L2 weight decay, early stopping, data augmentation, and more training data.

What is weight initialisation and why does it matter?

Setting starting weights (e.g. He or Xavier) to keep signal/gradient variance stable and avoid vanishing/exploding gradients.

What loss functions are used?

Cross-entropy for classification, MSE/MAE for regression, with softmax outputs for multi-class.

What is the universal approximation theorem?

A network with one sufficiently wide hidden layer can approximate any continuous function — though depth is usually more efficient.

Why go deeper instead of wider?

Depth learns hierarchical, compositional features more parameter-efficiently than a single huge layer.

What is transfer learning?

Reusing a model pretrained on a large dataset and fine-tuning it on your smaller task, saving data and compute.

🛠 Project idea & 📚 resources

🛠 Build it — project idea

From-scratch MLP. Build a small neural net (NumPy or PyTorch) and visualise the loss curve as it trains.

📂 Dataset: MNIST or a tabular classification set

🖼️

Convolutional Neural Network (CNN)

Computer vision

Neural nets that 'see' by sliding filters over images.

🟢 In simple words

Instead of looking at every pixel independently, a CNN slides small filters across the image to detect edges, then shapes, then objects — building understanding layer by layer, the way your visual system does. Powers face unlock, medical imaging, and self-driving vision.

🔬 How it actually works

Convolutional layers apply learnable filters (sharing weights across the image) to produce feature maps; pooling layers downsample for translation-invariance; stacked layers learn increasingly abstract features, ending in dense layers for the prediction.

💡 Real example

Classifying skin lesions as benign/malignant, detecting objects in photos, or reading handwriting — any grid-structured data.

🎤 Interview Q&A20 questions
What is a Convolutional Neural Network (CNN)?

A neural network for grid-like data (images) that uses convolutional filters to learn spatial features hierarchically.

What is a convolution operation?

Sliding a small learnable filter across the input and computing dot products to produce a feature map highlighting a pattern.

Why share weights across the image?

A pattern (e.g. an edge) can appear anywhere, so reusing the same filter gives translation invariance and far fewer parameters.

What is a feature map?

The output of applying one filter across the input — a map of where that learned feature is detected.

What is pooling and why use it?

Downsampling (e.g. max pooling) that shrinks feature maps, adds translation invariance, and cuts computation.

Max pooling vs. average pooling?

Max pooling keeps the strongest activation (sharp features); average pooling smooths — max is more common in practice.

What is stride?

How many pixels the filter moves each step; a larger stride downsamples the output more.

What is padding?

Adding border pixels (often zeros) so filters can cover edges and control output size ('same' vs. 'valid').

What is the receptive field?

The region of the input that influences a given output unit; it grows with depth, letting deep layers see larger context.

Why do CNNs beat MLPs on images?

They exploit spatial locality and weight sharing, using far fewer parameters while capturing translation-invariant features.

What do early vs. deep layers learn?

Early layers detect edges/textures; deeper layers combine them into shapes, parts, and whole objects.

What is a 1×1 convolution?

A convolution that mixes channels at each pixel — used to reduce/expand channel depth cheaply (e.g. in Inception).

What is transfer learning in vision?

Fine-tuning a CNN pretrained on ImageNet (e.g. ResNet) for your task, which works well with limited data.

What is data augmentation?

Generating varied training images (flips, crops, rotations, colour jitter) to improve generalisation and reduce overfitting.

What problem do residual connections (ResNet) solve?

They add skip connections so gradients flow through very deep networks, fixing degradation and vanishing gradients.

What is batch normalisation's role in CNNs?

It stabilises and accelerates training by normalising activations, enabling deeper networks and higher learning rates.

How does a CNN end in a classification output?

Feature maps are flattened or globally pooled, then passed through dense layers and a softmax over classes.

What is global average pooling?

Averaging each feature map to a single value, replacing large dense layers and reducing overfitting.

What CNN architectures should you know?

LeNet, AlexNet, VGG, GoogLeNet/Inception, ResNet, and EfficientNet.

Beyond classification, what do CNNs do?

Object detection (YOLO, Faster R-CNN), segmentation (U-Net), and feature extraction for many vision tasks.

🛠 Project idea & 📚 resources

🛠 Build it — project idea

Image classifier with transfer learning. Fine-tune a pretrained CNN (ResNet) on a custom image set; report accuracy + a confusion matrix.

📂 Dataset: CIFAR-10 or a small custom image set

🔁

RNN / LSTM (& Transformers)

Sequences

Networks with memory for sequences.

🟢 In simple words

For data with order — text, speech, time series — these networks read one step at a time while carrying a 'memory' of what came before, so context isn't lost. Transformers (behind ChatGPT) later replaced them by reading the whole sequence at once with 'attention'.

🔬 How it actually works

An RNN feeds its hidden state forward across time steps. LSTMs/GRUs add gates to retain long-range information and fight vanishing gradients. Transformers drop recurrence entirely, using self-attention to weigh every position against every other in parallel — the basis of modern LLMs.

💡 Real example

Next-word prediction, sentiment over a review, demand forecasting, and — via Transformers — the large language models powering today's AI.

🎤 Interview Q&A20 questions
What is a Recurrent Neural Network (RNN)?

A network for sequences that maintains a hidden state passed from step to step, letting earlier inputs influence later outputs.

What kind of data suits RNNs?

Ordered/sequential data — text, speech, time series, and any data with temporal dependencies.

Why do vanilla RNNs struggle with long sequences?

Repeated multiplication during backprop-through-time causes vanishing or exploding gradients, so long-range dependencies are lost.

What is an LSTM?

Long Short-Term Memory — an RNN with a cell state and gates that learn what to keep, forget, and output, preserving long-range information.

What are the three LSTM gates?

Forget gate (what to drop), input gate (what new info to store), and output gate (what to emit from the cell).

What is the cell state in an LSTM?

A protected memory channel that carries information across many steps with minimal interference, easing gradient flow.

What is a GRU?

Gated Recurrent Unit — a simpler LSTM variant with reset and update gates and no separate cell state; faster, often comparable.

LSTM vs. GRU?

GRU has fewer parameters and trains faster; LSTM can be more expressive on complex tasks — choice is empirical.

What is backpropagation through time (BPTT)?

Unrolling the RNN across time steps and applying backpropagation over the whole sequence to update shared weights.

How are exploding gradients handled in RNNs?

Gradient clipping caps the gradient norm, preventing destabilising updates.

What is a bidirectional RNN?

One that processes the sequence forward and backward, giving each step context from both past and future (for non-streaming tasks).

What is the attention mechanism?

A method letting the model weigh and draw on all input positions when producing each output, instead of a single fixed-size state.

What is a Transformer?

A sequence architecture built entirely on self-attention (no recurrence), enabling parallel training and long-range modelling — the basis of modern LLMs.

Why did Transformers largely replace RNNs?

They parallelise across the sequence (RNNs are sequential), capture long-range dependencies better, and scale to huge data.

What is self-attention?

Each token computes a weighted combination of all tokens via query-key-value scores, capturing relationships regardless of distance.

What is positional encoding?

Information added to Transformer inputs to convey token order, since attention itself is order-agnostic.

What is teacher forcing?

Training a sequence model by feeding the true previous token (not its own prediction) as the next input, speeding convergence.

What are sequence-to-sequence models?

Encoder-decoder architectures mapping an input sequence to an output sequence — used for translation and summarisation.

What tasks do RNNs/LSTMs handle?

Language modelling, sentiment analysis, machine translation, speech recognition, and time-series forecasting.

What are the trade-offs of RNNs vs. Transformers?

RNNs use less memory for short/streaming inputs and constant per-step cost; Transformers parallelise and model long context but scale quadratically with sequence length.

🛠 Project idea & 📚 resources

🛠 Build it — project idea

Text generator or forecaster. Train an LSTM for next-word/char generation or forecasting, then compare with a small Transformer.

📂 Dataset: A text corpus or a time-series set