Linear Regression
RegressionDraw the best straight line through your data.
🟢 In simple words
Imagine plotting house size against price on a graph. Linear regression finds the single straight line that best passes through all the dots — so for any new house size you can read its likely price off the line.
🔬 How it actually works
It assumes output = w·x + b. Training finds the weights (w) and bias (b) that minimise the mean squared error — the average squared gap between predicted and actual values — usually via gradient descent or a closed-form solution. With many inputs the 'line' becomes a flat plane in higher dimensions.
💡 Real example
Predicting a flat's rent from area, number of bedrooms, and distance to the metro. Each feature gets a weight; their weighted sum is the predicted rent.
🎤 Interview Q&A20 questions
What is linear regression?
A supervised algorithm that models a continuous target as a weighted linear combination of input features plus a bias term: y = w·x + b.
What loss function does it minimise?
Mean Squared Error (MSE) — the average of squared differences between predicted and actual values.
Why square the errors instead of taking absolute values?
Squaring penalises large errors more, is differentiable everywhere (easy gradients), and yields a convex loss with a unique closed-form solution.
What is the closed-form (normal equation) solution?
w = (XᵀX)⁻¹Xᵀy. It solves for weights directly without iteration, but inverting XᵀX is O(n³) and fails if features are collinear.
When would you use gradient descent over the normal equation?
When the feature count is large (inversion too costly) or the data doesn't fit in memory — gradient descent scales better.
What are the key assumptions?
Linearity, independence of errors, homoscedasticity (constant error variance), normally distributed residuals, and little multicollinearity.
What is multicollinearity and why is it a problem?
When features are highly correlated, making coefficient estimates unstable and hard to interpret. Detect it with the Variance Inflation Factor (VIF).
What does R² measure?
The proportion of variance in the target explained by the model, from 0 to 1 (can be negative for a bad model).
Why use adjusted R² instead of R²?
R² never decreases when you add features; adjusted R² penalises extra features, so it only rises if the new feature genuinely helps.
What is the difference between Ridge and Lasso regression?
Ridge adds an L2 penalty (shrinks weights toward zero); Lasso adds an L1 penalty that can drive weights exactly to zero, performing feature selection.
What does the L1 penalty do that L2 doesn't?
L1 produces sparse solutions — it zeros out unimportant features — whereas L2 only shrinks them.
What is the bias-variance trade-off here?
A simple linear model has high bias, low variance; adding polynomial terms lowers bias but raises variance and overfitting risk.
How do you handle non-linear relationships?
Add polynomial or interaction features, transform variables (log, sqrt), or switch to a non-linear model.
Why is feature scaling sometimes needed?
Not for the closed-form solution, but gradient descent converges faster and regularisation penalties are fairer when features are on the same scale.
What is heteroscedasticity and how do you detect it?
Non-constant error variance across the range of predictions. Spot it with a residuals-vs-fitted plot showing a fan/cone shape.
When is MAE preferred over MSE?
When outliers shouldn't dominate the loss — MAE treats all errors linearly, so it's more robust to outliers.
How do you interpret a coefficient?
It's the expected change in the target for a one-unit increase in that feature, holding all others constant.
Can linear regression be used for classification?
Not well — it outputs unbounded values and is sensitive to outliers; logistic regression is the right tool for classification.
What happens if XᵀX is not invertible?
It means perfect multicollinearity or more features than samples; use regularisation (Ridge) or the pseudo-inverse instead.
How do you detect and handle outliers?
Use residual plots, Cook's distance, or leverage scores; then cap, remove, or switch to a robust regression / MAE loss.
🛠 Project idea & 📚 resources
🛠 Build it — project idea
House / rent price predictor. Train a regression model and explain each feature's weight; ship a small predict form.
📂 Dataset: Kaggle 'House Prices — Advanced Regression Techniques'