20 topics · Layman + in-depth · Animated charts · Interview Q&A

Data Analytics — Explained Simply

Every core data-analytics concept in plain English first, then the real depth — with an animated visual, a worked example, 20 interview questions with answers, and a project idea to prove it. Grouped from statistics foundations through experimentation, business analytics, forecasting, and tooling.

📊

Statistics & Probability The maths every analyst leans on

The foundation: summarising data, understanding distributions, and reasoning about uncertainty and sampling before you make any claim.

🔢

Descriptive Statistics

Foundations

Summarise a dataset in a few numbers.

🟢 In simple words

Before any fancy analysis, you describe what you have: what's the typical value (mean/median), how spread out is it (range, standard deviation), and what's the shape? It's the 'getting to know your data' step.

🔬 How it actually works

Measures of central tendency (mean, median, mode) capture the centre; measures of spread (variance, standard deviation, IQR, range) capture variability; and shape is described by skewness and kurtosis. The five-number summary (min, Q1, median, Q3, max) powers the box plot.

💡 Real example

Reporting that average order value is ₹1,200 but the median is ₹650 instantly tells you a few big orders are pulling the mean up — a skewed distribution.

🎤 Interview Q&A20 questions
What is descriptive statistics?

Methods that summarise and describe the main features of a dataset — its centre, spread, and shape — without making inferences beyond the data.

Descriptive vs. inferential statistics?

Descriptive summarises the data you have; inferential uses a sample to draw conclusions about a larger population.

What are the measures of central tendency?

Mean (average), median (middle value), and mode (most frequent value).

When is the median better than the mean?

For skewed data or with outliers — the median resists extreme values, while the mean gets pulled toward them.

What is variance?

The average squared deviation from the mean — a measure of how spread out the values are.

What is standard deviation?

The square root of variance, expressing spread in the same units as the data, which makes it more interpretable.

What is the interquartile range (IQR)?

Q3 − Q1, the spread of the middle 50% of the data; it's robust to outliers.

What is the five-number summary?

Minimum, Q1, median, Q3, and maximum — the basis of a box plot.

What is skewness?

A measure of asymmetry; right-skew has a long right tail (mean > median), left-skew the opposite.

What is kurtosis?

A measure of how heavy the tails are — high kurtosis means more extreme outliers than a normal distribution.

How do you detect outliers statistically?

Common rules: beyond 1.5×IQR from the quartiles, or more than ~3 standard deviations from the mean (z-score).

What is a percentile?

The value below which a given percentage of observations fall — the 90th percentile is exceeded by only 10% of values.

Population vs. sample statistics?

Population covers everyone; a sample is a subset. Sample variance divides by n−1 (Bessel's correction) to stay unbiased.

What is the coefficient of variation?

Standard deviation divided by the mean — a unitless measure for comparing variability across different scales.

Why can the mean be misleading?

It's sensitive to outliers and skew; a single huge value can make a 'typical' figure unrepresentative.

What is a box plot used for?

Visualising the five-number summary and outliers, and comparing distributions across groups at a glance.

Nominal vs. ordinal vs. interval vs. ratio data?

Nominal = unordered categories, ordinal = ordered categories, interval = ordered with equal gaps but no true zero, ratio = interval with a true zero.

What does a histogram show that summary stats don't?

The full shape of the distribution — modality, skew, and gaps that a mean and standard deviation alone hide.

What is the range and its weakness?

Max − min; it's simple but driven entirely by the two most extreme values, so it's unstable.

What is Simpson's paradox?

A trend that appears in groups but reverses when the groups are combined — a warning to always segment before concluding.

🛠 Project idea & 📚 resources

🛠 Build it — project idea

Automated dataset profiler. Build a script that ingests any CSV and outputs central tendency, spread, missing-value rates, and distribution plots per column.

📂 Dataset: Any Kaggle CSV (e.g. retail sales)

🔔

Probability Distributions

Foundations

The shapes that data tends to follow.

🟢 In simple words

Many real-world quantities follow predictable shapes — heights cluster around an average (the bell curve), rare events follow others. Knowing the shape lets you say how likely a value is.

🔬 How it actually works

A distribution maps values to probabilities. Key ones: Normal (symmetric bell, defined by mean and standard deviation), Binomial (counts of successes), Poisson (rare event counts), Uniform, and Exponential (wait times). The Normal underpins many tests via the 68-95-99.7 rule.

💡 Real example

Customer support call durations often follow an exponential distribution; daily website visits may follow a Poisson — picking the right one improves modelling and alerts.

🎤 Interview Q&A20 questions
What is a probability distribution?

A function describing how probabilities are spread across the possible values of a random variable.

Discrete vs. continuous distributions?

Discrete take countable values (use a PMF); continuous take any value in a range (use a PDF where probability is area under the curve).

What is the normal distribution?

A symmetric bell-shaped distribution defined by its mean and standard deviation; it appears widely thanks to the CLT.

What is the 68-95-99.7 rule?

In a normal distribution, ~68% of data lies within 1 SD of the mean, ~95% within 2 SD, and ~99.7% within 3 SD.

What is the binomial distribution?

The distribution of the number of successes in n independent yes/no trials with fixed success probability p.

What is the Poisson distribution?

Models the count of rare, independent events in a fixed interval, parameterised by an average rate λ.

What is the uniform distribution?

Every value in a range is equally likely — a flat distribution.

What is the exponential distribution?

Models the time between events in a Poisson process, e.g. wait times; it's memoryless.

What is a PDF?

Probability Density Function — for continuous variables, its area over an interval gives the probability of falling in that interval.

What is a CDF?

Cumulative Distribution Function — the probability that the variable is less than or equal to a given value.

What is the standard normal distribution?

A normal distribution with mean 0 and standard deviation 1; any normal can be converted to it via z-scores.

What is a z-score?

How many standard deviations a value is from the mean: (x − μ)/σ; it standardises values for comparison.

What does the Central Limit Theorem say about distributions?

Sums/averages of many independent variables tend toward a normal distribution regardless of the original shape.

When would you use a log-normal distribution?

For positive, right-skewed quantities like incomes or response times, whose logarithm is normally distributed.

What is the expected value?

The long-run average of a random variable — the probability-weighted mean of its possible values.

What are the parameters of a normal distribution?

The mean (location of the centre) and standard deviation (spread).

How do you check if data is normally distributed?

A Q-Q plot, a histogram, or formal tests like Shapiro-Wilk or Kolmogorov-Smirnov.

What is a Bernoulli distribution?

A single yes/no trial with success probability p — the building block of the binomial distribution.

Why do so many tests assume normality?

Because the CLT makes sample means approximately normal, so normal-based tests are valid for reasonably large samples.

What is a heavy-tailed distribution and why does it matter?

One with more extreme values than normal; it matters because rare large events (risk, fraud) are far more likely than a normal assumption suggests.

🛠 Project idea & 📚 resources

🛠 Build it — project idea

Distribution fitter. Fit candidate distributions to a real metric, compare with goodness-of-fit, and visualise the best match.

📂 Dataset: Web traffic logs or call-centre durations

🎲

Sampling & the Central Limit Theorem

Foundations

Why a sample can speak for the whole.

🟢 In simple words

You can't survey everyone, so you take a sample. The magic is that averages of samples form a bell curve around the true value — even if the data itself isn't bell-shaped — so a good sample lets you estimate the whole population.

🔬 How it actually works

Random sampling avoids bias. The Central Limit Theorem says the distribution of sample means approaches Normal as sample size grows, with standard error = σ/√n. This is what makes confidence intervals and most tests valid.

💡 Real example

Estimating average national income from a 2,000-person sample: the CLT lets you put a margin of error on the estimate without surveying millions.

🎤 Interview Q&A20 questions
What is sampling?

Selecting a subset of a population to estimate something about the whole, since measuring everyone is usually impractical.

What is the Central Limit Theorem?

The distribution of sample means approaches a normal distribution as sample size grows, regardless of the population's shape.

Why is the CLT so important?

It justifies using normal-based confidence intervals and tests on sample means even when the underlying data isn't normal.

What is sampling bias?

When the sample isn't representative of the population, systematically skewing results — e.g. surveying only website visitors.

What is simple random sampling?

Every member of the population has an equal chance of being selected, which minimises selection bias.

What is stratified sampling?

Dividing the population into subgroups (strata) and sampling within each to ensure representation of key groups.

What is cluster sampling?

Splitting the population into clusters, randomly selecting whole clusters, and sampling within them — cheaper for dispersed populations.

What is standard error?

The standard deviation of a sample statistic (e.g. the mean); for the mean it's σ/√n and shrinks as n grows.

Standard deviation vs. standard error?

SD measures spread of the data; SE measures the precision of an estimate (like the sample mean).

How does sample size affect the estimate?

Larger samples reduce standard error, giving more precise estimates and narrower confidence intervals (precision scales with √n).

What is survivorship bias?

Drawing conclusions only from cases that 'survived' a process, ignoring those that dropped out — distorting the picture.

What is convenience sampling and its risk?

Sampling whoever is easiest to reach; it's fast but highly prone to bias and rarely representative.

Does the population need to be normal for the CLT?

No — that's the point; the sample mean becomes approximately normal even from skewed populations, given enough samples.

What sample size makes the CLT 'kick in'?

A rule of thumb is n ≥ 30, but heavily skewed populations need larger samples.

What is non-response bias?

When those who don't respond differ systematically from those who do, biasing survey results.

What is sampling with vs. without replacement?

With replacement, a unit can be picked more than once; without replacement, each is chosen at most once (typical for surveys).

What is a sampling distribution?

The distribution of a statistic (e.g. the mean) over all possible samples of a given size.

How do you reduce sampling error?

Increase sample size and use proper randomisation; note this addresses random error, not systematic bias.

What is selection bias?

Systematic error from how the sample is chosen, so the sample doesn't reflect the population.

Why randomise sample selection?

Randomisation removes systematic selection patterns, making the sample representative and inference valid.

🛠 Project idea & 📚 resources

🛠 Build it — project idea

CLT simulator. Repeatedly sample from a skewed distribution, plot the sampling distribution of the mean, and watch it converge to Normal as n grows.

📂 Dataset: Synthetic / any skewed real metric

📏

Confidence Intervals

Estimation

A range, not a single guess.

🟢 In simple words

Instead of saying 'average is 50', you say 'we're 95% confident it's between 47 and 53'. It honestly communicates the uncertainty in an estimate from a sample.

🔬 How it actually works

A confidence interval is estimate ± margin of error, where margin = critical value × standard error. A 95% CI means that if you repeated the sampling many times, ~95% of such intervals would contain the true parameter. Wider data spread or smaller samples give wider intervals.

💡 Real example

An A/B test reporting 'conversion lift of +3% (95% CI: +0.5% to +5.5%)' tells stakeholders the effect is likely real because the interval excludes zero.

🎤 Interview Q&A20 questions
What is a confidence interval?

A range of plausible values for a population parameter, estimated from a sample, with a stated confidence level like 95%.

What does '95% confidence' actually mean?

If you repeated the sampling many times, about 95% of the constructed intervals would contain the true parameter — not that there's a 95% chance for this one interval.

How is a confidence interval calculated?

Estimate ± (critical value × standard error); for a 95% CI on a mean it's roughly mean ± 1.96 × SE.

What is the margin of error?

The half-width of the interval — the critical value times the standard error.

What widens a confidence interval?

Higher confidence level, greater data variability, and smaller sample size.

How does sample size affect the interval?

Larger samples shrink the standard error, producing a narrower, more precise interval.

90% vs. 95% vs. 99% CI — trade-off?

Higher confidence means a wider interval; you trade precision for the assurance of capturing the true value.

Why use a t-distribution instead of normal?

When the population SD is unknown and the sample is small; the t-distribution's heavier tails account for the extra uncertainty.

How do confidence intervals relate to hypothesis tests?

If a 95% CI for a difference excludes zero, the result is significant at the 5% level — they're two views of the same inference.

Can you say there's a 95% probability the parameter is in this interval?

Not in frequentist terms — the parameter is fixed; the interval is random. That probabilistic phrasing belongs to Bayesian credible intervals.

What is a credible interval?

The Bayesian analogue — a range that contains the parameter with a stated probability given the prior and data.

What is the critical value?

The multiplier (e.g. 1.96 for 95% normal) from the relevant distribution that sets the interval's width for a confidence level.

How do you build a CI for a proportion?

p̂ ± z × √(p̂(1−p̂)/n), or better methods (Wilson) for small samples or extreme proportions.

Why report a CI instead of just a point estimate?

It communicates the uncertainty around the estimate, preventing false precision.

What happens to the interval if data is very noisy?

High variability increases the standard error, widening the interval and signalling a less precise estimate.

Does a wider interval mean a worse analysis?

No — it honestly reflects more uncertainty (small sample or high variance); a misleadingly narrow interval is worse.

What assumptions underlie a standard CI for the mean?

Roughly: random sampling, independent observations, and approximate normality of the sampling distribution (via CLT).

How do you halve the margin of error?

You need about four times the sample size, since precision improves with √n.

What is bootstrapping for confidence intervals?

Resampling the data with replacement many times to empirically build the sampling distribution and read off interval bounds — no distribution assumption needed.

Can two overlapping CIs still differ significantly?

Yes — overlapping intervals don't guarantee non-significance; you should test the difference directly.

🛠 Project idea & 📚 resources

🛠 Build it — project idea

Estimate with error bars. Compute and visualise 95% CIs for a key metric across segments, and explain which differences are statistically meaningful.

📂 Dataset: Survey or product-metric data

🔄

Bayesian Thinking

Inference

Update your belief as evidence arrives.

🟢 In simple words

Start with a prior belief, see some data, and update to a sharper belief (the posterior). It's how you reason when you have prior knowledge — like a doctor updating a diagnosis after each test result.

🔬 How it actually works

Bayes' theorem: posterior ∝ likelihood × prior. As data accumulates, the posterior concentrates around the truth. Bayesian A/B testing reports the probability that B beats A directly, which many find more intuitive than p-values.

💡 Real example

A spam filter starts with a prior spam rate, then updates the probability an email is spam as it reads each word — classic Bayesian updating.

🎤 Interview Q&A20 questions
What is Bayesian inference?

An approach that updates the probability of a hypothesis as new evidence arrives, combining prior belief with observed data.

State Bayes' theorem.

P(H|E) = P(E|H)·P(H) / P(E): posterior is proportional to likelihood times prior.

What is a prior?

Your belief about a parameter before seeing the new data, expressed as a probability distribution.

What is a posterior?

The updated belief about the parameter after combining the prior with the observed data.

What is the likelihood?

The probability of the observed data under a given value of the parameter.

Bayesian vs. frequentist statistics?

Frequentists treat parameters as fixed and data as random; Bayesians treat parameters as random with a probability distribution updated by data.

What is a conjugate prior?

A prior that yields a posterior of the same family (e.g. Beta prior with binomial likelihood → Beta posterior), making updates analytic.

What is an informative vs. uninformative prior?

Informative encodes real prior knowledge; uninformative (flat) expresses little, letting the data dominate.

How does Bayesian A/B testing differ from frequentist?

It directly reports the probability that the variant beats control and the expected loss, instead of a p-value.

What is a credible interval?

A range that contains the parameter with a stated probability given the data and prior — the Bayesian counterpart to a confidence interval.

Why can priors be controversial?

They're subjective; a strong, poorly-chosen prior can bias conclusions, especially with little data.

What happens to the posterior as data grows?

It concentrates around the true value and the prior's influence fades — data overwhelms the prior.

What is the base rate fallacy?

Ignoring the prior probability (base rate), e.g. overestimating disease likelihood from a positive test when the disease is rare.

Give the classic medical-test Bayesian example.

Even with a 99%-accurate test, a positive result for a rare disease often means a low actual probability of disease because the base rate is tiny.

What is marginal likelihood (evidence)?

P(E), the total probability of the data across all hypotheses; it normalises the posterior.

What is MCMC used for?

Markov Chain Monte Carlo samples from complex posteriors that can't be computed analytically.

What is a maximum a posteriori (MAP) estimate?

The most probable parameter value under the posterior — the Bayesian analogue of a point estimate.

When is the Bayesian approach especially useful?

With small data, when prior knowledge exists, or when you need direct probability statements for decisions.

How does Naive Bayes use Bayesian ideas?

It applies Bayes' theorem with a conditional-independence assumption to compute the posterior probability of each class.

Posterior odds vs. prior odds?

Posterior odds = prior odds × likelihood ratio (Bayes factor); the data shifts the odds by the strength of evidence.

🛠 Project idea & 📚 resources

🛠 Build it — project idea

Bayesian A/B test. Implement a Beta-Binomial Bayesian test and report P(variant > control); compare conclusions with a frequentist test.

📂 Dataset: Conversion counts from an experiment

🧪

Inference & Testing Prove it's real, not noise

Deciding whether a pattern is genuine: hypothesis tests, A/B experiments, correlation vs. causation, and regression to quantify relationships.

🧪

Hypothesis Testing

Inference

Is this difference real or just luck?

🟢 In simple words

You assume 'nothing is going on' (the null hypothesis), then check whether your data would be surprising under that assumption. If it's surprising enough, you reject the null and conclude something real is happening.

🔬 How it actually works

Set null and alternative hypotheses, pick a significance level α (often 0.05), compute a test statistic and its p-value, and reject the null if p < α. Common tests: t-test (means), chi-square (categorical association), ANOVA (3+ groups), z-test (proportions).

💡 Real example

Testing whether a new checkout flow changed average order value: a t-test with p = 0.02 says the difference is unlikely to be chance at the 5% level.

🎤 Interview Q&A20 questions
What is hypothesis testing?

A procedure for deciding whether sample evidence is strong enough to reject a default assumption about a population.

What is the null hypothesis?

The default 'no effect / no difference' statement that the test tries to disprove (H₀).

What is the alternative hypothesis?

The claim you suspect is true — that there is an effect or difference (H₁).

What is a p-value?

The probability of observing data at least as extreme as yours if the null hypothesis were true.

What does p < 0.05 mean?

The result would occur less than 5% of the time under the null, so it's deemed statistically significant at the 5% level.

What is a common misinterpretation of the p-value?

It is NOT the probability the null is true, nor the probability the result is due to chance — it's conditional on the null being true.

What is the significance level (α)?

The threshold for rejecting the null and the accepted false-positive rate, set before testing (commonly 0.05).

What is a Type I error?

A false positive — rejecting a true null hypothesis; its rate is α.

What is a Type II error?

A false negative — failing to reject a false null hypothesis; its rate is β.

What is statistical power?

1 − β, the probability of correctly detecting a real effect; typically targeted at 80%.

What raises statistical power?

Larger sample size, bigger true effect, lower variance, and a higher α.

One-tailed vs. two-tailed test?

One-tailed tests for an effect in a specific direction; two-tailed tests for any difference and is the safer default.

When do you use a t-test?

To compare the means of one or two groups when the population standard deviation is unknown.

When do you use a chi-square test?

To test association between two categorical variables, or goodness-of-fit to expected counts.

What is ANOVA?

Analysis of Variance — tests whether the means of three or more groups differ, avoiding inflated error from many pairwise t-tests.

What is the multiple-comparisons problem?

Running many tests inflates the chance of a false positive; corrections like Bonferroni or FDR control it.

What is the Bonferroni correction?

Dividing α by the number of tests to keep the overall false-positive rate in check (conservative).

Statistical vs. practical significance?

A result can be statistically significant but too small to matter in practice; always look at effect size, not just the p-value.

What is effect size?

A measure of the magnitude of a difference (e.g. Cohen's d), independent of sample size.

What is a test statistic?

A standardised number (t, z, χ², F) summarising how far the data is from the null, mapped to a p-value.

🛠 Project idea & 📚 resources

🛠 Build it — project idea

Significance tester. Build a tool that picks and runs the right test (t/chi-square/ANOVA) for two columns and explains the result in plain English.

📂 Dataset: Any experiment or survey dataset

🆎

A/B Testing

Experimentation

Compare two versions, fairly.

🟢 In simple words

Show version A to half your users and version B to the other half at random, then measure which performs better. Randomisation makes the comparison fair, so any difference is due to the change, not luck.

🔬 How it actually works

Randomly assign users to control/variant, define one primary metric, compute the required sample size from your minimum detectable effect and power (usually 80%), run until that size, then test for significance. Watch for peeking, novelty effects, and multiple-comparison inflation.

💡 Real example

Testing a new 'Buy now' button colour: B lifts conversion from 4.0% to 4.6% with p = 0.01 over two weeks — ship B.

🎤 Interview Q&A20 questions
What is an A/B test?

A randomised controlled experiment comparing two versions (control A and variant B) to see which performs better on a chosen metric.

Why is randomisation essential?

It balances confounders across groups so any measured difference can be attributed to the change, not pre-existing differences.

What is a control group?

The group that sees the existing version, providing the baseline to compare the variant against.

How do you choose the primary metric?

Pick one metric that reflects the goal and is sensitive to the change; too many metrics invite cherry-picking.

What is minimum detectable effect (MDE)?

The smallest effect you care to detect; smaller MDEs require larger samples.

How do you determine sample size?

From the baseline rate, the MDE, the significance level (α), and desired power (usually 80%).

What is the peeking problem?

Repeatedly checking results and stopping when significant inflates false positives; fix with a fixed horizon or sequential testing.

What is a novelty effect?

A temporary behaviour change just because something is new, which can fade — run the test long enough to see past it.

What is a guardrail metric?

A metric you don't want to harm (e.g. latency, revenue) monitored alongside the primary metric.

What is the Simpson's paradox risk in A/B tests?

An overall winner can lose within every segment (or vice versa); segment the results to be sure.

What is statistical power in an A/B context?

The probability the test detects the MDE if it truly exists; underpowered tests miss real effects.

How do you analyse the result?

Compare the metric with an appropriate test (e.g. proportions z-test), report the lift with a confidence interval, and check guardrails.

What is the sample ratio mismatch (SRM) check?

Verifying the actual split matches the intended (e.g. 50/50); a mismatch signals a broken experiment to be discarded.

What is an A/A test?

Running two identical versions to validate the experiment setup — it should show no significant difference.

Why not stop a test as soon as it's significant?

Early significance is often noise; stopping on it (peeking) greatly inflates the false-positive rate.

What is multivariate testing?

Testing multiple element combinations at once to find interactions, requiring much larger samples than a simple A/B test.

What are network effects in experiments?

When one user's treatment affects another's outcome (social apps), breaking independence; cluster-level randomisation helps.

How long should an A/B test run?

At least one full business cycle (often a week or two) to cover weekday/weekend behaviour and reach the planned sample size.

What is the difference between absolute and relative lift?

Absolute lift is the raw difference (4% → 5% = +1pp); relative lift expresses it as a percentage of baseline (+25%).

What does a non-significant result mean?

Not that there's no effect — only that you couldn't detect one; it may be too small or the test underpowered.

🛠 Project idea & 📚 resources

🛠 Build it — project idea

A/B test analyzer. Build an end-to-end analyzer: sample-size calculator, significance test, confidence interval on the lift, and a ship/no-ship verdict.

📂 Dataset: Kaggle A/B test datasets

🔗

Correlation vs. Causation

Inference

Moving together ≠ one causing the other.

🟢 In simple words

Ice-cream sales and drownings both rise in summer, but ice cream doesn't cause drowning — heat drives both. Correlation measures whether two things move together; causation means one actually drives the other.

🔬 How it actually works

Pearson's r measures linear correlation (−1 to 1); Spearman handles monotonic/ranked relationships. Correlation alone can't prove causation because of confounders, reverse causality, or coincidence. Causation needs a randomised experiment or careful causal-inference methods.

💡 Real example

A dashboard shows users who use feature X retain better. Before claiming X causes retention, check whether power users (a confounder) simply use everything more.

🎤 Interview Q&A20 questions
What is correlation?

A statistical measure of how two variables move together, ranging from −1 (perfect inverse) to +1 (perfect direct).

What is Pearson's correlation coefficient?

A measure of the strength and direction of a linear relationship between two continuous variables.

What is Spearman's correlation?

A rank-based correlation capturing monotonic relationships, robust to outliers and non-linearity.

Why doesn't correlation imply causation?

Because a third variable (confounder), reverse causality, or pure coincidence can produce correlation without a causal link.

What is a confounding variable?

A variable that influences both the supposed cause and effect, creating a misleading association.

Give an example of spurious correlation.

Ice-cream sales and drowning deaths correlate because both rise with summer heat — the confounder — not because one causes the other.

How can you establish causation?

Best via a randomised controlled experiment; otherwise with causal-inference methods that control for confounders.

What is reverse causality?

When the assumed effect actually causes the assumed cause — e.g. 'support tickets correlate with churn' may run either direction.

What does a correlation of 0 mean?

No linear relationship — but a strong non-linear relationship (e.g. U-shaped) can still exist.

What is the range of a correlation coefficient?

−1 to +1; the sign gives direction and the magnitude gives strength.

Why visualise before trusting a correlation?

Anscombe's quartet shows datasets with identical correlations but wildly different shapes; a scatter plot reveals the truth.

What is a confounder-control technique without an experiment?

Stratification, multivariable regression, matching, or propensity scoring to adjust for known confounders.

What is the difference between association and causation?

Association means a statistical relationship exists; causation means changing one variable would change the other.

What are natural experiments?

Situations where an external event randomly assigns treatment, enabling causal estimates without a designed experiment.

What is selection bias's role in false causation?

If the groups compared differ systematically from the start, any difference can be wrongly attributed to the treatment.

What is the counterfactual in causal inference?

What would have happened to the same unit without the treatment — unobservable, so we estimate it via controls or randomisation.

What is a lurking variable?

Another term for an unmeasured confounder that drives an observed correlation.

How does correlation relate to regression?

Simple linear regression's slope is directly related to Pearson's correlation; both summarise a linear relationship.

Why is 'controlling for variables' important?

It isolates the relationship of interest from confounders, moving an analysis closer to a causal claim.

What is a randomised controlled trial (RCT)?

The gold standard for causation: random assignment to treatment/control balances confounders so differences are causal.

🛠 Project idea & 📚 resources

🛠 Build it — project idea

Confounder hunt. Find a strong correlation in a dataset, then identify a plausible confounder and show how controlling for it changes the story.

📂 Dataset: Any multi-feature dataset

📉

Regression Analysis

Modelling

Quantify how drivers move an outcome.

🟢 In simple words

Regression draws the best line (or surface) relating inputs to an outcome, telling you both the direction and size of each driver's effect — e.g. 'each extra ad rupee adds 2 sales, holding price constant'.

🔬 How it actually works

Ordinary Least Squares fits coefficients minimising squared error; each coefficient is the effect of a one-unit change holding others constant. Check significance (p-values), fit (R²/adjusted R²), and assumptions (linearity, independence, constant variance, no severe multicollinearity).

💡 Real example

Modelling sales on price, ad spend, and season to attribute revenue drivers and simulate 'what if we cut price 5%?'.

🎤 Interview Q&A20 questions
What is regression analysis?

A method to model and quantify the relationship between a dependent variable and one or more independent variables.

What does a regression coefficient represent?

The expected change in the outcome for a one-unit increase in that predictor, holding the others constant.

What is the difference between simple and multiple regression?

Simple uses one predictor; multiple uses several, allowing you to control for confounders.

What does R² tell you?

The proportion of variance in the outcome explained by the model, from 0 to 1.

Why prefer adjusted R²?

It penalises adding predictors, so it only increases when a new variable genuinely improves the model.

What is multicollinearity?

High correlation among predictors that destabilises coefficient estimates; detect it with the Variance Inflation Factor.

What are OLS assumptions?

Linearity, independent errors, homoscedasticity, normally distributed residuals, and little multicollinearity.

What is homoscedasticity?

Constant variance of the residuals across all fitted values; violating it is heteroscedasticity.

How do you interpret a coefficient's p-value?

It tests whether that predictor's effect differs from zero; small p means the relationship is statistically significant.

What are residuals?

The differences between observed and predicted values; their patterns reveal model problems.

How do dummy variables work?

Categorical variables are encoded as 0/1 indicators, with one category dropped as the reference baseline.

What is an interaction term?

A product of two predictors that lets the effect of one depend on the level of another.

Logistic vs. linear regression?

Linear predicts a continuous outcome; logistic predicts the probability of a binary outcome via the sigmoid.

What does a residuals-vs-fitted plot diagnose?

Non-linearity, heteroscedasticity, and outliers — ideally it shows random scatter around zero.

What is overfitting in regression?

Adding too many predictors so the model fits noise; it shows high training R² but poor out-of-sample performance.

How do you handle non-linear relationships?

Add polynomial or transformed terms (log, sqrt), or use a non-linear model.

What is regularisation (Ridge/Lasso)?

Penalising large coefficients to reduce overfitting; Lasso can also zero out weak predictors for selection.

What's the danger of extrapolation?

Predicting outside the range of the training data, where the fitted relationship may not hold.

What is an influential point?

An observation that disproportionately changes the fit; detect with leverage and Cook's distance.

How do you validate a regression model?

Check assumptions via residual plots and assess predictive accuracy on held-out data (cross-validation), not just R².

🛠 Project idea & 📚 resources

🛠 Build it — project idea

Driver analysis. Fit a multiple regression on a business outcome, interpret each coefficient, and present the top levers with caveats.

📂 Dataset: Marketing / sales dataset

📈

Exploratory & Business Analytics Turn data into decisions

The day-to-day analyst toolkit: exploring data, defining metrics, and analysing users by cohort, funnel, value, and retention.

🔍

Exploratory Data Analysis (EDA)

Exploration

Interrogate the data before modelling.

🟢 In simple words

Before drawing conclusions, you explore: plot distributions, spot outliers, check missing values, and look at how variables relate. EDA is detective work that prevents garbage-in-garbage-out.

🔬 How it actually works

Univariate analysis (histograms, box plots) for each variable; bivariate/multivariate analysis (scatter plots, correlation heatmaps, group-bys) for relationships; plus missing-value and outlier audits. The goal is hypotheses and data-quality issues, not final answers.

💡 Real example

Plotting a revenue column reveals negative values (refunds coded wrong) and a long tail — fixing these before modelling avoids misleading results.

🎤 Interview Q&A20 questions
What is exploratory data analysis (EDA)?

The initial investigation of data to understand its structure, spot anomalies, test assumptions, and form hypotheses before modelling.

What are the main goals of EDA?

Understand distributions and relationships, find data-quality issues, detect outliers, and generate hypotheses.

What is univariate analysis?

Examining one variable at a time — its distribution, central tendency, and spread (histograms, box plots).

What is bivariate analysis?

Examining the relationship between two variables, e.g. with scatter plots, cross-tabs, or correlation.

How do you handle missing values during EDA?

Quantify them, understand why they're missing, then decide to drop, impute, or flag — never ignore them silently.

What is MCAR vs. MAR vs. MNAR?

Missing Completely At Random, Missing At Random (explained by other variables), and Missing Not At Random (related to the missing value itself) — each needs different handling.

How do you detect outliers in EDA?

Box plots, z-scores, the IQR rule, and scatter plots; then investigate whether they're errors or genuine extremes.

What is a correlation heatmap used for?

Quickly spotting strongly related variables and potential multicollinearity across many features.

Why visualise distributions early?

Summary stats hide shape; plots reveal skew, multimodality, and gaps that change how you analyse the data.

What data-quality checks belong in EDA?

Duplicates, impossible values, inconsistent categories, wrong types, and out-of-range entries.

How do you explore categorical variables?

Frequency counts, bar charts, and cross-tabulations against the target.

What is feature engineering's link to EDA?

EDA surfaces patterns and relationships that suggest useful new features (ratios, bins, date parts).

What is the role of group-by in EDA?

Aggregating a metric across segments to compare behaviour and uncover where differences live.

How do you check for data leakage during EDA?

Look for features that wouldn't be available at prediction time or that suspiciously predict the target perfectly.

What is a pair plot?

A grid of scatter plots for every pair of variables, with distributions on the diagonal — a fast multivariate overview.

Why segment data before drawing conclusions?

Aggregates can hide or reverse patterns (Simpson's paradox); segmenting reveals the real story.

What is the first thing to check on a new dataset?

Shape, column types, a sample of rows, missing-value counts, and basic summary statistics.

How do automated profiling tools help EDA?

Tools like ydata-profiling generate distributions, correlations, and quality warnings in one report to speed exploration.

What is the difference between EDA and confirmatory analysis?

EDA is open-ended discovery; confirmatory analysis formally tests pre-specified hypotheses.

Why document EDA findings?

They justify cleaning and modelling decisions and prevent others (or future you) from repeating the investigation.

🛠 Project idea & 📚 resources

🛠 Build it — project idea

EDA report. Produce a structured EDA notebook for a fresh dataset: quality checks, distributions, relationships, and 5 data-driven hypotheses.

📂 Dataset: Any unfamiliar Kaggle dataset

🪣

Cohort & Funnel Analysis

Product analytics

Where users group up — and drop off.

🟢 In simple words

A funnel shows how many users make it through each step (visit → sign-up → purchase) so you see where they leak. A cohort groups users by when they joined to compare how each group behaves over time.

🔬 How it actually works

Funnel analysis counts conversion between sequential steps to find the biggest drop-off. Cohort analysis buckets users (e.g. by signup month) and tracks a metric across their lifetime, separating 'are new users worse?' from 'is the product declining?'.

💡 Real example

A funnel reveals 60% drop between 'add to cart' and 'checkout' → fix shipping-cost surprise. A cohort table shows June signups retain better than May's after a feature launch.

🎤 Interview Q&A20 questions
What is a funnel analysis?

Measuring how many users progress through sequential steps of a process to find where the biggest drop-offs occur.

What is a cohort analysis?

Grouping users by a shared start characteristic (e.g. signup month) and tracking a metric across their lifetime.

Why use cohorts instead of overall averages?

Overall metrics mix old and new users; cohorts isolate how each group behaves over time, revealing real trends.

What is conversion rate in a funnel?

The percentage of users moving from one step to the next, or from the top to the bottom of the funnel.

How do you find the biggest opportunity in a funnel?

Identify the step with the largest drop-off relative to its potential impact and focus there.

What is retention cohort analysis?

Tracking what fraction of each signup cohort remains active in week/month 1, 2, 3… to see retention shape over time.

What is an acquisition cohort vs. behavioural cohort?

Acquisition cohorts group by when users joined; behavioural cohorts group by an action they took.

Why might a funnel step have >100% conversion?

Usually a tracking issue (double counting, users entering mid-funnel) — a signal to audit the event data.

How do cohorts help diagnose a metric drop?

They separate 'new users are worse' from 'the whole product is declining' by isolating each group's trajectory.

What is time-to-convert analysis?

Measuring how long users take to move through the funnel, revealing friction and natural decision lags.

What is a leaky funnel?

One where many users drop between steps; plugging the biggest leak yields the largest conversion gain.

How do you build a funnel in SQL?

Count distinct users at each event step (often with conditional aggregation or sequential joins on timestamps).

What does a flattening retention curve indicate?

A stable core of loyal users — the plateau height is your realistic long-term retention ceiling.

Why segment funnels?

Conversion often varies by device, channel, or geography; segmenting reveals where to focus fixes.

What is the difference between open and closed funnels?

Closed funnels require strict step order; open funnels allow users to enter or skip steps, affecting how you count.

How do you visualise a cohort analysis?

A triangular heatmap (cohorts as rows, periods as columns) coloured by the retention/metric value.

What is survivorship bias in cohort analysis?

Newer cohorts have less observed time, so comparing them to older ones at the same age — not calendar date — avoids bias.

What business question does funnel analysis answer?

'Where are we losing users, and which step should we fix first?'

What business question does cohort analysis answer?

'Are users acquired recently better or worse than before, and how does engagement evolve?'

How do you tie funnel and cohort together?

Compare funnel conversion across cohorts to see whether product changes improved progression for newer users.

🛠 Project idea & 📚 resources

🛠 Build it — project idea

Funnel + cohort dashboard. Build a conversion funnel and a monthly retention cohort grid from event data, and recommend the highest-impact fix.

📂 Dataset: E-commerce or app event logs

🧩

RFM & Customer Segmentation

Customer analytics

Group customers by behaviour, then act.

🟢 In simple words

Not all customers are equal. RFM scores each by how Recently they bought, how Frequently, and how much Money they spend — turning a customer list into actionable groups like 'champions' and 'about to churn'.

🔬 How it actually works

Score each customer 1–5 on Recency, Frequency, and Monetary value (quantiles), then combine into segments. For richer segmentation, cluster on behavioural features with K-Means. The output drives targeted marketing and retention.

💡 Real example

Tagging high-frequency, high-value, recently-active customers as 'VIPs' for a loyalty offer, and lapsed high-spenders for a win-back campaign.

🎤 Interview Q&A20 questions
What is RFM analysis?

A customer-segmentation technique scoring each customer on Recency, Frequency, and Monetary value.

What does Recency measure?

How recently a customer last purchased — recent buyers are more likely to buy again.

What does Frequency measure?

How often a customer purchases within a period — frequent buyers are more engaged.

What does Monetary measure?

How much a customer spends in total — high spenders are most valuable.

How are RFM scores computed?

Each dimension is split into quantiles (e.g. 1–5), then combined into a score or segment label.

Why segment customers at all?

Different groups need different treatment; targeting beats blanket campaigns on both cost and conversion.

What is a 'champions' segment?

Customers high on all three — recent, frequent, high-value — ideal for loyalty and advocacy programs.

What is an 'at-risk' segment?

Previously valuable customers who haven't purchased recently — prime targets for win-back campaigns.

RFM vs. K-Means segmentation?

RFM is rule-based and interpretable; K-Means clusters on many behavioural features for richer but less transparent segments.

What is customer lifetime value (CLV)?

The total value a customer is expected to generate over their relationship — guides acquisition spend.

Why scale features before clustering customers?

Distance-based clustering is dominated by large-range features (like monetary), so standardising is essential.

How do you choose the number of segments?

Balance statistical signals (elbow/silhouette for clustering) with how many groups the business can actually action.

What makes a segmentation 'actionable'?

Segments that are distinct, sizable, stable, and map to a clear, different action.

What is behavioural segmentation?

Grouping by what users do (usage patterns, features used) rather than who they are demographically.

How does RFM drive marketing?

Each segment gets a tailored message — VIP perks for champions, reactivation offers for lapsed high-spenders.

What data do you need for RFM?

A transaction log with customer ID, purchase date, and amount.

What is a limitation of RFM?

It's backward-looking and ignores product mix, channel, and engagement beyond purchases.

How do you validate segments?

Profile each segment, check they differ on outcomes, and test whether targeted actions actually lift results.

What is the Pareto principle in customer analytics?

Often ~80% of revenue comes from ~20% of customers — segmentation finds and protects that 20%.

How often should segments be refreshed?

Regularly (e.g. monthly), because customers move between segments as their behaviour changes.

🛠 Project idea & 📚 resources

🛠 Build it — project idea

RFM segmentation engine. Compute RFM scores from transaction data, define segments, and produce a per-segment action plan with sizes and value.

📂 Dataset: Online Retail (UCI) transactions

📉

Retention & Churn Analysis

Customer analytics

Who stays, who leaves, and why.

🟢 In simple words

Retention measures how many users keep coming back; churn is the opposite — who leaves. Since keeping a customer is far cheaper than winning a new one, understanding churn is one of the highest-value analyses.

🔬 How it actually works

Define active vs. churned with a clear window, then track retention curves and churn rate over time. Diagnose drivers with cohort comparisons and regression, and (optionally) predict churn risk with a classifier to target interventions.

💡 Real example

A retention curve flattening at 30% after month 3 sets the realistic ceiling; cohort analysis shows onboarding changes lifted month-1 retention by 8 points.

🎤 Interview Q&A20 questions
What is churn?

The rate at which customers stop using a product or cancel a subscription over a period.

What is retention?

The fraction of customers who continue to use the product over time — the inverse of churn.

Why is retention so important?

Acquiring a new customer costs far more than keeping one, so small retention gains compound into large revenue.

How do you define an 'active' user?

With a clear, product-specific action and time window (e.g. logged in within 30 days) — the definition shapes every metric.

What is a retention curve?

A plot of the percentage of a cohort still active at each period after signup; its plateau is the loyal core.

What is voluntary vs. involuntary churn?

Voluntary is a deliberate cancel; involuntary is failed payments or expired cards — each needs a different fix.

What is gross vs. net revenue churn?

Gross counts lost revenue only; net subtracts expansion (upsells) and can even be negative if expansion outpaces losses.

How do you predict churn?

Train a classifier (e.g. logistic regression or gradient boosting) on behavioural and account features to score churn risk.

What features predict churn well?

Declining usage/engagement, support tickets, time since last activity, tenure, and plan/price changes.

How do you evaluate a churn model?

With precision/recall, ROC-AUC or PR-AUC (data is imbalanced), and the business lift from acting on the scores.

Why is accuracy a poor metric for churn?

Churn is imbalanced; predicting 'no churn' for everyone can be highly accurate yet useless.

What is the difference between logo churn and revenue churn?

Logo (customer) churn counts accounts lost; revenue churn weights by the money those accounts represented.

How do you reduce churn?

Improve onboarding, target at-risk users with interventions, fix failed payments, and address the top drivers found in analysis.

What is N-day retention?

The share of users active exactly N days after signup (e.g. Day-7 retention).

What is rolling retention?

Counting a user as retained if active on or after day N, which is more forgiving than exact-day retention.

What is cohort-based churn analysis?

Comparing churn across signup cohorts to see whether product changes improved retention for newer users.

How does survival analysis apply to churn?

It models time-until-churn (e.g. Kaplan-Meier, Cox), handling customers who haven't churned yet (censoring).

What is the churn rate formula?

Customers lost in a period divided by customers at the start of that period.

How do you act on a churn-risk score?

Prioritise high-risk, high-value customers for targeted retention offers, balancing intervention cost against saved value.

What is negative churn?

When expansion revenue from existing customers exceeds revenue lost to churn — a strong growth signal.

🛠 Project idea & 📚 resources

🛠 Build it — project idea

Churn diagnosis + predictor. Compute retention curves and churn rate, find the top churn drivers, and build a simple churn-risk model to flag at-risk users.

📂 Dataset: Telco Customer Churn (Kaggle)

🎯

KPIs & Metrics Design

Measurement

Measure what actually matters.

🟢 In simple words

A KPI is the one number a team rallies around. Good metric design means picking measures that reflect real success and can't be gamed — and knowing the difference between a vanity metric and an actionable one.

🔬 How it actually works

Distinguish vanity vs. actionable metrics, leading vs. lagging indicators, and the 'North Star' that captures core value. Define each metric precisely (numerator, denominator, window), pair it with a guardrail metric, and segment it to avoid Simpson's-paradox traps.

💡 Real example

Choosing 'weekly active users who complete a core action' as a North Star instead of raw signups — which can rise while real engagement falls.

🎤 Interview Q&A20 questions
What is a KPI?

A Key Performance Indicator — a metric chosen to track progress toward a specific business objective.

Metric vs. KPI?

Every KPI is a metric, but a KPI is specifically tied to a goal and watched closely; not every metric is a KPI.

What is a vanity metric?

A number that looks impressive (total signups, page views) but doesn't inform decisions or reflect real value.

What is an actionable metric?

One that ties to a decision — when it moves, you know what to do about it.

What is a North Star metric?

The single metric that best captures the core value your product delivers, aligning the whole team.

Leading vs. lagging indicators?

Leading indicators predict future outcomes (e.g. trial usage); lagging indicators confirm past results (e.g. revenue).

Why pair a metric with a guardrail?

To ensure improving the target doesn't quietly harm something else (e.g. boosting engagement while hurting retention).

How do you define a metric precisely?

Specify the numerator, denominator, time window, and population so it's unambiguous and reproducible.

What is metric gaming (Goodhart's Law)?

'When a measure becomes a target, it ceases to be a good measure' — people optimise the number, not the goal.

Why segment a KPI?

Aggregates hide divergent group behaviour (Simpson's paradox); segmenting reveals what's really driving the number.

What is the difference between a ratio and a count metric?

Counts grow with scale and can mislead; ratios (rates, per-user) normalise and are more comparable over time.

What is the AARRR (pirate) metrics framework?

Acquisition, Activation, Retention, Referral, Revenue — a funnel framing for product/growth KPIs.

What is the HEART framework?

Google's UX metrics: Happiness, Engagement, Adoption, Retention, Task success.

How do you choose a North Star?

Pick the leading metric most correlated with long-term value and that the team can influence — not raw revenue or signups.

What is a counter-metric?

Another name for a guardrail — a metric watched to prevent unintended harm while optimising the primary KPI.

Why is 'active users' ambiguous?

Without defining the action and window, DAU/MAU can be inflated or inconsistent across teams.

What is DAU/MAU ratio?

Daily over Monthly Active Users — a stickiness measure of how often monthly users return.

How many KPIs should a team track?

Few — one North Star plus a handful of supporting and guardrail metrics; too many dilute focus.

What makes a good metric?

It's relevant to the goal, clearly defined, comparable over time, hard to game, and tied to action.

What is the difference between output and outcome metrics?

Output measures activity (features shipped); outcome measures impact (users helped) — outcomes matter more.

🛠 Project idea & 📚 resources

🛠 Build it — project idea

Metrics framework. For a product of your choice, design a North Star, supporting metrics, and guardrails, each with a precise definition and rationale.

📂 Dataset: Conceptual + a sample event dataset

⏱️

Time Series & Forecasting Patterns over time, and what's next

Working with data ordered in time — spotting trend and seasonality, and projecting future values with quantified uncertainty.

📅

Time Series Analysis

Temporal

Decompose data that moves over time.

🟢 In simple words

Data measured over time (daily sales, hourly traffic) has structure: a long-term trend, repeating seasonal cycles, and random noise. Pulling these apart tells you what's really changing versus what's just the usual weekly rhythm.

🔬 How it actually works

Decompose a series into trend, seasonality, and residual (additive or multiplicative). Check stationarity (e.g. ADF test), use rolling means/differencing to stabilise it, and inspect autocorrelation (ACF/PACF) to understand dependence on past values.

💡 Real example

Decomposing retail sales separates a steady upward trend from the December spike, so you don't mistake normal seasonality for real growth.

🎤 Interview Q&A20 questions
What is a time series?

A sequence of data points indexed in time order, such as daily sales or hourly temperature.

What are the components of a time series?

Trend (long-term direction), seasonality (regular cycles), and residual/noise (random variation).

What is trend?

The long-term increase or decrease in the series, independent of short cycles.

What is seasonality?

Patterns that repeat at fixed periods, like higher retail sales every December.

What is cyclicality vs. seasonality?

Seasonality has a fixed known period; cycles are longer, irregular fluctuations (like economic booms) with no fixed length.

What is stationarity?

A series whose statistical properties (mean, variance) don't change over time — required by many models.

How do you test for stationarity?

Visually, plus tests like the Augmented Dickey-Fuller (ADF) or KPSS test.

How do you make a series stationary?

Differencing, removing trend/seasonality, or transforming (e.g. log) to stabilise variance.

What is differencing?

Subtracting the previous value from the current to remove trend and help achieve stationarity.

What is autocorrelation?

The correlation of a series with its own past values at various lags.

What are the ACF and PACF?

Autocorrelation and Partial Autocorrelation Functions — plots used to identify the lag structure and choose ARIMA orders.

What is additive vs. multiplicative decomposition?

Additive when seasonal swings are constant in size; multiplicative when they grow proportionally with the level.

What is a moving average?

Averaging over a sliding window to smooth noise and reveal the underlying trend.

Why can't you use random train/test splits for time series?

It leaks future information into training; you must split chronologically to respect time order.

What is a lag feature?

A past value of the series used as a predictor for the current value.

What is a rolling statistic?

A statistic (mean, std) computed over a moving window, used for smoothing and feature engineering.

What is white noise?

A series of uncorrelated, zero-mean, constant-variance values — what residuals should look like after good modelling.

What is the difference between time series and cross-sectional data?

Time series tracks one entity over time; cross-sectional captures many entities at a single point in time.

Why is seasonality important to model?

Ignoring it makes you mistake normal cycles for real change and produces poor forecasts.

What is a structural break?

A sudden, lasting shift in the series' behaviour (e.g. a policy change) that breaks a single fitted model.

🛠 Project idea & 📚 resources

🛠 Build it — project idea

Sales decomposition. Decompose a sales series into trend/seasonality/residual, test stationarity, and summarise the underlying pattern.

📂 Dataset: Retail or web-traffic time series

🔮

Forecasting

Temporal

Project the future with error bars.

🟢 In simple words

Forecasting predicts future values — next month's sales, tomorrow's demand — and, crucially, how confident you are. A forecast without an uncertainty range is just a guess dressed up.

🔬 How it actually works

Methods range from simple (moving average, exponential smoothing/Holt-Winters) to ARIMA/SARIMA and ML (Prophet, gradient boosting on lag features). Validate with time-based splits (never random), report a prediction interval, and benchmark against a naive baseline.

💡 Real example

Forecasting weekly inventory demand with Holt-Winters to cut both stockouts and overstock, with an 80% interval guiding safety stock.

🎤 Interview Q&A20 questions
What is forecasting?

Predicting future values of a time series, ideally with a quantified uncertainty range.

Why include a prediction interval?

A point forecast hides risk; an interval communicates how confident the forecast is and supports planning.

What is a naive forecast?

Predicting the next value equals the last observed value — a baseline every model must beat.

What is exponential smoothing?

Forecasting with weighted averages that give more weight to recent observations; Holt-Winters extends it to trend and seasonality.

What is ARIMA?

AutoRegressive Integrated Moving Average — combines autoregression, differencing for stationarity, and moving-average terms.

What do the p, d, q in ARIMA mean?

p = autoregressive lags, d = differencing order, q = moving-average lags.

What is SARIMA?

ARIMA with added seasonal terms to capture repeating seasonal patterns.

How do you validate a forecast?

Use time-based splits or rolling/expanding-window backtesting — never random splits.

What metrics evaluate forecasts?

MAE, RMSE, MAPE, and sMAPE; MAPE is intuitive but breaks near zero values.

What is walk-forward (rolling) validation?

Repeatedly training on data up to time t and testing on t+1, advancing through the series to mimic real use.

When would you use Prophet?

For business series with strong seasonality, holidays, and missing data, where you want fast, robust forecasts with little tuning.

Can ML models forecast time series?

Yes — gradient boosting or neural nets on lag/rolling/calendar features, but you must prevent leakage with proper time splits.

What is overfitting in forecasting?

A model that fits historical noise and seasonal quirks but generalises poorly to the future.

Why benchmark against a naive model?

If a complex model can't beat 'use last value' or 'same as last year', it isn't adding value.

What is the difference between forecasting and prediction?

Forecasting specifically projects future values of a time-ordered series; prediction is the general term for any model output.

How do holidays affect forecasts?

They cause spikes/dips that regular seasonality misses; modelling them explicitly improves accuracy.

What is forecast horizon?

How far ahead you predict; accuracy generally degrades as the horizon lengthens.

What is a confidence/prediction band on a forecast?

The shaded range showing plausible future values; it widens with the horizon as uncertainty grows.

How do you handle a sudden regime change?

Detect the break, down-weight or drop pre-change data, or use adaptive models that adjust to the new level.

What is ensemble forecasting?

Combining multiple models' forecasts (e.g. averaging) to reduce variance and often beat any single model.

🛠 Project idea & 📚 resources

🛠 Build it — project idea

Demand forecaster. Build and back-test a forecast with a time-based split, compare against a naive baseline, and plot prediction intervals.

📂 Dataset: Store sales / demand dataset

🧰

Data Prep & Visualization Clean it, query it, show it

The craft around the analysis: cleaning messy data, querying with SQL, and communicating findings with clear charts and dashboards.

🧹

Data Cleaning & Wrangling

Data prep

80% of the job: making data usable.

🟢 In simple words

Real data is messy — typos, missing values, duplicates, inconsistent formats. Cleaning is the unglamorous but essential work that decides whether your analysis is trustworthy.

🔬 How it actually works

Handle missing values (drop, impute by mean/median/model, or flag), de-duplicate, fix types and inconsistent categories, parse dates, treat outliers, and reshape between wide and long. Always keep a reproducible, documented pipeline rather than manual edits.

💡 Real example

Standardising 'IN', 'India', 'india' into one category and median-imputing missing ages so a downstream group-by isn't silently wrong.

🎤 Interview Q&A20 questions
What is data cleaning?

The process of fixing or removing incorrect, incomplete, duplicated, or inconsistent data to make it analysis-ready.

Why is data cleaning so important?

Analysis is only as good as the data; errors lead to wrong conclusions — garbage in, garbage out.

What are common ways to handle missing values?

Drop rows/columns, impute (mean/median/mode/model-based), or flag missingness as its own signal.

When should you drop vs. impute missing data?

Drop when missingness is tiny or the column is mostly empty; impute when the data is valuable and missingness isn't informative.

How do you detect duplicate records?

Check for exact or fuzzy matches on key fields; decide whether duplicates are true repeats or data-entry errors.

How do you handle inconsistent categories?

Standardise casing/spelling and map variants ('IN', 'India', 'india') to a single canonical value.

What is an outlier and how do you treat it?

An extreme value; investigate whether it's an error (fix/remove) or genuine (keep, maybe cap/transform).

What is data type coercion?

Converting columns to correct types (dates, numbers, categories) so operations behave correctly.

Why parse dates explicitly?

Dates stored as strings can't be sorted, differenced, or grouped by period correctly until parsed to datetime.

What is the difference between wide and long format?

Wide spreads variables across columns; long stacks them into key-value rows — many tools and plots expect long (tidy) data.

What is tidy data?

Data where each variable is a column, each observation a row, and each cell a single value.

How do you handle outliers without deleting them?

Cap (winsorise), transform (log), or use robust methods that down-weight extremes.

What is data validation?

Checking data against rules (ranges, types, uniqueness) to catch errors early and automatically.

Why keep cleaning reproducible?

A documented, scripted pipeline lets others rerun it, audit decisions, and apply the same steps to new data — unlike manual edits.

What is imputation's main risk?

It can distort distributions and understate uncertainty if done carelessly (e.g. mean-imputing a skewed variable).

What is string normalisation?

Trimming whitespace, fixing case, and removing special characters so text values match consistently.

How do you deal with mixed units in a column?

Detect and convert all values to a single unit before any calculation.

What roughly is the 80/20 of an analyst's time?

Often ~80% is spent finding, cleaning, and preparing data, and ~20% on the actual analysis.

What is a data dictionary?

Documentation defining each field's meaning, type, and valid values — essential for correct cleaning and analysis.

How do you handle structural errors?

Fix typos, inconsistent naming, and mislabeled categories during a standardisation step before analysis.

🛠 Project idea & 📚 resources

🛠 Build it — project idea

Reusable cleaning pipeline. Write a documented pandas pipeline that takes a raw messy CSV to analysis-ready, with logging of every transformation.

📂 Dataset: Any deliberately-messy Kaggle dataset

🎨

Data Visualization Principles

Communication

Show the insight, not just the data.

🟢 In simple words

A good chart makes the point in seconds; a bad one hides it. Visualization is about choosing the right chart and removing clutter so the audience sees the story, not a wall of numbers.

🔬 How it actually works

Match chart to intent: bar for comparison, line for trend, scatter for relationship, histogram/box for distribution, heatmap for matrices. Avoid pie overuse and truncated/dual axes that mislead. Maximise data-ink, label clearly, and use colour purposefully (and accessibly).

💡 Real example

Replacing a 12-slice pie with a sorted bar chart instantly reveals the ranking that the pie obscured.

🎤 Interview Q&A20 questions
What is the goal of data visualization?

To communicate insight clearly and quickly — making patterns and comparisons obvious to the audience.

How do you choose the right chart type?

Match it to intent: bar for comparison, line for trend over time, scatter for relationship, histogram/box for distribution.

When should you avoid pie charts?

When there are many slices or similar sizes — humans compare angles poorly; a sorted bar chart is clearer.

What is the data-ink ratio?

Tufte's idea of maximising the ink that conveys data and minimising decorative 'chartjunk'.

Why can a truncated y-axis mislead?

Starting the axis above zero exaggerates small differences, distorting the reader's perception.

What chart shows a distribution?

A histogram or box plot (or violin plot) — to reveal shape, spread, and outliers.

What chart shows a relationship between two variables?

A scatter plot, optionally with a trend line.

What chart shows change over time?

A line chart, which emphasises trend and continuity.

How should you use colour in charts?

Purposefully — to encode data or highlight, not decorate — and accessibly for colour-blind viewers.

What is chartjunk?

Unnecessary visual elements (3D effects, heavy gridlines, clip art) that distract from the data.

Why label directly instead of relying on legends?

Direct labels reduce the back-and-forth of decoding a legend, speeding comprehension.

What is a heatmap good for?

Showing magnitude across a matrix — e.g. correlations or a cohort retention grid.

What is the problem with dual-axis charts?

They can imply relationships that don't exist by arbitrarily scaling two axes; use with care.

How do you visualise part-to-whole?

Stacked bars or a treemap for several categories; pie only for a few large, distinct slices.

What is preattentive processing in viz?

Visual features (colour, size, position) the brain perceives instantly — use them to direct attention to the key point.

Why sort categorical bars?

Sorting by value reveals ranking immediately, unlike alphabetical order which hides it.

What makes a chart accessible?

Colour-blind-safe palettes, sufficient contrast, text labels, and not relying on colour alone to convey meaning.

What is small multiples?

A grid of small charts with the same axes, letting you compare many categories at a glance.

How do you avoid overplotting in a scatter?

Use transparency, sampling, binning (hexbin), or density contours when points overlap heavily.

What is the single most important question before charting?

'What is the one message I want the audience to take away?' — the chart should serve that.

🛠 Project idea & 📚 resources

🛠 Build it — project idea

Chart makeover. Take three misleading or cluttered charts and redesign them, writing a short note on why each redesign communicates better.

📂 Dataset: Public charts / any dataset

🗄️

SQL for Analytics

Querying

The language of getting the data.

🟢 In simple words

Almost every analytics job runs on SQL — the way you ask a database questions: filter rows, join tables, group and summarise. If data lives in a warehouse, SQL is how you reach it.

🔬 How it actually works

Core: SELECT/WHERE/GROUP BY/HAVING/ORDER BY and JOINs. Analyst-level: window functions (ROW_NUMBER, RANK, running totals, LAG/LEAD), CTEs for readable multi-step queries, CASE logic, and date functions for cohorts and funnels.

💡 Real example

A window function computes each customer's running 30-day spend and rank within their region — in one query, no export needed.

🎤 Interview Q&A20 questions
Why is SQL essential for analysts?

Most business data lives in relational databases/warehouses, and SQL is the standard way to retrieve and aggregate it.

What is the order of SQL clause execution?

FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY → LIMIT (not the written order).

WHERE vs. HAVING?

WHERE filters rows before aggregation; HAVING filters groups after aggregation.

What is a JOIN?

Combining rows from two tables based on a related column.

INNER vs. LEFT JOIN?

INNER keeps only matching rows; LEFT keeps all left-table rows, filling unmatched right columns with NULL.

What is a window function?

A function that computes across a set of rows related to the current row without collapsing them (e.g. running totals, ranks).

Give examples of window functions.

ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD, and aggregate windows like SUM() OVER(...).

What is the difference between RANK and DENSE_RANK?

RANK leaves gaps after ties (1,2,2,4); DENSE_RANK doesn't (1,2,2,3).

What is a CTE?

A Common Table Expression (WITH clause) — a named, readable subquery that structures multi-step logic.

CTE vs. subquery?

Functionally similar, but CTEs are more readable, reusable within a query, and can be recursive.

How do you compute a running total in SQL?

SUM(value) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW).

What does GROUP BY do?

Aggregates rows that share values in specified columns, enabling per-group summaries.

How do you handle NULLs in SQL?

Use IS NULL/IS NOT NULL, COALESCE for defaults, and remember NULL breaks normal comparisons and some aggregates.

What is the difference between COUNT(*) and COUNT(column)?

COUNT(*) counts all rows; COUNT(column) counts non-NULL values in that column.

How do you find the top N per group?

Use ROW_NUMBER() OVER (PARTITION BY group ORDER BY metric DESC) and filter to rn <= N.

What does PARTITION BY do in a window function?

Restarts the window calculation for each partition (group), like a per-group running calculation.

What is the difference between UNION and UNION ALL?

UNION removes duplicate rows; UNION ALL keeps them and is faster.

How do you build a funnel in SQL?

Count distinct users at each step using conditional aggregation or sequential timestamp joins, then compute step conversions.

What is LAG used for?

Accessing a previous row's value — e.g. computing period-over-period change without a self-join.

How do you optimise a slow analytical query?

Filter early, select only needed columns, use indexes/partitions, avoid unnecessary DISTINCT, and pre-aggregate where possible.

🛠 Project idea & 📚 resources

🛠 Build it — project idea

Analytics query pack. Write a set of analyst queries on a sample warehouse: retention cohort, funnel, running totals, and top-N per group with window functions.

📂 Dataset: Any sample DB (e.g. Chinook, e-commerce)

📋

Dashboard Design

Communication

Self-serve answers, not chart soup.

🟢 In simple words

A dashboard is a living report stakeholders check themselves. The skill is designing it so the most important number is obvious and people can answer their own questions without asking you.

🔬 How it actually works

Start from the audience's key questions, lead with the headline metric, follow a visual hierarchy (top-left = most important), enable filters/drill-down, keep it uncluttered, and ensure consistent, trustworthy definitions. Tools: Tableau, Power BI, Looker, Metabase.

💡 Real example

A sales dashboard with the North Star at top, trend below, and region/product filters lets leaders answer 'why is this week down?' without a new request.

🎤 Interview Q&A20 questions
What is a dashboard?

A visual display of key metrics that lets stakeholders monitor performance and answer questions themselves.

What is the first step in dashboard design?

Understand the audience and the decisions they need to make; design around their key questions.

What is visual hierarchy in a dashboard?

Arranging elements so the most important metric draws the eye first, typically top-left.

Why lead with the headline metric?

Viewers scan briefly; the most important number must be instantly visible before any detail.

What makes a dashboard 'self-serve'?

Filters and drill-downs that let users answer follow-up questions without asking the analyst.

How do you avoid clutter?

Limit the number of charts, remove chartjunk, group related items, and use whitespace deliberately.

What is the difference between operational and strategic dashboards?

Operational track real-time day-to-day metrics; strategic track high-level KPIs over longer horizons for leadership.

Why are consistent metric definitions critical?

If 'active user' means different things in different tiles, the dashboard loses trust and creates confusion.

What chart types suit dashboards?

Big-number KPIs, trend lines, ranked bars, and simple tables — clarity over novelty.

How do you choose what to put on a dashboard?

Only metrics that are actionable and tied to the audience's goals; everything else is noise.

What is the role of filters and drill-downs?

They let one dashboard serve many questions by slicing by date, region, segment, etc.

Common dashboard tools?

Tableau, Power BI, Looker, Metabase, and Google Data Studio (Looker Studio).

How do you ensure a dashboard loads fast?

Pre-aggregate data, limit live queries, cache where possible, and avoid overly granular visuals.

What is alerting on a dashboard?

Automated notifications when a metric crosses a threshold, so issues are caught without constant watching.

How do you handle different audiences?

Provide a high-level summary view with the option to drill into detailed views per role.

Why include context like targets or benchmarks?

A number alone is meaningless; comparison to a goal, prior period, or benchmark tells whether it's good or bad.

What is a common dashboard mistake?

Cramming in every available metric ('chart soup') instead of focusing on the few that drive decisions.

How often should a dashboard refresh?

Match the decision cadence — real-time for operations, daily/weekly for strategic reporting.

How do you validate a dashboard's correctness?

Reconcile its numbers against a trusted source, test edge cases, and review definitions with stakeholders.

What is the difference between a report and a dashboard?

A report is usually static and detailed; a dashboard is live, visual, and focused on monitoring key metrics.

🛠 Project idea & 📚 resources

🛠 Build it — project idea

Stakeholder dashboard. Design and build a dashboard (Power BI/Tableau/Metabase) for a defined audience, justifying every chart and the layout hierarchy.

📂 Dataset: Sales / product metrics dataset