Descriptive Statistics
FoundationsSummarise a dataset in a few numbers.
🟢 In simple words
Before any fancy analysis, you describe what you have: what's the typical value (mean/median), how spread out is it (range, standard deviation), and what's the shape? It's the 'getting to know your data' step.
🔬 How it actually works
Measures of central tendency (mean, median, mode) capture the centre; measures of spread (variance, standard deviation, IQR, range) capture variability; and shape is described by skewness and kurtosis. The five-number summary (min, Q1, median, Q3, max) powers the box plot.
💡 Real example
Reporting that average order value is ₹1,200 but the median is ₹650 instantly tells you a few big orders are pulling the mean up — a skewed distribution.
🎤 Interview Q&A20 questions
What is descriptive statistics?
Methods that summarise and describe the main features of a dataset — its centre, spread, and shape — without making inferences beyond the data.
Descriptive vs. inferential statistics?
Descriptive summarises the data you have; inferential uses a sample to draw conclusions about a larger population.
What are the measures of central tendency?
Mean (average), median (middle value), and mode (most frequent value).
When is the median better than the mean?
For skewed data or with outliers — the median resists extreme values, while the mean gets pulled toward them.
What is variance?
The average squared deviation from the mean — a measure of how spread out the values are.
What is standard deviation?
The square root of variance, expressing spread in the same units as the data, which makes it more interpretable.
What is the interquartile range (IQR)?
Q3 − Q1, the spread of the middle 50% of the data; it's robust to outliers.
What is the five-number summary?
Minimum, Q1, median, Q3, and maximum — the basis of a box plot.
What is skewness?
A measure of asymmetry; right-skew has a long right tail (mean > median), left-skew the opposite.
What is kurtosis?
A measure of how heavy the tails are — high kurtosis means more extreme outliers than a normal distribution.
How do you detect outliers statistically?
Common rules: beyond 1.5×IQR from the quartiles, or more than ~3 standard deviations from the mean (z-score).
What is a percentile?
The value below which a given percentage of observations fall — the 90th percentile is exceeded by only 10% of values.
Population vs. sample statistics?
Population covers everyone; a sample is a subset. Sample variance divides by n−1 (Bessel's correction) to stay unbiased.
What is the coefficient of variation?
Standard deviation divided by the mean — a unitless measure for comparing variability across different scales.
Why can the mean be misleading?
It's sensitive to outliers and skew; a single huge value can make a 'typical' figure unrepresentative.
What is a box plot used for?
Visualising the five-number summary and outliers, and comparing distributions across groups at a glance.
Nominal vs. ordinal vs. interval vs. ratio data?
Nominal = unordered categories, ordinal = ordered categories, interval = ordered with equal gaps but no true zero, ratio = interval with a true zero.
What does a histogram show that summary stats don't?
The full shape of the distribution — modality, skew, and gaps that a mean and standard deviation alone hide.
What is the range and its weakness?
Max − min; it's simple but driven entirely by the two most extreme values, so it's unstable.
What is Simpson's paradox?
A trend that appears in groups but reverses when the groups are combined — a warning to always segment before concluding.
🛠 Project idea & 📚 resources
🛠 Build it — project idea
Automated dataset profiler. Build a script that ingests any CSV and outputs central tendency, spread, missing-value rates, and distribution plots per column.
📂 Dataset: Any Kaggle CSV (e.g. retail sales)