Last 30 Days
No notifications
Statistics transforms raw data into meaningful insights by measuring patterns, relationships, and uncertainty. Every data analyst must master these concepts to validate findings and make confident recommendations.
| Measure | Formula | What It Tells You |
| Mean | $\bar{x} = \frac{\sum x_i}{n}$ | Average value (sensitive to outliers) |
| Median | Middle value when sorted | Central tendency (robust to outliers) |
| Mode | Most frequent value | Most common observation |
| Range | Max − Min | Spread of data |
| Variance | $\sigma^2 = \frac{\sum(x_i - \bar{x})^2}{n}$ | Average squared deviation |
| Std Dev | $\sigma = \sqrt{\text{Variance}}$ | Spread in original units |
Example: Employee salaries: [45K, 50K, 52K, 55K, 200K]
P(A) = favorable_outcomes / total_outcomes # 0 to 1
P(A and B) = P(A) × P(B) # Independent events
P(A or B) = P(A) + P(B) - P(A and B) # Addition rule68-95-99.7 Rule:
Measures linear relationship: r ranges from −1 (perfect negative) to +1 (perfect positive). r = 0 means no linear correlation. Correlation ≠ Causation!
1. State null hypothesis (H₀): "No difference/effect"
2. State alternative (H₁): "There IS a difference"
3. Choose significance level: α = 0.05
4. Calculate test statistic (t-test, z-test)
5. Compare p-value to α:
p < 0.05 → Reject H₀ (statistically significant)
p ≥ 0.05 → Fail to reject H₀Statistics is just the language for talking about uncertain data. You will use it to describe what's typical, spot what's unusual, and decide whether a difference between two groups is real or just noise. You don't need a math degree — four or five ideas, applied carefully, cover most analyst work.
Two halves you'll come back to forever:
| Measure | What it tells you | Watch out for |
| Mean | Arithmetic average | Pulled hard by outliers |
| Median | Middle value when sorted | Robust to outliers |
| Mode | Most frequent value | Useful for categorical data |
Income data: mean salary in a startup of 9 engineers + 1 founder is misleading. The median is the honest number.
variance σ² = average squared distance from the mean
std dev σ = sqrt(variance) → same units as the data
IQR = Q3 − Q1 → robust spread, ignores tailsRule of thumb on a roughly normal distribution: ~68% of values within 1σ, ~95% within 2σ, ~99.7% within 3σ (the empirical rule).
Many measurements — heights, test scores, errors — cluster near a mean and thin out at the extremes. A lot of statistical tools assume normality, so it's the first shape to recognise. Skewed data (income, page views) often needs a log transform before normal-style analysis works.
1. Reporting only the mean. Always pair with a spread (std dev, IQR) and a sample size.
2. Confusing correlation with causation. "Ice cream sales correlate with drownings" — both are caused by summer.
3. Tiny samples, big claims. A survey of 12 friends is not the country.
4. Cherry-picking the time window. Pick the window before looking at the result.
5. Treating p-values as truth. p < 0.05 is a convention, not a magic threshold.
6. Ignoring outliers without investigating them. They're sometimes the most interesting row in the dataset.
Probability ranges from 0 (impossible) to 1 (certain).
P(A and B) = P(A) × P(B|A) → multiply for both happening
P(A or B) = P(A) + P(B) − P(A and B)
P(A|B) = P(A and B) / P(B) → conditional, the heart of BayesThe most important everyday tool is conditional probability — "given that the user clicked the email, how likely are they to buy?".
You rarely have the whole population. You have a *sample*. The CLT says:
> Whatever shape the population has, the distribution of the *sample mean* tends toward normal as sample size grows (n ≥ ~30 is the rule of thumb).
This is why methods built on normality work even when the underlying data is messy — you're using means.
Sampling methods worth knowing: simple random, stratified (proportional buckets), systematic (every kth), cluster (whole groups). Bad sampling → bad answer, no matter how clever the math.
A 95% CI says: *if we repeated this experiment 100 times, ~95 of the intervals we built would contain the true value*.
95% CI = x̄ ± 1.96 × (σ / √n) ' z = 1.96 for 95%, 2.576 for 99%Report intervals, not just point estimates. "Conversion lifted by 1.2% ± 0.4%" tells the reader far more than "+1.2%".
You want to know if a change is real. Set up two competing claims:
| Test | Question |
| One-sample | Does this group's mean differ from a known value? |
| Two-sample (independent) | Do two groups (A vs B) have different means? |
| Paired | Did the same units change after a treatment? |
Use a t-test when comparing means and you have small/unknown population variance. For proportions ("% who clicked") use a z-test for proportions.
y = a + bx) — fits the best line, gives an interpretable slope.Bayesian stats updates a prior belief with evidence to get a posterior:
posterior ∝ prior × likelihoodIn A/B testing it answers "what's the probability B is better than A?" directly — no p-values, no awkward "fail to reject". Tools: PyMC, Stan; or just scipy.stats.beta for conversion-rate experiments.
1. For a real dataset, compute mean / median / std dev / IQR; plot a histogram and a boxplot.
2. Build a 95% confidence interval for the mean using scipy.stats (or Excel CONFIDENCE.NORM) and explain it in one sentence.
3. Run an independent two-sample t-test on two groups, report the p-value and effect size.
4. Calculate the sample size you'd need for an A/B test with baseline 5%, MDE 0.5%, power 80%.