Notifications

No notifications

/Phase 1

Statistics & Probability

Statistics & Probability — The Mathematical Foundation of Data Analytics

Statistics transforms raw data into meaningful insights by measuring patterns, relationships, and uncertainty. Every data analyst must master these concepts to validate findings and make confident recommendations.

Descriptive Statistics — Summarizing Data

MeasureFormulaWhat It Tells You
Mean$\bar{x} = \frac{\sum x_i}{n}$Average value (sensitive to outliers)
MedianMiddle value when sortedCentral tendency (robust to outliers)
ModeMost frequent valueMost common observation
RangeMax − MinSpread of data
Variance$\sigma^2 = \frac{\sum(x_i - \bar{x})^2}{n}$Average squared deviation
Std Dev$\sigma = \sqrt{\text{Variance}}$Spread in original units

Example: Employee salaries: [45K, 50K, 52K, 55K, 200K]

  • Mean = 80.4K (skewed by outlier) vs Median = 52K (better center)

Probability Basics

P(A) = favorable_outcomes / total_outcomes   # 0 to 1
P(A and B) = P(A) × P(B)                    # Independent events
P(A or B) = P(A) + P(B) - P(A and B)        # Addition rule

Normal Distribution (Bell Curve)

68-95-99.7 Rule:

  • 68% of data falls within 1σ of the mean
  • 95% within 2σ
  • 99.7% within 3σ

Correlation (Pearson's r)

Measures linear relationship: r ranges from −1 (perfect negative) to +1 (perfect positive). r = 0 means no linear correlation. Correlation ≠ Causation!

Hypothesis Testing

1. State null hypothesis (H₀): "No difference/effect"
2. State alternative (H₁): "There IS a difference"
3. Choose significance level: α = 0.05
4. Calculate test statistic (t-test, z-test)
5. Compare p-value to α:
   p < 0.05 → Reject H₀ (statistically significant)
   p ≥ 0.05 → Fail to reject H₀

On this page

Detailed Theory

Statistics is just the language for talking about uncertain data. You will use it to describe what's typical, spot what's unusual, and decide whether a difference between two groups is real or just noise. You don't need a math degree — four or five ideas, applied carefully, cover most analyst work.

What Statistics Actually Is

Two halves you'll come back to forever:

  • Descriptive statistics — "summarise this dataset": mean, median, spread, distribution.
  • Inferential statistics — "draw a conclusion about the population from a sample": confidence intervals, hypothesis tests.
Descriptive answers *what is*; inferential answers *what's likely true beyond what we measured*.

Center: Mean vs Median vs Mode

MeasureWhat it tells youWatch out for
MeanArithmetic averagePulled hard by outliers
MedianMiddle value when sortedRobust to outliers
ModeMost frequent valueUseful for categorical data

Income data: mean salary in a startup of 9 engineers + 1 founder is misleading. The median is the honest number.

Spread: Variance, Std Dev, IQR

variance σ² = average squared distance from the mean
std dev  σ = sqrt(variance)              → same units as the data
IQR        = Q3 − Q1                     → robust spread, ignores tails

Rule of thumb on a roughly normal distribution: ~68% of values within 1σ, ~95% within 2σ, ~99.7% within 3σ (the empirical rule).

The Normal Distribution (Bell Curve)

Many measurements — heights, test scores, errors — cluster near a mean and thin out at the extremes. A lot of statistical tools assume normality, so it's the first shape to recognise. Skewed data (income, page views) often needs a log transform before normal-style analysis works.

Beginner Mistakes to Skip

1. Reporting only the mean. Always pair with a spread (std dev, IQR) and a sample size. 2. Confusing correlation with causation. "Ice cream sales correlate with drownings" — both are caused by summer. 3. Tiny samples, big claims. A survey of 12 friends is not the country. 4. Cherry-picking the time window. Pick the window before looking at the result. 5. Treating p-values as truth. p < 0.05 is a convention, not a magic threshold. 6. Ignoring outliers without investigating them. They're sometimes the most interesting row in the dataset.

Intermediate: Probability Basics

Probability ranges from 0 (impossible) to 1 (certain).

P(A and B) = P(A) × P(B|A)        → multiply for both happening
P(A or B)  = P(A) + P(B) − P(A and B)
P(A|B)     = P(A and B) / P(B)   → conditional, the heart of Bayes

The most important everyday tool is conditional probability — "given that the user clicked the email, how likely are they to buy?".

Intermediate: Sampling & The Central Limit Theorem

You rarely have the whole population. You have a *sample*. The CLT says:

> Whatever shape the population has, the distribution of the *sample mean* tends toward normal as sample size grows (n ≥ ~30 is the rule of thumb).

This is why methods built on normality work even when the underlying data is messy — you're using means.

Sampling methods worth knowing: simple random, stratified (proportional buckets), systematic (every kth), cluster (whole groups). Bad sampling → bad answer, no matter how clever the math.

Intermediate: Confidence Intervals

A 95% CI says: *if we repeated this experiment 100 times, ~95 of the intervals we built would contain the true value*.

95% CI = x̄ ± 1.96 × (σ / √n)        ' z = 1.96 for 95%, 2.576 for 99%

Report intervals, not just point estimates. "Conversion lifted by 1.2% ± 0.4%" tells the reader far more than "+1.2%".

Intermediate: Hypothesis Testing in Plain English

You want to know if a change is real. Set up two competing claims:

  • H₀ (null) — "there is no effect".
  • H₁ (alt) — "there is an effect".
Collect data, compute a test statistic, look up its p-value.

  • p-value < 0.05 → reject H₀ (effect is unlikely to be due to chance).
  • p-value ≥ 0.05 → fail to reject H₀ (cannot conclude an effect).
The p-value is not "probability the null is true" — it's "probability of seeing data this extreme *if* the null were true".

Intermediate: t-Tests — Three Flavours

TestQuestion
One-sampleDoes this group's mean differ from a known value?
Two-sample (independent)Do two groups (A vs B) have different means?
PairedDid the same units change after a treatment?

Use a t-test when comparing means and you have small/unknown population variance. For proportions ("% who clicked") use a z-test for proportions.

Advanced: Type I & Type II Errors, Power

  • Type I (α) — false positive. Saying there's an effect when there isn't. Usually capped at 5%.
  • Type II (β) — false negative. Missing a real effect.
  • Power = 1 − β — the chance you detect a real effect. Aim for 80%.
Low power = small samples = you'll miss real wins. Always compute the required sample size before running an A/B test.

Advanced: Correlation, Regression, Causation

  • Pearson r — linear correlation, range −1 to +1. Sensitive to outliers.
  • Spearman ρ — rank-based, robust, captures monotonic relationships.
  • Linear regression (y = a + bx) — fits the best line, gives an interpretable slope.
A strong correlation never proves causation. To claim cause you need either an experiment (random assignment) or a causal-inference design (instrumental variables, diff-in-diff, RDD).

Advanced: A/B Testing Pitfalls

  • Peeking — stopping the test when p drops below 0.05 inflates false positives. Decide n in advance.
  • Multiple comparisons — testing 20 metrics at α=0.05 → ~one false positive expected by chance. Use Bonferroni or FDR correction.
  • Sample-ratio mismatch — if you split 50/50 but your data shows 47/53, randomization is broken — stop and investigate.
  • Novelty effect — short tests over-credit shiny new things. Run for at least one full business cycle (often a week).

Advanced: Bayesian Thinking (One Page)

Bayesian stats updates a prior belief with evidence to get a posterior:

posterior ∝ prior × likelihood

In A/B testing it answers "what's the probability B is better than A?" directly — no p-values, no awkward "fail to reject". Tools: PyMC, Stan; or just scipy.stats.beta for conversion-rate experiments.

Practice Path

1. For a real dataset, compute mean / median / std dev / IQR; plot a histogram and a boxplot. 2. Build a 95% confidence interval for the mean using scipy.stats (or Excel CONFIDENCE.NORM) and explain it in one sentence. 3. Run an independent two-sample t-test on two groups, report the p-value and effect size. 4. Calculate the sample size you'd need for an A/B test with baseline 5%, MDE 0.5%, power 80%.