Notifications

No notifications

/Phase 4

Machine Learning Basics

Machine Learning Basics — Teaching Computers to Learn from Data

Machine Learning (ML) is the branch of AI where algorithms learn patterns from data instead of being explicitly programmed. As a data analyst, understanding ML allows you to build predictive models and automate decision-making.

Types of Machine Learning

TypeGoalExamples
SupervisedLearn from labeled dataRegression, Classification
UnsupervisedFind hidden patternsClustering, Dimensionality Reduction
ReinforcementLearn from rewards/penaltiesGame AI, Robotics

The ML Workflow

1. Collect & clean data
2. Split into train/test sets (typically 80/20)
3. Choose a model
4. Train on training data
5. Evaluate on test data
6. Tune hyperparameters
7. Deploy

Linear Regression

Predicts a continuous value by fitting a line through data points:

$$\hat{y} = w_0 + w_1 x_1 + w_2 x_2 + \dots + w_n x_n$$

The model minimizes the Mean Squared Error (MSE):

$$MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$

Logistic Regression

Despite the name, it's used for classification (binary outcomes). It applies a sigmoid function to output probabilities between 0 and 1:

$$P(y=1) = \frac{1}{1 + e^{-(w_0 + w_1 x)}}$$

Evaluation Metrics

MetricUse CaseFormula
MSERegressionMean of squared errors
RMSERegressionSquare root of MSE
R²Regression1 − (SS_res / SS_tot), 0 to 1
AccuracyClassificationCorrect / Total
PrecisionClassificationTP / (TP + FP)
RecallClassificationTP / (TP + FN)
F1 ScoreClassification2 × (P × R) / (P + R)

Overfitting vs Underfitting

  • Overfitting: Model memorizes training data, fails on new data (high variance)
  • Underfitting: Model too simple, misses patterns (high bias)
  • Cross-validation (k-fold) helps detect overfitting by testing on multiple data splits

On this page

Detailed Theory

Machine learning sounds magical but is mostly "learn a function from examples". Give it labeled data, it finds patterns, then predicts on new data. As an analyst, you don't need to invent algorithms — you need to know which one to pick, how to evaluate it honestly, and how to avoid the traps that produce confident-but-wrong models.

What Machine Learning Actually Is

Three flavours you'll meet:

TypeGoalExample
Supervisedpredict a label given featuresspam / not spam, price
Unsupervisedfind structure without labelscustomer segmentation
Reinforcementlearn by trial + rewardgame-playing, ad bidding

For 90% of analyst work, you're doing supervised learning — either classification (discrete label) or regression (continuous number).

The Standard ML Workflow

1. Frame the problem  → what are you predicting? what's the metric?
2. Get & clean data
3. Split: train / validation / test (e.g. 70 / 15 / 15)
4. Train a baseline model
5. Evaluate on validation → tune
6. Final, untouched test set → honest score
7. Deploy + monitor

Skipping any step (especially the test split) is how junior analysts ship models that look great in the notebook and fail in production.

Beginner Mistakes to Skip

1. Data leakage. Scaling/encoding the *whole* dataset before splitting leaks test info into training. Always split first, fit transformers only on train. 2. Touching the test set more than once. Every peek + tweak biases your final number. Use a separate validation set for tuning. 3. Accuracy on imbalanced data. 99% accuracy on fraud detection where 1% are fraud means you predicted "not fraud" for everyone. 4. No baseline. Always compare to a dumb baseline (predict the mean / most-common class). If your fancy model isn't beating it, something's wrong. 5. Random splits on time-series. Future leaks into the past. Use time-based splits. 6. Ignoring class imbalance. Use stratified splits, class weights, or resampling.

Intermediate: Bias–Variance Tradeoff

Every model error breaks into:

Total Error = Bias² + Variance + Irreducible Noise

  • High bias → model too simple → *underfits*. Both train and val scores low.
  • High variance → model too complex → *overfits*. Train score great, val score bad.
  • Sweet spot → balance complexity, regularise, get more data.
Diagnostic: plot train vs validation score as you increase complexity — the gap is variance.

Intermediate: Cross-Validation

One train/val split is noisy. K-Fold CV averages over multiple splits:

Fold 1: [val][train][train][train][train]
  Fold 2: [train][val][train][train][train]
  ...
  Score = mean(fold scores), ± std

Variants:

  • StratifiedKFold — preserves class ratios (classification).
  • TimeSeriesSplit — respects time order.
  • GroupKFold — keeps related rows (same patient, same user) in the same fold.

Intermediate: Choosing a Metric

The metric *defines* what "good" means — pick it before training.

TaskMetricWhen
RegressionMAEoutliers don't matter much
RegressionRMSEbig errors hurt more
RegressionR²how much variance explained
ClassificationAccuracybalanced classes only
ClassificationPrecisionfalse positives are costly
ClassificationRecallfalse negatives are costly
ClassificationF1balance both
ClassificationROC-AUCrank quality, threshold-free

Intermediate: Confusion Matrix Mental Model

predicted +    predicted -
actual +      TP             FN
actual -      FP             TN

  • Precision = TP / (TP + FP) — *of those I flagged, how many were right?*
  • Recall = TP / (TP + FN) — *of those that were positive, how many did I catch?*
  • F1 = harmonic mean of the two.
For cancer screening: optimise recall (don't miss real cases). For spam filters: optimise precision (don't junk real email).

Advanced: Regularisation

Add a penalty on model complexity to fight overfitting:

  • L1 (Lasso) — λ·Σ
    w
    — zeroes out unimportant features (does feature selection).
  • L2 (Ridge) — λ·Σw² — shrinks all weights, keeps features.
  • Elastic Net — mix of both.
  • Dropout / weight decay — the deep-learning equivalents.
Rule: weaker model + regularisation > stronger model that overfits.

Advanced: Hyperparameter Tuning

  • GridSearchCV — tries every combo. Slow, exhaustive.
  • RandomizedSearchCV — samples N combos. Often as good for far less compute.
  • Bayesian / Optuna — learns which hyperparams to try next. Production-grade.
Always tune on the validation fold(s), never the test set.

Advanced: Handling Imbalanced Classes

  • class_weight='balanced' — free, often enough.
  • SMOTE / oversampling — generates synthetic minority samples (apply *only on training fold*).
  • Threshold tuning — default 0.5 is rarely optimal; pick the threshold that maximises your business metric on validation.
  • Cost-sensitive metrics — use F1 / PR-AUC, not accuracy.

Advanced: From Notebook to Production

  • Snapshot the data + code + random seed so results reproduce.
  • Pin model + library versions (requirements.txt / environment.yml).
  • Monitor drift — input distribution and prediction distribution over time.
  • Have a fallback — if the model fails or returns garbage, default to the baseline.
  • Track decisions, not just predictions — log inputs, outputs, threshold, model version. Auditable AI is non-negotiable.

Practice Path

1. Pick a Kaggle classification dataset. Build a baseline (predict majority class). Beat it with a simple logistic regression. 2. Add 5-fold cross-validation; report mean ± std accuracy AND F1. 3. Diagnose bias vs variance with a learning curve (train + val score as data size grows). 4. Tune one hyperparameter with RandomizedSearchCV and confirm the test-set score didn't change just from luck.