Last 30 Days
No notifications
Machine Learning (ML) is the branch of AI where algorithms learn patterns from data instead of being explicitly programmed. As a data analyst, understanding ML allows you to build predictive models and automate decision-making.
| Type | Goal | Examples |
| Supervised | Learn from labeled data | Regression, Classification |
| Unsupervised | Find hidden patterns | Clustering, Dimensionality Reduction |
| Reinforcement | Learn from rewards/penalties | Game AI, Robotics |
1. Collect & clean data
2. Split into train/test sets (typically 80/20)
3. Choose a model
4. Train on training data
5. Evaluate on test data
6. Tune hyperparameters
7. DeployPredicts a continuous value by fitting a line through data points:
$$\hat{y} = w_0 + w_1 x_1 + w_2 x_2 + \dots + w_n x_n$$
The model minimizes the Mean Squared Error (MSE):
$$MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$
Despite the name, it's used for classification (binary outcomes). It applies a sigmoid function to output probabilities between 0 and 1:
$$P(y=1) = \frac{1}{1 + e^{-(w_0 + w_1 x)}}$$
| Metric | Use Case | Formula |
| MSE | Regression | Mean of squared errors |
| RMSE | Regression | Square root of MSE |
| R² | Regression | 1 − (SS_res / SS_tot), 0 to 1 |
| Accuracy | Classification | Correct / Total |
| Precision | Classification | TP / (TP + FP) |
| Recall | Classification | TP / (TP + FN) |
| F1 Score | Classification | 2 × (P × R) / (P + R) |
Machine learning sounds magical but is mostly "learn a function from examples". Give it labeled data, it finds patterns, then predicts on new data. As an analyst, you don't need to invent algorithms — you need to know which one to pick, how to evaluate it honestly, and how to avoid the traps that produce confident-but-wrong models.
Three flavours you'll meet:
| Type | Goal | Example |
| Supervised | predict a label given features | spam / not spam, price |
| Unsupervised | find structure without labels | customer segmentation |
| Reinforcement | learn by trial + reward | game-playing, ad bidding |
For 90% of analyst work, you're doing supervised learning — either classification (discrete label) or regression (continuous number).
1. Frame the problem → what are you predicting? what's the metric?
2. Get & clean data
3. Split: train / validation / test (e.g. 70 / 15 / 15)
4. Train a baseline model
5. Evaluate on validation → tune
6. Final, untouched test set → honest score
7. Deploy + monitorSkipping any step (especially the test split) is how junior analysts ship models that look great in the notebook and fail in production.
1. Data leakage. Scaling/encoding the *whole* dataset before splitting leaks test info into training. Always split first, fit transformers only on train. 2. Touching the test set more than once. Every peek + tweak biases your final number. Use a separate validation set for tuning. 3. Accuracy on imbalanced data. 99% accuracy on fraud detection where 1% are fraud means you predicted "not fraud" for everyone. 4. No baseline. Always compare to a dumb baseline (predict the mean / most-common class). If your fancy model isn't beating it, something's wrong. 5. Random splits on time-series. Future leaks into the past. Use time-based splits. 6. Ignoring class imbalance. Use stratified splits, class weights, or resampling.
Every model error breaks into:
Total Error = Bias² + Variance + Irreducible NoiseOne train/val split is noisy. K-Fold CV averages over multiple splits:
Fold 1: [val][train][train][train][train]
Fold 2: [train][val][train][train][train]
...
Score = mean(fold scores), ± stdVariants:
The metric *defines* what "good" means — pick it before training.
| Task | Metric | When |
| Regression | MAE | outliers don't matter much |
| Regression | RMSE | big errors hurt more |
| Regression | R² | how much variance explained |
| Classification | Accuracy | balanced classes only |
| Classification | Precision | false positives are costly |
| Classification | Recall | false negatives are costly |
| Classification | F1 | balance both |
| Classification | ROC-AUC | rank quality, threshold-free |
predicted + predicted -
actual + TP FN
actual - FP TNAdd a penalty on model complexity to fight overfitting:
λ·Σw
— zeroes out unimportant features (does feature selection).λ·Σw² — shrinks all weights, keeps features.class_weight='balanced' — free, often enough.requirements.txt / environment.yml).1. Pick a Kaggle classification dataset. Build a baseline (predict majority class). Beat it with a simple logistic regression. 2. Add 5-fold cross-validation; report mean ± std accuracy AND F1. 3. Diagnose bias vs variance with a learning curve (train + val score as data size grows). 4. Tune one hyperparameter with RandomizedSearchCV and confirm the test-set score didn't change just from luck.