Notifications

No notifications

/Phase 4

Scikit-Learn in Practice

Scikit-Learn in Practice โ€” End-to-End Machine Learning Pipelines

Scikit-Learn (sklearn) is the most widely-used Python ML library, providing a consistent API for the entire ML workflow: preprocessing, model training, evaluation, and tuning. Every ML project follows the same pipeline pattern.

The sklearn Pipeline

Load Data โ†’ Preprocess โ†’ Split โ†’ Train โ†’ Evaluate โ†’ Tune โ†’ Deploy

Preprocessing Tools

TransformerPurposeWhen to Use
StandardScalerZero mean, unit varianceDistance-based models (KNN, SVM)
MinMaxScalerScale to [0, 1] rangeNeural networks
LabelEncoderEncode labels as integersTarget variable encoding
OneHotEncoderBinary columns per categoryCategorical features
SimpleImputerFill missing valuesHandling NaN values

Common Algorithms

AlgorithmTypeBest For
DecisionTreeBothInterpretable rules
RandomForestBothGeneral-purpose, robust
SVMClassificationHigh-dimensional data
KNNBothSmall datasets, simple patterns
GradientBoostingBothCompetitions, best accuracy

The Consistent API

Every sklearn estimator follows the same pattern:

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier() # 1. Instantiate model.fit(X_train, y_train) # 2. Train predictions = model.predict(X_test) # 3. Predict score = model.score(X_test, y_test) # 4. Evaluate

Hyperparameter Tuning with GridSearchCV

Instead of manually testing parameters, GridSearchCV exhaustively searches all combinations with cross-validation:

param_grid = {'max_depth': [3, 5, 10], 'n_estimators': [50, 100, 200]}
grid = GridSearchCV(model, param_grid, cv=5, scoring='accuracy')
grid.fit(X_train, y_train)
print(grid.best_params_)   # Optimal hyperparameters

Model Persistence

Save trained models with joblib for deployment:

import joblib
joblib.dump(model, 'model.pkl')           # Save
loaded_model = joblib.load('model.pkl')    # Load

On this page

Detailed Theory

scikit-learn (sklearn) is the Swiss-army knife of classical ML in Python. Every model exposes the same three methods โ€” fit, predict, score โ€” so once you've trained a logistic regression, you've trained them all. The skill is in the *plumbing*: pipelines, encoding, validation, and avoiding leakage.

What sklearn Actually Is

A library of:

  • Estimators โ€” anything with fit(X, y): LinearRegression, RandomForestClassifier, KMeans.
  • Transformers โ€” anything with fit(X) + transform(X): StandardScaler, OneHotEncoder.
  • Pipelines โ€” chain transformers + an estimator into one object.
  • Model selection helpers โ€” train_test_split, cross_val_score, GridSearchCV.
All input is NumPy arrays / DataFrames; X is 2D (rows = samples, cols = features), y is 1D.

The 5-Line Workflow

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42) model = RandomForestClassifier(n_estimators=200, random_state=42).fit(X_tr, y_tr) print(model.score(X_te, y_te))

That's a working baseline. Everything else is making it *honest* and *reproducible*.

Beginner Mistakes to Skip

1. Fitting the scaler on full data. StandardScaler().fit(X) before splitting = leakage. Fit on X_train only. 2. One-hot encoding inside a notebook cell with categories that exist only in train. Test set explodes. Use the encoder inside a Pipeline so unseen categories are handled. 3. No random_state. Results change every run; debugging impossible. 4. Tree models with StandardScaler. Trees don't need scaling โ€” it's wasted work. 5. Calling .score() on the test set repeatedly while tuning. Use cross-validation on training data; touch test once. 6. fit_transform on test data. It's transform only on test, always.

Intermediate: Pipelines (the Single Best Habit)

A Pipeline chains transformers + a final estimator. fit calls fit_transform on each step then fit on the model; predict calls transform then predict. No leakage, no boilerplate.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

pipe = Pipeline([ ('scaler', StandardScaler()), ('clf', LogisticRegression(max_iter=1000)), ]) pipe.fit(X_tr, y_tr).score(X_te, y_te)

Now cross-validation, grid-search, and pickling all work on the *whole* pipeline.

Intermediate: ColumnTransformer for Mixed Types

Real datasets have numeric + categorical + text columns. ColumnTransformer applies different transforms to different columns:

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

num_pipe = Pipeline([('imp', SimpleImputer(strategy='median')), ('sc', StandardScaler())]) cat_pipe = Pipeline([('imp', SimpleImputer(strategy='most_frequent')), ('oh', OneHotEncoder(handle_unknown='ignore'))])

pre = ColumnTransformer([ ('num', num_pipe, ['age', 'income']), ('cat', cat_pipe, ['country', 'plan']), ]) model = Pipeline([('pre', pre), ('clf', RandomForestClassifier())])

This is the production-grade preprocessing template โ€” memorise it.

Intermediate: The Big-5 Algorithms You Should Know

ModelUse when
LogisticRegressionlinear baseline classifier, interpretable
LinearRegression / Ridge / Lassoregression baselines
RandomForestClassifier/Regressorstrong tabular baseline, no scaling needed
GradientBoostingClassifier / xgboost / lightgbmbest-in-class tabular
KMeans / DBSCANunsupervised clustering

Rule of thumb: start logistic/linear, jump to gradient-boosted trees if you need accuracy.

Intermediate: Cross-Validation Done Right

from sklearn.model_selection import cross_val_score, StratifiedKFold
cv = StratifiedKFold(5, shuffle=True, random_state=42)
scores = cross_val_score(pipe, X, y, cv=cv, scoring='f1')
print(f"F1 = {scores.mean():.3f} ยฑ {scores.std():.3f}")

Note: X, y is the *full* training set โ€” the pipeline is re-fit per fold, so no leakage.

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

params = { 'clf__n_estimators': [100, 300, 500], 'clf__max_depth': [None, 6, 12], 'clf__min_samples_split': [2, 5, 10], } search = RandomizedSearchCV(pipe, params, n_iter=20, cv=cv, scoring='f1', random_state=42) search.fit(X_tr, y_tr) print(search.best_params_, search.best_score_)

Notice the clf__ prefix โ€” it tells the search which step in the pipeline owns the param.

Advanced: Feature Importance & Interpretability

import pandas as pd
imp = pd.Series(model.named_steps['clf'].feature_importances_,
                index=model.named_steps['pre'].get_feature_names_out())
imp.nlargest(15).plot.barh()

For non-tree models, use permutation_importance (model-agnostic) or SHAP (per-prediction explanations). Watch for *suspiciously* important features โ€” they're usually data leakage.

Advanced: Calibration & Threshold Tuning

A classifier's predict_proba may not give honest probabilities. CalibratedClassifierCV fixes this. And the default 0.5 threshold is rarely optimal:

from sklearn.metrics import precision_recall_curve
prob = model.predict_proba(X_val)[:, 1]
p, r, t = precision_recall_curve(y_val, prob)
# pick threshold where F1 / business cost is best

Advanced: Persistence & Deployment

import joblib
joblib.dump(model, 'model.joblib')   # whole pipeline, including preprocessing
model = joblib.load('model.joblib')

Because the pipeline embeds preprocessing, the production code becomes one line: model.predict(new_df). Pin sklearn version โ€” pickled models are version-sensitive.

Advanced: Common Pitfalls

  • CV inside CV when tuning + reporting โ€” use cross_val_score *over* a GridSearchCV ("nested CV") for an unbiased estimate.
  • Class-imbalance + accuracy โ€” use scoring='f1' / 'roc_auc' and stratify=y.
  • Time-series โ€” use TimeSeriesSplit, never random shuffles.
  • Categorical with many levels โ€” use TargetEncoder or hashing instead of one-hot.

Practice Path

1. Build a single Pipeline that imputes, scales numerics, one-hot encodes categoricals, then trains a logistic regression on a public dataset. 2. Wrap it in 5-fold cross-validation; report F1 ยฑ std. 3. Run RandomizedSearchCV over 3 hyperparameters and check the best params on the held-out test set. 4. Save the pipeline with joblib and reload it; confirm a fresh row gives the same prediction.