Last 30 Days
No notifications
Scikit-Learn (sklearn) is the most widely-used Python ML library, providing a consistent API for the entire ML workflow: preprocessing, model training, evaluation, and tuning. Every ML project follows the same pipeline pattern.
Load Data โ Preprocess โ Split โ Train โ Evaluate โ Tune โ Deploy| Transformer | Purpose | When to Use | ||||
| StandardScaler | Zero mean, unit variance | Distance-based models (KNN, SVM) | ||||
| MinMaxScaler | Scale to [0, 1] range | Neural networks | ||||
| LabelEncoder | Encode labels as integers | Target variable encoding | ||||
| OneHotEncoder | Binary columns per category | Categorical features | ||||
| SimpleImputer | Fill missing values | Handling NaN values | Common Algorithms | Algorithm | Type | Best For |
| DecisionTree | Both | Interpretable rules | ||||
| RandomForest | Both | General-purpose, robust | ||||
| SVM | Classification | High-dimensional data | ||||
| KNN | Both | Small datasets, simple patterns | ||||
| GradientBoosting | Both | Competitions, best accuracy |
Every sklearn estimator follows the same pattern:
from sklearn.ensemble import RandomForestClassifiermodel = RandomForestClassifier() # 1. Instantiate
model.fit(X_train, y_train) # 2. Train
predictions = model.predict(X_test) # 3. Predict
score = model.score(X_test, y_test) # 4. Evaluate
Instead of manually testing parameters, GridSearchCV exhaustively searches all combinations with cross-validation:
param_grid = {'max_depth': [3, 5, 10], 'n_estimators': [50, 100, 200]}
grid = GridSearchCV(model, param_grid, cv=5, scoring='accuracy')
grid.fit(X_train, y_train)
print(grid.best_params_) # Optimal hyperparametersSave trained models with joblib for deployment:
import joblib
joblib.dump(model, 'model.pkl') # Save
loaded_model = joblib.load('model.pkl') # Loadscikit-learn (sklearn) is the Swiss-army knife of classical ML in Python. Every model exposes the same three methods โ fit, predict, score โ so once you've trained a logistic regression, you've trained them all. The skill is in the *plumbing*: pipelines, encoding, validation, and avoiding leakage.
A library of:
fit(X, y): LinearRegression, RandomForestClassifier, KMeans.fit(X) + transform(X): StandardScaler, OneHotEncoder.train_test_split, cross_val_score, GridSearchCV.X is 2D (rows = samples, cols = features), y is 1D.from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifierX_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
model = RandomForestClassifier(n_estimators=200, random_state=42).fit(X_tr, y_tr)
print(model.score(X_te, y_te))
That's a working baseline. Everything else is making it *honest* and *reproducible*.
1. Fitting the scaler on full data. StandardScaler().fit(X) before splitting = leakage. Fit on X_train only.
2. One-hot encoding inside a notebook cell with categories that exist only in train. Test set explodes. Use the encoder inside a Pipeline so unseen categories are handled.
3. No random_state. Results change every run; debugging impossible.
4. Tree models with StandardScaler. Trees don't need scaling โ it's wasted work.
5. Calling .score() on the test set repeatedly while tuning. Use cross-validation on training data; touch test once.
6. fit_transform on test data. It's transform only on test, always.
A Pipeline chains transformers + a final estimator. fit calls fit_transform on each step then fit on the model; predict calls transform then predict. No leakage, no boilerplate.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegressionpipe = Pipeline([
('scaler', StandardScaler()),
('clf', LogisticRegression(max_iter=1000)),
])
pipe.fit(X_tr, y_tr).score(X_te, y_te)
Now cross-validation, grid-search, and pickling all work on the *whole* pipeline.
Real datasets have numeric + categorical + text columns. ColumnTransformer applies different transforms to different columns:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputernum_pipe = Pipeline([('imp', SimpleImputer(strategy='median')), ('sc', StandardScaler())])
cat_pipe = Pipeline([('imp', SimpleImputer(strategy='most_frequent')),
('oh', OneHotEncoder(handle_unknown='ignore'))])
pre = ColumnTransformer([
('num', num_pipe, ['age', 'income']),
('cat', cat_pipe, ['country', 'plan']),
])
model = Pipeline([('pre', pre), ('clf', RandomForestClassifier())])
This is the production-grade preprocessing template โ memorise it.
| Model | Use when |
LogisticRegression | linear baseline classifier, interpretable |
LinearRegression / Ridge / Lasso | regression baselines |
RandomForestClassifier/Regressor | strong tabular baseline, no scaling needed |
GradientBoostingClassifier / xgboost / lightgbm | best-in-class tabular |
KMeans / DBSCAN | unsupervised clustering |
Rule of thumb: start logistic/linear, jump to gradient-boosted trees if you need accuracy.
from sklearn.model_selection import cross_val_score, StratifiedKFold
cv = StratifiedKFold(5, shuffle=True, random_state=42)
scores = cross_val_score(pipe, X, y, cv=cv, scoring='f1')
print(f"F1 = {scores.mean():.3f} ยฑ {scores.std():.3f}")Note: X, y is the *full* training set โ the pipeline is re-fit per fold, so no leakage.
from sklearn.model_selection import GridSearchCV, RandomizedSearchCVparams = {
'clf__n_estimators': [100, 300, 500],
'clf__max_depth': [None, 6, 12],
'clf__min_samples_split': [2, 5, 10],
}
search = RandomizedSearchCV(pipe, params, n_iter=20, cv=cv, scoring='f1', random_state=42)
search.fit(X_tr, y_tr)
print(search.best_params_, search.best_score_)
Notice the clf__ prefix โ it tells the search which step in the pipeline owns the param.
import pandas as pd
imp = pd.Series(model.named_steps['clf'].feature_importances_,
index=model.named_steps['pre'].get_feature_names_out())
imp.nlargest(15).plot.barh()For non-tree models, use permutation_importance (model-agnostic) or SHAP (per-prediction explanations). Watch for *suspiciously* important features โ they're usually data leakage.
A classifier's predict_proba may not give honest probabilities. CalibratedClassifierCV fixes this. And the default 0.5 threshold is rarely optimal:
from sklearn.metrics import precision_recall_curve
prob = model.predict_proba(X_val)[:, 1]
p, r, t = precision_recall_curve(y_val, prob)
# pick threshold where F1 / business cost is bestimport joblib
joblib.dump(model, 'model.joblib') # whole pipeline, including preprocessing
model = joblib.load('model.joblib')Because the pipeline embeds preprocessing, the production code becomes one line: model.predict(new_df). Pin sklearn version โ pickled models are version-sensitive.
cross_val_score *over* a GridSearchCV ("nested CV") for an unbiased estimate.scoring='f1' / 'roc_auc' and stratify=y.TimeSeriesSplit, never random shuffles.TargetEncoder or hashing instead of one-hot.1. Build a single Pipeline that imputes, scales numerics, one-hot encodes categoricals, then trains a logistic regression on a public dataset.
2. Wrap it in 5-fold cross-validation; report F1 ยฑ std.
3. Run RandomizedSearchCV over 3 hyperparameters and check the best params on the held-out test set.
4. Save the pipeline with joblib and reload it; confirm a fresh row gives the same prediction.