CampusCrate is a student operating system for B.Tech students in India, combining opportunities, academic resources, communities, learning hubs, and career tools in one platform.

Who is CampusCrate built for?

CampusCrate is built for Indian engineering and B.Tech students who need a structured place to discover opportunities, access college resources, join societies, learn technical skills, and build their career profile.

What can students find on CampusCrate?

Students can find hackathons, internships, competitions, notes, test papers, cheatsheets, college communities, DSA and development learning tracks, roadmaps, profiles, and AI career tools.

Learn Code and Practice

Scikit-Learn in Practice — End-to-End Machine Learning Pipelines

Scikit-Learn (sklearn) is the most widely-used Python ML library, providing a consistent API for the entire ML workflow: preprocessing, model training, evaluation, and tuning. Every ML project follows the same pipeline pattern.

The sklearn Pipeline

Load Data → Preprocess → Split → Train → Evaluate → Tune → Deploy

Preprocessing Tools

Transformer	Purpose	When to Use
StandardScaler	Zero mean, unit variance	Distance-based models (KNN, SVM)
MinMaxScaler	Scale to [0, 1] range	Neural networks
LabelEncoder	Encode labels as integers	Target variable encoding
OneHotEncoder	Binary columns per category	Categorical features
SimpleImputer	Fill missing values	Handling NaN values	Common Algorithms	Algorithm	Type	Best For
DecisionTree	Both	Interpretable rules
RandomForest	Both	General-purpose, robust
SVM	Classification	High-dimensional data
KNN	Both	Small datasets, simple patterns
GradientBoosting	Both	Competitions, best accuracy

The Consistent API

Every sklearn estimator follows the same pattern:

from sklearn.ensemble import RandomForestClassifiermodel = RandomForestClassifier()    # 1. Instantiate
model.fit(X_train, y_train)         # 2. Train
predictions = model.predict(X_test)  # 3. Predict
score = model.score(X_test, y_test)  # 4. Evaluate

Hyperparameter Tuning with GridSearchCV

Instead of manually testing parameters, GridSearchCV exhaustively searches all combinations with cross-validation:

param_grid = {'max_depth': [3, 5, 10], 'n_estimators': [50, 100, 200]}
grid = GridSearchCV(model, param_grid, cv=5, scoring='accuracy')
grid.fit(X_train, y_train)
print(grid.best_params_)   # Optimal hyperparameters

Model Persistence

Save trained models with joblib for deployment:

import joblib
joblib.dump(model, 'model.pkl')           # Save
loaded_model = joblib.load('model.pkl')    # Load

scikit-learn (sklearn) is the Swiss-army knife of classical ML in Python. Every model exposes the same three methods — fit, predict, score — so once you've trained a logistic regression, you've trained them all. The skill is in the *plumbing*: pipelines, encoding, validation, and avoiding leakage.

What sklearn Actually Is

A library of:

Estimators — anything with fit(X, y): LinearRegression, RandomForestClassifier, KMeans.
Transformers — anything with fit(X) + transform(X): StandardScaler, OneHotEncoder.
Pipelines — chain transformers + an estimator into one object.
Model selection helpers — train_test_split, cross_val_score, GridSearchCV.

All input is NumPy arrays / DataFrames; X is 2D (rows = samples, cols = features), y is 1D.

The 5-Line Workflow

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifierX_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
model = RandomForestClassifier(n_estimators=200, random_state=42).fit(X_tr, y_tr)
print(model.score(X_te, y_te))

That's a working baseline. Everything else is making it *honest* and *reproducible*.

Beginner Mistakes to Skip

1. Fitting the scaler on full data. StandardScaler().fit(X) before splitting = leakage. Fit on X_train only. 2. One-hot encoding inside a notebook cell with categories that exist only in train. Test set explodes. Use the encoder inside a Pipeline so unseen categories are handled. 3. No random_state. Results change every run; debugging impossible. 4. Tree models with StandardScaler. Trees don't need scaling — it's wasted work. 5. Calling .score() on the test set repeatedly while tuning. Use cross-validation on training data; touch test once. 6. fit_transform on test data. It's transform only on test, always.

Intermediate: Pipelines (the Single Best Habit)

A Pipeline chains transformers + a final estimator. fit calls fit_transform on each step then fit on the model; predict calls transform then predict. No leakage, no boilerplate.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegressionpipe = Pipeline([
    ('scaler', StandardScaler()),
    ('clf',    LogisticRegression(max_iter=1000)),
])
pipe.fit(X_tr, y_tr).score(X_te, y_te)

Now cross-validation, grid-search, and pickling all work on the *whole* pipeline.

Intermediate: ColumnTransformer for Mixed Types

Real datasets have numeric + categorical + text columns. ColumnTransformer applies different transforms to different columns:

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
num_pipe = Pipeline([('imp', SimpleImputer(strategy='median')), ('sc', StandardScaler())])
cat_pipe = Pipeline([('imp', SimpleImputer(strategy='most_frequent')),
                     ('oh',  OneHotEncoder(handle_unknown='ignore'))])pre = ColumnTransformer([
    ('num', num_pipe, ['age', 'income']),
    ('cat', cat_pipe, ['country', 'plan']),
])
model = Pipeline([('pre', pre), ('clf', RandomForestClassifier())])

This is the production-grade preprocessing template — memorise it.

Intermediate: The Big-5 Algorithms You Should Know

Model	Use when
`LogisticRegression`	linear baseline classifier, interpretable
`LinearRegression` / `Ridge` / `Lasso`	regression baselines
`RandomForestClassifier/Regressor`	strong tabular baseline, no scaling needed
`GradientBoostingClassifier` / `xgboost` / `lightgbm`	best-in-class tabular
`KMeans` / `DBSCAN`	unsupervised clustering

Rule of thumb: start logistic/linear, jump to gradient-boosted trees if you need accuracy.

Intermediate: Cross-Validation Done Right

from sklearn.model_selection import cross_val_score, StratifiedKFold
cv = StratifiedKFold(5, shuffle=True, random_state=42)
scores = cross_val_score(pipe, X, y, cv=cv, scoring='f1')
print(f"F1 = {scores.mean():.3f} ± {scores.std():.3f}")

Note: X, y is the *full* training set — the pipeline is re-fit per fold, so no leakage.

Advanced: Hyperparameter Search

from sklearn.model_selection import GridSearchCV, RandomizedSearchCVparams = {
    'clf__n_estimators': [100, 300, 500],
    'clf__max_depth':    [None, 6, 12],
    'clf__min_samples_split': [2, 5, 10],
}
search = RandomizedSearchCV(pipe, params, n_iter=20, cv=cv, scoring='f1', random_state=42)
search.fit(X_tr, y_tr)
print(search.best_params_, search.best_score_)

Notice the clf__ prefix — it tells the search which step in the pipeline owns the param.

Advanced: Feature Importance & Interpretability

import pandas as pd
imp = pd.Series(model.named_steps['clf'].feature_importances_,
                index=model.named_steps['pre'].get_feature_names_out())
imp.nlargest(15).plot.barh()

For non-tree models, use permutation_importance (model-agnostic) or SHAP (per-prediction explanations). Watch for *suspiciously* important features — they're usually data leakage.

Advanced: Calibration & Threshold Tuning

A classifier's predict_proba may not give honest probabilities. CalibratedClassifierCV fixes this. And the default 0.5 threshold is rarely optimal:

from sklearn.metrics import precision_recall_curve
prob = model.predict_proba(X_val)[:, 1]
p, r, t = precision_recall_curve(y_val, prob)
# pick threshold where F1 / business cost is best

Advanced: Persistence & Deployment

import joblib
joblib.dump(model, 'model.joblib')   # whole pipeline, including preprocessing
model = joblib.load('model.joblib')

Because the pipeline embeds preprocessing, the production code becomes one line: model.predict(new_df). Pin sklearn version — pickled models are version-sensitive.

Advanced: Common Pitfalls

CV inside CV when tuning + reporting — use cross_val_score *over* a GridSearchCV ("nested CV") for an unbiased estimate.
Class-imbalance + accuracy — use scoring='f1' / 'roc_auc' and stratify=y.
Time-series — use TimeSeriesSplit, never random shuffles.
Categorical with many levels — use TargetEncoder or hashing instead of one-hot.

Practice Path

1. Build a single Pipeline that imputes, scales numerics, one-hot encodes categoricals, then trains a logistic regression on a public dataset. 2. Wrap it in 5-fold cross-validation; report F1 ± std. 3. Run RandomizedSearchCV over 3 hyperparameters and check the best params on the held-out test set. 4. Save the pipeline with joblib and reload it; confirm a fresh row gives the same prediction.