CampusCrate is a student operating system for B.Tech students in India, combining opportunities, academic resources, communities, learning hubs, and career tools in one platform.

Who is CampusCrate built for?

CampusCrate is built for Indian engineering and B.Tech students who need a structured place to discover opportunities, access college resources, join societies, learn technical skills, and build their career profile.

What can students find on CampusCrate?

Students can find hackathons, internships, competitions, notes, test papers, cheatsheets, college communities, DSA and development learning tracks, roadmaps, profiles, and AI career tools.

Learn Code and Practice

Python for Data — Pandas, NumPy & Data Cleaning

Python is the #1 programming language for data analytics. With Pandas for data manipulation and NumPy for numerical computing, you can clean, transform, and analyze datasets of any size.

Pandas — The Data Analysis Workhorse

Operation	Code	Purpose
Load CSV	`pd.read_csv('data.csv')`	Import data from file
Preview	`df.head()`, `df.info()`, `df.describe()`	Quick data overview
Select columns	`df['col']` or `df[['col1','col2']]`	Column access
Filter rows	`df[df['salary'] > 50000]`	Boolean indexing
Loc / Iloc	`df.loc[0:5, 'name']` / `df.iloc[0:5, 0]`	Label vs position indexing
Group & Aggregate	`df.groupby('dept')['salary'].mean()`	Summary stats by group
Merge	`pd.merge(df1, df2, on='id')`	SQL-like joins
Pivot Table	`df.pivot_table(values, index, columns, aggfunc)`	Cross-tabulation

Data Cleaning Essentials

# Check for missing values
df.isnull().sum()
# Fill missing values
df['salary'].fillna(df['salary'].median(), inplace=True)
# Drop rows with any NaN
df.dropna(subset=['email'], inplace=True)
# Remove duplicates
df.drop_duplicates(subset=['employee_id'], inplace=True)# Fix data types
df['hire_date'] = pd.to_datetime(df['hire_date'])
df['salary'] = df['salary'].astype(float)

NumPy — Fast Numerical Computing

Feature	Example	Result
Create array	`np.array([1, 2, 3])`	1D array
Zeros/Ones	`np.zeros((3, 4))`	3×4 matrix of zeros
Vectorized math	`arr * 2 + 1`	Applies to every element
Statistics	`np.mean(arr)`, `np.std(arr)`	Mean, standard deviation
Broadcasting	`matrix + np.array([1,2,3])`	Add row to every row
Boolean mask	`arr[arr > 5]`	Filter elements

Data Cleaning Workflow

1. Load → pd.read_csv() 2. Inspect → info(), describe(), isnull().sum() 3. Clean → Handle NULLs, fix types, remove duplicates 4. Transform → Create new columns, merge tables, pivot 5. Analyze → groupby(), aggregations, statistical tests 6. Export → to_csv(), to_excel()

Python is the analyst's swiss-army knife. With Pandas and NumPy you can load CSVs, clean them, reshape them, run statistics, and produce charts — all in a notebook, all reproducible. Excel handles 100k rows; Pandas handles 10M without breaking a sweat.

What Pandas Actually Is

NumPy gives you fast typed arrays (ndarray). Pandas wraps those arrays with row/column labels and gives you two structures:

Series — one labeled column.
DataFrame — a dictionary of Series sharing an index. Think of it as a programmable Excel sheet.

import pandas as pd
df = pd.read_csv('sales.csv')   # load
df.head()                       # peek first 5 rows
df.info()                       # types + null counts
df.describe()                   # summary stats for numeric cols
df.shape                        # (rows, cols)

Almost every analysis starts with these four lines.

Selecting & Filtering

df['revenue']                            # one column → Series
df[['name', 'revenue']]                  # multiple columns → DataFrame
df.loc[df['country'] == 'IN']            # rows where country = IN
df.loc[df['revenue'] > 1000, ['name', 'revenue']]
df.iloc[0:5, :3]                         # first 5 rows, first 3 cols

Rule of thumb: loc uses labels (names, conditions) and is inclusive on both ends; iloc uses integer positions and is exclusive on the end (like Python slicing). Mixing them causes bugs — pick one consciously.

Beginner Mistakes to Skip

1. for looping over rows with iterrows. 100× slower than vectorised ops. df['bonus'] = df['salary'] * 0.1 is the fast way. 2. Chained assignment — df[df.x > 0]['y'] = 5 produces the famous *SettingWithCopyWarning* and silently does nothing. Use df.loc[df.x > 0, 'y'] = 5. 3. Forgetting to reset the index after filtering, then being surprised df.iloc[0] is row 7. 4. Comparing with == np.nan. NaN is never equal to anything. Use .isna() / .notna(). 5. inplace=True everywhere. It's being deprecated; just reassign: df = df.dropna(). 6. Reading dirty CSVs without dtypes. A leading-zero ID becomes an int and loses zeros. Pass dtype={'id': str} or parse_dates=[...].

Intermediate: Cleaning the Data

df = df.drop_duplicates()
df = df.dropna(subset=['email'])              # drop rows missing email
df['age'] = df['age'].fillna(df['age'].median())
df['email'] = df['email'].str.strip().str.lower()
df['signup_date'] = pd.to_datetime(df['signup_date'], errors='coerce')
df = df.rename(columns={'cust_nm': 'customer_name'})
df['country'] = df['country'].astype('category')   # save memory

Rule of thumb: spend 60% of your time here. Garbage in = garbage out.

Intermediate: GroupBy — Split, Apply, Combine

The most important Pandas pattern. Three steps under the hood:

1. Split rows into groups by a key. 2. Apply a function to each group (mean, sum, custom). 3. Combine the per-group results into a new frame.

df.groupby('country')['revenue'].sum().sort_values(ascending=False)df.groupby('country').agg(
    total_rev=('revenue', 'sum'),
    customers=('id', 'nunique'),
    avg_order=('revenue', 'mean'),
)

Multi-key groupby is just groupby(['country', 'category']) — you get a MultiIndex back.

Intermediate: Merging & Joining

Pandas joins behave like SQL JOINs:

orders.merge(users, on='user_id', how='left')
orders.merge(products, left_on='sku', right_on='product_sku', how='inner')
pd.concat([df_jan, df_feb], ignore_index=True)   # stack vertically

Watch for many-to-many joins exploding row counts — always check .shape before and after.

Intermediate: Reshape — pivot, melt, stack

# Wide → long
df_long = df.melt(id_vars='country', var_name='month', value_name='revenue')# Long → wide
df_wide = df_long.pivot_table(index='country', columns='month', values='revenue', aggfunc='sum')

Long format is what charting libraries (seaborn, plotly) prefer. Wide format is what humans read.

Intermediate: Apply, Map & Vectorisation

df['tier'] = pd.cut(df['revenue'], bins=[0, 100, 1000, 1e6], labels=['low', 'mid', 'high'])
df['country'] = df['country'].map({'IN': 'India', 'US': 'United States'})
df['log_rev'] = np.log1p(df['revenue'])

Reach for np.where, pd.cut, .map before resorting to apply(lambda ...). Built-in vectorised ops are 10–100× faster.

Advanced: NumPy Broadcasting

Broadcasting lets you operate on arrays of different shapes without writing loops:

import numpy as np
A = np.ones((3, 4))     # 3 rows, 4 cols
row = np.array([1, 2, 3, 4])
A + row                  # row added to every row of A

Rules: shapes are compared right-to-left; dimensions match if they're equal or one is 1; missing dimensions are treated as 1. Once you understand this, the entire scientific Python ecosystem clicks.

Advanced: Method Chaining (Tidy Pandas)

Mature Pandas reads top-to-bottom like SQL CTEs:

result = (
    df.query("status == 'paid' and revenue > 0")
      .assign(month=lambda d: d['created_at'].dt.to_period('M'))
      .groupby(['country', 'month'], as_index=False)
      .agg(revenue=('revenue', 'sum'))
      .sort_values(['country', 'month'])
)

No intermediate variables, easy to comment in/out a step.

Advanced: Time Series Helpers

df['ts'] = pd.to_datetime(df['ts'])
df = df.set_index('ts').sort_index()
df['rev'].resample('D').sum()                 # daily totals
df['rev'].rolling('7D').mean()                 # 7-day moving average
df['rev'].pct_change().shift(-1)               # next-day return

These turn time-series work into one-liners.

Advanced: Performance & Memory

Convert low-cardinality strings to category — saves 80–90% memory.
Use pd.read_csv(..., usecols=..., dtype=...) to read only what you need.
For >1M rows or >1GB files, switch to DuckDB or Polars — same DataFrame mental model, 10–100× faster.
Profile with %timeit, %prun, and memory_profiler in Jupyter.

Advanced: Visualisation Hooks

df.plot(x='month', y='revenue', kind='line')         # quick check
import seaborn as sns
sns.boxplot(data=df, x='country', y='revenue')         # publication
import plotly.express as px
px.scatter(df, x='age', y='revenue', color='country')  # interactive

For reports stick with seaborn; for dashboards, plotly; for quick eyeballing, df.plot.

Practice Path

1. Load a real CSV with pd.read_csv, set proper dtypes, run info() and fix any wrong types. 2. Clean the data: drop duplicates, fill NaNs sensibly, normalise email + country columns. 3. Build a groupby that returns total revenue and unique customers per country, sorted desc. 4. Convert the same pipeline into a single method-chained block; add a 7-day rolling mean over a time index.