CampusCrate is a student operating system for B.Tech students in India, combining opportunities, academic resources, communities, learning hubs, and career tools in one platform.

Who is CampusCrate built for?

CampusCrate is built for Indian engineering and B.Tech students who need a structured place to discover opportunities, access college resources, join societies, learn technical skills, and build their career profile.

What can students find on CampusCrate?

Students can find hackathons, internships, competitions, notes, test papers, cheatsheets, college communities, DSA and development learning tracks, roadmaps, profiles, and AI career tools.

Learn Code and Practice

Big Data Tools — Processing Data That Doesn't Fit in Memory

When datasets grow from gigabytes to terabytes and petabytes, traditional tools like Pandas and Excel break down. Big Data technologies enable distributed processing across clusters of machines, making it possible to analyze massive datasets in reasonable time.

The 3 Vs of Big Data

V	Description	Example
Volume	Massive data size	Terabytes of daily logs
Velocity	Speed of data generation	Real-time sensor streams
Variety	Multiple data formats	JSON, CSV, images, logs

Hadoop Ecosystem

Hadoop Stack:
┌─────────────────────────────────┐
│  Hive / Pig  (Query Layer)      │
├─────────────────────────────────┤
│  MapReduce / Spark (Processing) │
├─────────────────────────────────┤
│  YARN (Resource Management)     │
├─────────────────────────────────┤
│  HDFS (Distributed Storage)     │
└─────────────────────────────────┘

HDFS: Hadoop Distributed File System — splits files into 128MB blocks stored across nodes with replication for fault tolerance
MapReduce: Two-step processing: Map (filter/transform in parallel) → Reduce (aggregate results)
YARN: Manages cluster resources (CPU, memory) across jobs

Apache Spark — The Modern Standard

Spark is 100x faster than MapReduce for iterative tasks because it processes data in-memory instead of writing to disk between steps.

Spark Component	Purpose
RDD	Low-level resilient distributed dataset
DataFrame	Structured data with schema (like Pandas)
SparkSQL	SQL queries on distributed data
MLlib	Machine learning at scale
Streaming	Real-time data processing	When to Use What	Tool	Data Size	Use Case
Pandas	< 10 GB	Local analysis, prototyping
PySpark	10 GB – PB	Distributed processing, ETL
BigQuery	Any size	Serverless SQL analytics
Data Lake	Raw storage	Store everything, process later
Data Warehouse	Structured	Clean, query-optimized storage

"Big data" means the data no longer fits on one machine — not in RAM, not on disk, not in a single query. When that happens, you stop reaching for Pandas and start reaching for distributed engines: Spark, BigQuery, Snowflake. The mental model is the same (rows, columns, transformations) but the compute is split across a cluster.

When You Actually Need Big Data Tools

Rule of thumb based on data size:

Size	Use
<100 MB	Pandas / Excel
100 MB – 5 GB	Pandas + sampling, DuckDB, Polars
5 GB – 1 TB	DuckDB, BigQuery, Snowflake (single warehouse)
>1 TB	Spark, BigQuery, Snowflake clusters

Most analyst "big data" problems are solved by switching from Pandas to DuckDB or Polars without ever touching a cluster.

What Spark Actually Is

A distributed compute engine. Architecture:

Driver  (your code) → Cluster Manager → Executors (workers)
                                            │
                                            └─ partitions of data

Key ideas:

Data is split into partitions, processed in parallel on executors.
Operations are lazy — transformations build a DAG, nothing runs until an action (show, count, write).
The Catalyst optimiser rewrites your DataFrame query into an efficient physical plan.
Shuffles (joins, groupBys) move data across the network — the slowest part. Minimise them.

The PySpark API You'll Use Daily

from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.appName('demo').getOrCreate()
df = spark.read.parquet('s3://bucket/orders/')(df.filter(F.col('status') == 'paid')
   .groupBy('country', F.date_trunc('month', 'created_at').alias('month'))
   .agg(F.sum('amount').alias('revenue'),
        F.countDistinct('user_id').alias('customers'))
   .orderBy('month')
   .write.mode('overwrite').parquet('s3://bucket/marts/monthly_revenue/'))

Note how similar this looks to Pandas — same verbs (filter, groupBy, agg), distributed underneath.

Beginner Mistakes to Skip

1. .collect() on huge datasets. Pulls everything to the driver — OOM crash. Use .show(), .take(n), or write to storage. 2. CSV instead of Parquet. Parquet is columnar + compressed, often 10× faster to read. 3. Lots of tiny files. Each file = a task; thousands of tiny files = scheduler hell. Coalesce to ~128MB files. 4. Joins without thinking about skew. One hot key ("unknown") on millions of rows blocks the whole job. 5. Wide transforms in tight loops. Each groupBy + join shuffles. Plan the pipeline once, run it once. 6. Mixing Python UDFs into hot paths. They serialise to Python and break Catalyst optimisation. Prefer built-in pyspark.sql.functions.

Intermediate: RDD vs DataFrame vs Dataset

RDD — low-level, untyped, no Catalyst. Almost never use directly.
DataFrame — columnar, optimised, what you write 99% of the time.
Dataset — typed DataFrame (Scala/Java only). Python is DataFrame-only.

Intermediate: Partitioning & File Layout

Good file layout = fast queries:

s3://bucket/orders/
   year=2026/
      month=04/
         country=IN/
            part-0000.snappy.parquet

Queries that filter on year=2026 AND country='IN' only scan that prefix — partition pruning. Combine with bucketing (hash splitting) on join keys to remove shuffles.

Intermediate: Data Lake vs Data Warehouse

Data Lake	Data Warehouse
Format	files on object storage (Parquet on S3)	proprietary columnar (Snowflake, BigQuery)
Schema	schema-on-read	schema-on-write
Cost	cheap storage, BYO compute	bundled storage + compute
Best at	raw data, ML, semi-structured	BI dashboards, fast SQL

Modern stack often = Lakehouse (Delta Lake / Iceberg / Hudi): data-lake economics + warehouse-grade transactions and time-travel.

Intermediate: The Modern Cloud Toolkit

AWS S3 + Athena / EMR / Redshift — storage + serverless SQL + Spark + warehouse.
GCP BigQuery — serverless SQL warehouse, pay per scanned byte.
Snowflake — cross-cloud warehouse, separates storage and compute.
Databricks — managed Spark + Delta Lake on any cloud, notebook-first.
dbt — SQL transformation framework on top of any of the above.

For most analyst stacks today, a warehouse + dbt combo is enough — Spark only when warehouse SQL can't express it or compute is too expensive.

Advanced: Performance Tuning Spark

Broadcast joins — if one side is small (<10MB after filters), F.broadcast(small_df) skips the shuffle.
Cache strategically — df.cache() only when you reuse the same intermediate ≥2 times.
Adaptive Query Execution (AQE) — enabled by default in Spark 3+, auto-coalesces partitions and handles skew.
Salting skewed keys — append a random suffix to the hot key, join, then aggregate — spreads load across executors.
Read the Spark UI — stage timing, shuffle read/write, skew. The single best tuning tool.

Advanced: BigQuery / Snowflake Patterns

Cluster by the columns you filter on most (analogous to Spark partitioning).
Materialised views for hot aggregates; auto-incremented behind the scenes.
Slot-based pricing (BigQuery) vs warehouse sizes (Snowflake) — cost = compute × time, so smaller queries beat big ones.
Avoid SELECT * — columnar engines bill by bytes scanned.

Advanced: Streaming & Near-Real-Time

Spark Structured Streaming — same DataFrame API, source = Kafka, sink = Delta. Exactly-once with checkpoints.
Flink — lower-latency streaming, more complex.
Kafka is the durable event log; Spark/Flink the processor; warehouse the destination.

For analysts: most "real-time dashboards" are fine with 5-minute micro-batches — cheaper and simpler than true streaming.

Advanced: Cost & Governance

Tag every job with team / pipeline; cloud bills are merciless.
Auto-suspend warehouses; default-off is a habit, not a setting.
Use column-level access controls + row-level filters for PII.
Track lineage with dbt / OpenLineage so a broken upstream table is traceable.

Practice Path

1. Take a Pandas pipeline you have. Rewrite it in DuckDB SQL, then in Polars. Time all three on a 1GB CSV. 2. Spin up a free Databricks/BigQuery sandbox; load a Parquet dataset partitioned by date and run a partition-pruned query. 3. In PySpark, write a groupBy + count to S3 in Parquet; inspect the Spark UI and identify the shuffle stage. 4. Build a dbt model on top of a warehouse table that produces a daily revenue mart with tests for not-null and unique on the date column.