Notifications

No notifications

Big Data Tools — Processing Data That Doesn't Fit in Memory

When datasets grow from gigabytes to terabytes and petabytes, traditional tools like Pandas and Excel break down. Big Data technologies enable distributed processing across clusters of machines, making it possible to analyze massive datasets in reasonable time.

The 3 Vs of Big Data

VDescriptionExample
VolumeMassive data sizeTerabytes of daily logs
VelocitySpeed of data generationReal-time sensor streams
VarietyMultiple data formatsJSON, CSV, images, logs

Hadoop Ecosystem

Hadoop Stack:
┌─────────────────────────────────┐
│  Hive / Pig  (Query Layer)      │
├─────────────────────────────────┤
│  MapReduce / Spark (Processing) │
├─────────────────────────────────┤
│  YARN (Resource Management)     │
├─────────────────────────────────┤
│  HDFS (Distributed Storage)     │
└─────────────────────────────────┘

  • HDFS: Hadoop Distributed File System — splits files into 128MB blocks stored across nodes with replication for fault tolerance
  • MapReduce: Two-step processing: Map (filter/transform in parallel) → Reduce (aggregate results)
  • YARN: Manages cluster resources (CPU, memory) across jobs

Apache Spark — The Modern Standard

Spark is 100x faster than MapReduce for iterative tasks because it processes data in-memory instead of writing to disk between steps.

Spark ComponentPurpose
RDDLow-level resilient distributed dataset
DataFrameStructured data with schema (like Pandas)
SparkSQLSQL queries on distributed data
MLlibMachine learning at scale
StreamingReal-time data processing

When to Use What

ToolData SizeUse Case
Pandas< 10 GBLocal analysis, prototyping
PySpark10 GB – PBDistributed processing, ETL
BigQueryAny sizeServerless SQL analytics
Data LakeRaw storageStore everything, process later
Data WarehouseStructuredClean, query-optimized storage

On this page

Detailed Theory

"Big data" means the data no longer fits on one machine — not in RAM, not on disk, not in a single query. When that happens, you stop reaching for Pandas and start reaching for distributed engines: Spark, BigQuery, Snowflake. The mental model is the same (rows, columns, transformations) but the compute is split across a cluster.

When You Actually Need Big Data Tools

Rule of thumb based on data size:

SizeUse
<100 MBPandas / Excel
100 MB – 5 GBPandas + sampling, DuckDB, Polars
5 GB – 1 TBDuckDB, BigQuery, Snowflake (single warehouse)
>1 TBSpark, BigQuery, Snowflake clusters

Most analyst "big data" problems are solved by switching from Pandas to DuckDB or Polars without ever touching a cluster.

What Spark Actually Is

A distributed compute engine. Architecture:

Driver  (your code) → Cluster Manager → Executors (workers)
                                            │
                                            └─ partitions of data

Key ideas:

  • Data is split into partitions, processed in parallel on executors.
  • Operations are lazy — transformations build a DAG, nothing runs until an action (show, count, write).
  • The Catalyst optimiser rewrites your DataFrame query into an efficient physical plan.
  • Shuffles (joins, groupBys) move data across the network — the slowest part. Minimise them.

The PySpark API You'll Use Daily

from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.appName('demo').getOrCreate()

df = spark.read.parquet('s3://bucket/orders/')

(df.filter(F.col('status') == 'paid') .groupBy('country', F.date_trunc('month', 'created_at').alias('month')) .agg(F.sum('amount').alias('revenue'), F.countDistinct('user_id').alias('customers')) .orderBy('month') .write.mode('overwrite').parquet('s3://bucket/marts/monthly_revenue/'))

Note how similar this looks to Pandas — same verbs (filter, groupBy, agg), distributed underneath.

Beginner Mistakes to Skip

1. .collect() on huge datasets. Pulls everything to the driver — OOM crash. Use .show(), .take(n), or write to storage. 2. CSV instead of Parquet. Parquet is columnar + compressed, often 10× faster to read. 3. Lots of tiny files. Each file = a task; thousands of tiny files = scheduler hell. Coalesce to ~128MB files. 4. Joins without thinking about skew. One hot key ("unknown") on millions of rows blocks the whole job. 5. Wide transforms in tight loops. Each groupBy + join shuffles. Plan the pipeline once, run it once. 6. Mixing Python UDFs into hot paths. They serialise to Python and break Catalyst optimisation. Prefer built-in pyspark.sql.functions.

Intermediate: RDD vs DataFrame vs Dataset

  • RDD — low-level, untyped, no Catalyst. Almost never use directly.
  • DataFrame — columnar, optimised, what you write 99% of the time.
  • Dataset — typed DataFrame (Scala/Java only). Python is DataFrame-only.

Intermediate: Partitioning & File Layout

Good file layout = fast queries:

s3://bucket/orders/
   year=2026/
      month=04/
         country=IN/
            part-0000.snappy.parquet

Queries that filter on year=2026 AND country='IN' only scan that prefix — partition pruning. Combine with bucketing (hash splitting) on join keys to remove shuffles.

Intermediate: Data Lake vs Data Warehouse

Data LakeData Warehouse
Formatfiles on object storage (Parquet on S3)proprietary columnar (Snowflake, BigQuery)
Schemaschema-on-readschema-on-write
Costcheap storage, BYO computebundled storage + compute
Best atraw data, ML, semi-structuredBI dashboards, fast SQL

Modern stack often = Lakehouse (Delta Lake / Iceberg / Hudi): data-lake economics + warehouse-grade transactions and time-travel.

Intermediate: The Modern Cloud Toolkit

  • AWS S3 + Athena / EMR / Redshift — storage + serverless SQL + Spark + warehouse.
  • GCP BigQuery — serverless SQL warehouse, pay per scanned byte.
  • Snowflake — cross-cloud warehouse, separates storage and compute.
  • Databricks — managed Spark + Delta Lake on any cloud, notebook-first.
  • dbt — SQL transformation framework on top of any of the above.
For most analyst stacks today, a warehouse + dbt combo is enough — Spark only when warehouse SQL can't express it or compute is too expensive.

Advanced: Performance Tuning Spark

  • Broadcast joins — if one side is small (<10MB after filters), F.broadcast(small_df) skips the shuffle.
  • Cache strategicallydf.cache() only when you reuse the same intermediate ≥2 times.
  • Adaptive Query Execution (AQE) — enabled by default in Spark 3+, auto-coalesces partitions and handles skew.
  • Salting skewed keys — append a random suffix to the hot key, join, then aggregate — spreads load across executors.
  • Read the Spark UI — stage timing, shuffle read/write, skew. The single best tuning tool.

Advanced: BigQuery / Snowflake Patterns

  • Cluster by the columns you filter on most (analogous to Spark partitioning).
  • Materialised views for hot aggregates; auto-incremented behind the scenes.
  • Slot-based pricing (BigQuery) vs warehouse sizes (Snowflake) — cost = compute × time, so smaller queries beat big ones.
  • Avoid SELECT * — columnar engines bill by bytes scanned.

Advanced: Streaming & Near-Real-Time

  • Spark Structured Streaming — same DataFrame API, source = Kafka, sink = Delta. Exactly-once with checkpoints.
  • Flink — lower-latency streaming, more complex.
  • Kafka is the durable event log; Spark/Flink the processor; warehouse the destination.
For analysts: most "real-time dashboards" are fine with 5-minute micro-batches — cheaper and simpler than true streaming.

Advanced: Cost & Governance

  • Tag every job with team / pipeline; cloud bills are merciless.
  • Auto-suspend warehouses; default-off is a habit, not a setting.
  • Use column-level access controls + row-level filters for PII.
  • Track lineage with dbt / OpenLineage so a broken upstream table is traceable.

Practice Path

1. Take a Pandas pipeline you have. Rewrite it in DuckDB SQL, then in Polars. Time all three on a 1GB CSV. 2. Spin up a free Databricks/BigQuery sandbox; load a Parquet dataset partitioned by date and run a partition-pruned query. 3. In PySpark, write a groupBy + count to S3 in Parquet; inspect the Spark UI and identify the shuffle stage. 4. Build a dbt model on top of a warehouse table that produces a daily revenue mart with tests for not-null and unique on the date column.