Last 30 Days
No notifications
When datasets grow from gigabytes to terabytes and petabytes, traditional tools like Pandas and Excel break down. Big Data technologies enable distributed processing across clusters of machines, making it possible to analyze massive datasets in reasonable time.
| V | Description | Example |
| Volume | Massive data size | Terabytes of daily logs |
| Velocity | Speed of data generation | Real-time sensor streams |
| Variety | Multiple data formats | JSON, CSV, images, logs |
Hadoop Stack:
┌─────────────────────────────────┐
│ Hive / Pig (Query Layer) │
├─────────────────────────────────┤
│ MapReduce / Spark (Processing) │
├─────────────────────────────────┤
│ YARN (Resource Management) │
├─────────────────────────────────┤
│ HDFS (Distributed Storage) │
└─────────────────────────────────┘Spark is 100x faster than MapReduce for iterative tasks because it processes data in-memory instead of writing to disk between steps.
| Spark Component | Purpose | ||||
| RDD | Low-level resilient distributed dataset | ||||
| DataFrame | Structured data with schema (like Pandas) | ||||
| SparkSQL | SQL queries on distributed data | ||||
| MLlib | Machine learning at scale | ||||
| Streaming | Real-time data processing | When to Use What | Tool | Data Size | Use Case |
| Pandas | < 10 GB | Local analysis, prototyping | |||
| PySpark | 10 GB – PB | Distributed processing, ETL | |||
| BigQuery | Any size | Serverless SQL analytics | |||
| Data Lake | Raw storage | Store everything, process later | |||
| Data Warehouse | Structured | Clean, query-optimized storage |
"Big data" means the data no longer fits on one machine — not in RAM, not on disk, not in a single query. When that happens, you stop reaching for Pandas and start reaching for distributed engines: Spark, BigQuery, Snowflake. The mental model is the same (rows, columns, transformations) but the compute is split across a cluster.
Rule of thumb based on data size:
| Size | Use |
| <100 MB | Pandas / Excel |
| 100 MB – 5 GB | Pandas + sampling, DuckDB, Polars |
| 5 GB – 1 TB | DuckDB, BigQuery, Snowflake (single warehouse) |
| >1 TB | Spark, BigQuery, Snowflake clusters |
Most analyst "big data" problems are solved by switching from Pandas to DuckDB or Polars without ever touching a cluster.
A distributed compute engine. Architecture:
Driver (your code) → Cluster Manager → Executors (workers)
│
└─ partitions of dataKey ideas:
show, count, write).from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.appName('demo').getOrCreate()df = spark.read.parquet('s3://bucket/orders/')
(df.filter(F.col('status') == 'paid')
.groupBy('country', F.date_trunc('month', 'created_at').alias('month'))
.agg(F.sum('amount').alias('revenue'),
F.countDistinct('user_id').alias('customers'))
.orderBy('month')
.write.mode('overwrite').parquet('s3://bucket/marts/monthly_revenue/'))
Note how similar this looks to Pandas — same verbs (filter, groupBy, agg), distributed underneath.
1. .collect() on huge datasets. Pulls everything to the driver — OOM crash. Use .show(), .take(n), or write to storage.
2. CSV instead of Parquet. Parquet is columnar + compressed, often 10× faster to read.
3. Lots of tiny files. Each file = a task; thousands of tiny files = scheduler hell. Coalesce to ~128MB files.
4. Joins without thinking about skew. One hot key ("unknown") on millions of rows blocks the whole job.
5. Wide transforms in tight loops. Each groupBy + join shuffles. Plan the pipeline once, run it once.
6. Mixing Python UDFs into hot paths. They serialise to Python and break Catalyst optimisation. Prefer built-in pyspark.sql.functions.
Good file layout = fast queries:
s3://bucket/orders/
year=2026/
month=04/
country=IN/
part-0000.snappy.parquetQueries that filter on year=2026 AND country='IN' only scan that prefix — partition pruning. Combine with bucketing (hash splitting) on join keys to remove shuffles.
| Data Lake | Data Warehouse | |
| Format | files on object storage (Parquet on S3) | proprietary columnar (Snowflake, BigQuery) |
| Schema | schema-on-read | schema-on-write |
| Cost | cheap storage, BYO compute | bundled storage + compute |
| Best at | raw data, ML, semi-structured | BI dashboards, fast SQL |
Modern stack often = Lakehouse (Delta Lake / Iceberg / Hudi): data-lake economics + warehouse-grade transactions and time-travel.
F.broadcast(small_df) skips the shuffle.df.cache() only when you reuse the same intermediate ≥2 times.SELECT * — columnar engines bill by bytes scanned.team / pipeline; cloud bills are merciless.1. Take a Pandas pipeline you have. Rewrite it in DuckDB SQL, then in Polars. Time all three on a 1GB CSV. 2. Spin up a free Databricks/BigQuery sandbox; load a Parquet dataset partitioned by date and run a partition-pruned query. 3. In PySpark, write a groupBy + count to S3 in Parquet; inspect the Spark UI and identify the shuffle stage. 4. Build a dbt model on top of a warehouse table that produces a daily revenue mart with tests for not-null and unique on the date column.