Notifications

No notifications

/Phase 4

Pandas for Data Analysis

Pandas — Data Analysis Powerhouse 🐼

Pandas provides two core data structures — Series (1D labeled array) and DataFrame (2D labeled table) — that make data cleaning, transformation, and analysis intuitive and fast.

Creating DataFrames

import pandas as pd

# From a dictionary df = pd.DataFrame({ "name": ["Alice", "Bob", "Charlie"], "age": [25, 30, 35], "salary": [70000, 80000, 90000] })

# From a CSV file df = pd.read_csv("data.csv")

# From JSON df = pd.read_json("data.json")

Selection Methods

MethodSyntaxUse Case
Columndf["col"] or df.colSingle column (Series)
Multiple colsdf[["a", "b"]]Subset of columns
locdf.loc[row, col]Label-based selection
ilocdf.iloc[row, col]Integer-position selection
Booleandf[df["age"] > 30]Filter rows by condition

Essential Operations

# Add a column
df["bonus"] = df["salary"] * 0.1

# Remove a column df = df.drop(columns=["bonus"])

# Sort df = df.sort_values("salary", ascending=False)

# Group and aggregate df.groupby("department")["salary"].mean()

Handling Missing Data

MethodDescription
df.isna()Detect NaN values
df.dropna()Remove rows with NaN
df.fillna(value)Replace NaN with value
df.interpolate()Fill NaN by interpolation

Merging DataFrames

# SQL-style join
result = pd.merge(df1, df2, on="id", how="left")

# Concatenate vertically combined = pd.concat([df1, df2], ignore_index=True)

> Tip: Use df.describe() for a quick statistical summary. Use df.info() to inspect dtypes and memory usage.

On this page

Detailed Theory

Pandas is *the* tool for tabular data in Python — CSVs, Excel sheets, SQL exports, log files, scraped tables. If your data fits in RAM and looks like a spreadsheet, pandas is the answer. It's the workhorse of data analysis, ML data prep, and reporting.

What pandas Actually Is

Two core types:

  • Series — a 1-D labelled column (like a NumPy array + an index).
  • DataFrame — a 2-D labelled table: rows have an index, columns have names.
Under the hood: each column is (usually) a NumPy array, so column-wise math is NumPy-fast.

Creating & Loading

import pandas as pd

df = pd.DataFrame({ "name": ["Alice", "Bob", "Cara"], "age": [30, 25, 35], "city": ["NYC", "LA", "NYC"], })

df = pd.read_csv("users.csv") df = pd.read_excel("book.xlsx", sheet_name="Sheet1") df = pd.read_json("users.json") df = pd.read_parquet("big.parquet") # fast, compressed, typed

Writers mirror the readers: df.to_csv(...), to_parquet, to_json. For anything > a few hundred MB, parquet beats CSV every time.

Inspecting a DataFrame

df.head(), df.tail()
df.shape, df.columns, df.dtypes
df.info()                # types + memory
df.describe()             # numeric summary
df.isna().sum()           # nulls per column

First thing to do with any new dataset: info() + describe() + isna().sum(). Saves hours.

Selecting Data

df["age"]                  # one column → Series
df[["name", "age"]]        # multiple columns → DataFrame
df.loc[0]                   # row by label
df.iloc[0]                  # row by position
df.loc[df["age"] > 30]      # boolean mask
df.loc[df["city"] == "NYC", ["name", "age"]]

loc = label-based, iloc = integer-position-based. Mix them up and you'll get headaches; pick deliberately.

Beginner Mistakes to Skip

1. for i, row in df.iterrows(): for everything. Painfully slow. Vectorise with column ops or groupby. 2. Chained assignment like df[df.x > 0]["y"] = 1. Triggers SettingWithCopyWarning and silently does nothing. Use df.loc[df.x > 0, "y"] = 1. 3. Ignoring dtypes. A column read from CSV may be object (string) when you expect numbers. Convert with astype / pd.to_numeric / pd.to_datetime. 4. Using == to compare with NaN. x == NaN is always False. Use x.isna() / x.notna(). 5. Mixing loc and iloc mentally. Pick one per operation; never index by integer with loc. 6. Holding the whole CSV in memory "just in case". usecols=, dtype=, nrows= keep read_csv lean.

Intermediate: Cleaning

df.drop_duplicates()
df.dropna(subset=["email"])
df.fillna({"age": df["age"].median(), "city": "unknown"})
df["name"] = df["name"].str.strip().str.lower()
df["signup"] = pd.to_datetime(df["signup"], errors="coerce")
df = df.rename(columns={"old": "new"})

String methods live under .str, datetime methods under .dt. Vectorised, no loops.

Intermediate: Transforming

df["is_adult"] = df["age"] >= 18
df["name_len"] = df["name"].str.len()
df["bracket"]  = pd.cut(df["age"], bins=[0,18,40,100], labels=["kid","adult","senior"])

If you find yourself reaching for .apply(lambda row: ...), pause — there's almost always a vectorised version that's 10–100× faster.

Intermediate: GroupBy (Split-Apply-Combine)

df.groupby("city")["age"].mean()

df.groupby("city").agg( avg_age=("age", "mean"), n_users=("name", "count"), youngest=("age", "min"), )

df.groupby("city")["age"].transform("mean") # broadcast back to rows

GroupBy is pandas' superpower. Almost every dashboard query is a groupby.

Intermediate: Joins (Merge)

merged = users.merge(orders, on="user_id", how="left")

how = "inner" / "left" / "right" / "outer". Same mental model as SQL JOIN. Validate with indicator=True to see which rows matched.

Advanced: Method Chaining & pipe

result = (
    df
    .query("age > 25")
    .assign(tax=lambda d: d["salary"] * 0.3)
    .groupby("department", as_index=False)["tax"].sum()
    .sort_values("tax", ascending=False)
)

Reads top-to-bottom like a pipeline. pipe(func) lets you slot custom functions into the chain. Resist mutating intermediate variables — chains are easier to refactor.

Advanced: Indexes & Time Series

df = df.set_index("signup").sort_index()
df.loc["2026-01"]                     # all of January
df.resample("W")["sales"].sum()       # weekly sums
df.rolling("7D")["sales"].mean()      # 7-day moving avg

A well-chosen index (especially a sorted datetime) unlocks slicing by date and resample/rolling windows that would be brutal in raw NumPy.

Advanced: Memory & Performance

  • Convert low-cardinality strings to category (df["city"] = df["city"].astype("category")) — huge memory + groupby wins.
  • Downcast numerics: pd.to_numeric(s, downcast="integer").
  • Read only needed columns: pd.read_csv(..., usecols=["a","b"]).
  • Avoid apply with Python functions; prefer column ops, groupby.agg, or NumPy ufuncs.
  • For multi-GB data: use Polars (lazy, columnar, extremely fast) or DuckDB (SQL on parquet); switch back to pandas only at the boundaries.

Advanced: When *Not* to Use pandas

  • Tiny scripts (< 100 rows) — a list of dicts is simpler.
  • Streaming / out-of-memory data — Polars, Dask, DuckDB.
  • Heavy SQL transforms — just run SQL.
  • Tensor maths / ML training — NumPy / PyTorch.
Pandas is for *labelled, in-memory, mixed-type tables*. Use it where it shines.

Practice Path

1. Load any CSV (titanic, iris, your own) and produce a mini report: info, describe, null counts, top 5 rows. 2. Compute a per-group mean using groupby(...).agg(...) and verify with a vectorised mask. 3. Convert a slow for-loop transformation into a method chain (assign + boolean masks). 4. Save a DataFrame to parquet, reload it, and confirm dtypes survive (csv would lose them).