Last 30 Days
No notifications
Pandas provides two core data structures — Series (1D labeled array) and DataFrame (2D labeled table) — that make data cleaning, transformation, and analysis intuitive and fast.
import pandas as pd# From a dictionary
df = pd.DataFrame({
"name": ["Alice", "Bob", "Charlie"],
"age": [25, 30, 35],
"salary": [70000, 80000, 90000]
})
# From a CSV file
df = pd.read_csv("data.csv")
# From JSON
df = pd.read_json("data.json")
| Method | Syntax | Use Case |
| Column | df["col"] or df.col | Single column (Series) |
| Multiple cols | df[["a", "b"]] | Subset of columns |
| loc | df.loc[row, col] | Label-based selection |
| iloc | df.iloc[row, col] | Integer-position selection |
| Boolean | df[df["age"] > 30] | Filter rows by condition |
# Add a column
df["bonus"] = df["salary"] * 0.1# Remove a column
df = df.drop(columns=["bonus"])
# Sort
df = df.sort_values("salary", ascending=False)
# Group and aggregate
df.groupby("department")["salary"].mean()
| Method | Description |
df.isna() | Detect NaN values |
df.dropna() | Remove rows with NaN |
df.fillna(value) | Replace NaN with value |
df.interpolate() | Fill NaN by interpolation |
# SQL-style join
result = pd.merge(df1, df2, on="id", how="left")# Concatenate vertically
combined = pd.concat([df1, df2], ignore_index=True)
> Tip: Use df.describe() for a quick statistical summary. Use df.info() to inspect dtypes and memory usage.
Pandas is *the* tool for tabular data in Python — CSVs, Excel sheets, SQL exports, log files, scraped tables. If your data fits in RAM and looks like a spreadsheet, pandas is the answer. It's the workhorse of data analysis, ML data prep, and reporting.
Two core types:
Series — a 1-D labelled column (like a NumPy array + an index).DataFrame — a 2-D labelled table: rows have an index, columns have names.import pandas as pddf = pd.DataFrame({
"name": ["Alice", "Bob", "Cara"],
"age": [30, 25, 35],
"city": ["NYC", "LA", "NYC"],
})
df = pd.read_csv("users.csv")
df = pd.read_excel("book.xlsx", sheet_name="Sheet1")
df = pd.read_json("users.json")
df = pd.read_parquet("big.parquet") # fast, compressed, typed
Writers mirror the readers: df.to_csv(...), to_parquet, to_json. For anything > a few hundred MB, parquet beats CSV every time.
df.head(), df.tail()
df.shape, df.columns, df.dtypes
df.info() # types + memory
df.describe() # numeric summary
df.isna().sum() # nulls per columnFirst thing to do with any new dataset: info() + describe() + isna().sum(). Saves hours.
df["age"] # one column → Series
df[["name", "age"]] # multiple columns → DataFrame
df.loc[0] # row by label
df.iloc[0] # row by position
df.loc[df["age"] > 30] # boolean mask
df.loc[df["city"] == "NYC", ["name", "age"]]loc = label-based, iloc = integer-position-based. Mix them up and you'll get headaches; pick deliberately.
1. for i, row in df.iterrows(): for everything. Painfully slow. Vectorise with column ops or groupby.
2. Chained assignment like df[df.x > 0]["y"] = 1. Triggers SettingWithCopyWarning and silently does nothing. Use df.loc[df.x > 0, "y"] = 1.
3. Ignoring dtypes. A column read from CSV may be object (string) when you expect numbers. Convert with astype / pd.to_numeric / pd.to_datetime.
4. Using == to compare with NaN. x == NaN is always False. Use x.isna() / x.notna().
5. Mixing loc and iloc mentally. Pick one per operation; never index by integer with loc.
6. Holding the whole CSV in memory "just in case". usecols=, dtype=, nrows= keep read_csv lean.
df.drop_duplicates()
df.dropna(subset=["email"])
df.fillna({"age": df["age"].median(), "city": "unknown"})
df["name"] = df["name"].str.strip().str.lower()
df["signup"] = pd.to_datetime(df["signup"], errors="coerce")
df = df.rename(columns={"old": "new"})String methods live under .str, datetime methods under .dt. Vectorised, no loops.
df["is_adult"] = df["age"] >= 18
df["name_len"] = df["name"].str.len()
df["bracket"] = pd.cut(df["age"], bins=[0,18,40,100], labels=["kid","adult","senior"])If you find yourself reaching for .apply(lambda row: ...), pause — there's almost always a vectorised version that's 10–100× faster.
df.groupby("city")["age"].mean()df.groupby("city").agg(
avg_age=("age", "mean"),
n_users=("name", "count"),
youngest=("age", "min"),
)
df.groupby("city")["age"].transform("mean") # broadcast back to rows
GroupBy is pandas' superpower. Almost every dashboard query is a groupby.
merged = users.merge(orders, on="user_id", how="left")how = "inner" / "left" / "right" / "outer". Same mental model as SQL JOIN. Validate with indicator=True to see which rows matched.
piperesult = (
df
.query("age > 25")
.assign(tax=lambda d: d["salary"] * 0.3)
.groupby("department", as_index=False)["tax"].sum()
.sort_values("tax", ascending=False)
)Reads top-to-bottom like a pipeline. pipe(func) lets you slot custom functions into the chain. Resist mutating intermediate variables — chains are easier to refactor.
df = df.set_index("signup").sort_index()
df.loc["2026-01"] # all of January
df.resample("W")["sales"].sum() # weekly sums
df.rolling("7D")["sales"].mean() # 7-day moving avgA well-chosen index (especially a sorted datetime) unlocks slicing by date and resample/rolling windows that would be brutal in raw NumPy.
category (df["city"] = df["city"].astype("category")) — huge memory + groupby wins.pd.to_numeric(s, downcast="integer").pd.read_csv(..., usecols=["a","b"]).apply with Python functions; prefer column ops, groupby.agg, or NumPy ufuncs.1. Load any CSV (titanic, iris, your own) and produce a mini report: info, describe, null counts, top 5 rows.
2. Compute a per-group mean using groupby(...).agg(...) and verify with a vectorised mask.
3. Convert a slow for-loop transformation into a method chain (assign + boolean masks).
4. Save a DataFrame to parquet, reload it, and confirm dtypes survive (csv would lose them).