Notifications

No notifications

/Phase 2

Python for Data

Python for Data — Pandas, NumPy & Data Cleaning

Python is the #1 programming language for data analytics. With Pandas for data manipulation and NumPy for numerical computing, you can clean, transform, and analyze datasets of any size.

Pandas — The Data Analysis Workhorse

OperationCodePurpose
Load CSVpd.read_csv('data.csv')Import data from file
Previewdf.head(), df.info(), df.describe()Quick data overview
Select columnsdf['col'] or df[['col1','col2']]Column access
Filter rowsdf[df['salary'] > 50000]Boolean indexing
Loc / Ilocdf.loc[0:5, 'name'] / df.iloc[0:5, 0]Label vs position indexing
Group & Aggregatedf.groupby('dept')['salary'].mean()Summary stats by group
Mergepd.merge(df1, df2, on='id')SQL-like joins
Pivot Tabledf.pivot_table(values, index, columns, aggfunc)Cross-tabulation

Data Cleaning Essentials

# Check for missing values
df.isnull().sum()

# Fill missing values df['salary'].fillna(df['salary'].median(), inplace=True)

# Drop rows with any NaN df.dropna(subset=['email'], inplace=True)

# Remove duplicates df.drop_duplicates(subset=['employee_id'], inplace=True)

# Fix data types df['hire_date'] = pd.to_datetime(df['hire_date']) df['salary'] = df['salary'].astype(float)

NumPy — Fast Numerical Computing

FeatureExampleResult
Create arraynp.array([1, 2, 3])1D array
Zeros/Onesnp.zeros((3, 4))3×4 matrix of zeros
Vectorized matharr * 2 + 1Applies to every element
Statisticsnp.mean(arr), np.std(arr)Mean, standard deviation
Broadcastingmatrix + np.array([1,2,3])Add row to every row
Boolean maskarr[arr > 5]Filter elements

Data Cleaning Workflow

1. Loadpd.read_csv() 2. Inspectinfo(), describe(), isnull().sum() 3. Clean → Handle NULLs, fix types, remove duplicates 4. Transform → Create new columns, merge tables, pivot 5. Analyzegroupby(), aggregations, statistical tests 6. Exportto_csv(), to_excel()

On this page

Detailed Theory

Python is the analyst's swiss-army knife. With Pandas and NumPy you can load CSVs, clean them, reshape them, run statistics, and produce charts — all in a notebook, all reproducible. Excel handles 100k rows; Pandas handles 10M without breaking a sweat.

What Pandas Actually Is

NumPy gives you fast typed arrays (ndarray). Pandas wraps those arrays with row/column labels and gives you two structures:

  • Series — one labeled column.
  • DataFrame — a dictionary of Series sharing an index. Think of it as a programmable Excel sheet.
import pandas as pd
df = pd.read_csv('sales.csv')   # load
df.head()                       # peek first 5 rows
df.info()                       # types + null counts
df.describe()                   # summary stats for numeric cols
df.shape                        # (rows, cols)

Almost every analysis starts with these four lines.

Selecting & Filtering

df['revenue']                            # one column → Series
df[['name', 'revenue']]                  # multiple columns → DataFrame
df.loc[df['country'] == 'IN']            # rows where country = IN
df.loc[df['revenue'] > 1000, ['name', 'revenue']]
df.iloc[0:5, :3]                         # first 5 rows, first 3 cols

Rule of thumb: loc uses labels (names, conditions) and is inclusive on both ends; iloc uses integer positions and is exclusive on the end (like Python slicing). Mixing them causes bugs — pick one consciously.

Beginner Mistakes to Skip

1. for looping over rows with iterrows. 100× slower than vectorised ops. df['bonus'] = df['salary'] * 0.1 is the fast way. 2. Chained assignmentdf[df.x > 0]['y'] = 5 produces the famous *SettingWithCopyWarning* and silently does nothing. Use df.loc[df.x > 0, 'y'] = 5. 3. Forgetting to reset the index after filtering, then being surprised df.iloc[0] is row 7. 4. Comparing with == np.nan. NaN is never equal to anything. Use .isna() / .notna(). 5. inplace=True everywhere. It's being deprecated; just reassign: df = df.dropna(). 6. Reading dirty CSVs without dtypes. A leading-zero ID becomes an int and loses zeros. Pass dtype={'id': str} or parse_dates=[...].

Intermediate: Cleaning the Data

df = df.drop_duplicates()
df = df.dropna(subset=['email'])              # drop rows missing email
df['age'] = df['age'].fillna(df['age'].median())
df['email'] = df['email'].str.strip().str.lower()
df['signup_date'] = pd.to_datetime(df['signup_date'], errors='coerce')
df = df.rename(columns={'cust_nm': 'customer_name'})
df['country'] = df['country'].astype('category')   # save memory

Rule of thumb: spend 60% of your time here. Garbage in = garbage out.

Intermediate: GroupBy — Split, Apply, Combine

The most important Pandas pattern. Three steps under the hood:

1. Split rows into groups by a key. 2. Apply a function to each group (mean, sum, custom). 3. Combine the per-group results into a new frame.

df.groupby('country')['revenue'].sum().sort_values(ascending=False)

df.groupby('country').agg( total_rev=('revenue', 'sum'), customers=('id', 'nunique'), avg_order=('revenue', 'mean'), )

Multi-key groupby is just groupby(['country', 'category']) — you get a MultiIndex back.

Intermediate: Merging & Joining

Pandas joins behave like SQL JOINs:

orders.merge(users, on='user_id', how='left')
orders.merge(products, left_on='sku', right_on='product_sku', how='inner')
pd.concat([df_jan, df_feb], ignore_index=True)   # stack vertically

Watch for many-to-many joins exploding row counts — always check .shape before and after.

Intermediate: Reshape — pivot, melt, stack

# Wide → long
df_long = df.melt(id_vars='country', var_name='month', value_name='revenue')

# Long → wide df_wide = df_long.pivot_table(index='country', columns='month', values='revenue', aggfunc='sum')

Long format is what charting libraries (seaborn, plotly) prefer. Wide format is what humans read.

Intermediate: Apply, Map & Vectorisation

df['tier'] = pd.cut(df['revenue'], bins=[0, 100, 1000, 1e6], labels=['low', 'mid', 'high'])
df['country'] = df['country'].map({'IN': 'India', 'US': 'United States'})
df['log_rev'] = np.log1p(df['revenue'])

Reach for np.where, pd.cut, .map before resorting to apply(lambda ...). Built-in vectorised ops are 10–100× faster.

Advanced: NumPy Broadcasting

Broadcasting lets you operate on arrays of different shapes without writing loops:

import numpy as np
A = np.ones((3, 4))     # 3 rows, 4 cols
row = np.array([1, 2, 3, 4])
A + row                  # row added to every row of A

Rules: shapes are compared right-to-left; dimensions match if they're equal or one is 1; missing dimensions are treated as 1. Once you understand this, the entire scientific Python ecosystem clicks.

Advanced: Method Chaining (Tidy Pandas)

Mature Pandas reads top-to-bottom like SQL CTEs:

result = (
    df.query("status == 'paid' and revenue > 0")
      .assign(month=lambda d: d['created_at'].dt.to_period('M'))
      .groupby(['country', 'month'], as_index=False)
      .agg(revenue=('revenue', 'sum'))
      .sort_values(['country', 'month'])
)

No intermediate variables, easy to comment in/out a step.

Advanced: Time Series Helpers

df['ts'] = pd.to_datetime(df['ts'])
df = df.set_index('ts').sort_index()
df['rev'].resample('D').sum()                 # daily totals
df['rev'].rolling('7D').mean()                 # 7-day moving average
df['rev'].pct_change().shift(-1)               # next-day return

These turn time-series work into one-liners.

Advanced: Performance & Memory

  • Convert low-cardinality strings to category — saves 80–90% memory.
  • Use pd.read_csv(..., usecols=..., dtype=...) to read only what you need.
  • For >1M rows or >1GB files, switch to DuckDB or Polars — same DataFrame mental model, 10–100× faster.
  • Profile with %timeit, %prun, and memory_profiler in Jupyter.

Advanced: Visualisation Hooks

df.plot(x='month', y='revenue', kind='line')         # quick check
import seaborn as sns
sns.boxplot(data=df, x='country', y='revenue')         # publication
import plotly.express as px
px.scatter(df, x='age', y='revenue', color='country')  # interactive

For reports stick with seaborn; for dashboards, plotly; for quick eyeballing, df.plot.

Practice Path

1. Load a real CSV with pd.read_csv, set proper dtypes, run info() and fix any wrong types. 2. Clean the data: drop duplicates, fill NaNs sensibly, normalise email + country columns. 3. Build a groupby that returns total revenue and unique customers per country, sorted desc. 4. Convert the same pipeline into a single method-chained block; add a 7-day rolling mean over a time index.