Last 30 Days
No notifications
Python is the #1 programming language for data analytics. With Pandas for data manipulation and NumPy for numerical computing, you can clean, transform, and analyze datasets of any size.
| Operation | Code | Purpose |
| Load CSV | pd.read_csv('data.csv') | Import data from file |
| Preview | df.head(), df.info(), df.describe() | Quick data overview |
| Select columns | df['col'] or df[['col1','col2']] | Column access |
| Filter rows | df[df['salary'] > 50000] | Boolean indexing |
| Loc / Iloc | df.loc[0:5, 'name'] / df.iloc[0:5, 0] | Label vs position indexing |
| Group & Aggregate | df.groupby('dept')['salary'].mean() | Summary stats by group |
| Merge | pd.merge(df1, df2, on='id') | SQL-like joins |
| Pivot Table | df.pivot_table(values, index, columns, aggfunc) | Cross-tabulation |
# Check for missing values
df.isnull().sum()# Fill missing values
df['salary'].fillna(df['salary'].median(), inplace=True)
# Drop rows with any NaN
df.dropna(subset=['email'], inplace=True)
# Remove duplicates
df.drop_duplicates(subset=['employee_id'], inplace=True)
# Fix data types
df['hire_date'] = pd.to_datetime(df['hire_date'])
df['salary'] = df['salary'].astype(float)
| Feature | Example | Result |
| Create array | np.array([1, 2, 3]) | 1D array |
| Zeros/Ones | np.zeros((3, 4)) | 3×4 matrix of zeros |
| Vectorized math | arr * 2 + 1 | Applies to every element |
| Statistics | np.mean(arr), np.std(arr) | Mean, standard deviation |
| Broadcasting | matrix + np.array([1,2,3]) | Add row to every row |
| Boolean mask | arr[arr > 5] | Filter elements |
1. Load → pd.read_csv()
2. Inspect → info(), describe(), isnull().sum()
3. Clean → Handle NULLs, fix types, remove duplicates
4. Transform → Create new columns, merge tables, pivot
5. Analyze → groupby(), aggregations, statistical tests
6. Export → to_csv(), to_excel()
Python is the analyst's swiss-army knife. With Pandas and NumPy you can load CSVs, clean them, reshape them, run statistics, and produce charts — all in a notebook, all reproducible. Excel handles 100k rows; Pandas handles 10M without breaking a sweat.
NumPy gives you fast typed arrays (ndarray). Pandas wraps those arrays with row/column labels and gives you two structures:
Series — one labeled column.DataFrame — a dictionary of Series sharing an index. Think of it as a programmable Excel sheet.import pandas as pd
df = pd.read_csv('sales.csv') # load
df.head() # peek first 5 rows
df.info() # types + null counts
df.describe() # summary stats for numeric cols
df.shape # (rows, cols)Almost every analysis starts with these four lines.
df['revenue'] # one column → Series
df[['name', 'revenue']] # multiple columns → DataFrame
df.loc[df['country'] == 'IN'] # rows where country = IN
df.loc[df['revenue'] > 1000, ['name', 'revenue']]
df.iloc[0:5, :3] # first 5 rows, first 3 colsRule of thumb: loc uses labels (names, conditions) and is inclusive on both ends; iloc uses integer positions and is exclusive on the end (like Python slicing). Mixing them causes bugs — pick one consciously.
1. for looping over rows with iterrows. 100× slower than vectorised ops. df['bonus'] = df['salary'] * 0.1 is the fast way.
2. Chained assignment — df[df.x > 0]['y'] = 5 produces the famous *SettingWithCopyWarning* and silently does nothing. Use df.loc[df.x > 0, 'y'] = 5.
3. Forgetting to reset the index after filtering, then being surprised df.iloc[0] is row 7.
4. Comparing with == np.nan. NaN is never equal to anything. Use .isna() / .notna().
5. inplace=True everywhere. It's being deprecated; just reassign: df = df.dropna().
6. Reading dirty CSVs without dtypes. A leading-zero ID becomes an int and loses zeros. Pass dtype={'id': str} or parse_dates=[...].
df = df.drop_duplicates()
df = df.dropna(subset=['email']) # drop rows missing email
df['age'] = df['age'].fillna(df['age'].median())
df['email'] = df['email'].str.strip().str.lower()
df['signup_date'] = pd.to_datetime(df['signup_date'], errors='coerce')
df = df.rename(columns={'cust_nm': 'customer_name'})
df['country'] = df['country'].astype('category') # save memoryRule of thumb: spend 60% of your time here. Garbage in = garbage out.
The most important Pandas pattern. Three steps under the hood:
1. Split rows into groups by a key. 2. Apply a function to each group (mean, sum, custom). 3. Combine the per-group results into a new frame.
df.groupby('country')['revenue'].sum().sort_values(ascending=False)df.groupby('country').agg(
total_rev=('revenue', 'sum'),
customers=('id', 'nunique'),
avg_order=('revenue', 'mean'),
)
Multi-key groupby is just groupby(['country', 'category']) — you get a MultiIndex back.
Pandas joins behave like SQL JOINs:
orders.merge(users, on='user_id', how='left')
orders.merge(products, left_on='sku', right_on='product_sku', how='inner')
pd.concat([df_jan, df_feb], ignore_index=True) # stack verticallyWatch for many-to-many joins exploding row counts — always check .shape before and after.
# Wide → long
df_long = df.melt(id_vars='country', var_name='month', value_name='revenue')# Long → wide
df_wide = df_long.pivot_table(index='country', columns='month', values='revenue', aggfunc='sum')
Long format is what charting libraries (seaborn, plotly) prefer. Wide format is what humans read.
df['tier'] = pd.cut(df['revenue'], bins=[0, 100, 1000, 1e6], labels=['low', 'mid', 'high'])
df['country'] = df['country'].map({'IN': 'India', 'US': 'United States'})
df['log_rev'] = np.log1p(df['revenue'])Reach for np.where, pd.cut, .map before resorting to apply(lambda ...). Built-in vectorised ops are 10–100× faster.
Broadcasting lets you operate on arrays of different shapes without writing loops:
import numpy as np
A = np.ones((3, 4)) # 3 rows, 4 cols
row = np.array([1, 2, 3, 4])
A + row # row added to every row of ARules: shapes are compared right-to-left; dimensions match if they're equal or one is 1; missing dimensions are treated as 1. Once you understand this, the entire scientific Python ecosystem clicks.
Mature Pandas reads top-to-bottom like SQL CTEs:
result = (
df.query("status == 'paid' and revenue > 0")
.assign(month=lambda d: d['created_at'].dt.to_period('M'))
.groupby(['country', 'month'], as_index=False)
.agg(revenue=('revenue', 'sum'))
.sort_values(['country', 'month'])
)No intermediate variables, easy to comment in/out a step.
df['ts'] = pd.to_datetime(df['ts'])
df = df.set_index('ts').sort_index()
df['rev'].resample('D').sum() # daily totals
df['rev'].rolling('7D').mean() # 7-day moving average
df['rev'].pct_change().shift(-1) # next-day returnThese turn time-series work into one-liners.
category — saves 80–90% memory.pd.read_csv(..., usecols=..., dtype=...) to read only what you need.%timeit, %prun, and memory_profiler in Jupyter.df.plot(x='month', y='revenue', kind='line') # quick check
import seaborn as sns
sns.boxplot(data=df, x='country', y='revenue') # publication
import plotly.express as px
px.scatter(df, x='age', y='revenue', color='country') # interactiveFor reports stick with seaborn; for dashboards, plotly; for quick eyeballing, df.plot.
1. Load a real CSV with pd.read_csv, set proper dtypes, run info() and fix any wrong types.
2. Clean the data: drop duplicates, fill NaNs sensibly, normalise email + country columns.
3. Build a groupby that returns total revenue and unique customers per country, sorted desc.
4. Convert the same pipeline into a single method-chained block; add a 7-day rolling mean over a time index.