Notifications

No notifications

/Phase 2

SQL Fundamentals

SQL Fundamentals — Querying Relational Databases

SQL (Structured Query Language) is the universal language for working with relational databases. Every data analyst role requires strong SQL skills — it's how you extract, filter, and summarize business data at scale.

Core Query Structure

SELECT column1, column2       -- What columns to show
FROM table_name                -- Which table to query
WHERE condition                -- Filter rows
GROUP BY column                -- Group for aggregation
HAVING aggregate_condition     -- Filter groups
ORDER BY column DESC           -- Sort results
LIMIT 10;                      -- Restrict row count

Filtering Operators

OperatorExamplePurpose
=, !=, <, >WHERE salary > 50000Basic comparison
AND / OR / NOTWHERE dept = 'Sales' AND salary > 60000Combine conditions
BETWEENWHERE hire_date BETWEEN '2023-01-01' AND '2023-12-31'Range filter
INWHERE region IN ('North', 'South')Match any in list
LIKEWHERE name LIKE 'A%'Pattern matching
IS NULLWHERE manager_id IS NULLCheck for missing data

Aggregate Functions

FunctionPurposeExample
COUNT(*)Count all rowsSELECT COUNT(*) FROM orders
SUM(col)Total a numeric columnSELECT SUM(amount) FROM orders
AVG(col)Average valueSELECT AVG(salary) FROM employees
MIN(col) / MAX(col)Smallest / LargestSELECT MIN(price), MAX(price) FROM products
DISTINCTRemove duplicatesSELECT DISTINCT department FROM employees

NULL Handling

NULLs represent missing data and require special handling:

-- Check for NULL
SELECT * FROM employees WHERE manager_id IS NULL;

-- Replace NULL with a default using COALESCE SELECT name, COALESCE(phone, 'N/A') AS phone FROM employees;

-- COALESCE picks the first non-NULL value SELECT COALESCE(nickname, first_name, 'Unknown') AS display_name FROM users;

Column Aliases

Use AS to rename output columns for readability:

SELECT department AS dept, COUNT(*) AS headcount, AVG(salary) AS avg_salary
FROM employees
GROUP BY department;

On this page

Detailed Theory

SQL is the language you'll use every day as an analyst. Excel can show you 100 rows; SQL lets you ask questions of 100 million. Once you can write a clean SELECT ... WHERE ... GROUP BY, you can answer most business questions yourself instead of waiting for an engineer.

What SQL Actually Is

A database stores data in tables — rows and columns with strict types. SQL is the language you use to ask the database for data. You describe *what* you want, the database figures out *how*.

SELECT name, email
FROM users
WHERE country = 'IN'
ORDER BY created_at DESC
LIMIT 10;

Read it like English: "give me name and email from users in India, newest first, just 10".

The Six Clauses That Cover 80% of Queries

ClauseJob
SELECTwhich columns to return
FROMwhich table to read
WHEREfilter individual rows
GROUP BYbucket rows into groups
ORDER BYsort the result
LIMITcap the number of rows

Learn these, in this order. JOINs, subqueries, and window functions all build on top.

How SQL Actually Runs (Logical Order)

You *write* SQL in one order, the database *runs* it in another:

FROM → WHERE → GROUP BY → HAVING → SELECT → DISTINCT → ORDER BY → LIMIT

This explains the #1 beginner gotcha: a column alias from SELECT can't be used in WHERE because WHERE runs before SELECT. Use HAVING (after aggregation) or repeat the expression.

Beginner Mistakes to Skip

1. SELECT * in dashboards / production. Wastes bandwidth, breaks when columns are added. List the columns you need. 2. Forgetting WHERE on UPDATE / DELETE. DELETE FROM users; deletes the whole table. Always test with a SELECT first. 3. = NULL. NULL means *unknown*. Use IS NULL / IS NOT NULL. 4. Mixing data types in comparisons. WHERE id = '42' may quietly do a string compare and never use the index. 5. Sorting before filtering. ORDER BY then LIMIT 10 on millions of rows is slow if there's no index. Filter first. 6. Trusting COUNT(*) and COUNT(col) to mean the same thing. COUNT(*) counts rows; COUNT(col) skips NULLs.

Intermediate: WHERE Patterns

WHERE age BETWEEN 18 AND 30           -- inclusive on both ends
WHERE country IN ('IN', 'US', 'UK')
WHERE email LIKE '%@gmail.com'        -- % = any chars, _ = one char
WHERE deleted_at IS NULL
WHERE created_at >= '2026-01-01'
  AND status = 'active'
  AND (role = 'admin' OR role = 'owner')

Mix conditions with AND / OR. Parentheses matter — A AND B OR C is ambiguous, wrap groups.

Intermediate: Aggregates & GROUP BY

Aggregates squash many rows into one summary row:

SELECT
  country,
  COUNT(*)            AS user_count,
  AVG(age)            AS avg_age,
  MIN(created_at)     AS first_signup,
  MAX(created_at)     AS latest_signup
FROM users
WHERE deleted_at IS NULL
GROUP BY country
HAVING COUNT(*) > 100
ORDER BY user_count DESC;

Key rules:

  • Every column in SELECT must be either inside an aggregate or in GROUP BY.
  • WHERE filters rows; HAVING filters groups (after aggregation).
  • COUNT(DISTINCT col) for unique counts.

Intermediate: ORDER BY, LIMIT & Pagination

ORDER BY created_at DESC, id DESC
LIMIT 20 OFFSET 40       -- page 3, 20 per page

A secondary tiebreaker (id DESC) makes ordering deterministic when timestamps tie.

Intermediate: Date / String / Number Helpers (Postgres-flavoured)

DATE_TRUNC('month', created_at)         -- first day of that month
EXTRACT(YEAR FROM created_at)
NOW(), CURRENT_DATE, AGE(birthday)
LOWER(email), TRIM(name), CONCAT(a, ' ', b)
ROUND(price, 2), CEIL(x), FLOOR(x)
COALESCE(nickname, name, 'unknown')     -- first non-NULL

COALESCE is a lifesaver for handling NULLs in dashboards.

Intermediate: CASE — If/Else in SQL

SELECT name,
  CASE
    WHEN age < 18 THEN 'minor'
    WHEN age < 60 THEN 'adult'
    ELSE 'senior'
  END AS age_group,
  CASE WHEN is_paid THEN 'paying' ELSE 'free' END AS plan
FROM users;

Works anywhere expressions are allowed — inside SELECT, WHERE, ORDER BY, even SUM(CASE WHEN ... THEN 1 ELSE 0 END) for conditional counts.

Advanced: DISTINCT vs GROUP BY

Both deduplicate, but they're not interchangeable:

  • SELECT DISTINCT country FROM users — just unique values.
  • SELECT country, COUNT(*) FROM users GROUP BY country — unique values + per-group math.
DISTINCT is shorthand for GROUP BY over every selected column. On big tables, both can be slow without an index.

Advanced: NULL Logic (The 3-Valued Truth)

SQL has three truth values: TRUE / FALSE / UNKNOWN.

NULL = NULL    → UNKNOWN
NULL <> NULL   → UNKNOWN
NULL IS NULL   → TRUE
5 + NULL       → NULL
NULL OR TRUE   → TRUE
NULL AND FALSE → FALSE

This is why WHERE col = NULL always returns 0 rows. Use IS [NOT] NULL and reach for COALESCE to substitute defaults.

Advanced: INSERT / UPDATE / DELETE Safely

INSERT INTO users (name, email) VALUES ('Alice', 'a@t.com');

UPDATE users SET status = 'inactive' WHERE last_login < NOW() - INTERVAL '180 days';

DELETE FROM sessions WHERE expires_at < NOW();

Golden habit: always run SELECT with the same WHERE first. If it returns the rows you intended to change, only then rewrite as UPDATE / DELETE. Wrap in BEGIN; ... COMMIT; so you can ROLLBACK.

Practice Path

1. Pick a table with at least 5 columns (users, orders, anything). Write 5 SELECTs that each use a different clause: WHERE, ORDER BY, LIMIT, GROUP BY, HAVING. 2. Build a "users per country" report and add a HAVING COUNT(*) > 10 filter. 3. Use CASE to bucket users into age groups and count each bucket. 4. Write an UPDATE that flips all expired sessions to status = 'expired', dry-run with SELECT first, run inside a transaction.