Notifications

No notifications

/Phase 2

Subqueries & Set Operations

A subquery is a query inside another query. They let you compose questions: 'find users older than the average', 'show posts by users from India', 'list customers who never ordered'. Add set operations (UNION, INTERSECT, EXCEPT) and you can stitch result sets together like sets of rows.

On this page

Detailed Theory

Three places subqueries live

PositionLooks likeReturns
In SELECTscalar subqueryexactly one value
In FROMderived tablea virtual table
In WHEREpredicate subqueryone or many values, used by IN/EXISTS/comparisons

-- 1. Scalar subquery — returns ONE value
SELECT name, age, (SELECT AVG(age) FROM users) AS company_avg
FROM   users;

-- 2. Derived table — used like any other table SELECT t.country, t.users FROM (SELECT country, COUNT(*) AS users FROM users GROUP BY country) AS t WHERE t.users >= 10;

-- 3. Predicate subquery — feeds an IN / EXISTS / comparison SELECT * FROM users WHERE id IN (SELECT user_id FROM posts WHERE published);

Scalar subquery rules

A scalar subquery must return exactly one row and one column, otherwise it errors at runtime. Wrap with LIMIT 1 or aggregates if uncertain:

SELECT name, (SELECT MAX(amount) FROM orders WHERE user_id = u.id) AS biggest_order
FROM   users u;

Derived tables (FROM-clause subqueries)

Anything you can write as a query, you can wrap in parentheses, alias, and use as a table:

SELECT country, AVG(age_decade) AS avg_decade
FROM (
  SELECT country, FLOOR(age/10)*10 AS age_decade FROM users
) AS u
GROUP BY country;

The alias is mandatory in most engines. CTEs (next topic) are usually a cleaner alternative when the subquery is reused or deeply nested.

IN, NOT IN, EXISTS, NOT EXISTS

When to usePitfall
IN (subquery)small list / set membershipNULL in list → trouble for NOT IN
NOT IN (subquery)row not in setNOT IN with any NULL returns nothing
EXISTS (subquery)"is there at least one match?"none — preferred for correlated checks
NOT EXISTS (subquery)anti-joinnone — NULL-safe

-- All users with at least one post
SELECT * FROM users u
WHERE  EXISTS (SELECT 1 FROM posts p WHERE p.user_id = u.id);

-- All users with NO posts (NULL-safe!) SELECT * FROM users u WHERE NOT EXISTS (SELECT 1 FROM posts p WHERE p.user_id = u.id);

The famous trap:

-- ⚠️ If posts.user_id has any NULL, this returns ZERO rows
SELECT * FROM users
WHERE  id NOT IN (SELECT user_id FROM posts);

Why? x NOT IN (1, 2, NULL) evaluates to x <> 1 AND x <> 2 AND x <> NULL — and x <> NULL is UNKNOWN, never TRUE. The whole AND chain becomes UNKNOWN, the row is dropped. Use NOT EXISTS for safety.

Correlated subqueries

A correlated subquery references a column from the outer query — it re-runs once per outer row:

-- Users older than the average for their country
SELECT name, age, country
FROM   users u
WHERE  age > (SELECT AVG(age)
              FROM   users
              WHERE  country = u.country);

Conceptually expensive (one inner run per outer row), but optimisers often rewrite to a join + aggregate. Still, prefer joins or window functions if the dataset is large.

Subquery vs JOIN — when to choose what

ScenarioBetter with
"Does X exist?"EXISTS (subquery)
"Get matching rows + columns from both sides"JOIN
"Compare against an aggregate of another set"scalar subquery / window
"Combine result sets"UNION / UNION ALL

Set operations

Combine whole result sets that have the same column shape:

OperatorReturnsRemoves duplicates?
UNIONrows in A or Byes (sorts → slow)
UNION ALLrows in A or Bno (fast)
INTERSECTrows in bothyes
EXCEPT (MINUS in Oracle)rows in A not in Byes

-- Active or premium users (deduped)
SELECT id FROM users WHERE is_active
UNION
SELECT id FROM users WHERE is_premium;

-- Same, but keep duplicates (much faster on big tables) SELECT id FROM users WHERE is_active UNION ALL SELECT id FROM users WHERE is_premium;

-- Customers who logged in AND placed an order today SELECT user_id FROM logins WHERE day = CURRENT_DATE INTERSECT SELECT user_id FROM orders WHERE day = CURRENT_DATE;

-- Customers who logged in but did NOT order SELECT user_id FROM logins WHERE day = CURRENT_DATE EXCEPT SELECT user_id FROM orders WHERE day = CURRENT_DATE;

Use UNION ALL whenever you don't need dedup. UNION always sorts internally; UNION ALL is a cheap concatenation.

Rules for set operations

1. Same number of columns in every branch. 2. Compatible types per column position. 3. Column names come from the first branch. 4. ORDER BY / LIMIT belong at the very end, applied to the combined result.

SELECT id, name, 'employee' AS kind FROM employees
UNION ALL
SELECT id, name, 'contractor' AS kind FROM contractors
ORDER BY name;

Quick decision guide

  • "Does any matching row exist?" → EXISTS.
  • "Find rows missing from another table?" → NOT EXISTS or LEFT JOIN … IS NULL.
  • "Compare to an aggregate?" → scalar subquery or window function.
  • "Stack two result sets?" → UNION ALL (or UNION if you need dedup).
  • "Rows in both?" → INTERSECT.
  • "Rows in A only?" → EXCEPT.