PSA: Don't use UNIONs in SQL queries

The title is hyperbolic, but you should definitely use them less. I'll tell you why. Imagine we have a database schema that represents a toy store:

(customers table)
id (INTEGER PK) name (TEXT)
1 Olive
2 James
3 Tyler
4 Lydia
(products table)
id (INTEGER PK) name (TEXT) price (DECIMAL)
1 Baby Doll 10.00
2 Fingerboard 3.00
3 Robot (Laser Eyes) 5.00
4 Guitar 100.00
(transactions table)
id (INTEGER PK) customer_id (INTEGER FK) product_id (INTEGER FK)
1 1 2
2 1 3
3 2 4
4 3 3

So given this schema, how would you determine the number of products each customer has bought using a SQL query? Here's one method:

SELECT
  customers.name AS CUSTOMER,
  COUNT(*) AS ITEMS_BOUGHT
FROM customers
INNER JOIN transactions
  ON transactions.customer_id = customers.id
GROUP BY customers.name;
(query output)
CUSTOMER ITEMS_BOUGHT
Olive 2
James 1
Tyler 1

Does this give us a correct answer? No it doesn't. We want to get the number of products each customer has bought, but this query doesn't tell us how many products Lydia has bought! Here's a SQL query that fixes that problem by using a UNION to join the zero and non-zero sets together:

(
  SELECT
    customers.name AS CUSTOMER,
    COUNT(*) AS ITEMS_BOUGHT
  FROM customers
  INNER JOIN transactions
    ON transactions.customer_id = customers.id
  GROUP BY customers.name
) UNION (
  SELECT
    customers.name,
    0 AS ITEMS_BOUGHT
  FROM customers
  WHERE NOT EXISTS (
    SELECT 1 FROM transactions
    WHERE transactions.customer_id = customers.id
  )
)
(query output)
CUSTOMER ITEMS_BOUGHT
Olive 2
James 1
Tyler 1
Lydia 0

Yay, it works! But there's a problem. When you use a UNION to join a zero set and non-zero set you end up duplicating a lot of code. As a result, anytime you change the query you have to update the other side of the union to make sure the condition is inverted correctly. The query ends up maintaining this shape:

(
  -- QUERY
) UNION (
  -- INVERTED QUERY
)

If the top or bottom part of the query gets out of sync you end up with a broken query that returns incorrect results. So how can we make this better? Well here's what I think is the best method:

SELECT
  customers.name AS CUSTOMER,
  COUNT(transactions.id) AS ITEMS_BOUGHT
FROM customers
LEFT JOIN transactions
  ON transactions.customer_id = customers.id
GROUP BY customers.name;
(query output)
CUSTOMER ITEMS_BOUGHT
Olive 2
James 1
Tyler 1
Lydia 0

This method takes advantage of a not-so-obvious feature of the COUNT aggregate function. COUNT, when used with an asterisk (like this `COUNT(*)`) counts every row in a group created by a GROUP BY. When COUNT is used on an expression (like this `COUNT(transaction.id)`) then it only counts non-NULL values. We can exploit this functionality to avoid performing UNION's and instead keep our queries short.

Okay but what's the catch? There is no catch. This feature is available in every single widely used SQL database.

Happy querying :)