PSA: Don't use UNIONs in SQL queries
The title is hyperbolic, but you should definitely use them less. I'll tell you why. Imagine we have a database schema that represents a toy store:
id (INTEGER PK) | name (TEXT) |
---|---|
1 | Olive |
2 | James |
3 | Tyler |
4 | Lydia |
id (INTEGER PK) | name (TEXT) | price (DECIMAL) |
---|---|---|
1 | Baby Doll | 10.00 |
2 | Fingerboard | 3.00 |
3 | Robot (Laser Eyes) | 5.00 |
4 | Guitar | 100.00 |
id (INTEGER PK) | customer_id (INTEGER FK) | product_id (INTEGER FK) |
---|---|---|
1 | 1 | 2 |
2 | 1 | 3 |
3 | 2 | 4 |
4 | 3 | 3 |
So given this schema, how would you determine the number of products each customer has bought using a SQL query? Here's one method:
SELECT
customers.name AS CUSTOMER,
COUNT(*) AS ITEMS_BOUGHT
FROM customers
INNER JOIN transactions
ON transactions.customer_id = customers.id
GROUP BY customers.name;
CUSTOMER | ITEMS_BOUGHT |
---|---|
Olive | 2 |
James | 1 |
Tyler | 1 |
Does this give us a correct answer? No it doesn't. We want to get the number of products each customer has bought, but this query doesn't tell us how many products Lydia has bought! Here's a SQL query that fixes that problem by using a UNION to join the zero and non-zero sets together:
(
SELECT
customers.name AS CUSTOMER,
COUNT(*) AS ITEMS_BOUGHT
FROM customers
INNER JOIN transactions
ON transactions.customer_id = customers.id
GROUP BY customers.name
) UNION (
SELECT
customers.name,
0 AS ITEMS_BOUGHT
FROM customers
WHERE NOT EXISTS (
SELECT 1 FROM transactions
WHERE transactions.customer_id = customers.id
)
)
CUSTOMER | ITEMS_BOUGHT |
---|---|
Olive | 2 |
James | 1 |
Tyler | 1 |
Lydia | 0 |
Yay, it works! But there's a problem. When you use a UNION to join a zero set and non-zero set you end up duplicating a lot of code. As a result, anytime you change the query you have to update the other side of the union to make sure the condition is inverted correctly. The query ends up maintaining this shape:
(
-- QUERY
) UNION (
-- INVERTED QUERY
)
If the top or bottom part of the query gets out of sync you end up with a broken query that returns incorrect results. So how can we make this better? Well here's what I think is the best method:
SELECT
customers.name AS CUSTOMER,
COUNT(transactions.id) AS ITEMS_BOUGHT
FROM customers
LEFT JOIN transactions
ON transactions.customer_id = customers.id
GROUP BY customers.name;
CUSTOMER | ITEMS_BOUGHT |
---|---|
Olive | 2 |
James | 1 |
Tyler | 1 |
Lydia | 0 |
This method takes advantage of a not-so-obvious feature of the COUNT aggregate function. COUNT, when used with an asterisk (like this `COUNT(*)`) counts every row in a group created by a GROUP BY. When COUNT is used on an expression (like this `COUNT(transaction.id)`) then it only counts non-NULL values. We can exploit this functionality to avoid performing UNION's and instead keep our queries short.
Okay but what's the catch? There is no catch. This feature is available in every single widely used SQL database.
Happy querying :)