A database that can only hand back individual rows is like a calculator that only displays the
numbers you typed in. The real power of SQL appears when you stop asking "show me the rows" and
start asking "summarize them for me" — how many orders were placed, what the average spend was,
who the top customer is. That is the job of aggregate functions and the GROUP BY clause.
Why Aggregation Exists
Imagine a sales manager at the end of a busy quarter. On her desk sits a stack of ten thousand receipts. Nobody expects her to read each one. What she actually wants is a single sheet that says: total revenue per region, the average order value, the largest single sale, and how many orders each customer placed. She wants summaries, not rows.
Aggregation is exactly this act of collapsing many rows into a single meaningful number. Instead of returning five order rows, an aggregate returns one number that describes all five. The two sample tables below will be used in all examples throughout this article.
customers
| customer_id | name | city |
|---|---|---|
| 1 | Alice | London |
| 2 | Bob | Paris |
| 3 | Carla | Berlin |
| 4 | Diego | Madrid |
orders
| order_id | customer_id | total | status |
|---|---|---|---|
| 101 | 1 | 89.50 | shipped |
| 102 | 1 | 42.00 | pending |
| 103 | 2 | 150.00 | shipped |
| 104 | 3 | 27.75 | cancelled |
| 105 | 5 | 60.00 | shipped |
The Five Core Aggregate Functions
Five functions handle the vast majority of everyday summarizing. Each takes a set of values and returns exactly one value.
COUNT — and the critical COUNT(*) vs COUNT(column) difference
COUNT tells you how many rows exist. But there are two distinct flavors, and confusing
them is one of the most common beginner errors.
COUNT(*) counts rows. COUNT(column) counts only the rows where that
specific column is not NULL. In our table both return 5 because no values are missing
— but the moment a column contains NULLs, the two diverge sharply. That difference is
the source of countless reporting bugs.
SUM — adding it all up
SUM adds every non-NULL value in a numeric column. It is the most
direct way to compute a grand total.
AVG — the arithmetic mean
AVG computes the arithmetic mean of all non-NULL values. The phrase
"non-null" is the part beginners overlook: AVG does not divide by the number of
rows, it divides by the number of rows that actually have a value. We return to why this matters
in the NULL section below.
MIN and MAX — the extremes
MIN and MAX return the smallest and largest value respectively. They
are not limited to numbers — on text columns they return the alphabetically first and last value,
and on dates the earliest and latest. Like all other aggregates, they ignore NULLs.
GROUP BY — Splitting Rows Into Groups
So far every aggregate has collapsed the entire table into one number. That is rarely
what you want. The sales manager wants the total per customer, the count
per status. GROUP BY splits rows into buckets so an aggregate is computed
once per bucket rather than once for the whole table.
Basic syntax
This creates one bucket per customer_id and sums the totals within each. Customer 1
returns 131.50 (two orders), customer 2 returns 150.00, customer 3 returns 27.75, and customer 5
returns 60.00.
The golden rule
Every column in your SELECT list that is not wrapped in an aggregate function
must appear in the GROUP BY clause. The reason is logical: a group of many rows has
only one customer_id (so SQL can show it), but many different total
values (so the only way to show "the total" is to aggregate them). The database enforces this rule
strictly — leaving a non-aggregated column out of GROUP BY is an error in PostgreSQL,
SQL Server, and Redshift.
Grouping by multiple columns
SQL creates one bucket for each unique combination of the grouped columns. Customer 1 placed one shipped order and one pending order, so they produce two separate rows in the result.
Where GROUP BY sits in the logical execution order
Although we write SELECT first, the database evaluates clauses in this order:
FROM → WHERE → GROUP BY → HAVING →
SELECT → ORDER BY → LIMIT. GROUP BY runs
well before SELECT. In strict standard SQL this means a column alias defined in
SELECT is not yet available when GROUP BY or HAVING are
evaluated. In practice, however, several popular engines — including MySQL,
PostgreSQL, and BigQuery — extend the standard and do allow you
to reference a SELECT alias inside GROUP BY and HAVING as
a convenience. SQL Server follows the standard strictly and does not permit this.
When writing portable code, it is safest to repeat the expression rather than rely on the alias.
The scenario where this bites you is grouping by a computed expression — say, bucketing orders by
the month they were placed. Compare the two approaches below. The first leans on the
SELECT alias month inside GROUP BY; it reads beautifully and
works on some engines, but throws an error on stricter ones because the alias does not yet exist
when GROUP BY runs. The second repeats the full expression — more verbose, but it
runs identically everywhere.
Both queries return the exact same result. The only difference is portability: reach for option ① when you know your engine allows it and you value readability; reach for option ② when your SQL must run unchanged across different databases. When in doubt, repeat the expression — it never fails.
HAVING — Filtering Groups After Aggregation
Once you have groups, you often want to keep only some of them — say, customers who spent more
than 100. You cannot use WHERE for this, because WHERE runs before
grouping and has no concept of a group total yet. The tool for filtering groups is
HAVING.
WHERE vs HAVING — the essential distinction
-
WHEREfilters individual rows, before grouping. It cannot use aggregate functions. -
HAVINGfilters groups, after aggregation. It can — and usually does — reference aggregate functions.
HAVING just because it
works. Conditions on raw column values belong in WHERE, which runs before the more
expensive grouping step. The simple rule: if the condition tests a raw column value, use
WHERE; if it tests an aggregate result, use HAVING.
WHERE + GROUP BY + HAVING together
WHERE discards order 104 (cancelled); GROUP BY buckets the rest by
customer; HAVING keeps only customers whose qualifying total exceeds 100. Customers 1
(131.50) and 2 (150.00) qualify; customer 5 (60.00) does not.
NULL Behavior in Aggregates
Here is a fact worth memorizing: every aggregate function except COUNT(*)
ignores NULL values. Microsoft's SQL Server documentation states it plainly:
except for COUNT(*), aggregate functions skip null values entirely.
Why does this matter? Suppose three of ten orders have a NULL total because the amount
was never recorded. AVG(total) will sum the seven known values and divide by
seven, not ten. If you assumed it divided by ten, your average is quietly wrong. Similarly,
COUNT(total) reports seven while COUNT(*) reports ten.
SUM or AVG over a
set that contains only NULLs, the result is not 0 — it is NULL.
Only COUNT returns 0 when there are no qualifying rows. If you need missing values
treated as zero, convert them first with COALESCE(total, 0) before aggregating.
Combining Aggregates with JOINs
In the real world, the numbers you want to sum live in one table while the labels you want to group by live in another. JOIN-then-aggregate is therefore the most common reporting pattern of all.
Result:
| name | total_spent |
|---|---|
| Alice | 131.50 |
| Bob | 150.00 |
| Carla | 27.75 |
Notice who is missing. Diego (customer 4) placed no orders, so an inner join produces no rows for
him and he vanishes. Order 105 also disappears because its customer_id matches nobody.
If you wanted Diego to appear with a zero total, you would use a LEFT JOIN from
customers and wrap the sum in COALESCE(SUM(o.total), 0).
Useful Extras
COUNT(DISTINCT column)
Adding DISTINCT inside COUNT counts only unique, non-NULL
values. Our five orders were placed by four distinct customers (customer 1 placed two), so the
result is 4. This is the standard answer to "how many different customers ordered?"
Grouping by an expression
You are not limited to raw columns — you can group by the result of a function. This is how time-series reports are built.
In MySQL or SQL Server you would use YEAR(), MONTH(), or similar
functions to achieve the same bucketing by time period.
ROLLUP — subtotals and a grand total in one query
Sometimes you want both the per-group totals and a grand total in a single result.
ROLLUP does exactly that: it runs the same GROUP BY query you already
know, but automatically adds one extra summary row per grouping level. It is supported in
PostgreSQL, MySQL, SQL Server, and Redshift.
The second query produces the following result. The extra row at the bottom, where name is NULL, is the grand total automatically added by ROLLUP:
| name | spent | note |
|---|---|---|
| Alice | 131.50 | subtotal for Alice |
| Bob | 150.00 | subtotal for Bob |
| Carla | 27.75 | subtotal for Carla |
| NULL | 309.25 | ← grand total, added by ROLLUP |
The NULL in the name column is not a missing value — it is
ROLLUP's marker for "this row summarises all groups above." You can replace it
with a label using COALESCE(c.name, 'Grand Total') for a cleaner report.
Putting It All Together
Here is a single query that exercises every clause covered in this article. The goal: among non-cancelled orders, find customers whose qualifying spend exceeds 100, ranked highest to lowest, returning only the top result.
Traced in logical execution order:
- FROM & JOIN — assemble matched customer–order pairs. Order 105 has no matching customer and drops out.
- WHERE — discard the cancelled order 104, leaving rows for Alice and Bob.
- GROUP BY — bucket survivors by customer name: Alice (two orders, 131.50 total) and Bob (one order, 150.00).
- HAVING — keep only groups whose sum exceeds 100; both qualify.
- SELECT — compute the display columns: name, spent, order count.
- ORDER BY — sort by spent descending: Bob (150.00) before Alice (131.50).
- LIMIT — return only the first row: Bob, 150.00, 1 order.
Trace that sequence whenever a grouped query produces an unexpected result. Nearly every "why did I get that?" question dissolves once you walk the clauses in their true execution order rather than the written order.
Main References
-
PostgreSQL Global Development Group. PostgreSQL Documentation: Aggregate Functions.
https://www.postgresql.org/docs/current/functions-aggregate.html -
PostgreSQL Global Development Group. PostgreSQL Documentation: GROUP BY and HAVING Clauses.
https://www.postgresql.org/docs/current/queries-table-expressions.html#QUERIES-GROUP -
Oracle Corporation. MySQL 8.0 Reference Manual: GROUP BY Modifiers.
https://dev.mysql.com/doc/refman/8.0/en/group-by-modifiers.html -
Microsoft. Aggregate Functions (Transact-SQL) — SQL Server Documentation.
https://learn.microsoft.com/en-us/sql/t-sql/functions/aggregate-functions-transact-sql -
Amazon Web Services. Amazon Redshift Developer Guide: Aggregate Functions.
https://docs.aws.amazon.com/redshift/latest/dg/c_Aggregate_Functions.html