A database that can only hand back individual rows is like a calculator that only displays the numbers you typed in. The real power of SQL appears when you stop asking "show me the rows" and start asking "summarize them for me" — how many orders were placed, what the average spend was, who the top customer is. That is the job of aggregate functions and the GROUP BY clause.

Why Aggregation Exists

Imagine a sales manager at the end of a busy quarter. On her desk sits a stack of ten thousand receipts. Nobody expects her to read each one. What she actually wants is a single sheet that says: total revenue per region, the average order value, the largest single sale, and how many orders each customer placed. She wants summaries, not rows.

Aggregation is exactly this act of collapsing many rows into a single meaningful number. Instead of returning five order rows, an aggregate returns one number that describes all five. The two sample tables below will be used in all examples throughout this article.

customers

customer_idnamecity
1AliceLondon
2BobParis
3CarlaBerlin
4DiegoMadrid

orders

order_idcustomer_idtotalstatus
101189.50shipped
102142.00pending
1032150.00shipped
104327.75cancelled
105560.00shipped

The Five Core Aggregate Functions

Five functions handle the vast majority of everyday summarizing. Each takes a set of values and returns exactly one value.

COUNT — and the critical COUNT(*) vs COUNT(column) difference

COUNT tells you how many rows exist. But there are two distinct flavors, and confusing them is one of the most common beginner errors.

-- Counts every row, regardless of NULLs SELECT COUNT(*) FROM orders; -- result: 5 -- Counts only rows where the column is NOT NULL SELECT COUNT(status) FROM orders; -- result: 5 (no NULLs here)

COUNT(*) counts rows. COUNT(column) counts only the rows where that specific column is not NULL. In our table both return 5 because no values are missing — but the moment a column contains NULLs, the two diverge sharply. That difference is the source of countless reporting bugs.

SUM — adding it all up

SELECT SUM(total) FROM orders; -- 89.50 + 42.00 + 150.00 + 27.75 + 60.00 = 369.25

SUM adds every non-NULL value in a numeric column. It is the most direct way to compute a grand total.

AVG — the arithmetic mean

SELECT AVG(total) FROM orders; -- 73.85

AVG computes the arithmetic mean of all non-NULL values. The phrase "non-null" is the part beginners overlook: AVG does not divide by the number of rows, it divides by the number of rows that actually have a value. We return to why this matters in the NULL section below.

MIN and MAX — the extremes

SELECT MIN(total), MAX(total) FROM orders; -- MIN = 27.75, MAX = 150.00

MIN and MAX return the smallest and largest value respectively. They are not limited to numbers — on text columns they return the alphabetically first and last value, and on dates the earliest and latest. Like all other aggregates, they ignore NULLs.

GROUP BY — Splitting Rows Into Groups

So far every aggregate has collapsed the entire table into one number. That is rarely what you want. The sales manager wants the total per customer, the count per status. GROUP BY splits rows into buckets so an aggregate is computed once per bucket rather than once for the whole table.

Input rows Alice 89.50 Alice 42.00 Bob 150.00 Carla 27.75 Grouped Alice 89.50 42.00 Bob 150.00 Carla 27.75 SUM(total) Alice 131.50 Bob 150.00 Carla 27.75 1. Rows arrive → 2. GROUP BY buckets them by name → 3. SUM computes one value per bucket
Figure 1 — GROUP BY collapses many rows into groups; the aggregate then produces a single value per group.

Basic syntax

SELECT customer_id, SUM(total) AS spent FROM orders GROUP BY customer_id;

This creates one bucket per customer_id and sums the totals within each. Customer 1 returns 131.50 (two orders), customer 2 returns 150.00, customer 3 returns 27.75, and customer 5 returns 60.00.

The golden rule

Every column in your SELECT list that is not wrapped in an aggregate function must appear in the GROUP BY clause. The reason is logical: a group of many rows has only one customer_id (so SQL can show it), but many different total values (so the only way to show "the total" is to aggregate them). The database enforces this rule strictly — leaving a non-aggregated column out of GROUP BY is an error in PostgreSQL, SQL Server, and Redshift.

Grouping by multiple columns

SELECT customer_id, status, COUNT(*) AS n FROM orders GROUP BY customer_id, status;

SQL creates one bucket for each unique combination of the grouped columns. Customer 1 placed one shipped order and one pending order, so they produce two separate rows in the result.

Where GROUP BY sits in the logical execution order

Although we write SELECT first, the database evaluates clauses in this order: FROMWHEREGROUP BYHAVINGSELECTORDER BYLIMIT. GROUP BY runs well before SELECT. In strict standard SQL this means a column alias defined in SELECT is not yet available when GROUP BY or HAVING are evaluated. In practice, however, several popular engines — including MySQL, PostgreSQL, and BigQuery — extend the standard and do allow you to reference a SELECT alias inside GROUP BY and HAVING as a convenience. SQL Server follows the standard strictly and does not permit this. When writing portable code, it is safest to repeat the expression rather than rely on the alias.

The scenario where this bites you is grouping by a computed expression — say, bucketing orders by the month they were placed. Compare the two approaches below. The first leans on the SELECT alias month inside GROUP BY; it reads beautifully and works on some engines, but throws an error on stricter ones because the alias does not yet exist when GROUP BY runs. The second repeats the full expression — more verbose, but it runs identically everywhere.

-- ① Relies on the SELECT alias "month" -- Convenient, but rejected by engines that follow the standard strictly SELECT DATE_TRUNC('month', placed_at) AS month, SUM(total) AS revenue FROM orders GROUP BY month; -- ← uses the alias -- ② Repeats the full expression -- Portable and safe — works regardless of the engine SELECT DATE_TRUNC('month', placed_at) AS month, SUM(total) AS revenue FROM orders GROUP BY DATE_TRUNC('month', placed_at); -- ← repeats the expression

Both queries return the exact same result. The only difference is portability: reach for option ① when you know your engine allows it and you value readability; reach for option ② when your SQL must run unchanged across different databases. When in doubt, repeat the expression — it never fails.

HAVING — Filtering Groups After Aggregation

Once you have groups, you often want to keep only some of them — say, customers who spent more than 100. You cannot use WHERE for this, because WHERE runs before grouping and has no concept of a group total yet. The tool for filtering groups is HAVING.

WHERE vs HAVING — the essential distinction

  • WHERE filters individual rows, before grouping. It cannot use aggregate functions.
  • HAVING filters groups, after aggregation. It can — and usually does — reference aggregate functions.
ALL ROWS 101 · shipped 102 · pending 103 · shipped 104 · cancelled 105 · shipped 5 rows total WHERE Filters rows before grouping status <> 'cancelled' row 104 removed 4 rows remain GROUP BY Groups rows, runs aggregates Alice 131.50 Bob 150.00 cust.5 60.00 3 groups HAVING Filters groups after aggregation SUM(total) > 100 Alice · Bob ✓ 2 groups remain WHERE — filter rows before grouping Use for raw column values: status, city, total WHERE status = 'shipped' · WHERE total > 50 HAVING — filter groups after grouping Use for aggregate results: SUM, COUNT, AVG HAVING SUM(total) > 100 · HAVING COUNT(*) > 1
Figure 2 — WHERE filters individual rows before grouping; HAVING filters groups after aggregation.
A common mistake: putting every filter in HAVING just because it works. Conditions on raw column values belong in WHERE, which runs before the more expensive grouping step. The simple rule: if the condition tests a raw column value, use WHERE; if it tests an aggregate result, use HAVING.

WHERE + GROUP BY + HAVING together

SELECT customer_id, SUM(total) AS spent FROM orders WHERE status <> 'cancelled' -- filter rows first GROUP BY customer_id HAVING SUM(total) > 100; -- then filter groups

WHERE discards order 104 (cancelled); GROUP BY buckets the rest by customer; HAVING keeps only customers whose qualifying total exceeds 100. Customers 1 (131.50) and 2 (150.00) qualify; customer 5 (60.00) does not.

NULL Behavior in Aggregates

Here is a fact worth memorizing: every aggregate function except COUNT(*) ignores NULL values. Microsoft's SQL Server documentation states it plainly: except for COUNT(*), aggregate functions skip null values entirely.

Why does this matter? Suppose three of ten orders have a NULL total because the amount was never recorded. AVG(total) will sum the seven known values and divide by seven, not ten. If you assumed it divided by ten, your average is quietly wrong. Similarly, COUNT(total) reports seven while COUNT(*) reports ten.

The surprising edge case: if you run SUM or AVG over a set that contains only NULLs, the result is not 0 — it is NULL. Only COUNT returns 0 when there are no qualifying rows. If you need missing values treated as zero, convert them first with COALESCE(total, 0) before aggregating.

Combining Aggregates with JOINs

In the real world, the numbers you want to sum live in one table while the labels you want to group by live in another. JOIN-then-aggregate is therefore the most common reporting pattern of all.

SELECT c.name, SUM(o.total) AS total_spent FROM customers c JOIN orders o ON o.customer_id = c.customer_id GROUP BY c.name;

Result:

nametotal_spent
Alice131.50
Bob150.00
Carla27.75

Notice who is missing. Diego (customer 4) placed no orders, so an inner join produces no rows for him and he vanishes. Order 105 also disappears because its customer_id matches nobody. If you wanted Diego to appear with a zero total, you would use a LEFT JOIN from customers and wrap the sum in COALESCE(SUM(o.total), 0).

Useful Extras

COUNT(DISTINCT column)

SELECT COUNT(DISTINCT customer_id) FROM orders; -- 4

Adding DISTINCT inside COUNT counts only unique, non-NULL values. Our five orders were placed by four distinct customers (customer 1 placed two), so the result is 4. This is the standard answer to "how many different customers ordered?"

Grouping by an expression

You are not limited to raw columns — you can group by the result of a function. This is how time-series reports are built.

-- Monthly revenue report (PostgreSQL) SELECT DATE_TRUNC('month', placed_at) AS month, SUM(total) AS revenue FROM orders GROUP BY DATE_TRUNC('month', placed_at) ORDER BY month;

In MySQL or SQL Server you would use YEAR(), MONTH(), or similar functions to achieve the same bucketing by time period.

ROLLUP — subtotals and a grand total in one query

Sometimes you want both the per-group totals and a grand total in a single result. ROLLUP does exactly that: it runs the same GROUP BY query you already know, but automatically adds one extra summary row per grouping level. It is supported in PostgreSQL, MySQL, SQL Server, and Redshift.

-- Without ROLLUP: one row per customer SELECT c.name, SUM(o.total) AS spent FROM customers c JOIN orders o ON o.customer_id = c.customer_id GROUP BY c.name; -- With ROLLUP: same rows + an extra grand-total row SELECT c.name, SUM(o.total) AS spent FROM customers c JOIN orders o ON o.customer_id = c.customer_id GROUP BY ROLLUP(c.name);

The second query produces the following result. The extra row at the bottom, where name is NULL, is the grand total automatically added by ROLLUP:

namespentnote
Alice131.50subtotal for Alice
Bob150.00subtotal for Bob
Carla27.75subtotal for Carla
NULL309.25← grand total, added by ROLLUP

The NULL in the name column is not a missing value — it is ROLLUP's marker for "this row summarises all groups above." You can replace it with a label using COALESCE(c.name, 'Grand Total') for a cleaner report.

Putting It All Together

Here is a single query that exercises every clause covered in this article. The goal: among non-cancelled orders, find customers whose qualifying spend exceeds 100, ranked highest to lowest, returning only the top result.

SELECT c.name, SUM(o.total) AS spent, COUNT(*) AS order_count FROM customers c JOIN orders o ON o.customer_id = c.customer_id WHERE o.status <> 'cancelled' GROUP BY c.name HAVING SUM(o.total) > 100 ORDER BY spent DESC LIMIT 1;

Traced in logical execution order:

  1. FROM & JOIN — assemble matched customer–order pairs. Order 105 has no matching customer and drops out.
  2. WHERE — discard the cancelled order 104, leaving rows for Alice and Bob.
  3. GROUP BY — bucket survivors by customer name: Alice (two orders, 131.50 total) and Bob (one order, 150.00).
  4. HAVING — keep only groups whose sum exceeds 100; both qualify.
  5. SELECT — compute the display columns: name, spent, order count.
  6. ORDER BY — sort by spent descending: Bob (150.00) before Alice (131.50).
  7. LIMIT — return only the first row: Bob, 150.00, 1 order.

Trace that sequence whenever a grouped query produces an unexpected result. Nearly every "why did I get that?" question dissolves once you walk the clauses in their true execution order rather than the written order.

Main References

  1. PostgreSQL Global Development Group. PostgreSQL Documentation: Aggregate Functions.
    https://www.postgresql.org/docs/current/functions-aggregate.html
  2. PostgreSQL Global Development Group. PostgreSQL Documentation: GROUP BY and HAVING Clauses.
    https://www.postgresql.org/docs/current/queries-table-expressions.html#QUERIES-GROUP
  3. Oracle Corporation. MySQL 8.0 Reference Manual: GROUP BY Modifiers.
    https://dev.mysql.com/doc/refman/8.0/en/group-by-modifiers.html
  4. Microsoft. Aggregate Functions (Transact-SQL) — SQL Server Documentation.
    https://learn.microsoft.com/en-us/sql/t-sql/functions/aggregate-functions-transact-sql
  5. Amazon Web Services. Amazon Redshift Developer Guide: Aggregate Functions.
    https://docs.aws.amazon.com/redshift/latest/dg/c_Aggregate_Functions.html
← Back to all articles