From Rows to Insights: Aggregate Functions

A database that can only hand back individual rows is like a calculator that only displays the numbers you typed in. The real power of SQL appears when you stop asking "show me the rows" and start asking "summarize them for me" — how many orders were placed, what the average spend was, who the top customer is. That is the job of aggregate functions and the GROUP BY clause.

Why Aggregation Exists

Imagine a sales manager at the end of a busy quarter. On her desk sits a stack of ten thousand receipts. Nobody expects her to read each one. What she actually wants is a single sheet that says: total revenue per region, the average order value, the largest single sale, and how many orders each customer placed. She wants summaries, not rows.

Aggregation is exactly this act of collapsing many rows into a single meaningful number. Instead of returning five order rows, an aggregate returns one number that describes all five. The two sample tables below will be used in all examples throughout this article.

customers

customer_id	name	city
1	Alice	London
2	Bob	Paris
3	Carla	Berlin
4	Diego	Madrid

orders

order_id	customer_id	total	status
101	1	89.50	shipped
102	1	42.00	pending
103	2	150.00	shipped
104	3	27.75	cancelled
105	5	60.00	shipped

The Five Core Aggregate Functions

Five functions handle the vast majority of everyday summarizing. Each takes a set of values and returns exactly one value.

COUNT — and the critical COUNT(*) vs COUNT(column) difference

COUNT tells you how many rows exist. But there are two distinct flavors, and confusing them is one of the most common beginner errors.

-- Counts every row, regardless of NULLs SELECT COUNT(*) FROM orders; -- result: 5 -- Counts only rows where the column is NOT NULL SELECT COUNT(status) FROM orders; -- result: 5 (no NULLs here)

COUNT(*) counts rows. COUNT(column) counts only the rows where that specific column is not NULL. In our table both return 5 because no values are missing — but the moment a column contains NULLs, the two diverge sharply. That difference is the source of countless reporting bugs.

SUM — adding it all up

SELECT SUM(total) FROM orders; -- 89.50 + 42.00 + 150.00 + 27.75 + 60.00 = 369.25

SUM adds every non-NULL value in a numeric column. It is the most direct way to compute a grand total.

AVG — the arithmetic mean

SELECT AVG(total) FROM orders; -- 73.85

AVG computes the arithmetic mean of all non-NULL values. The phrase "non-null" is the part beginners overlook: AVG does not divide by the number of rows, it divides by the number of rows that actually have a value. We return to why this matters in the NULL section below.

MIN and MAX — the extremes

SELECT MIN(total), MAX(total) FROM orders; -- MIN = 27.75, MAX = 150.00

MIN and MAX return the smallest and largest value respectively. They are not limited to numbers — on text columns they return the alphabetically first and last value, and on dates the earliest and latest. Like all other aggregates, they ignore NULLs.

GROUP BY — Splitting Rows Into Groups

So far every aggregate has collapsed the entire table into one number. That is rarely what you want. The sales manager wants the total per customer, the count per status. GROUP BY splits rows into buckets so an aggregate is computed once per bucket rather than once for the whole table.

Figure 1 — GROUP BY collapses many rows into groups; the aggregate then produces a single value per group.

Basic syntax

SELECT customer_id, SUM(total) AS spent FROM orders GROUP BY customer_id;

This creates one bucket per customer_id and sums the totals within each. Customer 1 returns 131.50 (two orders), customer 2 returns 150.00, customer 3 returns 27.75, and customer 5 returns 60.00.

The golden rule

Every column in your SELECT list that is not wrapped in an aggregate function must appear in the GROUP BY clause. The reason is logical: a group of many rows has only one customer_id (so SQL can show it), but many different total values (so the only way to show "the total" is to aggregate them). The database enforces this rule strictly — leaving a non-aggregated column out of GROUP BY is an error in PostgreSQL, SQL Server, and Redshift.

Grouping by multiple columns

SELECT customer_id, status, COUNT(*) AS n FROM orders GROUP BY customer_id, status;

SQL creates one bucket for each unique combination of the grouped columns. Customer 1 placed one shipped order and one pending order, so they produce two separate rows in the result.

Where GROUP BY sits in the logical execution order

Although we write SELECT first, the database evaluates clauses in this order: FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY → LIMIT. GROUP BY runs well before SELECT. In strict standard SQL this means a column alias defined in SELECT is not yet available when GROUP BY or HAVING are evaluated. In practice, however, several popular engines — including MySQL, PostgreSQL, and BigQuery — extend the standard and do allow you to reference a SELECT alias inside GROUP BY and HAVING as a convenience. SQL Server follows the standard strictly and does not permit this. When writing portable code, it is safest to repeat the expression rather than rely on the alias.

The scenario where this bites you is grouping by a computed expression — say, bucketing orders by the month they were placed. Compare the two approaches below. The first leans on the SELECT alias month inside GROUP BY; it reads beautifully and works on some engines, but throws an error on stricter ones because the alias does not yet exist when GROUP BY runs. The second repeats the full expression — more verbose, but it runs identically everywhere.

-- ① Relies on the SELECT alias "month" -- Convenient, but rejected by engines that follow the standard strictly SELECT DATE_TRUNC('month', placed_at) AS month, SUM(total) AS revenue FROM orders GROUP BY month; -- ← uses the alias -- ② Repeats the full expression -- Portable and safe — works regardless of the engine SELECT DATE_TRUNC('month', placed_at) AS month, SUM(total) AS revenue FROM orders GROUP BY DATE_TRUNC('month', placed_at); -- ← repeats the expression

Both queries return the exact same result. The only difference is portability: reach for option ① when you know your engine allows it and you value readability; reach for option ② when your SQL must run unchanged across different databases. When in doubt, repeat the expression — it never fails.

HAVING — Filtering Groups After Aggregation

Once you have groups, you often want to keep only some of them — say, customers who spent more than 100. You cannot use WHERE for this, because WHERE runs before grouping and has no concept of a group total yet. The tool for filtering groups is HAVING.

WHERE vs HAVING — the essential distinction

WHERE filters individual rows, before grouping. It cannot use aggregate functions.
HAVING filters groups, after aggregation. It can — and usually does — reference aggregate functions.

Figure 2 — WHERE filters individual rows before grouping; HAVING filters groups after aggregation.

A common mistake: putting every filter in HAVING just because it works. Conditions on raw column values belong in WHERE, which runs before the more expensive grouping step. The simple rule: if the condition tests a raw column value, use WHERE; if it tests an aggregate result, use HAVING.

WHERE + GROUP BY + HAVING together

SELECT customer_id, SUM(total) AS spent FROM orders WHERE status <> 'cancelled' -- filter rows first GROUP BY customer_id HAVING SUM(total) > 100; -- then filter groups

WHERE discards order 104 (cancelled); GROUP BY buckets the rest by customer; HAVING keeps only customers whose qualifying total exceeds 100. Customers 1 (131.50) and 2 (150.00) qualify; customer 5 (60.00) does not.

NULL Behavior in Aggregates

Here is a fact worth memorizing: every aggregate function except COUNT(*) ignores NULL values. Microsoft's SQL Server documentation states it plainly: except for COUNT(*), aggregate functions skip null values entirely.

Why does this matter? Suppose three of ten orders have a NULL total because the amount was never recorded. AVG(total) will sum the seven known values and divide by seven, not ten. If you assumed it divided by ten, your average is quietly wrong. Similarly, COUNT(total) reports seven while COUNT(*) reports ten.

The surprising edge case: if you run SUM or AVG over a set that contains only NULLs, the result is not 0 — it is NULL. Only COUNT returns 0 when there are no qualifying rows. If you need missing values treated as zero, convert them first with COALESCE(total, 0) before aggregating.

Combining Aggregates with JOINs

In the real world, the numbers you want to sum live in one table while the labels you want to group by live in another. JOIN-then-aggregate is therefore the most common reporting pattern of all.

SELECT c.name, SUM(o.total) AS total_spent FROM customers c JOIN orders o ON o.customer_id = c.customer_id GROUP BY c.name;

Result:

name	total_spent
Alice	131.50
Bob	150.00
Carla	27.75

Notice who is missing. Diego (customer 4) placed no orders, so an inner join produces no rows for him and he vanishes. Order 105 also disappears because its customer_id matches nobody. If you wanted Diego to appear with a zero total, you would use a LEFT JOIN from customers and wrap the sum in COALESCE(SUM(o.total), 0).

Useful Extras

COUNT(DISTINCT column)

SELECT COUNT(DISTINCT customer_id) FROM orders; -- 4

Adding DISTINCT inside COUNT counts only unique, non-NULL values. Our five orders were placed by four distinct customers (customer 1 placed two), so the result is 4. This is the standard answer to "how many different customers ordered?"

Grouping by an expression

You are not limited to raw columns — you can group by the result of a function. This is how time-series reports are built.

-- Monthly revenue report (PostgreSQL) SELECT DATE_TRUNC('month', placed_at) AS month, SUM(total) AS revenue FROM orders GROUP BY DATE_TRUNC('month', placed_at) ORDER BY month;

In MySQL or SQL Server you would use YEAR(), MONTH(), or similar functions to achieve the same bucketing by time period.

ROLLUP — subtotals and a grand total in one query

Sometimes you want both the per-group totals and a grand total in a single result. ROLLUP does exactly that: it runs the same GROUP BY query you already know, but automatically adds one extra summary row per grouping level. It is supported in PostgreSQL, MySQL, SQL Server, and Redshift.

-- Without ROLLUP: one row per customer SELECT c.name, SUM(o.total) AS spent FROM customers c JOIN orders o ON o.customer_id = c.customer_id GROUP BY c.name; -- With ROLLUP: same rows + an extra grand-total row SELECT c.name, SUM(o.total) AS spent FROM customers c JOIN orders o ON o.customer_id = c.customer_id GROUP BY ROLLUP(c.name);

The second query produces the following result. The extra row at the bottom, where name is NULL, is the grand total automatically added by ROLLUP:

name	spent	note
Alice	131.50	subtotal for Alice
Bob	150.00	subtotal for Bob
Carla	27.75	subtotal for Carla
NULL	309.25	← grand total, added by ROLLUP

The NULL in the name column is not a missing value — it is ROLLUP's marker for "this row summarises all groups above." You can replace it with a label using COALESCE(c.name, 'Grand Total') for a cleaner report.

Putting It All Together

Here is a single query that exercises every clause covered in this article. The goal: among non-cancelled orders, find customers whose qualifying spend exceeds 100, ranked highest to lowest, returning only the top result.

SELECT c.name, SUM(o.total) AS spent, COUNT(*) AS order_count FROM customers c JOIN orders o ON o.customer_id = c.customer_id WHERE o.status <> 'cancelled' GROUP BY c.name HAVING SUM(o.total) > 100 ORDER BY spent DESC LIMIT 1;

Traced in logical execution order:

FROM & JOIN — assemble matched customer–order pairs. Order 105 has no matching customer and drops out.
WHERE — discard the cancelled order 104, leaving rows for Alice and Bob.
GROUP BY — bucket survivors by customer name: Alice (two orders, 131.50 total) and Bob (one order, 150.00).
HAVING — keep only groups whose sum exceeds 100; both qualify.
SELECT — compute the display columns: name, spent, order count.
ORDER BY — sort by spent descending: Bob (150.00) before Alice (131.50).
LIMIT — return only the first row: Bob, 150.00, 1 order.

Trace that sequence whenever a grouped query produces an unexpected result. Nearly every "why did I get that?" question dissolves once you walk the clauses in their true execution order rather than the written order.

Main References

PostgreSQL Global Development Group. PostgreSQL Documentation: Aggregate Functions.
https://www.postgresql.org/docs/current/functions-aggregate.html
PostgreSQL Global Development Group. PostgreSQL Documentation: GROUP BY and HAVING Clauses.
https://www.postgresql.org/docs/current/queries-table-expressions.html#QUERIES-GROUP
Oracle Corporation. MySQL 8.0 Reference Manual: GROUP BY Modifiers.
https://dev.mysql.com/doc/refman/8.0/en/group-by-modifiers.html
Microsoft. Aggregate Functions (Transact-SQL) — SQL Server Documentation.
https://learn.microsoft.com/en-us/sql/t-sql/functions/aggregate-functions-transact-sql
Amazon Web Services. Amazon Redshift Developer Guide: Aggregate Functions.
https://docs.aws.amazon.com/redshift/latest/dg/c_Aggregate_Functions.html

From Rows to Insights: Aggregate Functions

Why Aggregation Exists

customers

orders

The Five Core Aggregate Functions

COUNT — and the critical COUNT(*) vs COUNT(column) difference

SUM — adding it all up

AVG — the arithmetic mean

MIN and MAX — the extremes

GROUP BY — Splitting Rows Into Groups

Basic syntax

The golden rule

Grouping by multiple columns

Where GROUP BY sits in the logical execution order

HAVING — Filtering Groups After Aggregation

WHERE vs HAVING — the essential distinction

WHERE + GROUP BY + HAVING together

NULL Behavior in Aggregates

Combining Aggregates with JOINs

Useful Extras

COUNT(DISTINCT column)

Grouping by an expression

ROLLUP — subtotals and a grand total in one query

Putting It All Together

Main References

SQL from Scratch: History, Theory, and Why It Remains the Language of Data

The SELECT Statement: Reading Data from a Relational Database