Glenn Yonemitsu - software engineer. yonemitsu@gmail.com

Advanced Grouping with SQL's GROUP BY

When looking at data, a common question we have is "For all of X, what is the Y?". SQL has a great feature to easily query this information with the GROUP BY clause.

In fact there are many ways we can inspect the data with this.

What is a Group?

First let's define what a group in the GROUP BY clause is.

These are relevant only when using aggregate functions. Basically these are functions that give a scalar value for a set of rows. Without the GROUP BY you get the value for the entire table. You might have done this before:

SELECT  COUNT(*)
FROM    my_table

This gets the count across the entire table. When you use GROUP BY like so:

SELECT   department, COUNT(*)
FROM     my_table
GROUP BY department

The query calculates the COUNT for each department.

Some other common aggregate functions are:

With that in mind let's look at the different ways we can use GROUP BY.

For the rest of this post, we'll work with a couple of tables for a basketball team.

players

player_id player_name position
1 Player A G
2 Player B F
3 Player C C

game_stats

game_id player_id opponent points assists game_date
101 1 Team X 30 9 2024-10-25
101 2 Team X 22 12 2024-10-25
102 1 Team Y 25 7 2024-10-27

Using Multiple Aggregates: AVG() and SUM()

What if we want more than one summary statistic for a group? We can calculate several at once. Let's find the average and total points scored for each position.

SELECT      p.position,
            AVG(gs.points) AS avg_points,
            SUM(gs.points) AS total_points
FROM        game_stats AS gs
JOIN        players AS p
  ON        gs.player_id = p.player_id
GROUP BY    p.position

Here, we group all game stats by the player's position. With this we can understand if the guards are scoring more points than the centers, for example.

Finding a Range with MIN() and MAX()

We can also use GROUP BY to find the range of values within our groups. Let's find each player's best and worst scoring game of the season.

SELECT      p.player_id,
            p.player_name,
            MIN(gs.points) AS season_low,
            MAX(gs.points) AS season_high
FROM        game_stats AS gs
JOIN        players AS p
  ON        gs.player_id = p.player_id
GROUP BY    p.player_id

By grouping by the player's name, we can now see their individual scoring range over the season.

Filtering Groups with the HAVING Clause

The WHERE clause is for filtering rows before grouping. But what if we want to filter the groups themselves? For that, we use HAVING. Let's find players who average over 20 points per game.

SELECT      p.player_name,
            AVG(gs.points) AS avg_points
FROM        game_stats AS gs
JOIN        players AS p
  ON        gs.player_id = p.player_id
GROUP BY    p.player_name
HAVING      AVG(gs.points) > 20

With HAVING, we can apply a condition after the aggregation, so only the groups that meet our criteria (average points > 20) are returned.

Counting Unique Values with COUNT(DISTINCT)

Sometimes we need to count unique items, not just rows. How many different teams has each player scored against? Using COUNT(*) would count every game, but with COUNT(DISTINCT ...) we can count only the unique opponents.

SELECT      p.player_name,
            COUNT(DISTINCT gs.opponent) AS unique_opponents
FROM        game_stats AS gs
JOIN        players AS p
  ON        gs.player_id = p.player_id
GROUP BY    p.player_name

This is a common pattern we'll need for analytics where counting unique entities is more important than counting total events.

Grouping by Multiple Columns

We aren't limited to a single column in our GROUP BY clause. We can group by multiple columns to create more detailed summaries. Let's break down the average points scored first by position, and then by player.

SELECT      p.position,
            p.player_name,
            AVG(gs.points) AS avg_points
FROM        game_stats AS gs
JOIN        players AS p
  ON        gs.player_id = p.player_id
GROUP BY    p.position,
            p.player_name
ORDER BY    p.position,
            avg_points DESC

This creates subgroups, which lets us see how individual players stack up within their positional group.

References