PostgreSQL GROUP BY & DISTINCT: Aggregate Data & Remove Duplicates

PostgreSQL DISTINCT and GROUP BY: Remove Duplicates & Summarize Data

Sometimes you don’t want individual rows—you want summaries. How many employees are in each department? What’s the average salary per department? Those questions need DISTINCT or GROUP BY.

What You’ll Learn

This guide covers:

DISTINCT for removing duplicate rows
GROUP BY for grouping data
Aggregate functions (COUNT, SUM, AVG, MIN, MAX)
HAVING for filtering groups
Combining GROUP BY with ORDER BY and LIMIT

Quick Review: DISTINCT vs. GROUP BY

They sound similar but serve different purposes:

DISTINCT: Removes duplicate rows from results
GROUP BY: Groups rows and lets you aggregate

DISTINCT: Removing Duplicates

DISTINCT shows each unique value once.

Simple DISTINCT

How many different departments do we have?

SELECT DISTINCT dept_id FROM employees;

Result:

 dept_id
---------
       1
       2
       3
       4
(4 rows)

Without DISTINCT, this returns 30 rows (one per employee). With DISTINCT, you get 4 unique values.

DISTINCT on Multiple Columns

Find unique department-manager combinations:

SELECT DISTINCT dept_id, manager_id FROM departments
ORDER BY dept_id;

Result:

 dept_id | manager_id
---------+------------
       1 |          1
       2 |          7
       3 |         15
       4 |         21
(4 rows)

Each unique (dept_id, manager_id) pair appears once.

DISTINCT with WHERE

Find unique departments where employees earn over $90,000:

SELECT DISTINCT dept_id FROM employees
WHERE salary > 90000
ORDER BY dept_id;

Result:

 dept_id
---------
       2
(1 row)

Only Engineering (department 2) has employees over $90,000.

Performance Note on DISTINCT

DISTINCT scans the entire result set to find duplicates. On large datasets, it can be slow:

-- Slow: processes all rows, then removes duplicates
SELECT DISTINCT dept_id FROM employees;

-- Better: GROUP BY is often faster
SELECT dept_id FROM employees GROUP BY dept_id;

GROUP BY with a proper index can be faster than DISTINCT.

GROUP BY: Grouping and Aggregating Data

GROUP BY groups rows by one or more columns, allowing you to aggregate them.

COUNT with GROUP BY

How many employees are in each department?

SELECT dept_id, COUNT(*) FROM employees
GROUP BY dept_id
ORDER BY dept_id;

Breaking it down:

COUNT(*) — Count the rows in each group
GROUP BY dept_id — Group by department

Result:

 dept_id | count
---------+-------
       1 |     6
       2 |     8
       3 |     6
       4 |    10
(4 rows)

6 sales employees, 8 engineering, 6 marketing, 10 operations.

SUM with GROUP BY

Total salary per department:

SELECT dept_id, SUM(salary) as total_salary FROM employees
GROUP BY dept_id
ORDER BY dept_id;

Result:

 dept_id | total_salary
---------+--------------
       1 |    445000.00
       2 |    764000.00
       3 |    471000.00
       4 |    615000.00
(4 rows)

Engineering has the highest total salary ($764,000).

AVG with GROUP BY

Average salary per department:

SELECT dept_id, ROUND(AVG(salary), 2) as avg_salary FROM employees
GROUP BY dept_id
ORDER BY dept_id;

Result:

 dept_id | avg_salary
---------+------------
       1 |   74083.33
       2 |   95500.00
       3 |   78500.00
       4 |   61500.00
(4 rows)

Engineering employees average $95,500 (the highest).

MIN and MAX with GROUP BY

Find salary range per department:

SELECT dept_id, MIN(salary) as min_salary, MAX(salary) as max_salary,
       MAX(salary) - MIN(salary) as salary_range
FROM employees
GROUP BY dept_id
ORDER BY dept_id;

Result:

 dept_id | min_salary | max_salary | salary_range
---------+------------+------------+--------------
       1 |   70000.00 |   78000.00 |    8000.00
       2 |   90000.00 |  105000.00 |   15000.00
       3 |   75000.00 |   82000.00 |    7000.00
       4 |   64000.00 |   72000.00 |    8000.00
(4 rows)

Engineering has the widest salary range ($15,000).

GROUP BY Multiple Columns

Find count of employees hired per year per department:

SELECT dept_id, EXTRACT(YEAR FROM hire_date) as hire_year, COUNT(*) as count
FROM employees
GROUP BY dept_id, EXTRACT(YEAR FROM hire_date)
ORDER BY dept_id, hire_year;

Result:

 dept_id | hire_year | count
---------+-----------+-------
       1 |      2019 |     1
       1 |      2020 |     1
       1 |      2021 |     2
       1 |      2022 |     2
       2 |      2018 |     1
       2 |      2019 |     1
       2 |      2020 |     1
       2 |      2021 |     3
       2 |      2022 |     2
(9 rows shown)

You can see hiring patterns by department and year.

HAVING: Filtering Groups

WHERE filters individual rows. HAVING filters groups.

-- WHERE filters rows before grouping
-- HAVING filters groups after grouping

HAVING with COUNT

Find departments with more than 6 employees:

SELECT dept_id, COUNT(*) as emp_count FROM employees
GROUP BY dept_id
HAVING COUNT(*) > 6
ORDER BY emp_count DESC;

Result:

 dept_id | emp_count
---------+-----------
       4 |        10
       2 |         8
(2 rows)

Departments 4 (Operations) and 2 (Engineering) have more than 6 employees.

HAVING with AVG

Find departments where average salary exceeds $75,000:

SELECT dept_id, ROUND(AVG(salary), 2) as avg_salary FROM employees
GROUP BY dept_id
HAVING AVG(salary) > 75000
ORDER BY avg_salary DESC;

Result:

 dept_id | avg_salary
---------+------------
       2 |   95500.00
       3 |   78500.00
       1 |   74083.33
(3 rows)

Only Engineering and Marketing meet this threshold.

Combining WHERE and HAVING

Find departments where average salary of highly-paid employees exceeds $95,000:

SELECT dept_id, ROUND(AVG(salary), 2) as avg_salary FROM employees
WHERE salary > 90000
GROUP BY dept_id
HAVING AVG(salary) > 95000
ORDER BY avg_salary DESC;

Breaking it down:

WHERE salary > 90000 — Filter to highly-paid employees only
GROUP BY dept_id — Group remaining employees by department
HAVING AVG(salary) > 95000 — Filter to departments where the average is still over $95k

Result:

 dept_id | avg_salary
---------+------------
       2 |   98500.00
(1 row)

Only Engineering’s highly-paid employees average over $95,000.

Aggregate Functions Reference

Function	Purpose	Example
`COUNT(*)`	Count all rows	`COUNT(*)`
`COUNT(column)`	Count non-NULL values	`COUNT(email)`
`SUM(column)`	Total of values	`SUM(salary)`
`AVG(column)`	Average value	`AVG(salary)`
`MIN(column)`	Minimum value	`MIN(salary)`
`MAX(column)`	Maximum value	`MAX(salary)`
`STRING_AGG(column, delimiter)`	Concatenate strings	`STRING_AGG(first_name, ', ')`
`ARRAY_AGG(column)`	Array of values	`ARRAY_AGG(email)`

STRING_AGG Example

Get comma-separated list of employees per department:

SELECT dept_id, STRING_AGG(first_name, ', ') as employees
FROM employees
GROUP BY dept_id
ORDER BY dept_id;

Result:

 dept_id |                        employees
---------+----------------------------------------------------------
       1 | James, Mary, Robert, Patricia, Michael, Linda
       2 | David, Jennifer, Richard, Barbara, Charles, Susan, Joseph, Jessica
       3 | Thomas, Karen, Christopher, Nancy, Daniel, Lisa
       4 | Matthew, Betty, Anthony, Margaret, Mark, Sandra, Steven, Ashley, Paul
(4 rows)

Common Errors & Solutions

Error: “Column must appear in GROUP BY clause”

Problem:

SELECT dept_id, first_name, COUNT(*) FROM employees GROUP BY dept_id;

You’re selecting first_name but grouping by dept_id. PostgreSQL doesn’t know which first_name to return for each department.

Solution:

Either add first_name to GROUP BY (if meaningful):

SELECT dept_id, first_name, COUNT(*) FROM employees
GROUP BY dept_id, first_name;

Or remove it from SELECT:

SELECT dept_id, COUNT(*) FROM employees GROUP BY dept_id;

Error: “WHERE cannot use aggregate functions”

Problem:

SELECT dept_id, COUNT(*) FROM employees
WHERE COUNT(*) > 5
GROUP BY dept_id;

WHERE filters rows before grouping, so COUNT doesn’t exist yet.

Solution:

Use HAVING instead:

SELECT dept_id, COUNT(*) FROM employees
GROUP BY dept_id
HAVING COUNT(*) > 5;

Error: “NULL appears in GROUP BY results”

Problem:

SELECT manager_id, COUNT(*) FROM departments GROUP BY manager_id;

If some managers are NULL, you get a row with manager_id = NULL.

Solution:

This isn’t necessarily wrong—it shows departments without managers. But if you want to exclude them:

SELECT manager_id, COUNT(*) FROM departments
WHERE manager_id IS NOT NULL
GROUP BY manager_id;

Performance Tips

1. GROUP BY is Often Faster Than DISTINCT

-- Slower
SELECT DISTINCT dept_id FROM employees;

-- Faster (if dept_id is indexed)
SELECT dept_id FROM employees GROUP BY dept_id;

The query planner might even internally convert DISTINCT to GROUP BY.

2. Use HAVING After GROUP BY, Not WHERE

-- Correct
SELECT dept_id, COUNT(*) FROM employees
GROUP BY dept_id
HAVING COUNT(*) > 5

-- Wrong (WHERE filters individual rows, not groups)
SELECT dept_id, COUNT(*) FROM employees
WHERE COUNT(*) > 5
GROUP BY dept_id

3. Index Columns You Group By

If you frequently GROUP BY dept_id, create an index:

CREATE INDEX idx_employees_dept_id ON employees(dept_id);

This speeds up grouping.

4. Order Groups Carefully

-- Efficient: orders aggregated results
SELECT dept_id, COUNT(*) as emp_count FROM employees
GROUP BY dept_id
ORDER BY emp_count DESC;

-- Less efficient: orders before aggregating
SELECT * FROM (
    SELECT dept_id, COUNT(*) FROM employees
    GROUP BY dept_id
) ORDER BY dept_id;

The first is clearer and faster.

FAQ

Q: Can I use COUNT(DISTINCT column)? A: Yes! Count unique values:

SELECT COUNT(DISTINCT dept_id) FROM employees;
-- Returns 4 (four unique departments)

Q: What’s the difference between COUNT(*) and COUNT(column)? A: COUNT(*) counts all rows. COUNT(column) counts non-NULL values in that column.

Q: Can I use multiple aggregate functions in one query? A: Absolutely:

SELECT dept_id, COUNT(*), SUM(salary), AVG(salary)
FROM employees
GROUP BY dept_id;

Q: Why does GROUP BY without aggregates behave like DISTINCT? A: GROUP BY collapses rows to unique groups. Even without aggregates, you get unique values. It’s the same as DISTINCT.

Q: Can I use LIMIT with GROUP BY? A: Yes:

SELECT dept_id, COUNT(*) FROM employees
GROUP BY dept_id
LIMIT 2;

This returns the first 2 groups.

Q: What order should clauses appear in? A: SELECT → FROM → WHERE → GROUP BY → HAVING → ORDER BY → LIMIT

Next Steps

You now understand how to summarize data. Your next learning steps:

LIMIT/OFFSET vs. Pagination Patterns - Paginate large result sets efficiently
PostgreSQL JOINs Complete Guide - Combine data from multiple tables
Subqueries in SELECT - Nest queries for complex logic

GROUP BY and aggregates are fundamental to analytical queries. Master them, and you can answer almost any business question with data!