Counting Distinct Values Across More Than One Column

As data analysts and database administrators, we often encounter situations where we need to perform aggregations across multiple columns. In this post, we’ll explore a common problem: counting distinct values that appear in more than one column.

Problem Statement

Given a table with multiple columns, we want to count the number of distinct values that appear in each combination of two or more columns and calculate the total cost for each project.

For example, consider a table table with columns year, project, staff 1, staff 2, staff 3, and cost. We want to count the distinct staff members across all three staff columns for each project and calculate the total cost for each project.

Current Approach

One common approach is to use a combination of UNION ALL, GROUP BY, and aggregate functions. The query below demonstrates this approach:

SELECT project
    , COUNT(DISTINCT staff) AS num_distinct_staff
    , SUM(cost) AS total_cost
FROM (
    SELECT project, staff_1 AS staff, cost AS cost FROM table UNION ALL
    SELECT project, staff_2 AS staff, NULL AS cost FROM table UNION ALL
    SELECT project, staff_3 AS staff, NULL AS cost FROM table
) AS sub
GROUP BY project

This query uses UNION ALL to combine the rows from each staff column, treats NULL values as distinct identifiers, and then groups by project.

However, this approach has limitations:

It assumes that all columns are equally important and doesn’t account for asymmetric data distribution.
It may double-count identical combinations of values due to the order in which they appear in the UNION ALL statement.
It uses aggregate functions (COUNT(DISTINCT) and SUM) to calculate distinct staff counts and total costs, which can lead to incorrect results if there are NULL values present.

Alternative Approach: Using Table Value Constructors

A more elegant approach is to use a Table Value Constructor (TVC) with the VALUES keyword. This allows us to convert columns to rows and perform aggregations in a more flexible and efficient manner.

Here’s an example query that uses TVCs to achieve our desired result:

-- DDL and sample data population, start
DECLARE @tbl TABLE (project CHAR(1), staff_1 VARCHAR(20), staff_2 VARCHAR(20), staff_3 VARCHAR(20), cost INT);
INSERT @tbl (project, staff_1, staff_2, staff_3, cost) VALUES
('A', NULL,  'Ian',  NULL, 100),
('A', 'Jim', 'Anne', NULL, 200),
('A', 'Anne', 'Jim', 'Peter', 300),
('B', 'Anne', 'Sue', NULL, 400),
('B', NULL,   NULL, 'Dave', 500);
-- DDL and sample data population, end

SELECT t1.project
    , COUNT(DISTINCT t1.staff) AS num_distinct_staff
    , SUM(t1.cost) AS total_cost
FROM @tbl AS t
CROSS APPLY (VALUES
    (project, staff_1, cost),
    (project, staff_2, NULL),
    (project, staff_3, NULL)) AS t1(project, staff, cost)
GROUP BY t1.project;

This query uses a CROSS APPLY to expand the table into three rows per row in @tbl, one for each combination of two or more columns. It then groups by project and calculates distinct staff counts and total costs.

Advantages

The TVC approach offers several advantages over the traditional UNION ALL method:

Improved performance: By treating columns as values, the query can utilize indexing on both the column and value levels.
Reduced risk of double-counting: The order of the VALUES clause doesn’t matter because each row is treated independently.
Flexibility in handling NULL values: We don’t need to treat NULL as a distinct identifier; we can simply ignore it when counting staff members.

Conclusion

Counting distinct values across multiple columns requires careful consideration of data distribution and aggregation strategies. The traditional UNION ALL approach has limitations, but the Table Value Constructor method provides an elegant solution that leverages TVCs to achieve better performance and reduced double-counting risk.

Whether you’re working with existing data or designing a new database schema, understanding how to perform aggregations across multiple columns is essential for extracting meaningful insights from your data.

Last modified on 2023-09-25