EF Core - SELECT DISTINCT on a Few Columns Only

Konstantin Triger
5y
57.6k
0
3

Article

Introduction

In some cases, we want to ensure uniqueness only on a subset of row columns.

There are many different solutions proposed, some of them are simple and others are less. I didn't find a generic one. So let's state the problem first, then analyze it and define a generic solution.

Problem

DISTINCT is not a function, we cannot "apply" it on a subset of columns. Some folks try to substitute it with GROUPBY. But GROUPBY requires that all the SELECTed columns will be either grouped by or aggregated. Let's assume we use a FIRST/MIN/MAX aggregate. In this case, all the columns that are not grouped by will have the same values. I can hardly believe this is desired.

Analysis

Let's understand better why the original DISTINCT approach does not work. We get too many rows since all those not important columns are returned with their multiple values, therefore we have duplicates on the important columns. Fortunately, there is an OVER clause, which comes to solve this exact problem!

Solution

The OVER clause lets the first partition (group) the rows and only then apply the aggregate function. Assuming we want to SELECT fields A, B, C, D and want that all the A-B pairs will be distinct. So we probably want to partition all the rows to groups of unique A-B permutations and within each group peek a random C (or maybe the first C ordered by D (or even C)). For example:

Assuming we have a Promotions entity:

public partial class Promotions
{
public int PromotionId { get; set; }
public string PromotionName { get; set; }
public decimal? Discount { get; set; }
public DateTime StartDate { get; set; }
public DateTime ExpiredDate { get; set; }
}

We want to partition (group) our promotions by StartDate-ExpiredDate pairs and then return those with the highest Discount:

DbContext.Promotions.Query((Promotions promo) => {
// partition by StartDate-ExpiredDate and order by Discount
var window = PARTITION(BY(promo.StartDate), BY(promo.ExpiredDate)).
ORDER(BY(promo.Discount).DESC);
var r = SELECT(DISTINCT<Promotions>(promo.StartDate.@as(), promo.ExpiredDate.@as(),
// take the FIRST_VALUE which corresponds to the row with the highest discount
AggregateBy(FIRST_VALUE(promo.Discount)).OVER(window).@as(promo.Discount),
AggregateBy(FIRST_VALUE(promo.PromotionId)).OVER(window).@as(promo.PromotionId),
AggregateBy(FIRST_VALUE(promo.PromotionName)).OVER(window).
@as(promo.PromotionName)));
FROM(promo);
return r;
});

Performance Note

In case we don't care which row is returned in a partition, it's better to ORDER BY PK because usually rows are already ordered by it. Also needless to say that it's much faster to PARTITION BY indexed columns.

Summary

With this technique, we can require the uniqueness of an arbitrary column set and return the "right" row in case some group has multiple rows. A working example is hosted on GitHub in the DistinctOn class or online.