Introduction
In some cases, we want to ensure uniqueness only on a subset of row columns.
There are many different solutions proposed, some of them are simple and others are less. I didn't find a generic one. So let's state the problem first, then analyze it and define a generic solution.
Problem
DISTINCT is not a function, we cannot "apply" it on a subset of columns. Some folks try to substitute it with GROUPBY. But GROUPBY requires that all the SELECTed columns will be either grouped by or aggregated. Let's assume we use a FIRST/MIN/MAX aggregate. In this case, all the columns that are not grouped by will have the same values. I can hardly believe this is desired.
Analysis
Let's understand better why the original DISTINCT approach does not work. We get too many rows since all those not important columns are returned with their multiple values, therefore we have duplicates on the important columns. Fortunately, there is an
OVER clause, which comes to solve this exact problem!
Solution
The OVER clause lets the first partition (group) the rows and only then apply the aggregate function. Assuming we want to SELECT fields A, B, C, D and want that all the A-B pairs will be distinct. So we probably want to partition all the rows to groups of unique A-B permutations and within each group peek a random C (or maybe the first C ordered by D (or even C)). For example:
Assuming we have a Promotions entity:
- public partial class Promotions
- {
- public int PromotionId { get; set; }
- public string PromotionName { get; set; }
- public decimal? Discount { get; set; }
- public DateTime StartDate { get; set; }
- public DateTime ExpiredDate { get; set; }
- }
We want to partition (group) our promotions by StartDate-ExpiredDate pairs and then return those with the highest Discount:
- DbContext.Promotions.Query((Promotions promo) => {
-
- var window = PARTITION(BY(promo.StartDate), BY(promo.ExpiredDate)).
- ORDER(BY(promo.Discount).DESC);
-
- var r = SELECT(DISTINCT<Promotions>(promo.StartDate.@as(), promo.ExpiredDate.@as(),
-
- AggregateBy(FIRST_VALUE(promo.Discount)).OVER(window).@as(promo.Discount),
- AggregateBy(FIRST_VALUE(promo.PromotionId)).OVER(window).@as(promo.PromotionId),
- AggregateBy(FIRST_VALUE(promo.PromotionName)).OVER(window).
- @as(promo.PromotionName)));
- FROM(promo);
- return r;
- });
Performance Note
In case we don't care which row is returned in a partition, it's better to ORDER BY PK because usually rows are already ordered by it. Also needless to say that it's much faster to PARTITION BY indexed columns.
Summary
With this technique, we can require the uniqueness of an arbitrary column set and return the "right" row in case some group has multiple rows. A working example is hosted on
GitHub in the
DistinctOn class or
online.