Aggregate Functions/Group Functions
Aggregate functions are actually the built-in functions in SQL. They are used for some kind of specific operations, like to compute the average of numbers, the total count of the records, the total sum of the numbers etc. These are also called Group functions because these functions apply on the group of data. For example - if you want to compute the maximum salary among different records, then obviously, there will be more than 1 record from which you’re finding the maximum salary. Similarly, if you want to compute the sum of the salary, you’ll have a collection of salaries where you’ll apply this function to get the output.
An example of Aggregate Functions is given below.
- Sum(), Min(), Max(), Count(), Avg()
- SELECT COUNT(*) AS InvoiceCount
- , SUM(Total) AS TotalAllInvoices
- , AVG(Total) AS AverageTotal
- , MAX(Total) AS MaxInvoices
- , MIN(Total) AS MinInvoices
- FROM Invoice
- WHERE BillingCountry = 'USA'
Let me tell you how this query works. First of all, the SQL Engine extracts the record from the Invoice table on the basis of WHERE clause and then, these above GROUP functions will apply to the extracted group of records. Here, we’re getting the results after applying the group functions on the complete filtered table.
Now, let’s say we want to get the results on some groups of records of a relation (whether it is filtered or non-filtered). So, we use GROUP BY clause in SQL.
- SELECT COUNT(*) as Records
- FROM Invoice
- WHERE Total > 2
- GROUP BY BillingCountry
In this query, we’re getting the total counts of records on the basis of same BillingCountry attribute values. And when we run this query, we’ll get this result.
As you can see, we can’t understand the result just by output values so it is a best practice to also print the attribute on the basis of what we’re getting so as to make the result more meaningful.
- SELECT BillingCountry, COUNT(*) as Records
- FROM Invoice
- WHERE Total > 2
- GROUP BY BillingCountry
And with the help of this query, we’ll get some result.
Now, let’s see one more example.
- SELECT CustomerId
- , AVG(Total)
- FROM Invoice
- GROUP BY CustomerId
Look, we put Non-Aggregated attributes with GROUP BY. Now, when this query runs, it will collect the same number of Customer IDs and then take the average of all the column values of Customer Id 2 in the Invoice table. This is how it works.
Now, let’s suppose we’ve another attribute name in the script.
- SELECT CustomerId
- , InvoiceDate
- , AVG(Total)
- FROM Invoice
- GROUP BY CustomerId
When you run this statement, you’ll see the error in the Messages window stating - ‘Column 'Invoice.InvoiceDate' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.’
Because Avg() is an aggregate function which applies on the collection and InvoiceDate is a DateTime field whose value is always different, so if we’ve just applied the Group By on CustomerID, it is not enough because we need to cover InvoiceDate as well. So,
- SELECT CustomerId
- , InvoiceDate
- , AVG(Total)
- FROM Invoice
-
- WHERE CustomerId = 2
- GROUP BY CustomerId, InvoiceDate
Look what we get.
Before InvoiceDate, we were getting only 1 record of CustomerId 2 because 2 is the discrete value. And after putting InvoiceDate in the query we’re getting nearly all the records because it is continuous. Only that InvoiceDate record is skipped here when InvoiceDate value matches with other any record of InvoiceDate plus both records contain CustomerId 2. And we’re getting the Average value of all those matching records.
Let’s see one more example.
- SELECT BillingCountry, BillingCity, COUNT(*) as Records
- FROM Invoice
- WHERE Total > 2
- GROUP BY BillingCountry, BillingCity
- ORDER BY BillingCountry
Now, SQL Engine first of all gets the collection of records after applying [FROM and WHERE clause] and then we group the records on the basis of BillingCountry and then it further group the records on the basis of BillingCity which means we’ll get the number of records [COUNT(*) as Records] of each country and of each city [GROUP BY BillingCountry, BillingCity].
Having
Actually, we can’t apply the condition to Aggregate functions with the help of WHERE clause. So, we need Having here.
- SELECT CustomerId
- , AVG(Total)
- FROM Invoice
- GROUP BY CustomerId
- HAVING AVG(Total) > 6
And if we try to do this with WHERE clause,
‘Incorrect syntax near the keyword 'WHERE'.’
This is what we get.
- SELECT [State]
- , City
- , AVG(Total) AS CityAverage
- , SUM(Total) AS CityTotal
- FROM Customer
- JOIN Invoice
- ON Customer.CustomerId = Invoice.CustomerId
- WHERE [State] <> NULL
- GROUP BY [State], City
- HAVING SUM(Total) > 40
- ORDER BY [State], City
You might be wondering why we put State column name in square brackets. It is because sometimes these words are reserved in SQL Server. And if you try to run the query without putting State attribute name in square brackets, you’ll get the error. Because SQL Server became confused about whether it is your table attribute name or its reserved State keyword. So we put it into square brackets.
HAVING & BETWEEN
We already know that we use BETWEEN to retrieve the records within a specific range. So,
- SELECT InvoiceDate
- , AVG(Total) AS DateAverage
- , SUM(Total) AS DateTotal
- FROM Invoice
- GROUP BY InvoiceDate
- HAVING InvoiceDate BETWEEN '2012-01-01' AND '2012-12-31'
- ORDER BY InvoiceDate DESC
Now you might be thinking InvoiceDate is obviously not an aggregate function and it is working. Yes, it also works with non-aggregate but it would be better to use WHERE clause for better performance.
- SELECT InvoiceDate
- , AVG(Total) AS DateAverage
- , SUM(Total) AS DateTotal
- FROM Invoice
- WHERE InvoiceDate BETWEEN '2012-01-01' AND '2012-12-31'
- GROUP BY InvoiceDate
- ORDER BY InvoiceDate DESC
Always put WHERE clause before GROUP BY and HAVING clause after GROUP BY otherwise it will not accept and show an error in the Messages window.
Reporting
Mostly, when we’re doing some kind of reporting in SQL Server and we want to get the results on the basis of collections of records we’ve some functions in SQL Server to summarize the results.
- GROUPING SETS
- ROLLUP
- CUBE
- OVER
These functions are really very important when we’re especially working with Reports.
GROUPING SETS
To understand the GROUPING SETS, let’s create a new table of Employees.
- Create Table Employees
- (
- Id int primary key identity(1, 1),
- Name nvarchar(50),
- Gender nvarchar(10),
- Salary int,
- Country nvarchar(10)
- )
And now, we put the values in this Employees table.
- Insert Into Employees Values ('Usama', 'Male', 5000, 'USA')
- Insert Into Employees Values ('Safwan', 'Male', 4500, 'India')
- Insert Into Employees Values ('Gulraiz', 'Female', 5500, 'USA')
- Insert Into Employees Values ('Ayesha', 'Female', 4000, 'India')
- Insert Into Employees Values ('Anas', 'Male', 3500, 'India')
- Insert Into Employees Values ('Areeha', 'Female', 5000, 'UK')
- Insert Into Employees Values ('Raza', 'Male', 6500, 'UK')
- Insert Into Employees Values ('Eeman', 'Female', 7000, 'USA')
- Insert Into Employees Values ('Faseeh', 'Male', 5500, 'UK')
- Insert Into Employees Values ('Hassan', 'Male', 5000, 'USA')
Now let’s generate the Gender and the Sum of Salaries. So what will we do?
- SELECT Country, Gender, SUM(Salary) as TotalSalary
- FROM Employees
- GROUP BY Country, Gender
And it will shows us the result.
Now we want to show the total salaries of individual countries as well. So,
- SELECT Country, Gender, SUM(Salary) as TotalSalary
- FROM Employees
- GROUP BY Country, Gender
-
- UNION ALL
-
- SELECT Country, NULL, SUM(Salary) as TotalSalary
- FROM Employees
- GROUP BY Country
And it shows us this result,
Now we also want total salaries by Gender. So,
- SELECT Country, Gender, SUM(Salary) as TotalSalary
- FROM Employees
- GROUP BY Country, Gender
-
- UNION ALL
-
- SELECT Country, NULL, SUM(Salary) as TotalSalary
- FROM Employees
- GROUP BY Country
-
- UNION ALL
-
- SELECT NULL, Gender, SUM(Salary) as TotalSalary
- FROM Employees
- GROUP BY Gender
And now if we watch the results in Output,
Look we’re making some kind of reports with all statistics. Now let’s we get the final total salaries without any kind of classification. So,
- SELECT Country, Gender, SUM(Salary) as TotalSalary
- FROM Employees
- GROUP BY Country, Gender
-
- UNION ALL
-
- SELECT Country, NULL, SUM(Salary) as TotalSalary
- FROM Employees
- GROUP BY Country
-
- UNION ALL
-
- SELECT NULL, Gender, SUM(Salary) as TotalSalary
- FROM Employees
- GROUP BY Gender
-
- UNION ALL
-
- SELECT NULL, NULL, SUM(Salary) as TotalSalary
- FROM Employees
And the results we get now,
Now with this final query, we have 2 problems.
- The query is getting so huge and complex. And if we need some more kind of reporting feature then we’ll also need some more UNION ALL.
- We’re request again and again to the Employees table which is really very bad.
So the solution to this problem is,
- SELECT Country, Gender, SUM(Salary) as TotalSalary
- FROM Employees
- GROUP BY
- GROUPING SETS (
- (Country, Gender),
- (Country),
- (Gender),
- ()
- )
Now let’s take a look on the results,
Now this is the time to order the results.
- SELECT Country, Gender, SUM(Salary) as TotalSalary
- FROM Employees
- GROUP BY
- GROUPING SETS (
- (Country, Gender),
- (Country),
- (Gender),
- ()
- )
- ORDER BY GROUPING(Country), GROUPING(Gender)
And now our results are in orders. And the concept of GROUPING SETS is clear.
ROLLUP
Let’s rewind once again GROUP BY clause
- SELECT Country, SUM(Salary) as TotalSalary
- FROM Employees
- GROUP BY Country
If we run this statement, looks what we get,
Now let’s use ROLLUP and see what happens.
- SELECT Country, SUM(Salary) as TotalSalary, AVG(Salary) AS AverageSalary
- FROM Employees
- GROUP BY ROLLUP(Country)
And this is the result we get,
It means ROLLUP adds a single row if we’re Grouping one column. And obviously we can also achieve the same results with GROUPING SETS
- SELECT Country, SUM(Salary) as TotalSalary, AVG(Salary) AS AverageSalary
- FROM Employees
- GROUP BY
- GROUPING SETS (
- (Country),
- ()
- )
Yes we can also apply any number of arguments inside ROLLUP()
- SELECT Country, Gender, SUM(Salary) as TotalSalary
- FROM Employees
- GROUP BY ROLLUP(Country, Gender)
And this query is equal to,
- SELECT Country, Gender, SUM(Salary) as TotalSalary
- FROM Employees
- GROUP BY
- GROUPING SETS (
- (Country, Gender),
- (Country),
- ()
- )
Here we’re grouping the Salary by Country and Gender and it will also display the Subtotal of Country salaries and then the Grand Total.
The classical way of writing this rollup query is,
- SELECT Country, Gender, SUM(Salary) as TotalSalary
- FROM Employees
- GROUP BY Country, Gender WITH ROLLUP
So don’t confuse this.
CUBE
This is used to get the Sum of Salaries grouped by all the combinations of Country and Gender. So,
- SELECT Country, Gender, SUM(Salary) as TotalSalary
- FROM Employees
- GROUP BY CUBE(Country, Gender)
And it will show the result in output window,
And now we need to order the results again,
- SELECT Country, Gender, SUM(Salary) as TotalSalary
- FROM Employees
- GROUP BY CUBE(Country, Gender)
- ORDER BY GROUPING(Country), GROUPING(Gender)
Now you can guess from the results that where we get these results before. Yes it is equivalent to this query.
- SELECT Country, Gender, SUM(Salary) as TotalSalary
- FROM Employees
- GROUP BY
- GROUPING SETS (
- (Country, Gender),
- (Country),
- (Gender),
- ()
- )
- ORDER BY GROUPING(Country), GROUPING(Gender)
The classical way of this CUBE sql statement is,
- SELECT Country, Gender, SUM(Salary) as TotalSalary
- FROM Employees
- GROUP BY Country, Gender WITH CUBE
And we’re using the latest approach above.
Note
If you’re applying ROLLUP & CUBE on the single column, you won’t see any difference. Both queries will show you the same results. ROLLUP and CUBE are both performance tools actually. If you want to display the data hierarchically then you should use ROLLUP and if you want to display all the combinations of data then you should use CUBE.
For example,
Suppose we want to get the total population of a country, state and city. So if we use ROLLUP here, it will calculate the population of country, state and city first and then it comes to country and state and then it calculates the population of country and then sum all the grand total population. On the other hand, if we’re working with CUBE. It will calculate all the combinations of data and then sum it like,
- Country, State, City
- Country, State
- Country, City
- Country
- State, City
- State
- City
- (ALL) GrandTotal
So the conclusion is it all depends upon your data. If you have hierarchical data (Country > State > City or Department > Manager > Salesman) then obviously you’ll use the hierarchical results in most of the cases. And for hierarchical data we need ROLLUP. And if we have non-hierarchical data like (City, Gender, Nationality) then we’ll use non-hierarchical results and for this we’ll use CUBE.
OVER
Over clause actually allows us to use the aggregate functions without GROUP BY.
- SELECT Name, Salary, Gender,
- GenderTotal = COUNT(Gender) OVER (),
- Average = AVG(Salary) OVER (),
- MinimumSalary = MIN(Salary) OVER (),
- MaxmimumSalary = MAX(Salary) OVER ()
- FROM Employees
Now we want to get all these aggregate functions results for each Employee with respect to its Gender. Like this:
- SELECT Name, Salary, Gender,
- GenderTotal = COUNT(Gender) OVER (PARTITION BY Gender),
- Average = AVG(Salary) OVER (PARTITION BY Gender),
- MinimumSalary = MIN(Salary) OVER (PARTITION BY Gender),
- MaxmimumSalary = MAX(Salary) OVER (PARTITION BY Gender)
- FROM Employees