Optimizing LINQ Queries In C#.NET For MS SQL Server

Introduction

LINQ was added to .NET as a new powerful data manipulation language. LINQ to SQL allows you to talk with DBMS quite conveniently by using Entity Framework, for example. But often when using it developers forget to consider what kind of SQL query will be generated by the queryable provider (Entity Framework in our example case). In this article, we’ll look at how exactly you can optimize the performance of LINQ queries.

Implementation

Let’s examine the two most important points by using an example.

First, we’ll need to create the Test database in SQL Server. In this database, we’ll create two tables by running the following query.s

USE [TEST]
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[Ref](
[ID] [int] NOT NULL,
[ID2] [int] NOT NULL,
[Name] [nvarchar](255) NOT NULL,
[InsertUTCDate] [datetime] NOT NULL,
CONSTRAINT [PK_Ref] PRIMARY KEY CLUSTERED
(
[ID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
ALTER TABLE [dbo].[Ref] ADD CONSTRAINT [DF_Ref_InsertUTCDate] DEFAULT (getutcdate()) FOR [InsertUTCDate]
GO
USE [TEST]
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[Customer](
[ID] [int] NOT NULL,
[Name] [nvarchar](255) NOT NULL,
[Ref_ID] [int] NOT NULL,
[InsertUTCDate] [datetime] NOT NULL,
[Ref_ID2] [int] NOT NULL,
CONSTRAINT [PK_Customer] PRIMARY KEY CLUSTERED
(
[ID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
ALTER TABLE [dbo].[Customer] ADD CONSTRAINT [DF_Customer_Ref_ID] DEFAULT ((0)) FOR [Ref_ID]
GO
ALTER TABLE [dbo].[Customer] ADD CONSTRAINT [DF_Customer_InsertUTCDate] DEFAULT (getutcdate()) FOR [InsertUTCDate]
GO

Now, let’s fill the Ref table with the help of the following script. Wait... I just ran a script, but didn’t save it. In such cases, SQL Complete by Devart can be very convenient. It integrates with SSMS and Visual Studio and has the Execution History feature,

Execution history

Fig 1. Execution History feature

This functionality displays the history of queries that were executed in SSMS,

Functionality display

Fig.2. History of queries

Take note of how the window consists of the following elements.

  1. Search box for filtering the results
  2. Date range box for filtering the results
  3. Results are presented in a table. You can sort data by this table’s columns (by using the SHIFT key, you can choose a set of columns for sorting)
  4. Code of the selected row

The result table contains a history of scripts executed in SSMS and comprises the following tables.

  1. Status: Shows whether the script was executed successfully
  2. Query Text: code of the script
  3. Size (Bytes): Size of the script in bytes
  4. Executed On: Date and time at which the script was executed
  5. Duration: How long it took for the script to be executed
  6. File: Name of a file or a tab in SSMS followed by the name of the SQL Server instance on which the script was executed
  7. Server: Name of the SQL Server instance on which the script was executed
  8. User: The login under which the script was executed
  9. Database: The database context in which the script was executed

We can find the required query in this history table.

USE [TEST]
GO
DECLARE @ind INT = 1;
WHILE (@ind < 1200000)
BEGIN
    INSERT INTO [dbo].[Ref]
           ([ID]
           ,[ID2]
           ,[Name])
    SELECT @ind,
           @ind,
           CAST(@ind AS NVARCHAR(255));
    SET @ind = @ind + 1;
END
GO

In a similar way, we can populate the Customer table by using the following script

USE [TEST]
GO

DECLARE @ind INT = 1;
DECLARE @ind_ref INT = 1;

WHILE (@ind <= 12000000)
BEGIN
    IF (@ind % 3 = 0)
        SET @ind_ref = 1;
    ELSE IF (@ind % 5 = 0)
        SET @ind_ref = 2;
    ELSE IF (@ind % 7 = 0)
        SET @ind_ref = 3;
    ELSE IF (@ind % 11 = 0)
        SET @ind_ref = 4;
    ELSE IF (@ind % 13 = 0)
        SET @ind_ref = 5;
    ELSE IF (@ind % 17 = 0)
        SET @ind_ref = 6;
    ELSE IF (@ind % 19 = 0)
        SET @ind_ref = 7;
    ELSE IF (@ind % 23 = 0)
        SET @ind_ref = 8;
    ELSE IF (@ind % 29 = 0)
        SET @ind_ref = 9;
    ELSE IF (@ind % 31 = 0)
        SET @ind_ref = 10;
    ELSE IF (@ind % 37 = 0)
        SET @ind_ref = 11;
    ELSE
        SET @ind_ref = @ind % 1190000;

    INSERT INTO [dbo].[Customer]
                ([ID]
                ,[Name]
                ,[Ref_ID]
                ,[Ref_ID2])
    SELECT @ind,
           CAST(@ind AS NVARCHAR(255)),
           @ind_ref,
           @ind_ref;

    SET @ind = @ind + 1;
END
GO

The SQL Complete tool can help with keeping the code of your scripts neatly formatted.

In this way, we created two tables – one of them has over 1 million rows, and the other has more than 10 million.

Now we need to create a test project in Visual Studio. This one will be a Visual C# Console App (.NET Framework):

Next, we’ll need to add a library for the Entity Framework so that we can interact with the database.

To add this library, right-click the project and choose ‘Manage NuGet Packages...’ in the context menu,

Manage Nuget

Fig.3. ‘Manage NuGet Packages...’ in the context menu

In the window that will be opened, let’s enter ‘Entity Framework’ in the search box, choose the Entity Framework package, and install it:

Fig.4. Entity Framework package installation

Next, in the App.config file, we’ll add the following block after the configSections element.

<connectionStrings>
    <add name="DBConnection" connectionString="data source=MSSQL_INSTANCE_NAME;Initial Catalog=TEST;Integrated Security=True;" providerName="System.Data.SqlClient" />
</connectionStrings>

Make sure the connection string is entered in connectionString.

Now, let’s create 3 interfaces in separate files,

IBaseEntityID

namespace TestLINQ
{
    public interface IBaseEntityID
    {
        int ID { get; set; }
    }
}

IBaseEntityName

namespace TestLINQ
{
public interface IBaseEntityName
{
string Name { get; set; }
}
}

IBaseNameInsertUTCDate

namespace TestLINQ
{
    public interface IBaseNameInsertUTCDate
    {
        DateTime InsertUTCDate { get; set; }
    }
}

In a separate file, create a basic class BaseEntity for our two entities, which will contain their common fields.

namespace TestLINQ
{
    public class BaseEntity : IBaseEntityID, IBaseEntityName, IBaseNameInsertUTCDate
    {
        public int ID { get; set; }
        public string Name { get; set; }
        public DateTime InsertUTCDate { get; set; }
    }
}

Next, we’ll create our two entities, each one in a separate file,

Ref

using System.ComponentModel.DataAnnotations.Schema;

namespace TestLINQ
{
    [Table("Ref")]
    public class Ref : BaseEntity
    {
        public int ID2 { get; set; }
    }
}

Customer

using System.ComponentModel.DataAnnotations.Schema;
namespace TestLINQ
{
    [Table("Customer")]
    public class Customer : BaseEntity
    {
        public int Ref_ID { get; set; }
        public int Ref_ID2 { get; set; }
    }
}

Finally, let’s create a context UserContext in a separate file.

using System.Data.Entity;

namespace TestLINQ
{
    public class UserContext : DbContext
    {
        public UserContext()
            : base("DbConnection")
        {
            Database.SetInitializer<UserContext>(null);
        }
        public DbSet<Customer> Customer { get; set; }
        public DbSet<Ref> Ref { get; set; }
    }
}

In this way, we get a solution for performing optimization tests with the help of LINQ to SQL through Entity Framework for MS SQL Server:

Package

Fig.5. Solution for optimization tests

Now, let’s enter the following code in the Program.cs file.

using System;
using System.Collections.Generic;
using System.Linq;
namespace TestLINQ
{
    class Program
    {
        static void Main(string[] args)
        {
            using (UserContext db = new UserContext())
            {
                var dblog = new List<string>();
                db.Database.Log = dblog.Add;

                var query = from e1 in db.Customer
                            from e2 in db.Ref
                            where (e1.Ref_ID == e2.ID)
                                && (e1.Ref_ID2 == e2.ID2)
                            select new { Data1 = e1.Name, Data2 = e2.Name };
                var result = query.Take(1000).ToList();
                Console.WriteLine(dblog[1]);
                Console.ReadKey();
            }
        }
    }
}

When we run the project, this is the output we’ll see in the console as a result

SELECT TOP (1000)
    [Extent1].[Ref_ID] AS [Ref_ID],
    [Extent1].[Name] AS [Name],
    [Extent2].[Name] AS [Name1]
FROM 
    [dbo].[Customer] AS [Extent1]
INNER JOIN 
    [dbo].[Ref] AS [Extent2] ON ([Extent1].[Ref_ID] = [Extent2].[ID]) AND ([Extent1].[Ref_ID2] = [Extent2].[ID2])

As you can see, a LINQ query has efficiently generated an SQL query to the MS SQL Server DBMS.

Now, let’s change the AND condition to OR in the LINQ query.

var query = from e1 in db.Customer
            from e2 in db.Ref
            where (e1.Ref_ID == e2.ID)
                || (e1.Ref_ID2 == e2.ID2)
            select new { Data1 = e1.Name, Data2 = e2.Name };

Execute the application again.

An exception will be thrown. From the description of the error, we’ll see that the operation has timed out after 30 seconds,

Exception unhandled

Fig 6. Error description

This is the query that LINQ has created.

Query

Fig 7. LINQ query

We can see that the selection is performed through a cartesian product of two sets(tables)

SELECT TOP (1000)
    [Extent1].[Ref_ID] AS [Ref_ID],
    [Extent1].[Name] AS [Name],
    [Extent2].[Name] AS [Name1]
FROM 
    [dbo].[Customer] AS [Extent1]
CROSS JOIN 
    [dbo].[Ref] AS [Extent2]
WHERE 
    [Extent1].[Ref_ID] = [Extent2].[ID] OR [Extent1].[Ref_ID2] = [Extent2].[ID2]

Let’s rewrite the LINQ query like this.

var query = (
                from e1 in db.Customer
                join e2 in db.Ref
                on e1.Ref_ID equals e2.ID
                select new { Data1 = e1.Name, Data2 = e2.Name }
            )
            .Union(
                from e1 in db.Customer
                join e2 in db.Ref
                on e1.Ref_ID2 equals e2.ID2
                select new { Data1 = e1.Name, Data2 = e2.Name }
            );

This is the SQL query we’ll get as a result

SELECT
    [Limit1].[C1] AS [C1],
    [Limit1].[C2] AS [C2],
    [Limit1].[C3] AS [C3]
FROM
    (SELECT DISTINCT TOP (1000)
        [UnionAll1].[C1] AS [C1],
        [UnionAll1].[Name] AS [C2],
        [UnionAll1].[Name1] AS [C3]
    FROM
        (SELECT
            1 AS [C1],
            [Extent1].[Name] AS [Name],
            [Extent2].[Name] AS [Name1]
        FROM
            [dbo].[Customer] AS [Extent1]
        INNER JOIN
            [dbo].[Ref] AS [Extent2] ON [Extent1].[Ref_ID] = [Extent2].[ID]
        UNION ALL
        SELECT
            1 AS [C1],
            [Extent3].[Name] AS [Name],
            [Extent4].[Name] AS [Name1]
        FROM
            [dbo].[Customer] AS [Extent3]
        INNER JOIN
            [dbo].[Ref] AS [Extent4] ON [Extent3].[Ref_ID2] = [Extent4].[ID2]) AS [UnionAll1]
    ) AS [Limit1]

Unfortunately, there can only be one connection condition in one LINQ query, so we can achieve the results we need by creating one query for each of the two conditions and then combining them by using Union to delete repeating lines.

Yes, queries will be non-equivalent in most cases, considering that full row duplicates can be returned. However, in real life duplicate rows are not needed and usually they’re something you’ll want to get rid of.

Now let’s compare the execution plans of these two queries:

Average execution time for CROSS JOIN is 195s,

Execution time

Fig. 8. Execution time for CROSS JOIN

Average execution time for INNER JOIN-UNION is less than 24s

Average execution

Fig. 9. Execution time for INNER JOIN-UNION

As we can see from the results, the optimized LINQ query works several times faster than the unoptimized one in these two tables with millions of records.

For the version with the AND condition, a LINQ query will look like this

var query = from e1 in db.Customer
            from e2 in db.Ref
            where (e1.Ref_ID == e2.ID)
               && (e1.Ref_ID2 == e2.ID2)
            select new { Data1 = e1.Name, Data2 = e2.Name };

Almost always in this case, a correct SQL query will be generated, with the execution time of approximately 24 seconds:

Execution second

Fig 10. Correct SQL query

Also, for LINQ to Objects operations, instead of a query that looks like this,

var query = from e1 in seq1
            from e2 in seq2
            where (e1.Key1 == e2.Key1)
               && (e1.Key2 == e2.Key2)
            select new { Data1 = e1.Data, Data2 = e2.Data };

we can use a query similar to the following,

var query = from e1 in seq1
            join e2 in seq2
            on new { e1.Key1, e1.Key2 } equals new { e2.Key1, e2.Key2 }
            select new { Data1 = e1.Data, Data2 = e2.Data };

where

Para[] seq1 = new[] { new Para { Key1 = 1, Key2 = 2, Data = "777" }, 
                      new Para { Key1 = 2, Key2 = 3, Data = "888" }, 
                      new Para { Key1 = 3, Key2 = 4, Data = "999" } };
Para[] seq2 = new[] { new Para { Key1 = 1, Key2 = 2, Data = "777" }, 
                      new Para { Key1 = 2, Key2 = 3, Data = "888" }, 
                      new Para { Key1 = 3, Key2 = 5, Data = "999" } };

The Para type is defined in the following way.

class Para
{
    public int Key1, Key2;
    public string Data;
}

Conclusion

We looked at some aspects of LINQ query optimization for MS SQL Server. Also, SQL Complete helped us a lot with searching through the query history and with formatting the scripts we used in the course of this article.

Unfortunately, even experienced .NET developers often forget that it’s necessary to understand what the instructions they use do in the background. Otherwise, they can become configurators and set a figurative time bomb in the future – both when the solution is scaled and when the environment’s external conditions are slightly changed.

Source files for the test – the project itself, table creation in the TEST database, and the population of these tables with data can be found here.

Also, the Plans folder from this repository contains execution plans of queries with OR conditions.

Additionally, there is an excellent solution called dotConnect - it’s a line of data access components from Devart for various DBMS. Among others, dotConnect components support such ORM tools as Entity Framework Core and LinqConnect which allows you to work with LINQ to SQL classes.

References

  1. https://habr.com/ru/post/459716/
  2. https://stackoverflow.com/questions/56940996/how-to-optimally-combine-two-collections-in-linq-with-several-conditions-for-ms
  3. https://github.com/jobgemws/TestLINQ