Combination Of Id-ParentId And HierarchyId Approaches To Hierarchical Data

Vadim Loboda
7y
33.7k
0
2

Article

Introduction to Hierarchical and alphabetical sorting

To sort the data hierarchically and alphabetically means sorting a tree using depth-first search algorithm and selecting nodes with the same parent in the alphabetical order of their names.

The hierarchical sorting along with the level of nesting is a good way to emulate an expanded tree or a table of a book's contents using just a flat list of objects without the necessity to build a tree of objects.

Just fetch the records, sort them hierachically and alphabetically, indent each record n times ( where n is the level of nesting), and you'll get the expanded tree for UI and reports.

Two approaches

There are several approaches to structure a relational database to store the hierarchical data. The one, which is being used more frequently, is the Id-ParentId approach.

The Id-ParentId approach has a few drawbacks that make it necessary to seek alternative methods of storing hierarchical data.

The most difficult and expensive tasks in Id-ParentId approach are those which require recursive methods. For example -

find all descendants,
find all ancestors,
calculate the level of nesting,
sort flat records hierarchically and then alphabetically.

As a response to these difficulties, Microsoft added the HierarchyId data type to SQL Server since version 2008.

The HierarchyId data type,

helps find descendants and ancestors without recursion,
has fast GetLevel() method,
contains the information about a node position in hierarchy,
contains the information about a node order within its parent children.

If you want to build a clustered index on HierarchyId, then all the node-records will physically be stored in the order in which they are most frequently fetched.

But the HierarchyId data type still has drawbacks, which are the trade-offs for its useful properties,

Impossibility to use HierarchyId as a foreign key to a parent, i.e. self-referencing; (actially a foreign key to a parent can be substituted with PERSISTED REFERENCES as described in this article, but all the pros and cons of such a solution are waiting for their researchers)
Difficult ways to generate HierarchyId for added and changed records to keep required order, SQL Server does not do this automatically, it is up to programmer to manage it.
Difficult use of the HierarchyId as a primary and foreign key: it may regularly change,
It has no corresponding type in JavaScript (for example).

So what to choose: Id-ParentId or HierarchyId?

An acceptable solution to this dilemma of choice may be the combination of two approaches: Id-ParentId and HierarchyId.

This combination will allow us to,

Use Id as an integer, ever increasing and unchangeable primary key,
Use ParentId as a self-referencing foreign key,
Use HierrarchyId to keep the records physically sorted in hierarchical and alphabetical order.

This article describes the way to unite the both approaches and ensure their normal operation.

Some Facts about HierarchyId

Column and field names in this article and source code,

Hid — column of HierarchyId data type
HidPath — string column for human-readable path of HierarchyId
HidCode — a binary presentation of HierarchyId converted to a string and without leading 0x.

To human-readable format

The HierarchyId data type is a materialized path encoded into binary format. It can be decoded to human-readable string path with ToString() method,

select Hid, Hid.ToString() as HidPath, Hid.GetLevel() as Level from Folders order by Hid;
-- Note: ToString() and GetLevel() are case sensitive!!!

Results----------------------------------- Hid HidPath Level ---------- ------------- -------- 0x / 0
0x58 /1/ 1
0x5AC0 /1/1/ 2
0x5AD6 /1/1/1/ 3
0x5AD6B0 /1/1/1/1/ 4
0x5AD6B580 /1/1/1/1/1/ 5
0x5AD6B680 /1/1/1/1/2/ 5
0x5AD6B780 /1/1/1/1/3/ 5
0x5AD6D0 /1/1/1/2/ 4
0x5AD6D580 /1/1/1/2/1/ 5
0x5AD6D680 /1/1/1/2/2/ 5
0x5AD6D780 /1/1/1/2/3/ 5
...

From human-readable format

The opposite operation from string path to binary format works also.

declare @Hid hierarchyid = '/1/1/1/1/3/';
select @Hid

Results
------------
0x5AD6B780

The above is the implicit conversion. HierarchyId has the explicit Parse() method, although a call format is a little strange.

declare @Hid hierarchyid = hierarchyid::Parse('/1/1/1/1/3/');
-- Parse() is case sensitive!!!
select @Hid

Results
------------
0x5AD6B780

Tricky numbering system of paths

The interesting thing about the HierarchyId is the way how it manages the numeration of a node inserted between two other nodes.

To get a Hid for a new node, HierarchyId uses GetDescendant() method from parent Hid and uses two Hids of nodes between which the new has to be inserted.

declare @Parent hierarchyid = 0x;
print @Parent.GetDescendant('/1/', '/2/').ToString() -- /1.1/
print @Parent.GetDescendant('/1/', '/1.1/').ToString() -- /1.0/
print @Parent.GetDescendant('/1/', '/1.0/').ToString() -- /1.-1/
print @Parent.GetDescendant('/1/', '/1.-1/').ToString() -- /1.-2/
print @Parent.GetDescendant('/1.1/', '/2/').ToString() -- /1.2/
print @Parent.GetDescendant('/1.2/', '/2/').ToString() -- /1.3/
print @Parent.GetDescendant('/1.3/', '/2/').ToString() -- /1.4/
print @Parent.GetDescendant('/1.3/', '/1.4/').ToString() -- /1.3.1/
print @Parent.GetDescendant('/1.2.3.4.5.6.7.8/', '/1.2.3.4.5.6.7.9/').ToString()
-- /1.2.3.4.5.6.7.8.1/
-- by the way
declare @Hid hierarchyid = '/1.2.3.4.5.6.7.8.1/';
select @Hid; -- 0x63A08A49A85258
declare @Hid hierarchyid = '/-1.-2.-3.-4.-5.-6.-7.-8.-1234567890/';
select @Hid; -- 0x41F8F87A3C1D8E87216D9A81A73A
-- special cases with null
print @Parent.GetDescendant(null, null).ToString() -- /1/
print @Parent.GetDescendant('/1/', null).ToString() -- /2/
print @Parent.GetDescendant(null, '/1/').ToString() -- /0/

Why do we need binary encoding?

If we can build a string path, why do we need this HierarchyId binary encoding?

Could we just sort the hierarchical data by this string path?

Look at this example,

select hierarchyid::Parse('/1/') as Hid, '/1/' as HidPath union all
select hierarchyid::Parse('/2/') , '/2/' union all
select hierarchyid::Parse('/10/') , '/10/'
order by HidPath;

Results
--------------
Hid HidPath
----- --------
0x58 /1/
0xAA /10/
0x68 /2/

The same query but ordered by Hid makes the right sorting:

Results
--------------
Hid HidPath ----- -------- 0x58 /1/ 0x68 /2/ 0xAA /10/

Microsoft uses the sophisticated algorithm of HierarchyId to encode the string path so that it can be used for hierarchical sorting just by the Hid column value.

The Solution

Master and slave

So, let's combine Id-ParentId self referencing approach with HierarchyId.

The Id-ParentId part will be responsible for Id primary key, self-reference ParentId foreign key and be the leading, or master. The HierarchyId part will be a calculated field, or slave.

Of course, Hid as a persistent calculated field is the denormalization. But this is a conscious step and a compromise for the advantages of HierarchyId.

The best place and moment to calculate the HierarchyId and keep it in coordinated state is a stored procedure and while saving the node.

User Catalog for experiments

Let's imagine we have a multi-user application where each of user keeps its own Catalog. This Catalog keeps information in hierarchical Folders.

The table for these Folders might look like this,

create table dbo.Folders
(
UserId int not null ,
Hid hierarchyid not null ,
Id int not null identity,
ParentId int null ,
Name nvarchar(50) not null ,
constraint CU_Folders unique clustered (UserId asc, Hid asc),
constraint PK_Folders primary key nonclustered (Id asc),
constraint FK_Folders_UserId foreign key (UserId) references dbo.Users (Id),
constraint FK_Folders_ParentId foreign key (ParentId) references dbo.Folders (Id)
constraint CH_Folders_ParentId check (Hid = 0x and ParentId is null or Hid <> 0x and ParentId is not null)
);

Note, that the clustered index is build on UserId and Hid, but the primary key is build on Id column.

This allows to keep all the records of the user's folders hierarchically sorted physically. And, at the same time, use the integer primary key as a foreign key for another tables and DTOs.

It is well known, that the hierarchy built on HierarchyId may have one root only.

Because the clustered key is the combination of UserId and Hid columns, a table dbo.Folders may have one root for each user.

The root node is mandatory, every user's catalog must have one, so the root is mostly a system record that is not edited by users.

Once the root node is a system record, every other nodes must have a parent and ParentId must be not null.

The best place to make calculation

As Microsoft states it here: "It is up to the application to generate and assign HierarchyId values in such a way that the desired relationship between rows is reflected in the values."

The best way to create and keep integrity and consistency of hierarchical data in our case is always use the same stored procedure to insert and update a Folder node.

A user-defined table type that represents saving Folder may serve as a parameter to the SaveFolder stored procedure,

create type dbo.FolderTableType as table
(
Id int not null primary key clustered,
ParentId int not null ,
Name nvarchar(50) not null
);

And here is the SaveFolder stored procedure itself,

create procedure dbo.SaveFolder
@Folder dbo.FolderTableType readonly
as
begin
set nocount on;
begin -- variable declaration
declare
@ParamId int ,
@ParentId int ,
@UserId int ,
@ParentHid hierarchyid ,
@StartTranCount int ,
@OldHid hierarchyid ,
@NewHid hierarchyid ;
declare @FolderIds table (
InsertedId int not null,
OldHid hierarchyid null,
NewHid hierarchyid null
);
end;
begin try
set @StartTranCount = @@trancount;
if @StartTranCount = 0
begin
set transaction isolation level serializable;
begin transaction;
end;
begin -- init variables and lock parent for update
select
@ParamId = Id,
@ParentId = ParentId
from
@Folder;
select
@UserId = UserId,
@ParentHid = Hid
from
dbo.Folders
where
Id = @ParentId;
end;
begin -- save the @Folder
merge into dbo.Folders as target
using
(
-- full join in this 'select' allows to see picture as if
-- the change already applied,
-- thus coalesce(t.Id, f.Id) take new value
-- if t.Id exists and old value if not
select
-- LAG and LEAD functions help to find the previous
-- and next Hid values if to sort by Name
Hid = @ParentHid.GetDescendant (
LAG (case when t.Id is null then f.Hid end)
over(order by coalesce(t.Name, f.Name)),
LEAD(case when t.Id is null then f.Hid end)
over(order by coalesce(t.Name, f.Name))
),
Id = coalesce( t.Id , f.Id ),
ParentId = coalesce( t.ParentId , f.ParentId ),
Name = coalesce( t.Name , f.Name )
from
(select * from dbo.Folders where ParentId = @ParentId) f
full join @Folder t on t.Id = f.Id
)
-- for LAG and LEAD functions we need all children of the @ParentId
-- but for merge we need the Folder where source.Id = @ParamId
as source on source.Id = @ParamId and source.Id = target.Id
-- target.UserId = @UserId here just to make sure
-- that we do not reparent a Folder to another user
when matched and target.UserId = @UserId then
update set
Hid = source.Hid ,
Name = source.Name ,
ParentId = source.ParentId
-- source.Id = 0 here just to make sure that we insert a new Folder
-- and not already deleted one
-- not matched Id can be deleted by another user and we can try to add it again
when not matched by target and source.Id = 0 then
insert (
UserId ,
Hid ,
ParentId ,
Name )
values (
@UserId ,
source.Hid ,
source.ParentId ,
source.Name )
output
inserted.Id ,
deleted.Hid ,
inserted.Hid
into @FolderIds (
InsertedId ,
OldHid ,
NewHid );
end;
begin -- reparent children
-- saving folder might change its ParentId
-- in that case Hid of all its children should reflect the change also
select top 1 @OldHid = OldHid, @NewHid = NewHid from @FolderIds;
if @OldHid <> @NewHid
update dbo.Folders set
Hid = Hid.GetReparentedValue(@OldHid, @NewHid)
where
UserId = @UserId
and Hid.IsDescendantOf(@OldHid) = 1;
end;
if @StartTranCount = 0 commit transaction;
end try
begin catch
if xact_state() <> 0 and @StartTranCount = 0 rollback transaction;
declare @ErrorMessage nvarchar(4000) = dbo.GetErrorMessage();
raiserror (@ErrorMessage, 16, 1);
return;
end catch;
end;

If you want to save each Folder by the above stored procedure only, then the Hid column will always have actual and consistent information.

Fetch a Folder with its descendants in hierarchical order

Thus the following stored procedure fetches a Folder with all its descendants and sorts them hierarchically and then alphabetically as if they are an expanded tree,

create procedure dbo.GetFolderWithSubFolders
@FolderId int
as
begin
set nocount on;
select
d.Id ,
d.ParentId ,
d.Name ,
[Level] = d.Hid.GetLevel(),
HidCode = dbo.GetHidCode(d.Hid),
HidPath = d.Hid.ToString()
from
dbo.Folders f
inner join dbo.Folders d on d.UserId = f.UserId and d.Hid.IsDescendantOf(f.Hid) = 1
where
f.Id = @FolderId
order by
d.Hid;
end;

For example

exec dbo.GetFolderWithSubFolders @FolderId = 40

Results ---------------------------------------------------- Id ParentId Name Level HidCode HidPath
----- --------- ----- ------ ---------- ------------
40 13 3A 3 5AD6 /1/1/1/
121 40 4A 4 5AD6B0 /1/1/1/1/
364 121 5A 5 5AD6B580 /1/1/1/1/1/
365 121 5B 5 5AD6B680 /1/1/1/1/2/
366 121 5C 5 5AD6B780 /1/1/1/1/3/
122 40 4B 4 5AD6D0 /1/1/1/2/
367 122 5A 5 5AD6D580 /1/1/1/2/1/
368 122 5B 5 5AD6D680 /1/1/1/2/2/
369 122 5C 5 5AD6D780 /1/1/1/2/3/
123 40 4C 4 5AD6F0 /1/1/1/3/
370 123 5A 5 5AD6F580 /1/1/1/3/1/
371 123 5B 5 5AD6F680 /1/1/1/3/2/
372 123 5C 5 5AD6F780 /1/1/1/3/3/

What is the use of HidCode?

As it stated above, HidCode is a binary presentation of HierarchyId converted to varchar and without leading 0x.

Here is the dbo.GetHidCode scalar SQL function,

create function dbo.GetHidCode
(
@Hid hierarchyid
)
returns varchar(1000)
as
begin
return replace(convert(varchar(1000), cast(@Hid as varbinary(892)), 2), '0x', '');
end;

The HidCode string can be used on a client side to sort a flat list of Folders hierarchically.

Chapter for the perfectionists

The above method of Hid calculation works well, but it has a tiny drawback.

imagine you have two sibling folders
Id ParentId Name Level HidCode HidPath
----- --------- ----- ------ ---------- ------------
364 121 5A 5 5AD6B580 /1/1/1/1/1/
365 121 5B 5 5AD6B680 /1/1/1/1/2/
and you want to insert a new folder
Id ParentId Name
----- --------- -----
0 121 5Aaa
the result will be
Id ParentId Name Level HidCode HidPath
----- --------- ----- ------ ---------- --------------
364 121 5A 5 5AD6B580 /1/1/1/1/1/
1234 121 5Aaa 5 5AD6B62C /1/1/1/1/1.1/
365 121 5B 5 5AD6B680 /1/1/1/1/2/

See this sequence in HidPath: 1 -> 1.1 -> 2 instead of 1 -> 2 -> 3

In a time this ugly sequence may become more uglier.

If you OK with that then just skip this chapter.

But for the others I would suggest one more option for keeping the hierarchical data in actual state.

Here is the stored procedure which recalculates the HierarchyId so the path consists of integer numbers in a continuous sequence.

create procedure dbo.SaveFolderWithHidReculc
@Folder dbo.FolderTableType readonly
as
begin
set nocount on;
begin -- variable declaration
declare
@ParentId int ,
@UserId int ,
@ParentHid hierarchyid ,
@ParentHidStr varchar(1000) ,
@StartTranCount int ,
@OldParentId int ,
@OldParentHid hierarchyid ;
declare @FolderIds table (
InsertedId int not null,
OldParentId int null,
OldParentHid hierarchyid null
);
end;
begin try
set @StartTranCount = @@trancount;
if @StartTranCount = 0
begin
set transaction isolation level serializable;
begin transaction;
end;
begin -- init variables and lock parent for update
select @ParentId = ParentId from @Folder;
select
@UserId = UserId ,
@ParentHid = Hid ,
@ParentHidStr = cast(Hid as varchar(1000))
from
dbo.Folders
where
Id = @ParentId;
end;
begin -- merge calculated hierarchical data with existing folders
merge into dbo.Folders as target
using
(
select
Hid = cast(concat(@ParentHidStr, -1, '/') as varchar(1000)),
Id ,
ParentId ,
Name
from
@Folder
)
as source on source.Id = target.Id
when matched and target.UserId = @UserId then
update set
ParentId = source.ParentId ,
Name = source.Name
when not matched by target and source.Id = 0 then
insert (
UserId ,
Hid ,
ParentId ,
Name )
values (
@UserId ,
source.Hid ,
source.ParentId ,
source.Name )
output
inserted.Id,
deleted.ParentId,
deleted.Hid.GetAncestor(1)
into
@FolderIds (
InsertedId ,
OldParentId ,
OldParentHid );
end
begin -- reculculate SubFolder Hids
select top 1
@OldParentId = OldParentId ,
@OldParentHid = OldParentHid
from
@FolderIds;
exec dbo.ReculcSubFolderHids @UserId, @ParentId, @ParentHid, @OldParentId, @OldParentHid ;
end;
if @StartTranCount = 0 commit transaction;
end try
begin catch
if xact_state() <> 0 and @StartTranCount = 0 rollback transaction;
declare @ErrorMessage nvarchar(4000) = dbo.GetErrorMessage();
raiserror (@ErrorMessage, 16, 1);
return;
end catch;
end;

The trick here is in call of ReculcSubFolderHids stored procedure after the @Folder save,

create procedure dbo.ReculcSubFolderHids
@UserId int ,
@ParentId int ,
@ParentHid hierarchyid ,
@OldParentId int = null,
@OldParentHid hierarchyid = null
as
begin
declare @ParentHidStr varchar(1000) = cast(@ParentHid as varchar(1000));
declare @OldParentHidStr varchar(1000) = cast(@OldParentId as varchar(1000));
with Recursion as
(
select
Id ,
ParentId ,
Name ,
OldHid = cast(Hid as varchar(1000)),
NewHid = cast(
concat(
case when ParentId = @ParentId then @ParentHidStr else @OldParentHidStr end,
row_number() over (order by Name, Id),
'/'
)
as varchar(1000)
)
from
dbo.Folders
where
ParentId in (@ParentId, @OldParentId)
union all
select
Id = f.Id ,
ParentId = f.ParentId ,
Name = f.Name ,
OldHid = cast(f.Hid as varchar(1000)),
NewHid = cast(
concat(
r.NewHid,
row_number() over (partition by f.ParentId order by f.Name, f.Id),
'/'
)
as varchar(1000)
)
from
Recursion r
inner join dbo.Folders f on f.ParentId = r.Id
where
r.OldHid <> r.NewHid
)
update f set
Hid = r.NewHid
from
dbo.Folders f
inner join Recursion r on r.Id = f.Id and f.Hid <> r.NewHid
where
f.UserId = @UserId;
end;

The above stored procedure calculated the paths which consists of integer numbers in a continuous sequence.

So the result of the example in the section beginning with SaveFolderWithHidReculc will be

Id ParentId Name Level HidCode HidPath
----- --------- ----- ------ ---------- ------------
364 121 5A 5 5AD6B580 /1/1/1/1/1/
1234 121 5Aaa 5 5AD6B680 /1/1/1/1/2/
365 121 5B 5 5AD6B780 /1/1/1/1/3/

This method has a drawback also. It works well until the recalculated branch consists of up to about 10,000 nodes, when the time for calculation does not exceed a second. 100,000 nodes' recalculation may take up to 15 seconds or more.

Read a Tree in C#

The implemented Id-ParentId reference allows to build a tree of Folders in C#.

Moreover, once the Folders are sorted hierarchically, the tree can be built for one iteration through the flat list of Folders.

Suppose we have a Folder class in C#,

public class Folder
{
public Int32 Id { get; set; }
public Int32? ParentId { get; set; }
public String Name { get; set; }
public IList<Folder> SubFolders { get; set; }
}

So, the following method builds a tree for one pass through the hierarchically sorted Folder list,

public static IList<Folder> ConvertHierarchicallySortedFolderListToTrees(IEnumerable<Folder> folders)
{
var parentStack = new Stack<Folder>();
var parent = default(Folder);
var prevNode = default(Folder);
var rootNodes = new List<Folder>();
foreach (var folder in folders)
{
if (parent == null || folder.ParentId == null)
{
rootNodes.Add(folder);
parent = folder;
}
else if (folder.ParentId == parent.Id)
{
if (parent.SubFolders == null)
parent.SubFolders = new List<Folder>();
parent.SubFolders.Add(folder);
}
else if (folder.ParentId == prevNode.Id)
{
parentStack.Push(parent);
parent = prevNode;
if (parent.SubFolders == null)
parent.SubFolders = new List<Folder>();
parent.SubFolders.Add(folder);
}
else
{
var parentFound = false;
while(parentStack.Count > 0 && parentFound == false)
{
parent = parentStack.Pop();
if (folder.ParentId != null && folder.ParentId.Value == parent.Id)
{
parent.SubFolders.Add(folder);
parentFound = true;
}
}
if (parentFound == false)
{
rootNodes.Add(folder);
parent = folder;
}
}
prevNode = folder;
}
return rootNodes;
}

The above method works two times faster than the following method for an unsorted Folder list,

public static IList<Folder> ConvertHierarchicallyUnsortedFolderListToTrees(IEnumerable<Folder> folders)
{
var dictionary = folders.ToDictionary(n => n.Id, n => n);
var rootFolders = new List<Folder>();
foreach (var folder in dictionary.Select(item => item.Value))
{
Folder parent;
if (folder.ParentId.HasValue && dictionary.TryGetValue(folder.ParentId.Value, out parent))
{
if (parent.SubFolders == null)
parent.SubFolders = new List<Folder>();
parent.SubFolders.Add(folder);
}
else
{
rootFolders.Add(folder);
}
}
return rootFolders;
}

Both methods can build multiple roots, so the return type is IList<Folder>. That is useful when you need to fetch several bunches detached from the tree or several independent trees.

About source code

The attached archive contains the Hierarchy solution, created in Visual Studio 2015, which consists of two projects,

DB — SSDT project to create database for SQL Server 2016,
Tests — Test project to call the repository methods and see the result.

The solution contains the examples of,

How to fetch a Folder by Id with HidCode, HidPath and full path from the root Folder;
How to fetch immediate SubFolders of a Folder;
How to fetch a Folder with all its descendants as a flat list and as a tree;
How to fetch a Folder with all its parents as a flat list and as a tree;
How to add a new Folder with Hid calculation on the fly;
How to edit a Folder keeping right hierarchical and alphabetical order;
How to reparent a Folder, assign a Folder to another Parent in edit method;
How to delete a Folder with all its descendants;
How to find Folders by Name and build branches from the found Folders to the root as it realized in Visual Studio is Search Solution Explorer.

In order to install the database and run the tests, change the connection string in file Hierarchy.DB.publish.xml and in App.config to yours.

The Database project contains the dbo.GenerateFolders stored procedure which generates hierarchical data for tests. This generation occurs automatically during the database publishing phase.