We’ve recently looked at ways to work with parent-child hierarchies, particularly in reporting scenarios. Regular parent-child hierarchies are great when working with dimensions that are ragged, but they have a critical limitation – any given node in the tree can only have a single parent node. A great solution to this problem is a DAG – directed acyclic graph.
Example: a directory structure
Imagine a parent-child structure as a directory structure in a file system. Every directory (except the root directory, of course) has a parent directory. Here’s how the parent-child table would look.
CREATE TABLE dbo.Directories ( [ID] int NOT NULL, parentID int NULL, [path] varchar(1000) NULL, CONSTRAINT PK_Directories PRIMARY KEY CLUSTERED ([ID]), CONSTRAINT FK_Directories_parentID FOREIGN KEY (parentID) REFERENCES dbo.Directories ([ID]) ); INSERT INTO dbo.Directories ([ID], parentID, [path]) VALUES (1, NULL, 'MSSQL12.SQL2014\MSSQL'), (2, 1, 'MSSQL12.SQL2014\MSSQL\Backup'), (3, 1, 'MSSQL12.SQL2014\MSSQL\Binn'), (4, 1, 'MSSQL12.SQL2014\MSSQL\DATA'), (5, 1, 'MSSQL12.SQL2014\MSSQL\Install'), (6, 1, 'MSSQL12.SQL2014\MSSQL\JOBS'), (7, 1, 'MSSQL12.SQL2014\MSSQL\Log'), (8, 3, 'MSSQL12.SQL2014\MSSQL\Binn\DllTmp32'), (9, 3, 'MSSQL12.SQL2014\MSSQL\Binn\DllTmp64'), (10, 3, 'MSSQL12.SQL2014\MSSQL\Binn\Resources'), (11, 3, 'MSSQL12.SQL2014\MSSQL\Binn\Templates'), (12, 3, 'MSSQL12.SQL2014\MSSQL\Binn\Xtp'), (13, 10, 'MSSQL12.SQL2014\MSSQL\Binn\Resources\1028'), (14, 10, 'MSSQL12.SQL2014\MSSQL\Binn\Resources\1029'), (15, 10, 'MSSQL12.SQL2014\MSSQL\Binn\Resources\1030'), (16, 10, 'MSSQL12.SQL2014\MSSQL\Binn\Resources\1031'), (17, 10, 'MSSQL12.SQL2014\MSSQL\Binn\Resources\1032'), (18, 10, 'MSSQL12.SQL2014\MSSQL\Binn\Resources\1033'), (19, 10, 'MSSQL12.SQL2014\MSSQL\Binn\Resources\1035'), (20, 10, 'MSSQL12.SQL2014\MSSQL\Binn\Resources\1036'), (21, 10, 'MSSQL12.SQL2014\MSSQL\Binn\Resources\1038'), (22, 10, 'MSSQL12.SQL2014\MSSQL\Binn\Resources\1040'), (23, 10, 'MSSQL12.SQL2014\MSSQL\Binn\Resources\1041'), (24, 10, 'MSSQL12.SQL2014\MSSQL\Binn\Resources\1042'), (25, 10, 'MSSQL12.SQL2014\MSSQL\Binn\Resources\1043'), (26, 10, 'MSSQL12.SQL2014\MSSQL\Binn\Resources\1044'), (27, 10, 'MSSQL12.SQL2014\MSSQL\Binn\Resources\1045'), (28, 10, 'MSSQL12.SQL2014\MSSQL\Binn\Resources\1046'), (29, 10, 'MSSQL12.SQL2014\MSSQL\Binn\Resources\1049'), (30, 10, 'MSSQL12.SQL2014\MSSQL\Binn\Resources\1053'), (31, 10, 'MSSQL12.SQL2014\MSSQL\Binn\Resources\1055'), (32, 10, 'MSSQL12.SQL2014\MSSQL\Binn\Resources\2052'), (33, 10, 'MSSQL12.SQL2014\MSSQL\Binn\Resources\2070'), (34, 10, 'MSSQL12.SQL2014\MSSQL\Binn\Resources\3082'), (35, 12, 'MSSQL12.SQL2014\MSSQL\Binn\Xtp\gen'), (36, 12, 'MSSQL12.SQL2014\MSSQL\Binn\Xtp\VC'), (37, 35, 'MSSQL12.SQL2014\MSSQL\Binn\Xtp\gen\include'), (38, 35, 'MSSQL12.SQL2014\MSSQL\Binn\Xtp\gen\lib'), (39, 36, 'MSSQL12.SQL2014\MSSQL\Binn\Xtp\VC\bin'), (40, 36, 'MSSQL12.SQL2014\MSSQL\Binn\Xtp\VC\lib'), (41, 39, 'MSSQL12.SQL2014\MSSQL\Binn\Xtp\VC\bin\1033'), (42, 4, 'MSSQL12.SQL2014\MSSQL\DATA\xtp');
Now imagine that you want to add a junction point to your directory structure, otherwise known as a “symbolic link“. A junction point works like a shortcut in Windows, except it’s actually integrated in the file system, not just a “.lnk” file. The result is that a given folder or file can be accessed in two or more locations.
In this example, I want the directory MSSQL\Binn\Xtp\gen\include to be available by the alias MSSQL\include as well as the original directory location. This problem isn’t solvable in the traditional parent-child table we’ve set up here. Sure, you could move the directory or you could create another directory with the same name and the new parent, but the directory cannot actually have two parents.
Here’s where we create a DAG structure. Because a single node (a directory in this case) can have multiple parents in a DAG, we’re going to need two tables – one for the nodes and one that describes the relations between them.
--- Here are all the nodes in the DAG - without their internal relations: CREATE TABLE dbo.DirectoryNodes ( [ID] int NOT NULL, [path] varchar(1000) NULL, CONSTRAINT PK_DirectoryNodes PRIMARY KEY CLUSTERED ([ID]) ); INSERT INTO dbo.DirectoryNodes ([ID], [path]) SELECT [ID], [path] FROM dbo.Directories; --- Here are the relations between all the nodes in the DAG: CREATE TABLE dbo.DirectoryRelations ( parentID int NOT NULL, [ID] int NOT NULL, CONSTRAINT PK_DirectoryRelations PRIMARY KEY CLUSTERED (parentID, [ID]), CONSTRAINT FK_DirectoryRelations_ID FOREIGN KEY ([ID]) REFERENCES dbo.DirectoryNodes ([ID]), CONSTRAINT FK_DirectoryRelations_parentID FOREIGN KEY (parentID) REFERENCES dbo.DirectoryNodes ([ID]) ); INSERT INTO dbo.DirectoryRelations (parentID, [ID]) SELECT parentID, [ID] FROM dbo.Directories WHERE parentID IS NOT NULL;
Now, to add a symbolic link that places MSSQL\Binn\Xtp\gen\include (node ID 37) as a child to the MSSQL directory (node ID 1), all we have to do is add a record to the relation table:
INSERT INTO dbo.DirectoryRelations (parentID, [ID]) VALUES (1, 37);
Challenges with DAGs
DAG structures are inherently a lot more powerful but also more complex than traditional parent-child structures. This introduces a number of challenges that you will have to deal with along the way.
Inheritance
In any hierarchy, you will sometimes want to inherit attributes from one node to its children, grand children, etc. In a DAG, however, having potentially multiple parents (and ancestors, implicitly) means that any given child node can inherit conflicting attributes from different (equally distant) parent nodes. You may need to manage this by constructing a deterministic set of rules that determine which attributes are used in case of such a conflict.
Aggregations
Using parent-child hierarchies as well as DAGs, it’s likely that you will at some point need to accumulate data up the hierarchy. However, because DAGs allow for multiple parents for a single node, the amount from a single node can potentially be included two or more times in the aggregate of a single parent node. To manage this, you can build a DISTINCT lookup table that contains all ancestor relations between nodes (just the parent and child IDs).
The ancestor lookup table is a relatively simple recursive CTE:
WITH tree (parentID, [ID], [level]) AS (--- Each node is an anchor, with level=0: SELECT [ID] AS parentID, [ID], 0 AS [level] FROM dbo.DirectoryNodes UNION ALL --- Recursion, one level at a time: SELECT tree.parentID, rel.[ID], tree.[level]+1 FROM tree INNER JOIN dbo.DirectoryRelations AS rel ON tree.[ID]=rel.parentID) --- Return all ancestor relations: SELECT * FROM tree;
This query can also be applied to solve the inheritance of attributes in the tree, although it doesn’t provide a solution for potential conflicting attributes on equi-distant ancestors.
To make this ancestor list DISTINCT, simply add a DISTINCT keyword and eliminate the level column from the result set:
WITH tree (parentID, [ID]) AS (--- Each node is an anchor: SELECT [ID] AS parentID, [ID] FROM dbo.DirectoryNodes UNION ALL --- Recursion, one level at a time: SELECT tree.parentID, rel.[ID] FROM tree INNER JOIN dbo.DirectoryRelations AS rel ON tree.[ID]=rel.parentID) --- Return all DISTINCT ancestor relations (which means excluding "level") SELECT DISTINCT parentID, [ID] FROM tree;
Orphans
Because the nodes and relations are split into two tables, it’s possible to get orphaned nodes – i.e. nodes that have no parent node. Orphaned nodes come in two types: nodes that don’t have an immediate parent and implicitly orphaned nodes where the node itself may have a parent but an ancestor of the node is an orphan. The easiest way to manage orphans is to treat them as root nodes (as we do in the example above).
The following simple query will return any orphans (or root nodes) in your tree:
SELECT * FROM dbo.DirectoryNodes WHERE [ID] NOT IN ( SELECT [ID] FROM dbo.DirectoryRelations);
Circular references
By definition, an directed acyclic graph does may contain any cyclic references, known otherwise as circular references. A circular reference is when a node is its own ancestor. For instance, if node A is the parent of node B, which is the parent of node C, which is the parent of node A. Finding such circular references can be a real challenge. This can be accomplished with a recursive common table expression.
WITH tree (parentID, [ID]) AS (--- Each relation is an anchor: SELECT parentID, [ID] FROM dbo.DirectoryRelations UNION ALL --- Recursion, one level at a time: SELECT tree.parentID, rel.[ID] FROM tree INNER JOIN dbo.DirectoryRelations AS rel ON tree.[ID]=rel.parentID --- Stop if a circular reference has been found! WHERE tree.parentID!=tree.[ID]) --- Return only circular references: SELECT * FROM tree WHERE tree.parentID=tree.[ID];
Take circular references very seriously. They will kill your queries dead or, even worse, fill your tempdb and thereby potentially crash your server (if you’ve disabled the maximum number of recursions).
More properties
There’s more that can be done on the DAG setup in this tutorial. You may want to order the nodes in the tree – this is probably best done in the relation table, because it allows you to order the nodes differently under each parent node. Also, because a node can have multiple parents, you may want a flag to designate which parent relation that counts as the “primary” one, from the child node’s perspective.
Credits
This post was inspired by the clever metadata model used by Hypergene. Do check out their brilliant BI product.
1 comment