Directed acyclic graphs vs parent-child hierarchies

We’ve recently looked at ways to work with parent-child hierarchies, particularly in reporting scenarios. Regular parent-child hierarchies are great when working with dimensions that are ragged, but they have a critical limitation – any given node in the tree can only have a single parent node. A great solution to this problem is a DAGdirected acyclic graph.

Example: a directory structure

Imagine a parent-child structure as a directory structure in a file system. Every directory (except the root directory, of course) has a parent directory. Here’s how the parent-child table would look.

CREATE TABLE dbo.Directories (
    [ID]          int NOT NULL,
    parentID      int NULL,
    [path]        varchar(1000) NULL,
    CONSTRAINT PK_Directories PRIMARY KEY CLUSTERED ([ID]),
    CONSTRAINT FK_Directories_parentID FOREIGN KEY (parentID)
        REFERENCES dbo.Directories ([ID])
);

INSERT INTO dbo.Directories ([ID], parentID, [path])
VALUES (1, NULL, 'MSSQL12.SQL2014\MSSQL'),
       (2,   1, 'MSSQL12.SQL2014\MSSQL\Backup'),
       (3,   1, 'MSSQL12.SQL2014\MSSQL\Binn'),
       (4,   1, 'MSSQL12.SQL2014\MSSQL\DATA'),
       (5,   1, 'MSSQL12.SQL2014\MSSQL\Install'),
       (6,   1, 'MSSQL12.SQL2014\MSSQL\JOBS'),
       (7,   1, 'MSSQL12.SQL2014\MSSQL\Log'),
       (8,   3, 'MSSQL12.SQL2014\MSSQL\Binn\DllTmp32'),
       (9,   3, 'MSSQL12.SQL2014\MSSQL\Binn\DllTmp64'),
       (10,  3, 'MSSQL12.SQL2014\MSSQL\Binn\Resources'),
       (11,  3, 'MSSQL12.SQL2014\MSSQL\Binn\Templates'),
       (12,  3, 'MSSQL12.SQL2014\MSSQL\Binn\Xtp'),
       (13, 10, 'MSSQL12.SQL2014\MSSQL\Binn\Resources\1028'),
       (14, 10, 'MSSQL12.SQL2014\MSSQL\Binn\Resources\1029'),
       (15, 10, 'MSSQL12.SQL2014\MSSQL\Binn\Resources\1030'),
       (16, 10, 'MSSQL12.SQL2014\MSSQL\Binn\Resources\1031'),
       (17, 10, 'MSSQL12.SQL2014\MSSQL\Binn\Resources\1032'),
       (18, 10, 'MSSQL12.SQL2014\MSSQL\Binn\Resources\1033'),
       (19, 10, 'MSSQL12.SQL2014\MSSQL\Binn\Resources\1035'),
       (20, 10, 'MSSQL12.SQL2014\MSSQL\Binn\Resources\1036'),
       (21, 10, 'MSSQL12.SQL2014\MSSQL\Binn\Resources\1038'),
       (22, 10, 'MSSQL12.SQL2014\MSSQL\Binn\Resources\1040'),
       (23, 10, 'MSSQL12.SQL2014\MSSQL\Binn\Resources\1041'),
       (24, 10, 'MSSQL12.SQL2014\MSSQL\Binn\Resources\1042'),
       (25, 10, 'MSSQL12.SQL2014\MSSQL\Binn\Resources\1043'),
       (26, 10, 'MSSQL12.SQL2014\MSSQL\Binn\Resources\1044'),
       (27, 10, 'MSSQL12.SQL2014\MSSQL\Binn\Resources\1045'),
       (28, 10, 'MSSQL12.SQL2014\MSSQL\Binn\Resources\1046'),
       (29, 10, 'MSSQL12.SQL2014\MSSQL\Binn\Resources\1049'),
       (30, 10, 'MSSQL12.SQL2014\MSSQL\Binn\Resources\1053'),
       (31, 10, 'MSSQL12.SQL2014\MSSQL\Binn\Resources\1055'),
       (32, 10, 'MSSQL12.SQL2014\MSSQL\Binn\Resources\2052'),
       (33, 10, 'MSSQL12.SQL2014\MSSQL\Binn\Resources\2070'),
       (34, 10, 'MSSQL12.SQL2014\MSSQL\Binn\Resources\3082'),
       (35, 12, 'MSSQL12.SQL2014\MSSQL\Binn\Xtp\gen'),
       (36, 12, 'MSSQL12.SQL2014\MSSQL\Binn\Xtp\VC'),
       (37, 35, 'MSSQL12.SQL2014\MSSQL\Binn\Xtp\gen\include'),
       (38, 35, 'MSSQL12.SQL2014\MSSQL\Binn\Xtp\gen\lib'),
       (39, 36, 'MSSQL12.SQL2014\MSSQL\Binn\Xtp\VC\bin'),
       (40, 36, 'MSSQL12.SQL2014\MSSQL\Binn\Xtp\VC\lib'),
       (41, 39, 'MSSQL12.SQL2014\MSSQL\Binn\Xtp\VC\bin\1033'),
       (42,  4, 'MSSQL12.SQL2014\MSSQL\DATA\xtp');

Now imagine that you want to add a junction point to your directory structure, otherwise known as a “symbolic link“. A junction point works like a shortcut in Windows, except it’s actually integrated in the file system, not just a “.lnk” file. The result is that a given folder or file can be accessed in two or more locations.

In this example, I want the directory MSSQL\Binn\Xtp\gen\include to be available by the alias MSSQL\include as well as the original directory location. This problem isn’t solvable in the traditional parent-child table we’ve set up here. Sure, you could move the directory or you could create another directory with the same name and the new parent, but the directory cannot actually have two parents.

Here’s where we create a DAG structure. Because a single node (a directory in this case) can have multiple parents in a DAG, we’re going to need two tables – one for the nodes and one that describes the relations between them.

--- Here are all the nodes in the DAG - without their internal relations:
CREATE TABLE dbo.DirectoryNodes (
    [ID]          int NOT NULL,
    [path]        varchar(1000) NULL,
    CONSTRAINT PK_DirectoryNodes PRIMARY KEY CLUSTERED ([ID])
);

INSERT INTO dbo.DirectoryNodes ([ID], [path])
SELECT [ID], [path]
FROM dbo.Directories;

--- Here are the relations between all the nodes in the DAG:
CREATE TABLE dbo.DirectoryRelations (
    parentID      int NOT NULL,
    [ID]          int NOT NULL,
    CONSTRAINT PK_DirectoryRelations PRIMARY KEY CLUSTERED (parentID, [ID]),
    CONSTRAINT FK_DirectoryRelations_ID FOREIGN KEY ([ID])
        REFERENCES dbo.DirectoryNodes ([ID]),
    CONSTRAINT FK_DirectoryRelations_parentID FOREIGN KEY (parentID)
        REFERENCES dbo.DirectoryNodes ([ID])
);

INSERT INTO dbo.DirectoryRelations (parentID, [ID])
SELECT parentID, [ID]
FROM dbo.Directories
WHERE parentID IS NOT NULL;

Now, to add a symbolic link that places MSSQL\Binn\Xtp\gen\include (node ID 37) as a child to the MSSQL directory (node ID 1), all we have to do is add a record to the relation table:

INSERT INTO dbo.DirectoryRelations (parentID, [ID])
VALUES (1, 37);

Challenges with DAGs

DAG structures are inherently a lot more powerful but also more complex than traditional parent-child structures. This introduces a number of challenges that you will have to deal with along the way.

Inheritance

In any hierarchy, you will sometimes want to inherit attributes from one node to its children, grand children, etc. In a DAG, however, having potentially multiple parents (and ancestors, implicitly) means that any given child node can inherit conflicting attributes from different (equally distant) parent nodes. You may need to manage this by constructing a deterministic set of rules that determine which attributes are used in case of such a conflict.

Aggregations

Using parent-child hierarchies as well as DAGs, it’s likely that you will at some point need to accumulate data up the hierarchy. However, because DAGs allow for multiple parents for a single node, the amount from a single node can potentially be included two or more times in the aggregate of a single parent node. To manage this, you can build a DISTINCT lookup table that contains all ancestor relations between nodes (just the parent and child IDs).

The ancestor lookup table is a relatively simple recursive CTE:

WITH tree (parentID, [ID], [level])
AS (--- Each node is an anchor, with level=0:
    SELECT [ID] AS parentID, [ID], 0 AS [level]
    FROM dbo.DirectoryNodes

    UNION ALL

    --- Recursion, one level at a time:
    SELECT tree.parentID, rel.[ID], tree.[level]+1
    FROM tree
    INNER JOIN dbo.DirectoryRelations AS rel ON tree.[ID]=rel.parentID)

--- Return all ancestor relations:
SELECT *
FROM tree;

This query can also be applied to solve the inheritance of attributes in the tree, although it doesn’t provide a solution for potential conflicting attributes on equi-distant ancestors.

To make this ancestor list DISTINCT, simply add a DISTINCT keyword and eliminate the level column from the result set:

WITH tree (parentID, [ID])
AS (--- Each node is an anchor:
    SELECT [ID] AS parentID, [ID]
    FROM dbo.DirectoryNodes

    UNION ALL

    --- Recursion, one level at a time:
    SELECT tree.parentID, rel.[ID]
    FROM tree
    INNER JOIN dbo.DirectoryRelations AS rel ON tree.[ID]=rel.parentID)

--- Return all DISTINCT ancestor relations (which means excluding "level")
SELECT DISTINCT parentID, [ID]
FROM tree;

Orphans

Because the nodes and relations are split into two tables, it’s possible to get orphaned nodes – i.e. nodes that have no parent node. Orphaned nodes come in two types: nodes that don’t have an immediate parent and implicitly orphaned nodes where the node itself may have a parent but an ancestor of the node is an orphan. The easiest way to manage orphans is to treat them as root nodes (as we do in the example above).

The following simple query will return any orphans (or root nodes) in your tree:

SELECT *
FROM dbo.DirectoryNodes
WHERE [ID] NOT IN (
    SELECT [ID]
    FROM dbo.DirectoryRelations);

Circular references

By definition, an directed acyclic graph does may contain any cyclic references, known otherwise as circular references. A circular reference is when a node is its own ancestor. For instance, if node A is the parent of node B, which is the parent of node C, which is the parent of node A. Finding such circular references can be a real challenge. This can be accomplished with a recursive common table expression.

WITH tree (parentID, [ID])
AS (--- Each relation is an anchor:
    SELECT parentID, [ID]
    FROM dbo.DirectoryRelations

    UNION ALL

    --- Recursion, one level at a time:
    SELECT tree.parentID, rel.[ID]
    FROM tree
    INNER JOIN dbo.DirectoryRelations AS rel ON tree.[ID]=rel.parentID
    --- Stop if a circular reference has been found!
    WHERE tree.parentID!=tree.[ID])

--- Return only circular references:
SELECT *
FROM tree
WHERE tree.parentID=tree.[ID];

Take circular references very seriously. They will kill your queries dead or, even worse, fill your tempdb and thereby potentially crash your server (if you’ve disabled the maximum number of recursions).

More properties

There’s more that can be done on the DAG setup in this tutorial. You may want to order the nodes in the tree – this is probably best done in the relation table, because it allows you to order the nodes differently under each parent node. Also, because a node can have multiple parents, you may want a flag to designate which parent relation that counts as the “primary” one, from the child node’s perspective.

Credits

This post was inspired by the clever metadata model used by Hypergene. Do check out their brilliant BI product.

1 comment

Leave a comment

Your email address will not be published. Required fields are marked *