Automatically guessing foreign key constraints

With good naming and datatyping conventions, an automated script can help you with the process of creating foreign key constraints across your database, or actually, suggest table relations where you’ve forgotten to implement them.

However, the whole idea of automating foreign key constraint creation hinges on a few critical factors, which is why I’ve used the word “guessing” in the title of this post.

The exact definition of foreign key constraints comes down to domain knowledge of your database – sometimes, what looks like a foreign key candidate really isn’t one.
Fake NULLs: Do you use “n/a”, “0”, “-1” or “1900-01-01” when you actually mean NULL?
Legacy code is full of exceptions to good modelling practices. Outside the english-speaking world, for instance, we frequently see a mix of languages, depending on what developer was working on this particular piece of code.
This script won’t guess relations where the column name and datatype don’t match. For instance, there’s no magic way to match ManagerID to EmployeeID unless you try every row of every column in the database.

With those things aside, let’s take a look at creating a script.

DMVs we’ll need

SQL Server comes with a huge number of DMVs (dynamic management views, nothing to do with your driver’s licence). The dynamic management views we’ll be looking at today reflect the metadata of the database, i.e. tables, columns, datatypes, indexes and primary keys. Here are the main players:

sys.objects and sys.tables

The sys.objects DMV contains a list of each database object, including tables, views, stored procedures, sequences, primary keys, and so on. The type column describes what type of object it is, “U” for instance being a user table, which is what we’ll be looking for today’s purpose. sys.tables has the same columns as sys.objects, but only contains user table objects and contains a few extra columns that are specific only to tables.

sys.schemas

Like it says on the tin, this DMV contains a list of database schemas. Connects to sys.objects and sys.tables via the schema_id column.

sys.columns

A list of the columns in each table, view, table value function, etc. The primary key here is (object_id, column_id) and it connects to sys.objects using the object_id column.

sys.types

Contains datatypes and joins to sys.columns via the user_type_id column. We won’t actually be needing this, but I’m mentioning it just for good measure.

sys.indexes and sys.index_columns

sys.indexes contains indexes, primary keys and unique constraints (which, apart from the SQL syntax, are all the same physical thing under the hood). The is_unique column tells us if this is a unique index or constraint, which is interesting when we’re looking for primary keys. The primary key of sys.indexes is (object_id, index_id). We’re also going to be looking at the kind of index we’re dealing with: filtering indexes on type IN (1, 2) gets us just the clustered and non-clustered ones, not heaps, columnstore, spatial, xml indexes, etc.

sys.index_columns contains the keys of each index. It connects to sys.indexes using (object_id, index_id) and its primary key is (object_id, index_id, column_id), where (object_id, column_id) refers to sys.columns. The key_ordinal column is the “ordering number” of the index key column, where 1 is the first column in the index, 2 is the second, and so on. Note that key_ordinal=0 means that this column is only INCLUDEd, not actually indexed.

Putting it together

Here’s how they all fit together. First off, all the schemas, tables and columns:

SELECT s.[name] AS [Schema],
       t.[name] AS [Table],
       c.column_id,
       c.[name] AS [Column],
       dt.[name] AS Datatype
FROM sys.schemas AS s
INNER JOIN sys.tables AS t ON s.[schema_id]=t.[schema_id]
INNER JOIN sys.columns AS c ON t.[object_id]=c.[object_id]
INNER JOIN sys.types AS dt ON c.user_type_id=dt.user_type_id
ORDER BY s.[name], t.[name], c.column_id;

And here are all the indexes along with their key columns:

SELECT s.[name] AS [Schema],
       t.[name] AS [Table],
       i.index_id,
       i.[type_desc],
       i.[name] AS [Index],

       --- String concatenation trick
       SUBSTRING(CAST(
       (SELECT ', '+c.[name]
        FROM sys.index_columns AS ic
        INNER JOIN sys.columns AS c ON ic.[object_id]=c.[object_id] AND ic.column_id=c.column_id
        WHERE ic.[object_id]=i.[object_id] AND ic.index_id=i.index_id AND ic.key_ordinal>0
        ORDER BY ic.key_ordinal
        FOR XML PATH(''), TYPE) AS varchar(4000)), 3, 4000) AS [Columns],

       SUBSTRING(CAST(
       (SELECT ', '+c.[name]
        FROM sys.index_columns AS ic
        INNER JOIN sys.columns AS c ON ic.[object_id]=c.[object_id] AND ic.column_id=c.column_id
        WHERE ic.[object_id]=i.[object_id] AND ic.index_id=i.index_id AND ic.key_ordinal=0
        FOR XML PATH(''), TYPE) AS varchar(4000)), 3, 4000) AS [Include]

FROM sys.schemas AS s
INNER JOIN sys.tables AS t ON s.[schema_id]=t.[schema_id]
INNER JOIN sys.indexes AS i ON t.[object_id]=i.[object_id] AND i.[type] IN (1, 2)
ORDER BY s.[name], t.[name], i.index_id;

Identifying referenced tables

In a foreign key constraint, the two tables involved are the referencing and the referenced. The referencing table is the one to which you’ve attached the constraint, while the referenced table is the one to which the foreign key references. In this example,

ALTER TABLE dbo.Cars ADD
    CONSTRAINT FK_Cars_Color (ColorID)
    REFERENCES dbo.Colors (ColorID);

… the table dbo.Cars is the referencing table and dbo.Colors is the referenced table. You can tell because we reference dbo.Colors by its primary/unique key (ColorID), whereas ColorID is (hopefully) not the primary key of dbo.Cars.

Let’s create a table of possible referenceable tables, i.e. of all the unique indexes in the database:

DECLARE @referenced TABLE (
    [object_id]        int NOT NULL,
    index_id        int NOT NULL,
    column_name        sysname NOT NULL,
    user_type_id    int NOT NULL,
    PRIMARY KEY CLUSTERED ([object_id], index_id, column_name)
);

INSERT INTO @referenced ([object_id], index_id, column_name, user_type_id)
SELECT t.[object_id], i.index_id, c.[name] AS column_name, c.user_type_id
FROM sys.tables AS t
INNER JOIN sys.indexes AS i ON t.[object_id]=i.[object_id]
INNER JOIN sys.index_columns AS ic ON i.[object_id]=ic.[object_id] AND i.index_id=ic.index_id
INNER JOIN sys.columns AS c ON ic.[object_id]=c.[object_id] AND ic.column_id=c.column_id
WHERE c.is_nullable=0 AND i.[type] IN (1, 2) AND i.is_unique=1 AND ic.key_ordinal>0;

Rather than saving the fully qualified schema and table names, we’ll just save the object_id, which uniquely identifies every object in the database.

At first glance, finding referencing objects sounds simple enough: just match the column names and datatypes, right?

SELECT t.[object_id] AS referencing_tbl, c.[name] AS referencing_col,
       r.[object_id] AS referenced_tbl,  r.column_name AS referenced_col
FROM @referenced AS r
INNER JOIN sys.tables AS t ON
    r.[object_id]!=t.[object_id]
INNER JOIN sys.columns AS c ON
    t.[object_id]=c.[object_id] AND
    c.[name]=r.column_name AND
    c.user_type_id=r.user_type_id;

But this won’t work when you have a composite primary key on the referenced table, meaning that you have two or more columns that make up the primary key together. The problem is that you can’t make sure if all the columns match, or if you’ve only matched some of them.

To make sure that we can match all of the columns, we’re going to break out our CTE skills and count the columns on the referenced tables, as well as the number of columns that we’ve been able to match. So if your primary key on the referenced table is a composite key of three columns, you should be able to join on all of those three columns in order to be able to create a foreign key constraint. If you’ve joined two out of three columns, this isn’t a valid foreign key constraint.

Our CTE happens in three steps. First, the referenced tables, to which we’re adding a count of the key columns:

WITH referenced AS (
    SELECT *, COUNT(*) OVER (PARTITION BY [object_id], index_id) AS col_count
    FROM @referenced),

The col_count column returns the number of keys/columns in the primary key or unique index. If you want to understand the OVER () syntax, read up on windowed functions. Next, we’ll join the referencing and referenced sides like we did before, but we’ll also add a column count after the join:

     work AS (
    SELECT COUNT(*) OVER (PARTITION BY r.[object_id], r.index_id, t.[object_id]) AS referencing_count,
           r.col_count AS referenced_count, r.index_id,
           t.[object_id] AS referencing_tbl, c.[name] AS referencing_col,
           r.[object_id] AS referenced_tbl,  r.column_name AS referenced_col
    FROM referenced AS r
    INNER JOIN sys.tables AS t ON
        r.[object_id]!=t.[object_id]
    INNER JOIN sys.columns AS c ON
        t.[object_id]=c.[object_id] AND
        --- This is where the column naming logic
        --- can be customized:
        c.[name]=r.column_name AND
        c.user_type_id=r.user_type_id)

Finally, return the results, but only where @referenced.col_count is equal to work.referencing_count:

SELECT *
FROM work
WHERE referencing_count=referenced_count
ORDER BY referencing_tbl, referenced_tbl;

Prettifying the output

Finally, we’ll clean up the output by writing out the full syntax to build the constraint, as well as checking if there’s already an existing foreign key constraint that covers each relation that we’ve identified.

The end result looks like this:

DECLARE @referenced TABLE (
    [object_id]        int NOT NULL,
    index_id        int NOT NULL,
    column_name        sysname NOT NULL,
    user_type_id    int NOT NULL,
    PRIMARY KEY CLUSTERED ([object_id], index_id, column_name)
);

INSERT INTO @referenced ([object_id], index_id, column_name, user_type_id)
SELECT t.[object_id], i.index_id, c.[name] AS column_name, c.user_type_id
FROM sys.tables AS t
INNER JOIN sys.indexes AS i ON t.[object_id]=i.[object_id]
INNER JOIN sys.index_columns AS ic ON i.[object_id]=ic.[object_id] AND i.index_id=ic.index_id
INNER JOIN sys.columns AS c ON ic.[object_id]=c.[object_id] AND ic.column_id=c.column_id
WHERE c.is_nullable=0 AND i.[type] IN (1, 2) AND i.is_unique=1 AND ic.key_ordinal>0;

WITH referenced AS (
    SELECT *, COUNT(*) OVER (
               PARTITION BY [object_id], index_id) AS col_count
    FROM @referenced),

     work AS (
    SELECT COUNT(*) OVER (
               PARTITION BY r.[object_id], r.index_id, t.[object_id]) AS referencing_count,
           r.col_count AS referenced_count, r.index_id,
           t.[object_id] AS referencing_tbl, c.[name] AS referencing_col,
           r.[object_id] AS referenced_tbl,  r.column_name AS referenced_col
    FROM referenced AS r
    INNER JOIN sys.tables AS t ON
        r.[object_id]!=t.[object_id]
    INNER JOIN sys.columns AS c ON
        t.[object_id]=c.[object_id] AND
        --- This is where the column naming logic
        --- can be customized:
        c.[name]=r.column_name AND
        c.user_type_id=r.user_type_id)

SELECT fk.[name] AS [Existing FK],
       'ALTER TABLE '+ts.[name]+'.'+t.[name]+
           ' ADD CONSTRAINT '+
           ISNULL(fk.[name], 'FK_'+rs.[name]+'_'+r.[name]+'_'+ts.[name]+'_'+t.[name])+
           ' FOREIGN KEY ('+x.referencing_columns+')'+
           ' REFERENCES '+rs.[name]+'.'+r.[name]+' ('+x.referenced_columns+')' AS Syntax
FROM work
INNER JOIN sys.tables AS r ON work.referenced_tbl=r.[object_id]
INNER JOIN sys.schemas AS rs ON r.[schema_id]=rs.[schema_id]
INNER JOIN sys.tables AS t ON work.referencing_tbl=t.[object_id]
INNER JOIN sys.schemas AS ts ON t.[schema_id]=ts.[schema_id]
LEFT JOIN sys.foreign_keys AS fk ON
    work.referencing_tbl=fk.parent_object_id AND
    work.referenced_tbl=fk.referenced_object_id
CROSS APPLY (
    SELECT
        SUBSTRING(CAST((
            SELECT ', '+w.referencing_col
            FROM work AS w
            WHERE w.referencing_tbl=work.referencing_tbl AND
                  w.referenced_tbl=work.referenced_tbl AND
                  w.index_id=work.index_id
            ORDER BY w.referencing_col
            FOR XML PATH(''), TYPE) AS varchar(4000)), 3, 4000),
        SUBSTRING(CAST((
            SELECT ', '+w.referenced_col
            FROM work AS w
            WHERE w.referencing_tbl=work.referencing_tbl AND
                  w.referenced_tbl=work.referenced_tbl AND
                  w.index_id=work.index_id
            ORDER BY w.referencing_col
            FOR XML PATH(''), TYPE) AS varchar(4000)), 3, 4000)
    ) AS x(referencing_columns, referenced_columns)
WHERE work.referencing_count=work.referenced_count
GROUP BY ts.[name], t.[name], rs.[name], r.[name], x.referencing_columns, x.referenced_columns, fk.[name]
ORDER BY Syntax;

Now, before you go crazy with this stuff, remember, it’s not a magic bullet, but rather some automation help to save you some coding and to help you review your data model. The script doesn’t change the database, it only prints out its suggestions, and this is totally by design.

For this to work, you’ll obviously need proper primary keys or unique indexes on your referenced tables.
We’re working on the assumption that the referencing and referenced column names are the same. Go ahead and change the script to suit your naming standards (look for the comment in the CTE)
The script has no domain knowledge of your database, some of the suggestions are probably going to be downright silly.

I specifically used the old xml string concatenation trick for this to work on older SQL Server versions. If you’re on SQL Server 2016, by all means, you could refactor the script to use STRING_AGG().

Let me know how this works for you and if there’s anything you did to improve on the solution!

4 comments

yzan says:

2017-10-04 at 19:08

I wrote an application for that. The nice thing is that it is also using database statistics, which is commonly used for execution plan optimization, to produce more accurate estimates. The application is called Linkifier and can be found on GitHub.

1. Jeff Antonson says:
  
  2018-01-17 at 05:04
  
  Yzan – I am interested in reviewing your Linkifier application, however, I cannot fid it on GitHub. Can you drop back a direct ink. I would love an opportunity to review what you have created. Thanks, Jeff
  
TomXYZ says:

2018-03-23 at 15:01

Excellent tool. Thanks for explaining.

1. Daniel Hutmacher says:
  
  2018-03-23 at 15:06
  
  You’re welcome!