Execution order of non-deterministic functions

Here’s a strange insight that I gained when building a test case where I needed some randomized values. In order to generate random values, you can use the NEWID() function, which creates a uniqueidentifier value for each row. But NEWID() comes with a strange behaviour, that some (including me) will consider a bug, while others (including the SQL Server development team) consider it to be “by design”.

A short background on randomization

If you use RAND(), there are primarily two problems: RAND() is not strictly speaking deterministic, and this means that the “seed” used to create random numbers is recycled in a predictive manner. If you’ve set a seed manually, the code will continue “cycling” from there. Here’s an example:

SELECT RAND(123);
SELECT RAND();
SELECT RAND();
SELECT RAND();

I’m going to guess that if you run this code, you’ll get exactly the following values:

0.715865215706424
0.432967391490358
0.433199650308282
0.998945883228918

This is not a magic trick. The reason is simply that the first value is seeded manually with 123, which will always result in 0.7159[…] on any SQL Server you run it on. In subsequent statements, the seed is computed/iterated in a deterministic manner, which means that the same sequence of output values will be returned.

The other problem with RAND() is that it is calculated once for each recordset, which means that if you use RAND() with an entire recordset, the “random” value will be the same for each row.

A solution: Using NEWID()

The NEWID() function solves both these two problems, but it presents us with another one. The following query will return four rows with unique, randomized values:

WITH test (id)
AS (SELECT NEWID() AS id UNION ALL
    SELECT NEWID() AS id UNION ALL
    SELECT NEWID() AS id UNION ALL
    SELECT NEWID() AS id)

SELECT id
FROM test;

So far, so good. But look what happens if we use the same column twice:

WITH test (id)
AS (SELECT NEWID() AS id UNION ALL
    SELECT NEWID() AS id UNION ALL
    SELECT NEWID() AS id UNION ALL
    SELECT NEWID() AS id)

SELECT id, id
FROM test;

The two columns should be the same, right? Wrong. What happens here is that NEWID() is not actually computed in the common table expression, but at the end of the execution plan, which means that there are two “id” columns with NEWID(), which are calculated separately. The execution plan doesn’t really show this in this example. Because of the simplicity of the query, it just evaluates as a single constant scan:

Using NEWID(), simple execution plan

A closer look at the execution order

Let’s try adding a ROW_NUMBER() calculation on the “id” column:

WITH test (id)
AS (SELECT NEWID() AS id UNION ALL
    SELECT NEWID() AS id UNION ALL
    SELECT NEWID() AS id UNION ALL
    SELECT NEWID() AS id),

     ord (id, ordinal)
AS (SELECT id, ROW_NUMBER() OVER (ORDER BY id) AS ordinal
    FROM test)

SELECT *
FROM ord;

Still nothing out of the ordinary:

Using NEWID(), with ROW_NUMBER()

So the constant scan in the beginning calculates the NEWID()s for the four rows, the data set is sorted, and in the “sequence project (compute scalar)” operator, the value of the ROW_NUMBER() is calculated.

Now, let’s make things a bit more complicated:

WITH test (id)
AS (SELECT NEWID() AS id UNION ALL
    SELECT NEWID() AS id UNION ALL
    SELECT NEWID() AS id UNION ALL
    SELECT NEWID() AS id),

     ord (id, ordinal)
AS (SELECT id, ROW_NUMBER() OVER (ORDER BY id) AS ordinal
    FROM test)

SELECT *
FROM ord AS a
FULL JOIN ord AS b ON a.ordinal+1=b.ordinal;

In this query, I’m joining two “ord” CTE recordsets with each other using the ordinal number that we calculated with ROW_NUMBER(). You would expect the output recordset to look something like this, right?

id                       ordinal id                       ordinal
------------------------ ------- ------------------------ -------
NULL                     NULL    AAAAAAAA-...-...-...-... 1
AAAAAAAA-...-...-...-... 1       BBBBBBBB-...-...-...-... 2
BBBBBBBB-...-...-...-... 2       CCCCCCCC-...-...-...-... 3
CCCCCCCC-...-...-...-... 3       DDDDDDDD-...-...-...-... 4
DDDDDDDD-...-...-...-... 4       NULL                     NULL

Actually, the result set will look like this:

id                       ordinal id                       ordinal
------------------------ ------- ------------------------ -------
NULL                     NULL    EEEEEEEE-...-...-...-... 1
AAAAAAAA-...-...-...-... 1       FFFFFFFF-...-...-...-... 2
BBBBBBBB-...-...-...-... 2       GGGGGGGG-...-...-...-... 3
CCCCCCCC-...-...-...-... 3       HHHHHHHH-...-...-...-... 4
DDDDDDDD-...-...-...-... 4       NULL                     NULL

The reason for this is clearly visible if you examine the query plan.

Using NEWID(), with ROW_NUMBER() and multiple CTE references

The NEWID() is calculated individually in both CTE references (“a” and “b”), as is the ROW_NUMBER(). You can see them in the “constant scan” and “sequence project” operators respectively in the query plan.

Microsoft considers this to be a “by-design” feature, because NEWID() is an non-deterministic function – apparently even within the context of a single statement.

Workaround

The obvious workaround, if you still want to use the NEWID() function for a purpose like this one, is to first materialize the “test” CTE into a table. That way, the NEWID() is only evaluated once, and then remains static.

2 comments

  1. Actually, there is a workaround without materialization values into the table.

    What is needed – cross join to such CTE:

    with uniqueGroup as (
    select top 1
    group_id from (values (NEWID()), NEWID()) as t(group_id)
    order by t.group_id
    )

    The optimizer will have to calculate newid() values beforehand, because it is needed to take one based on the order.

Leave a comment

Your email address will not be published. Required fields are marked *