The 2014 Swedish SQL Server usergroup challenge

Sunday came early this week. In this longer-than-usual post, I’m going to show you the details of my contribution to the 2014 edition of the annual Swedish SQL Server usergroup challenge.

For the last few years, the Swedish SQL Server usergroup, a local PASS chapter, hosts an annual T-SQL challenge, in which participants need to solve some type of reasonably complicated task using T-SQL. Furthermore, all the correct solutions are then pitted against each other on a given benchmark, like number of reads or duration in order to find a winner of the challenge.

For the purpose of readability for my non-swedish speaking readers, I’ve translated the names of all the database objects to their english equivalents.

Foreword and congratulations

I didn’t win the challenge, although I was ranked #1 for some time on the preliminary charts. In the end, something about my solution failed in the final evaluation – I don’t know what and I honestly won’t invest any more hours in trying to figure it out. The winner of the challenge is Mikael Eriksson, who incidentally also beat my time by a few milliseconds, using an absolutely ingenious application of message queues to parallelize a native in-memory stored procedure with a cursor.

Respect.

The 2014 challenge

This edition of the challenge is based on business logic from an actual healthcare business database. This healthcare organization logs patient appointments in a table, Appointments, with details of business unit, patient ID, date and time of the visit, and a rate code which details the type of appointment the patient had. Each rate code has a different rate, as detailed in the price list. The basic rule is that you are only paid once for each business unit and patient and date, so if they patient has multiple appointments on the same day at the same business unit, you get paid only the highest of those appointment rates.

Some of the rates have exceptions to this main rule. Here are the ten different rates and how to calculate them:

  • A01 and A02 are standard appointments, following the basic rule.
  • B01 and B02 are exempt of the basic rule, meaning that they do not affect and are not affected by other appointments. These appointments are always paid.
  • K01, K02, K03 and K04 are special combination appointments: If you combine any of K01 and K02 with any or K03 and K04, you can get paid for both of them, provided that the sum of those two is higher than any other paid appointment that day, of course. You can still get paid for one of them individually, but not for K01 and K02 or K03 and K04 combined.
  • T01 and T02 are so-called “three-month appointments”, meaning that you can only get paid for this type of appointment if it’s been at least 90 days since the last appointment of this type.

In the end, the goal is to calculate the aggregate billable amount per business unit and date, and output that data into the reporting table Claims.

The tables provided are:

CREATE TABLE dbo.Rates (
    Rate              char(3) COLLATE Finnish_Swedish_100_BIN2 NOT NULL,
    Price             int NOT NULL,
    CONSTRAINT PK_Rates PRIMARY KEY CLUSTERED (Rate)
);

CREATE TABLE dbo.Appointments (
    AppointmentID     int IDENTITY(1, 1) NOT NULL,
    PatientID         uniqueidentifier NOT NULL,
    Rate              char(3) COLLATE Finnish_Swedish_100_BIN2 NOT NULL,
    DateAndTime       datetime2(0) NOT NULL,
    BusinessUnit      char(5) COLLATE Finnish_Swedish_100_BIN2 NOT NULL,
    CONSTRAINT PK_Appointments PRIMARY KEY CLUSTERED (AppointmentID),
    CONSTRAINT FK_Appointments_Rate FOREIGN KEY (Rate)
        REFERENCES dbo.Rates (Rate)
);

CREATE UNIQUE NONCLUSTERED INDEX IX_Appointments
    ON dbo.Appointments (BusinessUnit, PatientID, DateAndTime)
    INCLUDE (Rate);

CREATE TABLE dbo.Claims (
    BusinessUnit      char(5) COLLATE Finnish_Swedish_100_BIN2 NOT NULL,
    [Date]            date NOT NULL,
    Amount            int NOT NULL,
    CONSTRAINT PK_Claims PRIMARY KEY CLUSTERED (BusinessUnit, [Date])
);

For testing purposes, I generated a sample amount of about three million appointments, but the competition rules say nothing about how much data to optimize the solution for (or, more importantly, the distribution of that data).

The solution

First off, creating temp tables and stuff can induce recompiles in a stored procedure, so I place them all in the beginning of the procedure, in order to get it over with.

CREATE PROCEDURE dbo.sp_CalculateClaims
AS

SET NOCOUNT ON;

--- Temp table to collect rows that will be aggregated
--- into dbo.Claims in the end:
CREATE TABLE #claims (
    BusinessUnit    char(5) COLLATE Finnish_Swedish_100_BIN2 NOT NULL,
    [Date]          date NOT NULL,
    Amount          int NOT NULL,
);

--- Work table for the calculation of the three-month appointments (T01 and T02).
--- This table does intentionally not contain any indexes at this stage in order
--- to optimize INSERT speed and to avoid sorts in the execution plan:
CREATE TABLE #appointments (
    BusinessUnit    char(5) COLLATE Finnish_Swedish_100_BIN2 NOT NULL,
    PatientID       uniqueidentifier NOT NULL,
    [Date]          date NOT NULL,
    Amount          int NOT NULL,
    NewAmount       int NOT NULL,
    T1              int NOT NULL,
    T2              int NOT NULL
);

--- Load rates into variables:
DECLARE @a01 int, @a02 int,
        @b01 int, @b02 int,
        @k01 int, @k02 int, @k03 int, @k04 int,
        @t01 int, @t02 int,
        @sql varchar(max);

SELECT @a01=(CASE WHEN Rate='A01' THEN Price ELSE @a01 END),
       @a02=(CASE WHEN Rate='A02' THEN Price ELSE @a02 END),
       @b01=(CASE WHEN Rate='B01' THEN Price ELSE @b01 END),
       @b02=(CASE WHEN Rate='B02' THEN Price ELSE @b02 END),
       @k01=(CASE WHEN Rate='K01' THEN Price ELSE @k01 END),
       @k02=(CASE WHEN Rate='K02' THEN Price ELSE @k02 END),
       @k03=(CASE WHEN Rate='K03' THEN Price ELSE @k03 END),
       @k04=(CASE WHEN Rate='K04' THEN Price ELSE @k04 END),
       @t01=(CASE WHEN Rate='T01' THEN Price ELSE @t01 END),
       @t02=(CASE WHEN Rate='T02' THEN Price ELSE @t02 END)
FROM dbo.Rates;

You’ll notice that because the rate codes are fixed, I’m choosing to load them into variables so I won’t have to join the dbo.Rates table at a later point. In practice, this makes a very minimal difference in performance because Rates only contains ten rows, but my trials still show a gain of 10-50 ms. In itself not much, but possibly something you could lose a competition over. 🙂

Moving on, here I’m using a table scan to extract just the A and K visits, and storing them in the #claims temp table.

INSERT INTO #claims WITH (TABLOCKX) (BusinessUnit, [Date], Amount)
SELECT BusinessUnit, [Date], SUM(Amount) AS Amount
FROM (
    SELECT BusinessUnit, PatientID, [Date], MAX((CASE
           WHEN A>K THEN A
           WHEN A<=K THEN K
           ELSE 0 END)) AS Amount
    FROM (
        SELECT DISTINCT BusinessUnit, PatientID
        FROM dbo.Appointments WITH (INDEX=IX_Appointments)) AS vp
    CROSS APPLY (
        SELECT CAST(DateAndTime AS date) AS [Date],
               MAX((CASE Rate WHEN 'A01' THEN @a01 WHEN 'A02' THEN @a02 ELSE 0 END)) AS A,
               MAX((CASE Rate WHEN 'K01' THEN @k01 WHEN 'K02' THEN @k02 ELSE 0 END))+
               MAX((CASE Rate WHEN 'K03' THEN @k03 WHEN 'K04' THEN @k04 ELSE 0 END)) AS K
        FROM dbo.Appointments WITH (INDEX=IX_Appointments)
        WHERE BusinessUnit=vp.BusinessUnit AND
              PatientID=vp.PatientID AND
              Rate NOT IN ('B01', 'B02', 'T01', 'T02')
        GROUP BY CAST(DateAndTime AS date)
        ) AS b
    GROUP BY BusinessUnit, PatientID, [Date]
    ) AS vp
GROUP BY BusinessUnit, [Date];

Here’s the query plan:

SQLUG2014_plan1_2

SQLUG2014_plan1_1

Aggregating the A and K appointments

I’m intentionally excluding B rates from this query because they’re very simple to aggregate in comparison, as we don’t need to consider duplicate appointments or patients. I’m also saving the T rates for later, because of their high complexity.

First off, we’re trying to find the highest-paid A rate as well as the highest-paid K rate for each business unit, patient and date. You can’t combine K01 with K02 or K03 with K04, but you can combine the highest-paying of K01 and K02 with the highest-paying of K03 and K04. In pseduo-code this is equivalent to:

MAX(K01, K02)+MAX(K03, K04)

.. so in the end, here’s what I’m doing in the CASE expression:

MAX(A01, A02, MAX(K01, K02)+MAX(K03, K04))

This is all stored in #claims. By then, we’ve already stripped away the PatientID and aggregated the data by (BusinessUnit, Date). With #claims being empty before this statement, my hope is that this will be a minimally logged insert, which should also speed up performance.

Using CROSS APPLY

I could have just built the query as a simple aggregate of dbo.Appointments, but I used a trick that I learned from Itzik Ben-Gan. When you use CROSS APPLY the way I’ve done above, it can facilitate parallelization of the query. Some trial-and-error in this case showed that I gained a few precious seconds by doing this.

Converting datetime2(0) to date and sorting

The non-clustered index that we’re using is ordered by (BusinessUnit, PatientID, DateAndTime), which in theory makes it ideal for what we want to do here – except, we need to truncate the datetime2(0) column to a regular date, to strip away hours, minutes and seconds. This data conversion, though it does not actually change the sort order of the stream, confuses the optimizer. Because the new Date column is now a calculated expression, the query optimizer now needs to either sort the stream again in order to use a Stream Aggregate, or to use a Hash match aggregate. This is one problem I fruitlessly spent days trying to solve.

Calculating the T01 and T02 rates

Decidedly, the trickiest part of the task is to calculate the T01 and T02 claims. You get paid for a T01/T02 appointment if at least 90 days have passed since the last paid appointment of the same type, provided that this T01/T02 appointment is the most well-paid appointment for this patient, business unit and date.

For this part, I’m using a query which is very similar to the previous one, except I’m keeping the PatientID column, and I’m selecting everything except the B rate appointments. All of this goes into the #appointments table:

INSERT INTO #appointments WITH (TABLOCKX)
       (BusinessUnit, PatientID, [Date], Amount, NewAmount, T1, T2)
SELECT BusinessUnit, PatientID, [Date], (CASE
       WHEN A>K THEN A
       WHEN A<=K THEN K
       ELSE 0 END) AS Amount, (CASE
       WHEN A>K THEN A
       WHEN A<=K THEN K
       ELSE 0 END) AS NewAmount,
       (CASE WHEN T1>A AND T1>K THEN T1 ELSE 0 END) AS T1,
       (CASE WHEN T2>A AND T2>K THEN T2 ELSE 0 END) AS T2
FROM (
    SELECT DISTINCT BusinessUnit, PatientID
    FROM dbo.Appointments WITH (INDEX=IX_Appointments)
    WHERE Rate IN ('T01', 'T02')) AS vp
CROSS APPLY (
    SELECT CAST(DateAndTime AS date) AS [Date],
           MAX((CASE Rate WHEN 'A01' THEN @a01 WHEN 'A02' THEN @a02 ELSE 0 END)) AS A,
           MAX((CASE Rate WHEN 'K01' THEN @k01 WHEN 'K02' THEN @k02 ELSE 0 END))+
           MAX((CASE Rate WHEN 'K03' THEN @k03 WHEN 'K04' THEN @k04 ELSE 0 END)) AS K,
           MAX((CASE Rate WHEN 'T01' THEN @t01 ELSE 0 END)) AS T1,
           MAX((CASE Rate WHEN 'T02' THEN @t02 ELSE 0 END)) AS T2
    FROM dbo.Appointments WITH (INDEX=IX_Appointments)
    WHERE BusinessUnit=vp.BusinessUnit AND PatientID=vp.PatientID AND Rate NOT IN ('B01', 'B02')
    GROUP BY CAST(DateAndTime AS date)
    ) AS b
WHERE (T1>A AND T1>K OR T2>A AND T2>K);

Again, a tablock and an empty destination table should hopefully render a minimally logged insert. The query plan is very similar, so I won’t bore you with it.

Note the WHERE clause, at the end: We’re only selecting rows where a T01 or T02 actually might be more well-paid than the A or K rates for that same patient, business unit and date. This further eliminates a lot of rows, making the insert smaller and faster.

Now, we have a table that contains one row for each patient, business unit and date that contain a T01 and/or T02 appointment that outprices any A or K appointment on the same row. Here’s a sample:

Bus.unit PatientID                               Date       Amount T1  T2
-------- --------------------------------------- ---------- ------ --- ---
AAAAB    13722274-9CD2-467E-A50E-000D46B2C499    2001-01-19      0 600   0
AAAAB    13722274-9CD2-467E-A50E-000D46B2C499    2001-02-12    350   0 300
AAAAB    13722274-9CD2-467E-A50E-000D46B2C499    2001-03-08      0   0 300
AAAAB    13722274-9CD2-467E-A50E-000D46B2C499    2001-04-27    650 600   0
AAAAB    13722274-9CD2-467E-A50E-000D46B2C499    2001-05-03      0 600   0
AAAAB    13722274-9CD2-467E-A50E-000D46B2C499    2001-05-05      0 600   0

The first appointment is a T01 on january 19. No previous T01 appointments for this patient and business unit in the last 90 days, and nothing that day that has a higher rate means that we’ll get paid for the T01.

The second row has a T02 appointment and no prior T02 in the last 90 days, but there’s a more well-paid appointment (350 instead of 300) on the same day, so we’ll be paid 350 that day.

The third row has a T02 appointment again. Even though it’s been less than a month since the last T02 appointment, that appointment wasn’t paid – the other one had a higher rate, remember? So on march 8, we’re getting paid for the T02.

Fourth row, T01 for 600, but even though it’s been more than 90 days since the last T01, the “regular” payment is higher here.

Fifth row, the T01 is paid at 600.

Sixth row, less than 90 days since last T01, no payment.

.. and so on.

The inline update

This type of iteration is really easy to describe in human-readable form, but not really suitable for a declarative, set-based solution like T-SQL. You could probably use a cursor to accomplish the same result, but cursors can be incredibly slow. However, you can mimic the workings of a cursor, while still remaining in the set-based world by using a little-known (and unsupported) use of UPDATE that I wrote about a while ago.

The UPDATE statement can also assign variables; it does this in the order you write them in the SET clause, once for each row it “passes”, so with a bit of fiddling around with CASE expressions where there would be IF conditions, you can build this type of logic in a single UPDATE statement.

DECLARE -- Start values:
    @DateT01      date={d '1900-01-01'},
    @DateT02      date={d '1900-01-01'},
    @BusinessUnit char(5)='',
    @PatientID    uniqueidentifier=NEWID(),
    @t1           int,
    @t2           int;

UPDATE t
SET @DateT01   =(CASE WHEN @BusinessUnit!=t.BusinessUnit OR @PatientID!=t.PatientID
                      THEN {d '1900-01-01'} ELSE @DateT01 END),
    @DateT01   =(CASE WHEN t.T1>t.T2 AND t.T1>t.Amount AND t.[Date]>DATEADD(dd, 90, @DateT01)
                      THEN t.[Date] ELSE @DateT01 END),
    @DateT02   =(CASE WHEN @BusinessUnit!=t.BusinessUnit OR @PatientID!=t.PatientID
                      THEN {d '1900-01-01'} ELSE @DateT02 END),
    @DateT02   =(CASE WHEN @DateT01!=t.[Date] AND t.T2>t.Amount AND t.[Date]>DATEADD(dd, 90, @DateT02)
                      THEN t.[Date] ELSE @DateT02 END),
    @t1        =(CASE WHEN @DateT01=t.[Date] THEN t.T1 ELSE 0 END),
    @t2        =(CASE WHEN @DateT02=t.[Date] THEN t.T2 ELSE 0 END),
    @BusinessUnit=t.BusinessUnit,
    @PatientID  =t.PatientID,
    t.NewAmount=(CASE WHEN @t1+@t2>0 THEN @t1+@t2 ELSE t.Amount END)
FROM #appointments AS t WITH (INDEX=1)
OPTION (MAXDOP 1);

And here’s the query plan – in its simple elegance:

Inline UPDATE queryplan

You can actually read the SET clause top-down to figure out what it does:

  • If the business unit or patient is different from the previous row, reset @DateT01 to 1900-01-01.
  • If the T01 amount is greater than T02 and Amount and 90 days or more have passed since @dateT01, set @DateT01 to the current date.
  • If the business unit or patient is different from the previous row, reset @DateT02 to 1900-01-01.
  • If @DateT01 isn’t today’s date and the T02 is greater than Amount and 90 days or more have passed since @DateT02, set @DateT02 to the current date.
  • If we’ve set @DateT01 to today’s date, set @t1 to the T01 amount, otherwise set it to 0.
  • If we’ve set @DateT02 to today’s date, set @t2 to the T02 amount, otherwise set it to 0.
  • Finally, if @t1 or @t2 are non-zero (mutually exclusive, per definition), assign them to the NewAmount column.

Three important things:

  • Ordering is not guaranteed – in theory. The rows will appear in the order that they come from the chosen index, but this is not guaranteed. So don’t use this type of stuff on production machines.
  • I’ve provided a WITH (INDEX) hint to force the use of the clustered index (1) in order to try to give me the correct order of rows.
  • I’ve also set the maximum degree of parallelism to 1 in order to ensure a serial plan. This, however, is overkill – the optimizer will do that for you automatically when you run this type of UDPATE.

For some reason, running this UPDATE statement in the procedure itself yielded poor performance, so in the actual stored procedure, I encapsulated the DECLARE and UPDATE in a dynamic block of SQL, which fixed the problem.

Merging it all together

Now, all that remains is to scrape together all the totals and aggregating them into dbo.Claims.

In #appointments, we now have all the (BusinessUnit, PatientId, Date) where there was originally a T01 or T02 code. Some of those rows are not billable, because of the rules of the 90-day appointments, so we’ve updated the NewAmount column. For those rows, we’re adding the difference (i.e. the change) between NewAmount and Amount, to #claims. The aggregate result in #claims should be the correct billable amount:

INSERT INTO #claims WITH (TABLOCKX) (BusinessUnit, [Date], Amount)
SELECT BusinessUnit, [Date], SUM(NewAmount-Amount) AS Amount
FROM #appointments
WHERE NewAmount!=Amount
GROUP BY BusinessUnit, [Date];

Add the B appointments, which are always paid, no matter what:

INSERT INTO #claims WITH (TABLOCKX) (BusinessUnit, [Date], Amount)
SELECT BusinessUnit, CAST(DateAndTime AS date), SUM((CASE Rate WHEN 'B01' THEN @b01 WHEN 'B02' THEN @b02 END)) AS Amount
FROM dbo.Appointments
WHERE Rate IN ('B01', 'B02')
GROUP BY BusinessUnit, CAST(DateAndTime AS date);

And finally, clear out the dbo.Claims table and aggregate everything from #claims on (BusinessUnit, Date).

--- Empty dbo.Claims
TRUNCATE TABLE dbo.Claims;

--- Sum up the partial aggregates from #claims and store the
--- result in dbo.Claims.
INSERT INTO dbo.Claims WITH (TABLOCKX) (BusinessUnit, [Date], Amount)
SELECT BusinessUnit, [Date], SUM(Amount) AS Amount
FROM #claims
GROUP BY BusinessUnit, [Date];

--- Done, clean up temp tables.
DROP TABLE #appointments;
DROP TABLE #claims;

Conclusion

In the final evaluation with about a million appointments, my stored procedure ran for 1,415 seconds. The winning contribution (apart from producing a correct result, which my solution apparently didn’t) ran for 1,407 seconds. Remember, boys and girls, every millisecond counts.

See you in the 2015 challenge, guys!

4 comments

  1. Did your code work on test data? If yes, I may have an idea of why it didn’t work on a larger set of data.

  2. I generated a set of test data with about 2.7 million rows over 5 business units and 27 000 patients (100 appointments for each patient) and as far as I know, this turned out the correct results.
    Then I tried the official script to generate test data (1.1 million rows over 600 patients), and it still looks correct. Then again, who knows, my reconciliation script is probably flawed. 🙂

    I’d love to hear your thoughts on this one!

    1. I don’t have a ‘silver bullet’ but one of the reasons for different results may be the different logic used in finding “90 days gap” by DATEDIFF and DATEADD functions.

      For example:

      SELECT DATEADD(DAY, 90, ‘2014-01-01 00:00:00.000’)

      returns ‘2014-04-01 00:00:00.000’ meaning that ‘2014-04-01 01:00:00.000’ is beyond 90 days gap

      but

      SELECT DATEDIFF(DAY, ‘2014-01-01 00:00:00.000’, ‘2014-04-01 01:00:00.000′)

      Returns ’90’ meaning that it is still within the gap.

      Your solution uses DATEADD but on dates casted to DATE type. So it should work as DATEDIFF really.

      Can this be the reason for discrepancies?

      1. Like you said, it’s on a date column, so the effect should be the same.
        My best guess is that it’s something related to the “quirky update”, which isn’t really officially supported. It did work on two different multi-core machines of mine, and obviously in the first few tests that they ran on the competition machines..

Leave a comment

Your email address will not be published. Required fields are marked *