It’s been absolutely ages since I last wrote a blog post, mostly because I’ve been busy getting my shiny new own consultancy up to speed, but I’ll admit that writer’s block has also been a factor.
But here’s something to write home about. This year’s annual Swedish SQL Server usergroup challenge was as interesting as ever, and it marks my third stab at this prestigious competition. In this post, I’ll go through my contribution, highlighting some of the techniques that I’ve applied to make it go really fast.
Essentially, the task is to calculate the outcome of a large number of (sequential) games of Black Jack. The object of the game is to draw cards, adding up a score, and stopping as late as possible in order to get a high score without going over 21. In this challenge, the calculation is done from the perspective of the dealer (the house), and thus follows a strict set of rules. Those rules govern when the house draws and when it stands. It also tells us how the values of the cards are counted when adding up the score.
The rules in the challenge are the most common one, “dealer must stand on 17”, and in essence it breaks down like this:
- Whenever the score reaches 17 or higher, the dealer will stand (i.e. not take any more cards).
- A score of 21 is a winning score, and a score of 21 with exactly two cards drawn is known as a “black jack”.
- Jacks, queens and kings count as 10, aces count as 11 unless this means the score will go over 21, in which case the ace counts as 1.
- All other cards count as their face value.
The aim of the challenge is two-fold: For each “deciding” card (i.e. the last card for each deal), set a status code that reflects the outcome of the deal; “S” means Stand (score is at least 17, but not quite 21), “W” means Win and happens when the score is 21, but if the score is 21 and the number of cards is exactly two, the status is “B” for Black Jack. Any score over 21 is a loss, “L”.
The second part of the challenge is aggregating these status codes with a count and storing them in a separate table, so we can determine the number of wins, stands, black jacks and losses respectively.
All the cards are stored in a table, dbo.CardHistory, in the order they were dealt. Eight complete decks of 52 cards are shuffled over and over as many times as it takes.
- CardID is the identity column which gives the order that the cards were dealt. This column is also the clustering key for the table, which apart from that does not have any other indexes.
- Rank, tinyint, which is the face value of the card. Legal values are 1 through 13.
- Suit, char(1), can be “D” (diamonds), “H” (hearts), “S” (spades) or “C” (clubs).
- Status, char(1), is blank at the start of execution as must be filled with any of the four valid codes or left blank.
The aggregate is stored in the dbo.DealerStatus table.
Setting a benchmark
The very first thing I tried was making the operation as simple as theoretically possible, using a “quirky update”. This is incidentally the type of technique that saw me lose last year’s competition, because it’s unsupported and you can’t really depend on it.
But when it works, it’s fast.
DECLARE @val tinyint=0, @cards tinyint=0, @status char(1)='', @ace bit=0, @s int=0, @l int=0, @b int=0, @w int=0, @dt datetime2(7)=SYSDATETIME(); BEGIN TRANSACTION; UPDATE h SET @cards= (CASE WHEN @val>=17 OR @ace=1 AND @val BETWEEN 7 AND 11 THEN 0 ELSE @cards END)+1, @ace= (CASE WHEN @cards=1 THEN 0 ELSE @ace END), @val= (CASE WHEN @cards=1 THEN 0 ELSE @val END), @ace= (CASE WHEN h.[Rank]=1 THEN 1 ELSE @ace END), @val=@val+(CASE WHEN h.[Rank]<10 THEN h.[Rank] ELSE 10 END), @status= (CASE WHEN @ace=1 AND @val BETWEEN 7 AND 10 OR @val BETWEEN 17 AND 20 THEN 'S' WHEN @val>21 THEN 'L' WHEN @val=11 AND @ace=1 AND @cards=2 THEN 'B' WHEN @val=21 OR @val=11 AND @ace=1 THEN 'W' ELSE '' END), @s=@s+(CASE WHEN @status='S' THEN 1 ELSE 0 END), @l=@l+(CASE WHEN @status='L' THEN 1 ELSE 0 END), @b=@b+(CASE WHEN @status='B' THEN 1 ELSE 0 END), @w=@w+(CASE WHEN @status='W' THEN 1 ELSE 0 END), h.[Status]=@status FROM dbo.CardHistory AS h WITH (INDEX=1, FORCESEEK) WHERE h.CardID>=1 OPTION (MAXDOP 1); DELETE FROM dbo.DealerStatus WHERE [Status] IN ('B', 'L', 'S', 'W'); INSERT INTO dbo.DealerStatus VALUES ('B', @b), ('L', @l), ('S', @s), ('W', @w); COMMIT TRANSACTION WITH (DELAYED_DURABILITY=ON);
This solution comes in at about 3.4 seconds, which I think is pretty fast. The reason is that everything it does – reading, the calculations, and writing the status code – is done in a single, ordered scan of the table. So I used this as my benchmark for what could be the fastest theoretical solution. If I could reach 3.4 seconds with a reliable solution, that would probably be the best I could do.
Basic workflow of my solution
A number of challenges arise from this apparently unassuming task. First off, the (entire) process cannot be parallelized, because if you were to split the work into multiple streams, you wouldn’t know where to split them without first performing the calculation up to that point.
The workflow of my solution happens in three parts;
- Load the data into an in-memory structure
- Run a natively compiled stored procedure on the data, compute scores and totals.
- Retrieve the data from the in-memory table and re-apply the results to the original disk-based table.
Each of these steps present a few challenges.
Loading the in-memory table
Just loading a bunch of rows into an in-memory table variable (which is a tiny bit faster than a regular in-memory table) takes time. In-memory tables aren’t stored in ordered pages like disk-based tables. And they’re also versioned. All of this means, storing rows in-memory can take quite some time.
I found that the number of rows written to the in-memory table variable affects performance – fewer rows are faster. So I packed blocks of 8000 cards into char(8000) strings, which I then loaded into the table. Remember, all I need is the Rank column, and it fits nicely into a single byte, in human-readable plain text (ace is “1”, 10 and up are all “0”). Building this string by aggregating rows in a given order can be accomplished using a clever application of the FOR XML PATH construct.
Note that this string-based approach only works because the data is ordered and the identity column is dense (unbroken).
Also, because these char(8000) blocks are fixed-length, they lend themselves excellently to parallelism – you can start a new thread every 8000 rows. For this, I’m using CROSS APPLY, and just to make sure, I’ve also included trace flag 8649 that tells the optimizer to always use parallelism, even if it estimates a high overhead cost.
--- First step: Define all the batches. Because this is a recursive common --- table expression, it’ll disable parallelism for the entire query plan, --- so we get it over with here, separate from the main query. This takes --- very few milliseconds: WITH cte AS ( SELECT 1 AS CardID UNION ALL SELECT CardID+8000 FROM cte WHERE CardID+8000<@CardCount) INSERT INTO #batches (CardID) SELECT CardID FROM cte OPTION (MAXRECURSION 0); --- Again, in-memory structures will prevent parallelism, so we create our --- string blobs in a regular temp table first: INSERT INTO #temp (CardID, Cards) SELECT b.CardID, CAST(x.Cards AS char(8000)) AS Cards FROM #batches AS b CROSS APPLY ( --- This (seek) query runs once for each 8000 rows in dbo.CardHistory, --- i.e. once for each row in #batches: SELECT (CASE WHEN h2.[Rank]<10 THEN h2.[Rank] ELSE 0 END) FROM dbo.CardHistory AS h2 WHERE h2.CardID>=b.CardID AND h2.CardID<b.CardID+8000 ORDER BY h2.CardID FOR XML PATH(''), TYPE) AS x(Cards) --- Trace flag to enforce parallelism: OPTION (QUERYTRACEON 8649); DECLARE @tbl dbo.CardHistory_8k_type; --- ... and then, load everything into the in-memory table variable, serially: INSERT INTO @tbl (CardID, Cards) SELECT CardID, Cards FROM #temp;
By the way, I did experiment with varchar and varchar(max) instead of char(8000), hoping to perhaps build a single large variable with everything in it. But it appears (at least in my trials) that large datatypes (beyond 8000 bytes) and/or the dynamic memory reallocation (varchar) didn’t scale very well at all.
On a side note, I used non-clustered indexes rather than hash indexes on the in-memory tables to facilitate ordered seeks/scans and so I wouldn’t have to bother with bucket counts.
Loading the in-memory tables constitutes less than 10% of the total execution time.
The natively compiled stored procedure
This is, relatively speaking, the simplest part of the puzzle, mostly because natively compiled procedures are quite restricted in what you can do with them.
The challenge wasn’t really a performance one, but rather getting the logic right – in particular, it turned out, when the number of cards weren’t evenly divisible by 8000.
The string-based approach, however, was a huge performance help, because I didn’t have to build a cursor or do a lot of seeks on the table. Rather, the procedure reads 8000 bytes (cards) at a time, then loops through this char(8000) with the SUBSTRING() function, which takes next to no time at all. When all 8000 characters have been processed, a single char(8000) row with the resulting status codes is written to an in-memory table that holds the output of the procedure.
At about 3% of the total execution time, the execution time for the stored procedure is practically negligible.
Writing it all back to disk
My biggest headache with using in-memory tables was the fact that I would have to perform separate reads and updates on dbo.CardHistory – which would hamper performance compared to a single updating scan like the quirky update.
The final challenge, therefore, was to create a high-performance update of dbo.CardHistory. And, mind you, not every row needs to be updated, just the ones that will get a status code, which on average will be about one in about three.
Trying to parallelize this query turned out to involve a lot of overhead with nasty Sort and Repartition stream operators, which absolutely murdered performance, and prevented it from scaling in a nice, linear fashion.
Using the char(8000) output, the best solution I could think of was a similar approach as the one in the native stored procedure:
Read 8000 characters in a single string (the char(8000) variable stays in RAM and doesn’t need locks and latches the way a table does). Then update the output table with a seek, using this variable and SUBSTRING(). This looks like a nasty cursor, but performs exceptionally well, because it only executes a single UPDATE statement for each 8000 rows:
WHILE (1=1) BEGIN; --- Retrieve a batch of 8000 "rows" into a char(8000): SELECT @cards=Cards FROM dbo.CardHistory_8k WITH (SERIALIZABLE) WHERE CardID=@offset; IF (@@ROWCOUNT=0) BREAK; --- ... and UPDATE dbo.CardHistory: UPDATE dbo.CardHistory SET [Status]=SUBSTRING(@cards, CardID+1-@offset, 1) WHERE CardID>=@offset AND CardID<@offset+8000 AND SUBSTRING(@cards, CardID+1-@offset, 1)!='_'; SET @offset=@offset+8000; END;
Writing it all back to disk is by far the most time-consuming part. At 87% of the total execution time, this is the part I found the most frustrating to tune.
Updating disk-based tables is simply very expensive compared to the other tasks we’ve done, mostly because this workload pretty much ensures that every single page in the table is touched by the status update.
My solution completed in 1.8 seconds on my benchmark SQL Server (2014 SP 1 with SSD, two virtual cores, 8 GB), which is almost twice as fast as the quirky update. It landed me a third spot in the competition, which I think is very good, though I was shooting for a win.
Not that I’d be bitter about something like that. Right?
These are the key takeaways:
- Loading in-memory tables is faster when using fewer rows and larger blocks – less transactional overhead, I suppose.
- Working with fixed-width variables and datatypes can provide a small gain. Avoid the (max) datatypes all together when you’re counting milliseconds.
- FOR XML PATH() in itself cannot be parallelized, but it can run in multiple threads using CROSS APPLY.
- There’s a trace flag, 8649, that will force parallelism.
- Minimize the number of table interactions in the native procedure by using the char(8000) solution and SUBSTRING().
- Let the stored procedure count the aggregates and return them in output variables (for use in dbo.DealerStatus). No additional scans performed on dbo.CardHistory.
- Use the same char(8000) method to extract data from the in-memory structure and SUBSTRING() to write the status codes back to disk.
- Always add a WHERE clauses to your updates: UPDATE x SET y=1 WHERE y!=1 is actually faster than UPDATE x SET y=1.
- The solution is completely free of Sort, Hash and other blocking operators, giving the solution nearly linear scaling.
- Delayed durability tells the client “we’re done” before the data is hardened to the log, possibly saving a little bit of time for this type of operation. In theory, if the server loses power or the RAID card crashes at the exact millisecond after the commit, the client will think that the transaction is saved (“hardened”), but in effect it won’t have been saved once you restart the server.