How to fix rounding errors

A number of business processes require you to distribute a value over a date range. However, if your distribution keys and values don’t add up perfectly or aren’t perfectly divisible, it’s very easy to get rounding errors in your distributions.

Demo setup

Here are 65,000 example balances that we’ll try to distribute over 12 months:

CREATE TABLE #balances (
    id          int NOT NULL,
    balance     numeric(12, 2) NOT NULL,
    PRIMARY KEY CLUSTERED (id)
);

--- Create four examples
INSERT INTO #balances
VALUES (1,    1.00),
       (2,  100.00),
       (3,  240.00),
       (4, 1000.00);

--- Duplicate the results
WHILE (@@ROWCOUNT<30000)
    INSERT INTO #balances
    SELECT MAX(id) OVER ()+id, balance
    FROM #balances;

We’ll also create a table that contains the distribution key that we want to apply. This one’s a simple straigh-line 12-month accrual, but you could put any distribution key in there as long as they add up to 1.0:

CREATE TABLE #distribution (
    [period]    tinyint NOT NULL,
    [key]       numeric(6, 6) NOT NULL,
    PRIMARY KEY CLUSTERED ([period])
);

INSERT INTO #distribution
SELECT [period], 1./12 AS [key]
FROM (
    VALUES (1), (2), (3), (4), (5), (6),
           (7), (8), (9), (10), (11), (12)
    ) AS x([period]);

A plain distribution will create rounding errors

This query will distribute the balances over the 12 months, turning the 65,000 rows into 786,000 ones:

SELECT bal.id,
       bal.balance,
       d.[period],
       CAST(bal.balance*d.[key] AS numeric(12, 2)) AS distributed_balance
FROM #balances AS bal
CROSS JOIN #distribution AS d;
One row becomes 12. But the total isn’t really 1.00.

Because we’ve rounded the result of the distribution to the same precision as the balance, the resulting rows are 0.08 each. Add those up, and you get 0.96, not 1.00. Which may be fine if you’re working with geographical measurements, but not if you’re doing accounting.

Trust me, accounting folks think of themselves as very precise people.

Ways to fix the difference

There are a number of ways to decide which rows you should correct in order to bring the distributed balance to 100% of the original balance, and none of them will inherently be perfect. For the sake of this post, I’m going to fix the rounding error by adding/subtracting to the last row of each balance’s distribution, but this is somewhat of an arbitrary choice.

We need two things to solve this problem:

Identifying the last row for each distribution

We can use a window function to find the maximum period (the latest) for each distributed balance:

MAX([period]) OVER (PARTITION BY id) AS last_period

In our example, the last period is always 12, but I’m designing the code as if each balance could potentially use a different distribution key. If you want to really go to town on this, I’ve written a whole blog post on identifying the last row in a set.

Computing the rounding error

With the same kind of window function, we can also compute the total of the distribution:

SUM(distributed_balance) OVER (PARTITION BY id) AS net_distributed

Putting it in place

I’ve packaged the distribution logic in a common table expression:

WITH cte AS (
    SELECT bal.id,
           bal.balance,
           d.[period],
           CAST(bal.balance*d.[key] AS numeric(12, 2)) AS distributed_balance
    FROM #balances AS bal
    CROSS JOIN #distribution AS d
)

SELECT id, [period], distributed_balance,
       SUM(distributed_balance) OVER (PARTITION BY id) AS net_distributed,
       MAX([period]) OVER (PARTITION BY id) AS last_period
FROM cte;

The correction that we want to add is the balance (1.00 in the screenshot) minus the distributed_balance, 0.96, and we want to add this correction only to the last row of each distribution, i.e. where period=last_period.

Using an inline CASE construct, that looks something like this:

WITH cte AS (
    SELECT bal.id,
           bal.balance,
           d.[period],
           CAST(bal.balance*d.[key] AS numeric(12, 2)) AS distributed_balance
    FROM #balances AS bal
    CROSS JOIN #distribution AS d
)

SELECT id, [period],
       distributed_balance
           +(CASE WHEN [period]=MAX([period]) OVER (PARTITION BY id)
                  THEN balance-SUM(distributed_balance) OVER (PARTITION BY id)
                  ELSE 0.0 END) AS distributed_balance
FROM cte;

And here’s the result:

The last row contains the correction. Now the total comes to exactly 100%.

What about performance?

Obviously, the rounding error fix adds a lot of complexity to the query plan. Here’s the plan for the raw distribution:

135 000 logical reads, 500 ms CPU, 3300 ms execution time

When we apply the window functions required for our rounding fix, the plan grows:

2 million logical reads (of which 1.8 from the table spool), 2700 ms CPU, 2900 ms execution time.

In the top-right corner, we still have the CROSS JOIN between #balances and #distribution. After that, we get a Segment operator which is used for our two window functions. The windowing itself happens in the following Table Spool and Nested Loops operators.

None of those two plans had any memory grant, because there was no sorting and no hashing going on. In a scenario with different distribution models, the CROSS JOIN would have been an INNER JOIN, and the plan would look different.

Other strategies

The optimization strategy will vary wildly between these types of scenarios, including to which row you’re adding the rounding errors. Astute readers may for instance have noted that in this particular example, the distribution table doesn’t add up to 100%, so you could try making a correction there first.

UPDATE d
SET [key]=[key]+correction
FROM (
    SELECT [key], (CASE WHEN [period]=MAX([period]) OVER ()
                        THEN 1.0-SUM([key]) OVER ()
                        ELSE 0 END) AS correction
    FROM #distribution
    ) AS d
WHERE correction!=0;

This won’t solve all of your rounding errors, but maybe some of them.

How to run your CTE just once, and re-use the output

You may already know that common table expressions, like views, don’t behave like regular tables. They’re a way to make your query more readable by allowing you to write a complex SQL expression just once, rather than repeating it all over your statement or view. This makes reading, understanding and future refactoring of your code a little less painful.

But they’re no magic bullet, and you may end up with some unexpected execution plans.

Continue reading

Set your Azure VM to auto-shutdown and have it notify you on Slack

Virtual machines cost money when they’re powered on. Most servers obviously need to be on 24 hours a day. Others, like development machines, only have to be on when you’re using them. And if you forget to turn them off, they’ll empty out your Azure credits (or your credit card) before you know it.

Today, I’ll show you how to set an Auto-shutdown time to turn a VM off if you forget, as well as have Azure notify you on Slack 30 minutes ahead of time, so you have the option to postpone or cancel the shutdown.

Continue reading

Consolidating grouped transactions into evenly sized batches

I recently worked with a large set of accounting transactions. I needed to split those rows into multiple logical batches, but each batch had to be logically consistent – among other things, those batches had to be properly balanced, because accounting people are kind of fussy like that.

So I designed a little T-SQL logic that would split all of those transactions into evenly sized batches, without violating their logical groupings.

Safety glasses on. Let’s dive in.

Continue reading

Fun with VALUES

There’s more to the VALUES clause in T-SQL than meets the eye. We’ve all used the most basic INSERT syntax:

INSERT INTO #work (a, b, c)
VALUES (10, 20, 30);

But did you know that you can create multiple rows using that same VALUES clause, separated by commas?

INSERT INTO #work (a, b, c)
VALUES (10, 20, 30),
       (11, 21, 31),
       (12, 22, 32);

Note the commas at the end of each line, denoting that a new row begins here. Because this runs as a single statement, the INSERT runs as an atomic operation, meaning that all rows are inserted, or none at all (like if there’s a syntax issue or a constraint violation).

I use this construct all the time to generate scripts to import data from various external sources, like Excel, or even a result set in Management Studio or Azure Data Studio.

Here’s something you can try:

  • Select a dataset from SSMS or Excel, copy it to the clipboard, and paste it into a new SSMS window.
  • Select just one of the tabs, then use the “find and replace” feature (Ctrl+H) in SSMS to replace all tabs with the text ', ' (including the apostrophes).
  • Now, add the text (' at the beginning of each line and '), at the end of each line. The last line obviously won’t need the trailing comma. If you’re handy with SSMS, you can do at least the leading values with a “box select”: holding down the Alt key as you make a zero-width selection over all the rows, then typing the text.

If all of this sounds like a lot of work for you, you might want to try out a little web hack that I wrote. It allows you to paste a tab-delimited dataset, just like the ones you get from Excel or the result pane in SSMS or ADS, into a window and instantly convert it into a T-SQL INSERT statement with the click of a button.

Copying a dataset from SSMS into table.strd.co, then converting it to a multi-row INSERT statement.

Pro tip: in SQL Server Management Studio, use Ctrl+Shift+C to copy not only the results, but also the column names!

Options

  • First row has headers: instead of inserting the first row of the raw data, the script uses it to map the INSERTed values to the correct columns in the destination table.
  • Fix nulls: Particularly when exporting from SSMS, we’ll lose information about which values are actually NULL and which ones are actually the text “NULL”. When this option is unchecked, the values will be treated as the text “NULL”, when checked, all values that consist entirely of the text “NULL” will have the surrounding apostrophes removed, so they become actual NULL values.
  • Pretty: adds some indenting spaces to the output code. This increases the script size by a few bytes, but increases readability.
  • Table name: Option table name to put in the INSERT INTO header of the script.

And to make sure you sleep well at night, the entire process on table.strd.co happens in the browser – nothing is ever uploaded to the Internet.

SSMS: Search and replace across multiple objects in a database

Here’s a quick tip that touches on one of the powerful SSMS tricks in my “Management Studio Level-Up” presentation. Say you have a potentially large number of database objects (procedures, functions, views, what have you), and you need to make a search-and-replace kind of change to all of those objects.

You could, of course, put the database in source control and use a proper IDE to replace everything, then check your code back into source control and commit it to the database. That’s obviously the grown-up solution. Thanks for reading this post.

But let’s say for the sake of argument that you haven’t put your database in version control. What’s the lazy option here?

Continue reading

Break out of your comfort zone in 2020.

A few years ago, I was in a meeting at a client’s, and the discussion turned to teamwork. I jokingly said “teamwork is for people who cannot be trusted with complex tasks on their own”, at which point the whole room exploded in laughter (well, everyone laughed except the agile coach).

And even though my joke has since become a pinned tweet, it is in all fairness only true in a limited context. In most of our day-to-day work, we need the help of others, and they rely on ours.

Continue reading