A “shock absorber” pattern for high-performance data ingestion

It’s almost like a myth – one that I’ve heard people talk about, but never actually seen myself. The “shock absorber” is a pretty clever data flow design pattern to ingest data where a regular ETL process would choke on the throughput or spikes. The idea is to use a buffer table to capture incoming data, and then run an asynchronous process that loads that data in batches from the buffer into its intended target table.

While I’ve seen whitepapers and blog posts mention the concept loosely along with claims of “7x or 10x performance”, none of them go into technical detail on how it’s done, so I decided to try my hand at it.

I’ve compiled my findings, along with some pre-baked framework code if you want to try building something yourself. Professional driver on closed roads. It’s gonna get pretty technical.

Continue reading

How MERGE on two different rows can still deadlock you

I recently ran into a curious deadlock issue. I have a process that performs a lot of updates in a “state” table using multiple, concurrent connections. The business logic in the application guarantees that two connections won’t try to update the same item, so we shouldn’t ever run into any locking issues. And yet, we keep getting deadlocks.

What’s going on here? Hint: it has to do with isolation levels and range locks.

Continue reading

Is a sort faster when the data is already sorted?

Whenever SQL Server needs to sort a data stream, it will use the Sort operator to reorder the rows of the stream. Sorting data is an expensive operation because it entails loading part or all of the data into memory and shifting that data back and forth a couple of times. The only time SQL Server doesn’t sort the data is when it already knows the data to be ordered correctly, like when it has already passed a Sort operator or it’s reading from an appropriately sorted index.

But what happens if the data is ordered correctly, but SQL Server doesn’t know about it? Let’s find out.

Continue reading

How to run your CTE just once, and re-use the output

You may already know that common table expressions, like views, don’t behave like regular tables. They’re a way to make your query more readable by allowing you to write a complex SQL expression just once, rather than repeating it all over your statement or view. This makes reading, understanding and future refactoring of your code a little less painful.

But they’re no magic bullet, and you may end up with some unexpected execution plans.

Continue reading

Consolidating grouped transactions into evenly sized batches

I recently worked with a large set of accounting transactions. I needed to split those rows into multiple logical batches, but each batch had to be logically consistent – among other things, those batches had to be properly balanced, because accounting people are kind of fussy like that.

So I designed a little T-SQL logic that would split all of those transactions into evenly sized batches, without violating their logical groupings.

Safety glasses on. Let’s dive in.

Continue reading

Fun with VALUES

There’s more to the VALUES clause in T-SQL than meets the eye. We’ve all used the most basic INSERT syntax:

INSERT INTO #work (a, b, c)
VALUES (10, 20, 30);

But did you know that you can create multiple rows using that same VALUES clause, separated by commas?

INSERT INTO #work (a, b, c)
VALUES (10, 20, 30),
       (11, 21, 31),
       (12, 22, 32);

Note the commas at the end of each line, denoting that a new row begins here. Because this runs as a single statement, the INSERT runs as an atomic operation, meaning that all rows are inserted, or none at all (like if there’s a syntax issue or a constraint violation).

I use this construct all the time to generate scripts to import data from various external sources, like Excel, or even a result set in Management Studio or Azure Data Studio.

Here’s something you can try:

  • Select a dataset from SSMS or Excel, copy it to the clipboard, and paste it into a new SSMS window.
  • Select just one of the tabs, then use the “find and replace” feature (Ctrl+H) in SSMS to replace all tabs with the text ', ' (including the apostrophes).
  • Now, add the text (' at the beginning of each line and '), at the end of each line. The last line obviously won’t need the trailing comma. If you’re handy with SSMS, you can do at least the leading values with a “box select”: holding down the Alt key as you make a zero-width selection over all the rows, then typing the text.

If all of this sounds like a lot of work for you, you might want to try out a little web hack that I wrote. It allows you to paste a tab-delimited dataset, just like the ones you get from Excel or the result pane in SSMS or ADS, into a window and instantly convert it into a T-SQL INSERT statement with the click of a button.

Copying a dataset from SSMS into table.strd.co, then converting it to a multi-row INSERT statement.

Pro tip: in SQL Server Management Studio, use Ctrl+Shift+C to copy not only the results, but also the column names!

Options

  • First row has headers: instead of inserting the first row of the raw data, the script uses it to map the INSERTed values to the correct columns in the destination table.
  • Fix nulls: Particularly when exporting from SSMS, we’ll lose information about which values are actually NULL and which ones are actually the text “NULL”. When this option is unchecked, the values will be treated as the text “NULL”, when checked, all values that consist entirely of the text “NULL” will have the surrounding apostrophes removed, so they become actual NULL values.
  • Pretty: adds some indenting spaces to the output code. This increases the script size by a few bytes, but increases readability.
  • Table name: Option table name to put in the INSERT INTO header of the script.

And to make sure you sleep well at night, the entire process on table.strd.co happens in the browser – nothing is ever uploaded to the Internet.