The compelling case for using heaps

I’m an outspoken advocate of always using a clustered index on each and every table you create as a matter of best practice. But even I will agree that there’s a case for using the odd heap now and then.

The great thing with (clustered) indexes

There are numerous advantages to clustering your table over just using a heap,

  • If you put a little thought into the clustered index, your most common and/or heavy queries won’t have to scan the table. Instead, they’ll use the clustered index (or a suitable non-clustered index) to seek directly to the correct row or range.
  • Similarly, the data is already ordered in the table.
  • Indexes often help with joins as well.
  • Indexes come with statistics, which helps SQL Server build better query plans.
  • Inserting records into a heap (row by row – think SSIS transformation) may leave you with half-full pages because SQL Server stores just a rough figure of how much of the page is actually used, not the exact number of bytes.
  • Updating rows may cause page splits in a clustered index, but at least you won’t have forwarded records like you get in heaps.
  • Updates on heaps might also change the RID on the row (if it has to move to another page), which requires updating every non-clustered index on the table.
  • Deleting rows from a clustered table will clean up the “ghost” records, whereas it will leave big gaping holes in a heap.

What about heaps?

So, you might ask, when should I use heaps? The slightly simplified answer: when most of the conditions below are met:

  • if you want to insert most of the data in a single statement or using a table lock,
  • if you don’t want to have to order the data when it goes into the table,
  • if you don’t need the data ordered or aggregated (except scalar aggregates) in any particular way when it goes out,
  • if you’re going to read most of the data at once,
  • if you’re not going to join the table to another table of significant size (i.e. more than a few rows).
  • if you don’t have to share the table or even database with a significant number of other user sessions (to avoid PFS or IAM page contention).

You think the above sounds a bit like an edge case? It is. And with all this in mind, you might still get worse performance with heaps than with a clustered index. Your mileage will vary.

For every other situation, do yourself and the world a favor and add a clustered index to every table. It doesn’t even have to be unique if that’s a problem.

How do I look for heaps?

Here’s a quick way to identify all the heaps in a given database:

SELECT s.[name]+N'.'+t.[name] AS [Table],
       i.[type_desc] AS [Index/heap],
       (CASE WHEN MAX(p.partition_number) OVER (
                  PARTITION BY p.[object_id])>1
             THEN p.partition_number END) AS [Partition],
       p.data_compression_desc AS [Compression],
       p.[rows] AS [Row count]
FROM sys.schemas AS s
INNER JOIN sys.tables AS t ON s.[schema_id]=t.[schema_id]
INNER JOIN sys.indexes AS i ON t.[object_id]=i.[object_id] AND i.index_id IN (0, 1)
INNER JOIN sys.partitions AS p ON i.[object_id]=p.[object_id] AND i.index_id=p.index_id
ORDER BY s.[name], t.[name];

Now, go forth and cluster.

5 thoughts on “The compelling case for using heaps

  1. We have a few tables whose primary key is a uniqueidentifier. We made the decision not to use a clustered index on the primary key based upon the advice that uniqueidentifier indexes are space-consuming and cause index fragmentation easily. Performance has been just fine, even though those tables are HEAP. We didn’t bother creating a cluster index on a small field in the table if it didn’t make sense. So which “best practice” wins? not clustering uniqueidentifiers or always have a clustered index on the table?

    • Without knowing the specifics of your solution, I would say that using a heap or a clustered index mostly depends on your usage patterns, not so much the type of data you’re storing. Are you truncating and bulk inserting a pile of data (like an ETL process), is it a dimension or fact table in a datawarehouse application, or is it a transaction table that sees OLTP workloads?

      In the OLTP case, I would definitely try to find a “natural” primary key that makes sense (if you have one).

      I’ve seen customers make good use of a highly non-unique clustered index on a “timestamp”/”load date”/”batch number” column, and this facilitates ETL loads where you want to get all the rows from a specific ETL batch.

      • It is a live database of medical records (OLTP). For example, the patient visits are stored in a PatientContact table, whose primary key is a guid PatientContactID. Another field in the table is the PatientID guid which naturally links up to the patient table. When I ran your index/heap query, the PatientContact table came up as HEAP since we used a non-clustered primary key on the guid PatientContactID (and a non-clustered index on PatientID). We’ve never had any performance problems.

        The issue I was looking to discuss is when you have competing “best-practices”. There is a good school of thought that says your primary key should have no meaning, thus a guid or an auto-incrementing integer does the trick. You think a “natural” primary key makes sense presumably so it can be the clustered index. An auto-incrementing integer is small enough that it is often used as the clustering primary key because it also satisfies the “best practice” of having a primary key with no meaning. We ended up choosing a guid because it is easier to move records around between databases without worrying about primary key collisions.

        • Yeah, I see what you mean.

          I personally am certainly not a fan of guids, I would probably bite the bullet and cluster on the guid (with regular rebuilds). After all, I’m assuming you have a non-clustered index on your guid-columns? So you still get the same fragmentation problem. Plus the issues with using heaps.

          “Best practice” isn’t a universal truth, and as you’ve noticed, it can even be conflicting. :)

  2. I may be assuming here but if the PatientContact is a child of Patient then I would put the Clustered index on PatientId and maybe uniquefy it by adding PatientContactId. So when you query the visits for a patient it would do minimal page reads using a Clustered Index Seek (3 – 7 page reads depending on table size) instead of multiple RID lookups( 3-5 page reads per visit depending on table size) to return all the PatientContactIds. The more contacts/visits a patient has the more this will reduce reads for the query. Don’t forget to add FILLFACTOR of 70-80 to prevent page splits. Also always add FILLFACTOR when indexing GUIDS clustered or nonclustered to prevent table or index splitting.

Leave a Reply to Mike R Cancel reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.