The compelling case for using heaps

I’m an outspoken advocate of always using a clustered index on each and every table you create as a matter of best practice. But even I will agree that there’s a case for using the odd heap now and then.

The great thing with (clustered) indexes

There are numerous advantages to clustering your table over just using a heap,

  • If you put a little thought into the clustered index, your most common and/or heavy queries won’t have to scan the table. Instead, they’ll use the clustered index (or a suitable non-clustered index) to seek directly to the correct row or range.
  • Similarly, the data is already ordered in the table.
  • Indexes often help with joins as well.
  • Indexes come with statistics, which helps SQL Server build better query plans.
  • Inserting records into a heap (row by row – think SSIS transformation) may leave you with half-full pages because SQL Server stores just a rough figure of how much of the page is actually used, not the exact number of bytes.
  • Updating rows may cause page splits in a clustered index, but at least you won’t have forwarded records like you get in heaps.
  • Updates on heaps might also change the RID on the row (if it has to move to another page), which requires updating every non-clustered index on the table.
  • Deleting rows from a clustered table will clean up the “ghost” records, whereas it will leave big gaping holes in a heap.

What about heaps?

So, you might ask, when should I use heaps? The slightly simplified answer: when most of the conditions below are met:

  • if you want to insert most of the data in a single statement or using a table lock,
  • if you don’t want to have to order the data when it goes into the table,
  • if you don’t need the data ordered or aggregated (except scalar aggregates) in any particular way when it goes out,
  • if you’re going to read most of the data at once,
  • if you’re not going to join the table to another table of significant size (i.e. more than a few rows).
  • if you don’t have to share the table or even database with a significant number of other user sessions (to avoid PFS or IAM page contention).

You think the above sounds a bit like an edge case? It is. And with all this in mind, you might still get worse performance with heaps than with a clustered index. Your mileage will vary.

For every other situation, do yourself and the world a favor and add a clustered index to every table. It doesn’t even have to be unique if that’s a problem.

How do I look for heaps?

Here’s a quick way to identify all the heaps in a given database:

SELECT s.[name]+N'.'+t.[name] AS [Table],
       i.[type_desc] AS [Index/heap],
       (CASE WHEN MAX(p.partition_number) OVER (
                  PARTITION BY p.[object_id])>1
             THEN p.partition_number END) AS [Partition],
       p.data_compression_desc AS [Compression],
       p.[rows] AS [Row count]
FROM sys.schemas AS s
INNER JOIN sys.tables AS t ON s.[schema_id]=t.[schema_id]
INNER JOIN sys.indexes AS i ON t.[object_id]=i.[object_id] AND i.index_id IN (0, 1)
INNER JOIN sys.partitions AS p ON i.[object_id]=p.[object_id] AND i.index_id=p.index_id
ORDER BY s.[name], t.[name];

Now, go forth and cluster.

5 thoughts on “The compelling case for using heaps

  1. We have a few tables whose primary key is a uniqueidentifier. We made the decision not to use a clustered index on the primary key based upon the advice that uniqueidentifier indexes are space-consuming and cause index fragmentation easily. Performance has been just fine, even though those tables are HEAP. We didn’t bother creating a cluster index on a small field in the table if it didn’t make sense. So which “best practice” wins? not clustering uniqueidentifiers or always have a clustered index on the table?

    • Without knowing the specifics of your solution, I would say that using a heap or a clustered index mostly depends on your usage patterns, not so much the type of data you’re storing. Are you truncating and bulk inserting a pile of data (like an ETL process), is it a dimension or fact table in a datawarehouse application, or is it a transaction table that sees OLTP workloads?

      In the OLTP case, I would definitely try to find a “natural” primary key that makes sense (if you have one).

      I’ve seen customers make good use of a highly non-unique clustered index on a “timestamp”/”load date”/”batch number” column, and this facilitates ETL loads where you want to get all the rows from a specific ETL batch.

      • It is a live database of medical records (OLTP). For example, the patient visits are stored in a PatientContact table, whose primary key is a guid PatientContactID. Another field in the table is the PatientID guid which naturally links up to the patient table. When I ran your index/heap query, the PatientContact table came up as HEAP since we used a non-clustered primary key on the guid PatientContactID (and a non-clustered index on PatientID). We’ve never had any performance problems.

        The issue I was looking to discuss is when you have competing “best-practices”. There is a good school of thought that says your primary key should have no meaning, thus a guid or an auto-incrementing integer does the trick. You think a “natural” primary key makes sense presumably so it can be the clustered index. An auto-incrementing integer is small enough that it is often used as the clustering primary key because it also satisfies the “best practice” of having a primary key with no meaning. We ended up choosing a guid because it is easier to move records around between databases without worrying about primary key collisions.

      • Yeah, I see what you mean.

        I personally am certainly not a fan of guids, I would probably bite the bullet and cluster on the guid (with regular rebuilds). After all, I’m assuming you have a non-clustered index on your guid-columns? So you still get the same fragmentation problem. Plus the issues with using heaps.

        “Best practice” isn’t a universal truth, and as you’ve noticed, it can even be conflicting. :)

  2. I may be assuming here but if the PatientContact is a child of Patient then I would put the Clustered index on PatientId and maybe uniquefy it by adding PatientContactId. So when you query the visits for a patient it would do minimal page reads using a Clustered Index Seek (3 – 7 page reads depending on table size) instead of multiple RID lookups( 3-5 page reads per visit depending on table size) to return all the PatientContactIds. The more contacts/visits a patient has the more this will reduce reads for the query. Don’t forget to add FILLFACTOR of 70-80 to prevent page splits. Also always add FILLFACTOR when indexing GUIDS clustered or nonclustered to prevent table or index splitting.

Let me hear your thoughts!

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s