The compelling case for using heaps

I’m an outspoken advocate of always using a clustered index on each and every table you create as a matter of best practice. But even I will agree that there’s a case for using the odd heap now and then.

The great thing with (clustered) indexes

There are numerous advantages to clustering your table over just using a heap,

  • If you put a little thought into the clustered index, your most common and/or heavy queries won’t have to scan the table. Instead, they’ll use the clustered index (or a suitable non-clustered index) to seek directly to the correct row or range.
  • Similarly, the data is already ordered in the table.
  • Indexes often help with joins as well.
  • Indexes come with statistics, which helps SQL Server build better query plans.
  • Inserting records into a heap (row by row – think SSIS transformation) may leave you with half-full pages because SQL Server stores just a rough figure of how much of the page is actually used, not the exact number of bytes.
  • Updating rows may cause page splits in a clustered index, but at least you won’t have forwarded records like you get in heaps.
  • Updates on heaps might also change the RID on the row (if it has to move to another page), which requires updating every non-clustered index on the table.
  • Deleting rows from a clustered table will clean up the “ghost” records, whereas it will leave big gaping holes in a heap.

What about heaps?

So, you might ask, when should I use heaps? The slightly simplified answer: when most of the conditions below are met:

  • if you want to insert most of the data in a single statement or using a table lock,
  • if you don’t want to have to order the data when it goes into the table,
  • if you don’t need the data ordered or aggregated (except scalar aggregates) in any particular way when it goes out,
  • if you’re going to read most of the data at once,
  • if you’re not going to join the table to another table of significant size (i.e. more than a few rows).
  • if you don’t have to share the table or even database with a significant number of other user sessions (to avoid PFS or IAM page contention).

You think the above sounds a bit like an edge case? It is. And with all this in mind, you might still get worse performance with heaps than with a clustered index. Your mileage will vary.

For every other situation, do yourself and the world a favor and add a clustered index to every table. It doesn’t even have to be unique if that’s a problem.

How do I look for heaps?

Here’s a quick way to identify all the heaps in a given database:

SELECT s.[name]+N'.'+t.[name] AS [Table],
       i.[type_desc] AS [Index/heap],
       (CASE WHEN MAX(p.partition_number) OVER (
                  PARTITION BY p.[object_id])>1
             THEN p.partition_number END) AS [Partition],
       p.data_compression_desc AS [Compression],
       p.[rows] AS [Row count]
FROM sys.schemas AS s
INNER JOIN sys.tables AS t ON s.[schema_id]=t.[schema_id]
INNER JOIN sys.indexes AS i ON t.[object_id]=i.[object_id] AND i.index_id IN (0, 1)
INNER JOIN sys.partitions AS p ON i.[object_id]=p.[object_id] AND i.index_id=p.index_id
ORDER BY s.[name], t.[name];

Now, go forth and cluster.

10 thoughts on “The compelling case for using heaps

  1. We have a few tables whose primary key is a uniqueidentifier. We made the decision not to use a clustered index on the primary key based upon the advice that uniqueidentifier indexes are space-consuming and cause index fragmentation easily. Performance has been just fine, even though those tables are HEAP. We didn’t bother creating a cluster index on a small field in the table if it didn’t make sense. So which “best practice” wins? not clustering uniqueidentifiers or always have a clustered index on the table?

    • Without knowing the specifics of your solution, I would say that using a heap or a clustered index mostly depends on your usage patterns, not so much the type of data you’re storing. Are you truncating and bulk inserting a pile of data (like an ETL process), is it a dimension or fact table in a datawarehouse application, or is it a transaction table that sees OLTP workloads?

      In the OLTP case, I would definitely try to find a “natural” primary key that makes sense (if you have one).

      I’ve seen customers make good use of a highly non-unique clustered index on a “timestamp”/”load date”/”batch number” column, and this facilitates ETL loads where you want to get all the rows from a specific ETL batch.

      • It is a live database of medical records (OLTP). For example, the patient visits are stored in a PatientContact table, whose primary key is a guid PatientContactID. Another field in the table is the PatientID guid which naturally links up to the patient table. When I ran your index/heap query, the PatientContact table came up as HEAP since we used a non-clustered primary key on the guid PatientContactID (and a non-clustered index on PatientID). We’ve never had any performance problems.

        The issue I was looking to discuss is when you have competing “best-practices”. There is a good school of thought that says your primary key should have no meaning, thus a guid or an auto-incrementing integer does the trick. You think a “natural” primary key makes sense presumably so it can be the clustered index. An auto-incrementing integer is small enough that it is often used as the clustering primary key because it also satisfies the “best practice” of having a primary key with no meaning. We ended up choosing a guid because it is easier to move records around between databases without worrying about primary key collisions.

        • Yeah, I see what you mean.

          I personally am certainly not a fan of guids, I would probably bite the bullet and cluster on the guid (with regular rebuilds). After all, I’m assuming you have a non-clustered index on your guid-columns? So you still get the same fragmentation problem. Plus the issues with using heaps.

          “Best practice” isn’t a universal truth, and as you’ve noticed, it can even be conflicting. :)

  2. I may be assuming here but if the PatientContact is a child of Patient then I would put the Clustered index on PatientId and maybe uniquefy it by adding PatientContactId. So when you query the visits for a patient it would do minimal page reads using a Clustered Index Seek (3 – 7 page reads depending on table size) instead of multiple RID lookups( 3-5 page reads per visit depending on table size) to return all the PatientContactIds. The more contacts/visits a patient has the more this will reduce reads for the query. Don’t forget to add FILLFACTOR of 70-80 to prevent page splits. Also always add FILLFACTOR when indexing GUIDS clustered or nonclustered to prevent table or index splitting.

  3. Although this is a short blog entry, it contains several misconceptions and bad assumptions. The reader should be aware that these support the author’s bias towards clustered tables and effectively reverse the outcome of the recommendations.

    1. Updates on heaps might also change the RID on the row (if it has to move to another page), which requires updating every non-clustered index on the table.
    The RID always points to a page and is never updated in the indexes. If the row is updated and no longer fits the page, it is removed from the original page and an RID is left in its place. On an index lookup, this RID is seen to be a forwarding pointer and it is followed to the new data page. This mechanism exists precisely to avoid updating all the indexes.

    2. I disagree with all the conditions placed on heaps. Some unrealistic assumptions a built in. In most applications, the natural key, say IdNumber, is rarely something that you need to scan in order. What sense is there, for example, in wanting to list out all people in order by Social Security Number? Often the key is replaced by a surrogate key, such as a GUID or IDENTITY. Then it is even more unlikely that the table will need be scanned in order through the leaf nodes and returned in PK order. This removes one of the supposed benefits of having a clustered table in the first place. All accesses will, almost invariably, be performed by accessing the data by other indexes. Each of those indexes will terminate in a leaf node containing the full PK of the clustered row, and then the clustering index must also be traversed to find the appropriate leaf node that contains the data.

    In most cases I would suggest that the user consider using a heap table first, and carefully consider whether any of the purported advantages of clustered indexes are actually possible to realize in practice.

    The two things you want to be aware of in heaps are 1) try not to overwrite rows with longer data, to minimize forwarding pointers, and 2) be aware that the algorithms for finding free space for the next row tend to leave the data pages partially empty. However, as time goes on, this should be balanced out by deletions of old data and insertion of new data. Don’t freak out and think you need to be concerned about this “fragmentation”, but you may want to consider ways to reduce forwarding pointers. (For example, if your fields are CHAR instead of VARCHAR, then the size of the row can never grow.)

    • My bias towards clustered indexes is clear from the very first sentence of the post. I hardly think I’m fooling anyone in this regard.

      To address your first point, any update in a heap that requires the data to move to a new page will require a) adding a forwarding record, then b) writing the that new information in an available free space. This is bad for performance, for a number of reasons: It’s two writes instead of one, and you need to find (and possibly allocate) new space for the forwarded record, which hits the allocation map pages. In this respect, it’s pretty common to see contention in these allocation maps for transaction-intensive databases.

      What’s worse, if you have a large number of non-clustered indexs, the forwarding record is applied to each of those indexes. In a clustered scenario, only indexes affected by the update would need to be updated.

      To your second point: The usefulness of scanning an ordered (clustered) index doesn’t derive from ordering natural keys like social security numbers, but indeed synthetic keys like identity columns that you use when joining or aggregating tables. This is particularly notable in reporting applications. As an aside, using a guid as a clustering key is just terrible practice as it has most of the downsides that a heap has.

      The assertion that insertion of new data balances out the deletion of old data is only correct for heaps if you rebuild your table intermittently, like in a nightly maintenance job. This automatic balancing is one of the features of the b-tree structure in a clustered index, but not in heaps. Fragmentation is not really an issue in modern SQL Servers, particularly with SSD drives. But data growth and increasing IO operations are.

      In closing, avoiding updates that would increase the length of a row is just not practical in a real-world database – it would require some seriously ugly architecture, like using char datatypes for data that isn’t fixed in length, not using NULL values, etc.

  4. Pingback: SQL Staging Tables: Primary Key Clustered or Heap

  5. Pingback: SQL Staging Tables: Primary Key Clustered or Heap

  6. Pingback: How to start indexing a new data warehouse

Let me hear your thoughts!

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.