There are basically three main types of joins in SQL Server, each one of which has its own characteristics, advantages and drawbacks. You’ll find these three operators in query plans. Knowing how different types joins work will help you optimize your queries, indexing and how to store your data.
First off, a disclaimer: This water gets as deep as you want it to be. I’ll provide a medium-high level overview, but if you google the subject, you’ll find some very technical deep-dives into the subject.
The HASH JOIN
This join type is probably the most common one that you will encounter. It uses a hash table to aid in joining. The join works in two phases, the build phase and the probe phase. We will refer to the two tables to be joined as the build table (commonly the smaller of the two) and the probe table. For a hash join to work, at least one of the join conditions will need to be a equijoin, that is, two columns that are equal (=) to each other.
In the build phase, the server calculates hash values for the build table. Think of this as a temporary index on the join columns. This calculated hash table is then stored in memory or swapped out to physical storage in tempdb, depending on its size. This is the main potential drawback of the hash join – if your tables are really large, this join will consume quite a bit of memory.
When the build phase is complete, the server will enter the probe phase. This phase calculates hash values for each row in the probe table and tries to match those values to corresponding hash value in the build table, completing the join. If the entire build table fits into memory, the join is called an in-memory hash join. If it has to be split to disk, it’s referred to as a grace hash join. In some cases, where you have really large amounts of data, the server will create a number of recursive merge joins, called a recursive hash join.
Summary: The HASH JOIN uses hash values of the join keys as go-betweens in order to improve join performance, but this process requires memory for the build hash table. Check out Craig Freedman’s excellent blog post on MSDN for more details.
The LOOP JOIN
The nested loop join, or loop join for short, works a lot like the hash join, but without the hash tables. The name derives from how the join actually works as two loops.
The outer loop goes through every row in the outer table. For each row in the outer table, there’s a loop that tries to match rows in the inner table. This means that the cost of a loop join increases dramatically with volume. Because of this, nested loop joins are best used when one or prefereably both tables are fairly small.
Summary: The LOOP JOIN does not really sport excellent performance, but it’s the only join you can use if you don’t have any equijoin, and it’s the only way to execute a CROSS JOIN. Again, there’s a great article by Craig Freedman if you want to dig deeper into the subject.
The MERGE JOIN
It’s no secret that this is my favourite join, because it sports excellent performance and it scales beautifully with volumes, but it may be tricky to achieve with just any set of data. The merge join works by aligning the two input tables, and this requires that both tables are sorted by their respective join columns (either by using a clustered index or a covering index) and you need at least one equijoin. Also, the joining columns need to have exactly the same data types, or SQL Server will have to make an implicit conversion, at which point it’ll just probably choose a different join method instead.
Summary: The merge join is very, very fast, but the data needs to be arranged just in the right order.
Forcing a join type in T-SQL
You can manually provide a query hint if you want SQL Server to choose a specific join method, though I would strongly advise against it.
SELECT * FROM dbo.table_a AS a INNER HASH JOIN dbo.table_b AS b ON a.k=b.k SELECT * FROM dbo.table_a AS a INNER LOOP JOIN dbo.table_b AS b ON a.k=b.k SELECT * FROM dbo.table_a AS a INNER MERGE JOIN dbo.table_b AS b ON a.k=b.k
So far, I haven’t encountered a situation where you would need to manually specify the join method. The SQL Server optimizer is really, really clever at finding the right join strategy for a given situation. However, you can use the above join hints together with the query plan to try to find out why SQL Server does not automatically select the type of join you were hoping for. For example, if you force an INNER MERGE JOIN, and the operator right before the join is a “Sort”, this indicates that you haven’t set up the sort order of the indexes properly. Similarly, if you see a “Calculate scalar”, this could indicate that you’re performing an implicit data type conversion.
5 thoughts on “JOIN types and performance”
Pingback: About the spool operator | Sunday morning T-SQL
Pingback: HASH JOIN deep-dive « Sunday morning T-SQL
Pingback: Comparing nullable columns | Sunday morning T-SQL
Pingback: Video: three SQL Server join operators in three minutes | Sunday morning T-SQL
Pingback: Prioritizing rows in a union | sqlsunday.com