A tip about intersect

This article first appeared in the SQLServerGeeks Magazine.
Author: Erland Sommarskog
Subscribe to get your copy.

This article is based on something I originally learnt from a blog post from my MVP colleague Paul White. I also need to thank Diethard Michaelis who was kind to point me to Paul’s post.

In this month’s tip you will learn a compact way to express your query when you want to handle NULL like any other value in comparisons. Let’s start with setting up a demo table:

USE tempdb
go
DROP TABLE IF EXISTS DemoTable
CREATE TABLE DemoTable(a int NOT NULL,
                       b int NOT NULL,
                       c int NULL,
                       d int NULL,
                       CONSTRAINT pk_DemoTable PRIMARY KEY (a, b),
                       INDEX c_d_ix NONCLUSTERED (c, d)
)
go
INSERT DemoTable (a, b, c, d)
 SELECT object_id, column_id, 
          IIF(max_length = -1, NULL, abs(checksum(*) % 200)), 
          IIf(len(name) = 8, NULL, abs(checksum(*) % 100))
   	 FROM   sys.columns

This fills up DemoTable with a couple of hundreds of rows, exactly how many depends on your SQL Server version. You can run this SELECT to verify that some rows will have NULL in both of the columns c and d.

SELECT * FROM DemoTable WHERE c IS NULL AND d IS NULL

Here is a simple stored procedure to return data by the values in c and d:

CREATE OR ALTER PROCEDURE GetDataByCD @c int, @d int AS
SELECT a, b, c, d
FROM   DemoTable
WHERE  c = @c
  AND  d = @d

You can test the procedure with display of execution plan enabled:

EXEC GetDataByCD 64, 64

It returns a couple of rows, and you can see that the plan is an Index Seek.

Say now that the business rules for this table are such that it makes sense to also search for the rows where any of c or d has a NULL value. But if you try:

EXEC GetDataByCD NULL, NULL

You get no rows back, despite that the query above told us that there are matching rows. If we recall how NULL values work in SQL this is not surprising. NULL represents an unknown value, and when comparing a NULL with something else, even another NULL, we don’t know if they are equal or not. By chance, they could have could be the same unknown value, but then again, they may not. Therefore, the result is not TRUE or FALSE, but UNKNOWN by the three-valued logic of SQL. But rows are only returned if the condition in the WHERE clause evaluates to TRUE, and whence no rows are returned.

To solve this, we need to change the procedure. A popular way to solve the issue is to use isnull or coalesce. Here is an example with the former:

CREATE OR ALTER PROCEDURE GetDataByCD @c int, @d int AS
SELECT a, b, c, d
FROM   DemoTable
WHERE  isnull(c, -1) = isnull(@c, -1)
  	   AND  isnull(d, -1) = isnull(@d, -1)
go
EXEC GetDataByCD 64, 64
EXEC GetDataByCD NULL, NULL

Now we get back rows with the second call as well.

There are however two problems with this solution. One is that it relies on a “magic” value, that you assume is not present in the data. It is not uncommon that such assumptions are safe to make. For instance, we can see from how this table is filled up that there can be no negative values in c or d. However, casual use of this technique can lead to incorrect results when you are assumptions are wrong. The second problem is performance. If you check the execution plan, you will find that there is an Index Scan. The optimizer in SQL Server does not have any rule to untangle isnull or coalesce to permit an Index Seek.

Here is a better way to do this:

CREATE OR ALTER PROCEDURE GetDataByCD @c int, @d int AS
SELECT a, b, c, d
FROM   DemoTable
WHERE  (c = @c OR c IS NULL AND @c IS NULL)
  	  AND  (d = @d OR d IS NULL AND @d IS NULL)
go
EXEC GetDataByCD NULL, NULL

Not only do we get the correct results. This time, we also get an Index Seek.

You may however find this a bit long-winded, and if you forget to place the parentheses correctly, you will get incorrect results.

The ANSI standard for SQL defines an operator that permits us to write this a little more compactly:

WHERE  c IS NOT DISTINCT FROM @c
  	  AND  d IS NOT DISTINCT FROM @d

This is analogous to the DISTINCT operator in SELECT, where all NULL values are handled as one and the same. This operator is implemented by for instance PostgreSQL, but it is not available in SQL Server. Maybe that is just as well, because as you may notice the operator is defined “backwards”; you have to put in a NOT when you are looking for equality. And it still is a bit verbose.

However, there is another alternative that permits for a more compact form, and that is available in SQL Server and that is the INTERSECT set operator. Before we apply it to this query, let’s first repeat the operator as such. Consider these tables:

CREATE TABLE #alpha(a int NULL)
INSERT #alpha(a) VALUES(1), (2), (3), (NULL)
CREATE TABLE #beta(a int NULL)
INSERT #beta(a) VALUES(3), (4), (5), (NULL)

We all know the UNION operator:

SELECT a FROM #alpha
UNION
SELECT a FROM #beta

This returns NULL, 1, 2, 3, 4, 5, that is, the union of the two sets. NULL and 3 that appear in both sets, occur only once in the result set, because UNION without ALL implies DISTINCT.

   
Note: The ANSI standard also defines INTERSECT ALL and EXCEPT ALL, analogous to UNION ALL,     but they are not implemented in SQL Server

Here I had only a single column to keep the examples simple. But as with UNION, you can have any number of columns in an INTERSECT operation. Therefore, we can write our stored procedure as:

SELECT a, b, c, d
FROM   DemoTable
WHERE  EXISTS CREATE OR ALTER PROCEDURE GetDataByCD @c int, @d int AS
(SELECT c, d INTERSECT SELECT @c, @d)
go
EXEC GetDataByCD NULL, NULL

The subquery with INTERSECT is logically evaluated for every row, and if the two sets have the same values, a row is returned, else not. And as discussed above, NULL is not a special case here, but works like any other value.

The query returns the correct result, but not only that: the execution plan uses an Index Seek.

What is interesting here is that if you have plenty of columns and there is one or more where you need to handle NULL as equal to NULL, you can use the above pattern for a very compact way of expressing  your WHERE clause. Or for that matter an ON condition in a join where you want to join on NULL values. At the same time, you can still expect the optimizer to use appropriate indexes.
Let’s now look at a situation where INTERSECT and EXCEPT really shine.

Here is one more demo table with similar, but not identical data. We also add one extra row to each table:

CREATE TABLE DemoTable2(a int NOT NULL,
                        b int NOT NULL,
                        c int NULL,
                        d int NULL,
                        CONSTRAINT pk_DemoTable2 PRIMARY KEY (a, b),
                        INDEX c_d_ix NONCLUSTERED (c, d)
)
go
INSERT DemoTable2 (a, b, c, d)
SELECT object_id, column_id, iif(len(name) = 8, NULL, abs(checksum(*) % 200)), 
          IIf(max_length = -1, NULL, abs(checksum(*) % 100))
   	FROM   sys.columns

INSERT DemoTable(a, b, c, d)
  		VALUES(10000, 1, 13, 13)
INSERT DemoTable2(a, b, c, d)
  		VALUES(10000, 2, 13, 13)
SELECT object_id, column_id, iif(len(name) = 8, NULL, abs(checksum(*) % 200)), 
          IIf(max_length = -1, NULL, abs(checksum(*) % 100))
   	FROM   sys.columns

INSERT DemoTable(a, b, c, d)
  		VALUES(10000, 1, 13, 13)
INSERT DemoTable2(a, b, c, d)
  		VALUES(10000, 2, 13, 13)

Say that these two tables are the result of two variations of a query or a stored procedure that you have rewritten to improve performance. You want to verify that the result is the same, because better performance is worth little if the result is wrong.

I touched at this problem in my article about full outer join last month, and I gave an outline for how to write a query to compare the tables. Applying this outline to this pair of tables we get:

SELECT *
FROM   DemoTable X
FULL OUTER JOIN DemoTable2 Y ON X.a = Y.a
                            AND X.b = Y.b
WHERE X.a IS NULL OR Y.a IS NULL OR
      		X.c <> X.c OR 
            		(X.c IS NULL AND Y.c IS NOT NULL) OR 
            		(X.c IS NOT NULL AND Y.c IS NULL) OR
      		X.d <> X.d OR 
            (X.d IS NULL AND Y.d IS NOT NULL) OR 
            (X.d IS NOT NULL AND Y.d IS NULL)

The first condition on one of the key columns is to find key values that are only in one of the tables. The remaining conditions serve to find difference in values, including differences related to NULL.

This is extremely tedious to write, and you can easily slip on one of the conditions. This is only two columns, but in a real-world case, you could easily have ten or twenty columns. It can be made shorter with isnull or coalesce, if you accept to rely on “magic” values. But even with these functions, the code will still be long-winded and error-prone.

However, we can make this a lot shorter with INTERSECT. And I mean a lot shorter:

SELECT *
FROM   DemoTable X
FULL OUTER JOIN DemoTable2 Y ON X.a = Y.a
                            AND X.b = Y.b
WHERE NOT EXISTS (SELECT X.* INTERSECT SELECT Y.*)

That’s not bad, is it?

Although, It is only for the special case when you want to compare each and every column that you can cut it this short with SELECT *. It is not uncommon, that you for one reason or another need to exempt a column or two from the comparison. (For instance, a datetime column holding the time when the row was created.) In that case you will need to list the columns twice, on both sides of the INTERSECT. But the query will still be a lot shorter compared to the pattern above.

Here I used NOT EXISTS + INTERSECT, but it also possible to write the query with EXISTS + EXCEPT:

SELECT *
FROM   DemoTable X
FULL OUTER JOIN DemoTable2 Y ON X.a = Y.a
                            AND X.b = Y.b
WHERE EXISTS (SELECT X.* EXCEPT SELECT Y.*)

Which form to use is a matter of taste.

When I first read Paul White’s blog post https://www.sql.kiwi/2011/06/undocumented-query-plans-equality-comparisons.html a few years back, I was amazed that I still could learn something about T-SQL that helped to enhance my technique for writing queries, and I have applied this method several times since then. I hope that you will also find use for the INTERSECT operator in your queries when you want to handle NULL values as equal.

This article first appeared in the SQLServerGeeks Magazine.
Author: Erland Sommarskog
Subscribe to get your copy.

   

Leave a Reply

Your email address will not be published. Required fields are marked *