PASS Pro Now Available

Welcome to the next evolution of PASS. Unlock exclusive training, discounts, and networking opportunities designed to accelerate your data career. Learn More >

The official PASS Blog is where you’ll find the latest blog posts from PASS community members and the PASS Board. Contributors share their thoughts and discuss a wide variety of topics spanning PASS and the data community.

The Optimizer, Statistics, and Correlated Predicates

Sometimes, a query with two predicates (filters) executes slowly, but if you remove one of the predicates it executes faster. Let’s look at why that may be, and one possible solution for it.

When you submit a query to SQL Server, the optimizer’s role is to come up with a good plan that will give accurate results quickly. One of the most important things the optimizer considers when coming up with a plan is the number of rows it expects from an operator, referred to as cardinality. Cardinality is often determined from statistics. When there is a column in a predicate and there are statistics on that column, the optimizer may use the statistics to estimate the number of rows. For example, type the following query in the AdventureWorks2016 database, and select Display Estimated Execution Plan (Ctrl-L):

If you select properties of the Index Scan operator, you should see this:

The optimizer looked at the statistics on the City column and estimated that almost three rows will be returned for a value of ‘Clackamas’. You can look at the statistics on the City column with the following:

Here are the partial results, with the pertinent row highlighted:

Note that ‘Clackamas’ (in our predicate) sorts between ‘Chula Vista’ and ‘Cliffside’, and the statistics estimated that each value between those two will have an average of 2.857143 rows. That’s exactly where the Estimated Number of Rows came from.

However, what happens when there is more than one predicate? That depends on which compatibility mode you’re running because the cardinality estimator was rewritten for SQL Server 2014. In prior versions of SQL Server, the cardinality estimator assumed independence between predicates, and as a result would often estimate too low. The new cardinality estimator assumes correlation, where the value in one predicate column is assumed to be somewhat dependent on the other. As a result, the new estimator doesn’t estimate as low as the earlier cardinality estimator, but it’s still just a calculated estimate. For example, in the following two scripts, the city of Bellevue and the PostalCode of 98004 are closely related (correlated). In the first script, I’ve set the compatibility level to 110 (SQL Server 2012).

Looking at the properties of the Index Scan, we see that the optimizer estimated one row, whereas 41 rows were actually returned.

That’s a significant difference and could result in the optimizer making a poor plan choice under some circumstances. If we execute it again with a post SQL Server 2014 cardinality estimator, we get a higher estimate of number of rows. Compatibility level 150 in the following script is for SQL Server 2019, Azure SQL Database – single database, and Azure SQL Database – managed instance.

Looking again at the properties of the Index Scan, we see that the optimizer estimated almost two rows.

Although this is a little better than the estimate from pre-SQL Server 2014, it’s still not very close to the actual count of 41. So, what can we do?

This is a perfect case for creating your own statistics covering both columns. Following Microsoft’s lead, I prefixed my statistics name with ‘_OR’ since I live in Oregon:

Now the cardinality estimator makes a much better guess (29.1441), with the same estimate in both 110 and 150 compatibility modes.

Statistics aren’t free, but the cost of a statistics object is very low. A statistics object only consumes one 8K page and requires occasional automatic updates. This is a small price to pay for improved performance.

In summary, I’ve shown how you can often get much better cardinality estimates for correlated predicates by creating statistics on the multiple columns involved. For best results, you may want to experiment with the order of the columns within the statistics definition.


Vern Rabe
About the author

Vern Rabe is an independent SQL Server consultant and contract trainer in Portland, OR, and leader of the OregonSQL user group. He has attained MCSE, MCITP (both Administration and Development), and MCT certifications, among others. Vern has been passionate about databases since 1992 and has worked with SQL Server since version 4.21a. He provides broad technical SQL Server knowledge gained from the mixture of academic and practical experiences acquired from his classroom instructing and varied consulting contracts. Vern was honored to have one of the Top 9 PASS Summit sessions in 2017.

Please login or register to post comments.

Theme picker

Back to Top