Let's say you are running SQL Server in a hosting environment where the instance is hosting hundreds of small databases, each belonging to a different client. Your hosting company is billing customers based on resource consumption, and IO is a critical resource. This means you need to find out how much IO is being consumed by each database. Sounds strange, right? As a database professional, you know that IOPs happen with workloads and queries, not by the database itself. But those workloads belong to a database, right? So if we sum up the IO consumption (reads & writes) of all the workloads and group them by the database, we can try to achieve "database-wise IO consumption". Hmm... this does not sound as meaningless as it was sounding a while ago. I know my audience. It is unlikely that you are working for a hosting company. So let us change the business a bit. Let's say your SQL Server instance is hosting multiple small databases for different applications running inside the IT infrastructure of your company serving different purposes. And you have the same requirement to find out which database is consuming the maximum reads & writes (overall IO consumption). So that you can focus your tuning efforts on workloads coming from those databases. Yeah, this sounds better. So in SQL Server, you can achieve this by using the DMV - sys.dm_io_virtual_file_stats. This DMV has important attributes like the total number of bytes read and written for each data file present in the database. Well, once you get to this, you can do the maths. Well, there is another DMV, sys.dm_exec_query_stats, which has IO numbers for each query but let's talk about that in another note. For this one, let's focus on sys.dm_io_virtual_file_stats.
Let's say you are running SQL Server in a hosting environment where the instance is hosting hundreds of small databases, each belonging to a different client. Your hosting company is billing customers based on the CPU consumption by their databases. This means you need to find out how much CPU is being consumed by each database. Sounds strange, right? As a database professional, you know that CPU cycles are consumed by workloads and queries, not by the database itself. But those workloads belong to a database, right? So if we sum up the CPU consumption of all the workloads and group them by the database, we can try to achieve "database-wise CPU consumption". Hmm... this does not sound as meaningless as it was sounding a while ago. I know my audience. It is unlikely that you are working for a hosting company. So let us change the business a bit. Let's say your SQL Server instance is hosting multiple small databases for different applications running inside the IT infrastructure of your company serving different purposes. And you have the same requirement to find out which database is consuming the maximum CPU cycles. So that you can focus your tuning efforts on workloads coming from those databases. Yeah, this sounds better. So in SQL Server, you can achieve this by using two DMV - sys.dm_exec_query_stats (QS) and sys.dm_exec_plan_attributes (PA). The QS DMV gives you the execution statistics for each query and the PA DMV gives you the value for each plan attribute, one of them being the database id. That's it. Now all that we need is to join them and extract the computed output for our requirement.
Page splits and fragmentation concepts were one of those early things we learned while learning about SQL Server architecture. Years later, with the advent of ultra-fast storage, the importance of these concepts has diminished but they are not irrelevant. Data in SQL Server is stored on 8 KB pages and sorted based on the clustering key/column(s). Assuming that the data is nicely arranged in an ordered fashion, fragmentation is zero or minimal. For example, when you create a new clustered index, SQL Server will re-arrange the pages and at that point, the pages are nicely laid down from start to end. Fragmentation is determined by the percentage of out-of-order pages vs the ordered pages. Page split is a phenomenon that leads to fragmentation and reading fragmented data is slower compared to defragmented data. How much slower? Is it a deal breaker? Well, it depends on a lot of factors. Let's come back to the page splits. When you try to update a record in such a fashion that its length increases and there is no space on the page to hold it anymore, SQL Server will add a new blank page at the end of the file and move approximately fifty percent of the data from the existing page (where the modification was happening) to the new page. This will make space for the current record to expand. This operation of adding a new page and records moving around is called a page split. And you can now see how the data starts to get unordered. The rows that were nicely sitting in the correct order are now relocated at the end of the file to a new page and are no more in the order, thus increasing fragmentation.
Life is just so easy when you have to deal with single-column indexes. They are lovable. In contrast, multi-column indexes bring so many complexities along with them. Let's dig a bit deeper. In a multi-column index (col1, col2), the data is sorted by the first column. So if you try to order by col2, the optimizer will deploy a SORT operator to sort the data. And of course, if you are ordering by Col1, that data is pre-sorted and the SORT operator is not needed. In a multi-column index, the order of columns is important. Col1, Col2 is not the same as Col2, Col1. Let's say the structure is Col1, Col2. Now if both Col1 & Col2 are used as predicates (they are in your WHERE clause), the engine can seek on Col2 only if it can seek on Col1. Yes, that's a condition and a rule of the optimizer, so to say. If you omit Col1 from your query, you cannot seek on Col2. The optimizer will resort to an Index Scan, not Index Seek. Also, the equality operator plays a critical role here. If both the columns are in the WHERE clause, the optimizer can seek on Col2 only if Col1 is using the equality comparison operator. So there is a concept of partial seek vs full seek. Unfortunately, the execution plan does not depict this clearly. So if the optimizer cannot seek on a column, it is called a residual predicate.
Implicit conversions are not a good thing for SQL Server. Your queries are silently suffering from performance issues and consuming extra CPU cycles while implicitly converting from one data type to another. For example, if you are trying to assign an INT to a VARCHAR variable, the INT value will automatically be converted to the VARCHAR type. When you are trying to assign a VARCHAR value to an INT type, the VARCHAR will get converted to an INT type if it is possible to do so. This means if the VARCHAR holds a numeric value, the conversion will work. And if it holds character data, SQL Server will throw an error. Therefore the resulting data type depends on the statement itself. For example, for comparison operators or other expressions, the resulting data type depends on the rules of data type precedence. Well, whatever the scenario be, the moot point is that there is a performance penalty that is paid when an implicit conversion happens as the engine has to determine the rules and take action. It is advisable you should use CAST or CONVERT function as appropriate and perform an explicit conversion.
The Buffer Pool is the largest memory consumer inside SQL Server. This memory is used to host the data and index pages. Each buffer is 8KB in size and the content of the data/index page is loaded in the buffer space. The buffer pool is mostly managed automatically and SQL Server does a great job there. You do not have too many knobs/switches to control the behavior other than setting the MIN and MAX memory at the instance level. Ok, I am excluding Resource Governor from the discussion here. The buffer pool expands as needed and thousands of pages keep moving in and out of the pool as the workload demands. You can track each buffer item using the DMV sys.dm_os_buffer_descriptors. Just like other DMVs, the output is quite overwhelming. If you look at the data items, you can figure out that using this DMV output you can group by the database and the object. In other words, you can figure out how much memory is consumed by a database and/or by the objects in that database. You can also figure out whether the page is clean (content unchanged) or dirty (content changed). You can track the total size of the buffer pool using the perf mon counter Total Server Memory under the Memory Manager object of the instance. With some applied mathematics, you can extract more memory numbers from the DMV sys.dm_os_process_memory. You can also track the buffer pool consumption using the memory clerk MEMORYCLERK_SQLBUFFERPOOL. You can clear the buffer pool memory using... no, for good reasons, I won't mention that 🙂
This is one among those thousands of architectural beauty inside SQL Server engine that make me fall in love with it. When you send your query to SQL Server, sometimes, it may need an extra memory grant for its execution. For example, let's say the query has an expensive sort to deal with. While the query optimizer is working on an efficient execution plan for your query, it knows that a lot of memory will be needed and just before execution it will want to reserve that memory upfront. So just before the execution begins, the SQLQERESERVATIONS Memory Clerk gets into action and reserves the required memory upfront. Now fathom this: your query might need GBs of memory grant. Does this mean that the clerk will be able to secure all the memory your query needs? In other words, will the engine grant all of it? No. There are other workloads too that need a slice of memory, CPU & IO. In such a case, your query will wait on the RESOURCE SEMAPHORE wait type for the memory grant.
A few days back I had written a note about CXPACKET wait type and SQL Server parallelism. You can find it here: https://bit.ly/ABSQLNotes. In today's note, I wish to pen down a few more thoughts about SQL Server parallelism and the CXPACKET wait type. Now, this is my observation in general and I am not 100% sure that irrespective of the stress SQL Server engine is under, it will always assign the number of threads equivalent to the number of virtual processors for a parallel query under default affinity. Let me elaborate. Let's say the CPU affinity is set to default, which means that SQL Server will use all the CPUs available to it. Now, let's say the engine is under heavy stress. The box has 8 virtual processors and there are more than 1000 user requests at any point in time. Overall CPU is choking at 100% since 8 processors have to serve so many threads. Clearly, the engine is under a heavy workload. Now, you fire a simple, inexpensive query for which the query optimizer decides a parallel plan, and during execution, it gets 8 threads even if the query would have just run fine with 4 threads. I mean, the engine is already reeling under heavy stress and it adds more to it. Do you see the same observation? Or have you witnessed that the execution engine takes an informed decision to assign an appropriate number of threads based on the current stressful condition of the server?
Really, cannot think about query tuning without these two DMVs - sys.dm_exec_query_stats and sys.dm_exec_procedure_stats. For the sake of simplicity, I will refer to them as QS and PS, respectively. Query Stats (QS) DMV will give you query-level metrics and Procedure Stats (PS) DMV will give you stored procedure-level metrics. DMV stands for Dynamic Management Views. Simply put, they give performance and internal metrics about the SQL Server engine. After a query completes execution, you can find how much CPU time it has taken, execution time, how much IO it has caused, etc, in QS DMV, along with many other metrics. Likewise, for a stored procedure, you can find similar metrics in PS DMV. Now let us say your SP has five SQL statements inside it. We are assuming five SELECT queries. For each SELECT statement, you will find its metrics in the QS DMV. And the overall metrics for the SP itself will be available in the PS DMV. So the QS DMV will have five records and the PS DMV will have one record. When an SP runs slow, essentially, a statement inside it might be running slow. So troubleshooting the SP performance is actually troubleshooting the statement/query performance. So that's the correlation between these two DMVs. It is important to extract the right information from them to be able to make sense if your query tuning efforts.
In SQL Server, you can create single-column and multi-column indexes. Life is beautiful and easy with single-column indexes and full of ups & downs with multi-column indexes. With single-column indexes, SQL Server has to deal with only one column - just go to a value in the column and start seeking from that point. With mult-column indexes, seeking happens from left to right. Let's say you have a three-column index on Customer table, columns being FirstName, LastName & EmailAddress. First things first, data is sorted by these three columns, exactly in the specified order. Therefore FirstName, LastName, EmailAddress is not the same as FirstName, EmailAddress, LastName - the structure and storage, both change. Next, let's talk about seeking. In a multi-column index, SQL Server will seek from left to right. This means if you wish to seek on LastName, you have to seek on FirstName too. That's the catch. SQL Server cannot directly jump to LastName bypassing FirstName. Likewsie, if it has to seek on EmailAddress, it has to seek on FirstName & LastName also. Remember, left to right. So what if you put only the LastName in the predicate. Let's say your query looks like SELECT * FROM CUSTOMER WHERE LastName = 'Bansal' - what will SQL Server do now? Can it seek? No it cannot. So? Yeah, you guess it right. It will scan the entire index.
CXPACKET stands for Class eXchange Packet. This is one of the wait types in SQL Server. It simply means that parallelism is happening inside SQL Server. Parallelism means that your query is using more than one thread. There is a server setting called "Cost Threshold of Parallelism" (CTP). If the cost of the query is more than then the value specified in CTP setting, SQL Server decides to use multiple threads to run your query. All these decisions are taken during the optimization phase of the query even before the query has started running. Let's say your server has 8 logical processors and SQL Server is configured to use all of them. Now when the query, which is marked to run in parallel, starts running, SQL Server execution engine will create 9 threads for this query, one for each logical processor. So why 9? The first thread, Thread0, will be the controller thread, and the remaining ones, Thread1 to Thread8 will be the workers that will do the job. Thread0 registers for the CXPACKET wait type. This is parallelism and CXPACKET wait type in action. Ok, so all good, but when does the CXPACKET wait type trigger? Well, in parallelism, the distribution of work is uneven among the threads. When one thread finishes its job earlier than the other, it waits on CXPACKET wait type, waiting for other threads to complete their work. There are other engine aspects too that register for the CXPACET wait type and now we are powered with another wait type called CXCONSUMER but all of this discussion is beyond the scope of this short note. CXPACKET wait time may or may not mean a performance problem - it is tricky business. A few other things need to be evaluated before you jump to a conclusion that the wait time of CXPACKET is very high and some optimization is required.
What is the difference between Processor % Processor Time and Process % Processor Time perfmon counters? Yes, it's confusing, but once we look at the nomenclature and understand the multi-processor architecture, it is easily understandable. First things first, both the counters are showing % Processor Time - the CPU consumption in percentage factor. The key difference is that the first counter shows it for the overall processor and the second one shows it for a specific process. For example: SQL Server, the executable, also runs as a process - sqlservr.exe. Now let us say that SQL Server is running on an 8-core box. For the sake of simplicity, let us call it a server with 8 logical processors. And if SQL Server is configured to consume all the processors (which is the default), the max value for Process % Processor Time perfmon counter can go up to 800 (for the sqlservr process). So that is 100 units for each processor. But what about Processor % Processor Time? Well, that is a simple metric of overall processor utilization in percentage and the max will not go beyond 100%. So lets say a CPU-intensive workload is running and the overall processor utilization is 80% (Processor % Processor Time), the value for Process % Processor Time (sqlservr) could be 480%. But the task manager for SQL Server process will show 60%. This is 480/800 * 100 = 60%.
When you are tuning queries, there are two fundamental concepts that you need to be well aware of - Density & Selectivity. In one of my previous notes, I talked about Selectivity. Today, let's talk about Density. In simple words, it measures the uniqueness of data in a column. Note that we are talking about per-column basis. For example, a column like EmailAddress will have more unique data compared to, say, Country. So it is about the uniqueness of the row taking into account the data present in the column. From the perspective of EmailAddress, your data is very unique, but from the perspective of Country, the data is less unique. High density means less unique data and low density means high unique data. Density ranges from 0 to 1.0. The formula is Density = 1/(Number of distinct values in a column). So what is this used for? Well, this computation helps in cardinality estimation. The cardinality estimator is a very complicated piece of code inside the SQL query optimizer that leverages the density factor for its computation. The density is pre-calculated and stored as part of statistics. Remember, the stats object is also based on column (column data).
When you are tuning queries, there are two fundamental concepts that you need to be well aware of - Density & Selectivity. Today, let's talk about Selectivity. In simple words, it means, how selective your predicates are. Low selectivity means your query is returning a high number of records and, high selectivity means your query is returning a low number of records. A lot of decisions that the optimizer takes depend on selectivity. A classic example could be Scan vs Seek. Typically, when a given predicate is returning a high number of records, the optimizer may deploy a Scan operator to read the data pages. Conversely, if the predicate returns a low number of records, the optimizer may deploy a Seek operator to read the data. Assumptions are being made here that there is a non-clustered index on the column used in the predicate. Given this simple example, if you were ever wondering why are you seeing a scan sometimes and seek other times, think Selectivity - it helps the optimizer take a cost-based decision.
Did you know that you can get multiple missing index hints in your execution plan? When you switch over to the execution plan tab, you can see one missing index hint on the top (in green color). When you right-click the hint -> Missing Index Details, you get the index definition script in a new query window. Most would think that that is the only missing index hint for your query. In reality, there could be multiple missing index hints at the same time for your given query/plan. To see all the missing index hints, right-click the SELECT operator, click Properties, scroll down to the Missing Index section in the properties window, and expand all the missing index hints. Not very intuitive, but still helpful.
What is an Index Range Scan? Let's say you have a clustered index on the OrderDate column. When you have a predicate that says WHERE OrderDate > 01/01/2022 ORDER BY OrderDate, the optimizer performs a range scan. It is simply scanning the logically ordered data from a specific point. Note that the data is always logically ordered by the clustering key using the linked list. The data may or may not be physically ordered, that is another story for another day. The linked list is used to perform the scan on the supposedly ordered data. But then why does the optimizer show an Index Seek iterator for range scans. Well, there is no separate iterator for range scans and the Index Scan iterator is not the same as range scan. Index seek iterator makes sense. The optimizer first seeks to the point 01/01/2022 in the traversal and then starts scanning from that point.
Continuing my efforts on penning down some notes on SQL Server IO, here is the most critical (probably) weapon in your arsenal - PAGEIOLATCH wait type - identifying workloads that are waiting on page read IO completion. There are many IO operations (and of different sizes) in SQL Server. One of the most common is reading data pages (8 KB) from the disk (when they are not already in the memory). This is physical IO. And as the data page is read, a latch is acquired to protect the physical integrity of the page. If your SELECT query is waiting to complete this operation due to IO bottleneck, then most likely, your query is waiting on PAGEIOLATCH wait type. In other words, PAGEIOLATCH wait type indicates a delay in reading data pages from the disk. Capturing this wait type can give you a lot of insights related to IO slowness. But yes, you need to know which queries are causing this wait type. The cumulative wait stats that you get from sys.dm_os_wait_stats might not be very helpful if you want to deeply diagnose the problem. You got to capture query-level wait stats using Extended Events.
Continuing my efforts on penning down some notes on SQL Server IO, here is another weapon in your arsenal - identifying workloads that are causing high IO (reads & writes). The DMV to be used here is dm_exec_query_stats. This DMV is overwhelming with just so many attributes to compute. Amongst many metrics, you have a bunch of around logical reads and writes & physical reads and writes & a few variations of them. A straightforward retrieval of these metrics may not be very helpful so as usual, we need to do a lot of mathematics and make the metrics meaningful. For example, if you see the attached image, we are computing the total physical reads caused by multiple executions of queries. Note the descending order to get the highest values first. In other words, who is causing the most physical IO. Hey, not logical IO, but the focus here is physical IO - reading from the disk. Likewise, you can compute physical writes, logical reads, logical writes, and a lot more.
In one of my previous #ABSQLNotes, I talked about computing latency via sys.dm_io_virtual_file_stats DMV - a critical metric to determine storage performance. Another way to measure latency is via the Average Disk Second Per Read perfmon counter. In simple words: how much time does it take to complete one read operation. And of course, this will be in milliseconds. The lower the better. The easiest way to compute this is to use the Performance Monitor counter: LogicalDiskAvg. Disk sec/Read. There are many guidelines on the internet talking about preferred values for this counter. It ideally should be less than 10 ms. With today's fast disks, even lesser. You need to create a baseline for this metric to get an average over time. And then slice it around for peak & non-peak hours. This value is computed for the entire disk and is not exclusive to SQL Server. But this is an important measurement of how fast the disks are able to complete their read operation.
For any database software, IO is so critical. The READS and the WRITES. And the amount of time it takes to get to the data - the latency factor. So in hindsight, the overall storage strategy plays such a critical role. You really want a very fast storage subsystem for your queries to go super fast. When I deal with IO metrics, latency comes on top of my mind. Simply put, latency is the time it takes for the IO request to be completed. Well, the storage custodians have many ways to calculate it and monitor it. As a SQL pro, how do you compute latency from inside SQL Server? Answer: leverage the sys.dm_io_virtual_file_stats DMV. This DMV has to two attributes - num_of_reads & io_stall_read_ms - for each data & log file in your SQL instance. So looking at the attributes, does it ring some bells? How can you compute latency now?
In one of my previous notes, I had talked about the histogram, which is one of the most critical things in the stats object. Histogram, as the name suggests, is a bucketing technique of how the column data is distributed - the highest value of a bucket, which defines a boundary, how many rows are between two boundary values, how many rows are equal to a specific boundary value, how many unique values are there between a range and, how many rows are there on average per distinct value. All of this helps the optimizer make the right estimates. Here is a sample histogram on TotalDue column of SalesOrderHeader table (not putting down all the columns for brevity):
RANGE_HI_KEY RANGE_ROWS EQ_ROWS
============= ============ =========
26.2769 40 142
30.1444 17 202
.
.
(more rows)
Now, let's we write a query:
SELECT * FROM Sales.SalesOrderHeader
WHERE TotalDue = 30.1444
The above query will return 202 rows. If check the cardinality estimation from the execution plan, you will observe the Estimated Number of Rows = 202 and the Actual Number of Rows=202. This is perfect and the best-case scenario for the optimizer where the estimate and the actual matched 100%.
How did this happen? Well, this is called Histogram Step Hit. The predicate value mentioned in the query (WHERE condition) has a step representation in the histogram and the optimizer does a perfect estimation with EQ_ROWS (the number of rows equal to the step value), which is 202.
The above is just a quick explanation of the ways how the optimizer leverages the histogram.
In the SQL Community, if you ask anybody about the Missing Index Hints of SSMS, people will give negative feedback and tell you that they are useless. Some frown upon those index hints. I will tell you otherwise. Yes, hints are hints. Don't follow them blindly, but look at them, evaluate and take a call. In most cases, those are covering indexes that are being recommended. Will you go ahead and create covering indexes just because SQL Server is telling you to do so. Not at all. Having a lot of covering indexes on a single table is a recipe for disaster (kind of). But sometimes those hints make complete sense and at other times, they are good candidates for a review. Listen, not everyone is an expert index tuner. So you need a starting point. Well, that hint is not a bad starting point. If I am you and am just getting started with index tuning, I will take a lot of those hints (as recommendations), copy/paste them into my workbook and look for common patterns based on my querying pattern. I am certain, that some decent index definitions may come out of it. Yeah, that could be a winner. So do not ignore hints completely. They are a good starting point.
You create indexes. SQL Server creates statistics. So many of these objects get created to support your query. Stats help in estimations. Indexes help the engine to get to the data faster. But are these objects really being used? Well, for indexes, you can track their usage using DMVs like sys.dm_db_index_usage_stats, sys.dm_db_index_operational_stats, etc. But for stats objects, it isn't so straightforward. You got to depend on a few trace flags like 9204, 9292, 2388, etc. These can tell you which STATS objects have been used by the optimizer for estimation purposes.
The SQL Optimizer is heavily dependent on Statistics for the estimates it makes and thereby a cheap execution plan. Remember, the optimizer is cost-based. Column statistics are automatically created when you create an index on a specific column. And even if you are not creating an index, yet you put the column in the predicate, SQL Server will first create stats for the column values (provided AUTO_CREATE_STATISTICS is ON which is the default). Either way, when you have single column stats or multi-column stats, the most critical thing is the histogram for the first column in your stats. Histogram, as the name suggests, is a bucketing technique of how the column data is distributed - the highest value of a bucket, which defines a boundary, how many rows are between two boundary values, how many rows are equal to a specific boundary value, how many unique values are there between a range and, how many rows are there on average per distinct value. All of this helps the optimizer make the right estimates and, of course, this needs to be up-to-date. If not, estimates can go wrong resulting in a poor plan choice.
You have just moved into a new residential society with 5 huge towers. Each tower has 50 floors and each floor has 2 apartments. That is a total of 500 families. You want to host a new year bash at your home and want to invite your neighbors spread across all the towers. But you are not sure what kind of cuisine they may prefer. You decide to go to each tower and randomly ask around their preferences. Each tower has 100 families/homes, so with your randomness, you land up talking to 30 families in each tower. Of course, talking to all the 500 families would have been a lot of work. Well, you did a great job in surveying 150 families. And now you have some data - people love all sorts of cuisines but what stands out is the Indian food 😉 followed by Italian and then Mexican. OK, you have some idea what to do. And you start your party preparations. On the day when all your neighbors are coming in, you are extremely pleased to see that the survey worked out so well and the food choices were just right. Lucky you. SQL Server does the same. When you want the data out of a table with a predicate (WHERE clause), it estimates the number of records that may be returned and generates a "possibly good" execution plan. This estimation is based on statistics the optimizer generated by surveying the column data on which you have the predicate. Exactly like you surveyed the families.
There are many types of memory allocations in SQL Server. One such memory grant is for query execution. The relational engine needs memory to execute your query. More memory is needed if you have expensive hash and sort operations. Let's take the example of the SORT operator. This is an expensive in-memory operation. And if enough memory is not available, the operation spills down to disk. The memory requirement increases and is directly proportionate to the size of the data it's sorting. Sorting a million rows will need more memory than sorting a thousand rows. Unfortunately, it has to sort all the data before it can output the first row. And that is why it is a blocking operator too. In the execution plan, when you take the cursor over the SELECT operator, you can see the memory grant for the query execution. You can also see the overall query cost. Keep an eye on the memory grant. Sometimes, it will run into GBs - one of the reasons for the sudden memory burst (requirement). The optimizer estimates the amount of memory needed based on the estimation of the data size it has to sort. This memory is reserved at the beginning of the query execution by a memory clerk called MEMORYCLERK_SQLQERESERVATIONS. You can track this clerk and many others using the DMV sys.dm_os_memory_clerks.
In one of my previous notes, I talked about Wait Types. Summary: When your query is waiting on resources, a specific wait type is assigned to it by SQL Server which you can track & capture. There are two DMVs (Dynamic Management Views) that show you real-time information about waits currently happening in the engine while the workload is running. sys.dm_os_waiting_tasks and sys.dm_exec_requests. While the former (os_waiting_tasks) is quite self-explanatory and shows a lot of details about the current waits happening, the latter (exec_requests) is also informative and has a column wait_type that shows the current wait encountered by the session. But there is a catch. Observe the nomenclature of the DMVs. "Task" vs "Session". Got it? No? sys.dm_os_waiting_tasks shows you task-level wait information and sys.dm_exec_requests shows you session-level wait information. What is the difference? Well, a request can be broken down into multiple tasks and thereby multiple threads, and each thread could be waiting on a different wait type (possible scenario). So sys.dm_os_waiting_tasks will show you one record for each task/thread and the specific wait type. But sys.dm_exec_requests will show you a record for the overall request and may mislead you by showing a generic wait type. Underlying the session/request, a lot is going on at the task/thread level. So when you are troubleshooting query slowness, focus on task-level waits, not session/request level.
Taking a very simplified approach, your query that you send to SQL Server, is either RUNNING or WAITING for resources. When it's waiting, it is still running, right? When you are on your way to office (think non-pandemic days 🙂 ), you wait for the elevator, you wait for the cab, etc. While you wait for many of these resources, you are still "on your way to office". It's the same with your query. In it's lifetime of "the run", it has to encounter many instances where it needs resources and it has to wait for them if those resources are not instantly available. SQL Server records these waits and you can capture them. This helps in understanding why was your query running "slow". Well, the query never runs "slow", it probably is just waiting for a resource. Let's say it wants access to a data page which is locked by another transaction. On a lighter note, you never reach late to the office, it’s the obstacles you face on your way ;). If you are able to track and capture what your query was waiting on, query tuning becomes a lot simpler.
There are many ways how you can troubleshoot a slow-running query. One technique is to capture the wait stats for the query. Waits? What are those? In most simplistic terms, either your query is running or waiting for resources. When the query is waiting for resources, SQL Server will assign one or more wait type(s) to the query depending on what it is waiting for. Maybe it's waiting for CPU, or memory, or IO, or waiting for a lock to be released - there are so many reasons why the execution can be stalled when the query has to wait for something. Thankfully, SQL Server captures that information but there is no easy way to extract that information 'only' for your given query. Extended Events can help here. You can set up an extended events session and capture wait stats 'only' for your query. You can apply a variety of filters to capture wait stats 'only' for your query. Cut down on the noise and only extract what you need. Extended Events (XE) is conceptually similar to the profiler tool but advanced and lightweight. While setting up the session, use the wait_info event class.
Just because you see an Index Seek operation, do not relax. All may not be well. Scan or seek, it all depends on the amount of work being done by SQL Server to fetch the data. Let's say you have a non-clustered index on a column and the optimizer seeks on the index to fetch the data. If the predicate is highly selective (less number of records are returned), then most likely, things are well. If it's low-selective (more number of records are returned), then even though it's an index seek operation, it might be touching a larger portion of the leaf level increasing the IO. Therefore, seeking performance is directly proportionate to the number of pages the storage engine is touching. On another note, multi-column indexes may be very tricky. Depending on the predicate operator, the optimizer may be seeking on the first column of the index, but not the second column. In this case, the first column is the SEEK predicate while the second column is the residual predicate. The execution plan is non-intuitive and such nuances are not clearly visible. So, the summary is: when you see an Index Seek operation, look for finer details like IO cost (reads, etc), cardinality estimation & predicates.
Many SQL folks believe that the execution plan they see in the plan cache is an Actual Execution Plan. Unfortunately, it is an Estimated Execution Plan. An Estimated Execution Plan does not have actual runtime statistics that may be helpful in query tuning. Actual Execution Plan has the runtime stats that you really need. Well, you can use the Query Store feature that can automatically persist the Actual Execution Plans. But remember, Query Store is a full-fledged feature that needs careful planning before you implement it. Are there other ways of getting Actual Execution Plans from the Plan Cache? Extended Events? Well, again, needs to be implemented, and capturing plans may turn out to be an expensive affair if you do not filter enough. What I mean is, try to capture only what you want out and cut down the noise. So what else? In SQL Server 2019, you can use the DMF sys.dm_exec_query_plan_stats to retrieve the equivalent of the last known actual execution plan for a previously cached query plan. Note that, this is an opt-in feature. If you want to enable this at the server level for all the databases, use trace flag 2451. To enable at the database level, turn ON the LAST_QUERY_PLAN_STATS database scoped config option.
A thread is the lowest unit of execution. Your SQL query, at the least, needs one thread to run. Depending on the SQL instance configurations and the eligibility of the query, SQL Server may decide to execute your query with multiple threads - this is called parallelism. If too many such workloads are parallelized and if they are long-running, your subsequent incoming queries might not even get one thread to run. In other words, they will be waiting for a thread to be assigned to them because SQL Server has a limited number of threads (by default) based on the number of processors on the hardware. This wait is called THREADPOOL. It is not a good situation to be in and may go unnoticed. If you are facing this wait type, consider reducing parallelism (one of the ways). Consider increasing the Cost Threshold of Parallelism (server config) so that not all inexpensive workloads go in parallel. Or change the MAXDOP value to a reasonable number, say 4. Especially on hardware with a lot of cores. Maybe you can set MAXDOP to the number of cores per NUMA node. This will reduce parallelism and thereby make more threads available in the thread pool.