Importance of Analyzing Waits in SQL Server

Importance of Analyzing Waits in SQL Server

Understanding and analyzing waits in SQL Server is crucial for diagnosing and improving performance bottlenecks.

www.sandroservino.com.br

Wait statistics indicate where SQL Server processes spend most of their time waiting, offering insights into resource contention, query optimization opportunities, and system bottlenecks. Each wait type provides valuable clues about the underlying issue, such as delays caused by CPU, memory, lock, disk I/O, or network resources. By identifying and addressing these waits, you can optimize your system for better performance and reliability.

Below is a SQL script to capture the main wait types currently affecting your environment.

This script is widely known as part of an approach for analyzing wait statistics (wait stats) in SQL Server and was created by Paul Randal, a renowned SQL Server expert and founder of SQLskills. He is one of the leading advocates for using wait statistics to diagnose performance issues in SQL Server.

The script is available on the SQLskills website, where Paul and his team publish educational materials and scripts to assist with SQL Server administration and optimization. The dynamically generated URL in the script (https://www.sqlskills.com/help/waits/) links directly to pages explaining the different types of waits in SQL Server, providing detailed information about their meaning and potential solutions.

Purpose of the Script

  • This script is used to identify major bottlenecks in SQL Server based on recorded wait types.
  • It prioritizes the most significant wait types, excluding those generally irrelevant for analysis.
  • It provides helpful links for additional information, facilitating understanding and troubleshooting.

If you are using this script, it is highly recommended to follow Paul Randal's materials and SQLskills resources to maximize your knowledge about wait stats analysis.

This can help pinpoint which resources or queries are contributing to performance issues:

WITH [Waits]

AS (

SELECT

[wait_type],

[wait_time_ms] / 1000.0 AS [WaitS],

( [wait_time_ms] - [signal_wait_time_ms] ) / 1000.0 AS [ResourceS],

[signal_wait_time_ms] / 1000.0 AS [SignalS],

[waiting_tasks_count] AS [WaitCount],

100.0 * [wait_time_ms] / SUM([wait_time_ms]) OVER () AS [Percentage],

ROW_NUMBER() OVER ( ORDER BY [wait_time_ms] DESC ) AS [RowNum]

FROM sys.dm_os_wait_stats

WHERE [wait_type] NOT IN (

-- Filtered waits types, not usually relevant for analysis

N'BROKER_EVENTHANDLER', N'BROKER_RECEIVE_WAITFOR',

N'CXCONSUMER', N'LAZYWRITER_SLEEP', N'SLEEP_TASK',

N'XE_TIMER_EVENT', -- Add more irrelevant waits here

N'SQLTRACE_BUFFER_FLUSH')

AND [waiting_tasks_count] > 0

)

SELECT

MAX([W1].[wait_type]) AS [WaitType],

CAST(MAX([W1].[WaitS]) AS DECIMAL(16, 2)) AS [Wait_S],

CAST(MAX([W1].[ResourceS]) AS DECIMAL(16, 2)) AS [Resource_S],

CAST(MAX([W1].[SignalS]) AS DECIMAL(16, 2)) AS [Signal_S],

MAX([W1].[WaitCount]) AS [WaitCount],

CAST(MAX([W1].[Percentage]) AS DECIMAL(5, 2)) AS [Percentage],

CAST(( MAX([W1].[WaitS]) / MAX([W1].[WaitCount])) AS DECIMAL(16, 4)) AS [AvgWait_S],

CAST(( MAX([W1].[ResourceS]) / MAX([W1].[WaitCount])) AS DECIMAL(16, 4)) AS [AvgRes_S],

CAST(( MAX([W1].[SignalS]) / MAX([W1].[WaitCount])) AS DECIMAL(16, 4)) AS [AvgSig_S],

CAST('https://www.sqlskills.com/help/waits/' + MAX([W1].[wait_type]) AS XML) AS [Help/Info URL]

FROM [Waits] AS [W1]

INNER JOIN [Waits] AS [W2] ON [W2].[RowNum] <= [W1].[RowNum]

GROUP BY [W1].[RowNum]

HAVING SUM([W2].[Percentage]) - MAX([W1].[Percentage]) < 95; -- Threshold for analysis


Some common wait events and their possible causes:

ASYNC_NETWORK_IO / NETWORKIO: The NETWORKIO (SQL 2000) and ASYNC_NETWORK_IO (SQL 2005+) wait events can point to network-related issues (rarely), but generally indicate that a client application is not processing SQL Server results quickly enough. This event often appears in cases of RBAR (Row-By-Agonizing-Row).

CXPACKET: This event is typically related to queries using parallel processing. It may indicate that one thread has completed its processing and is waiting for other threads in the process to finish. If this wait time is very high, it may be worth revisiting the MAXDOP configuration, evaluating whether the OPTION(MAXDOP x) hint could be helpful for certain queries, reassessing the indexes used by disk-heavy queries (likely those utilizing parallelism), and ensuring that sargable arguments are being used.

DTC: This wait event is not local. When using the Microsoft Distributed Transaction Coordinator (MS-DTC), a transaction is opened across multiple systems simultaneously and only completes once executed on all systems.

OLEDB: This wait event indicates that the process made a call to an OLEDB provider and is waiting for that call to process on the destination server and return the data. This event also occurs when running commands on other instances using Linked Servers (remote calls), BULK INSERT commands, and FULLTEXT-SEARCH queries. There's nothing to do on the local instance, but in cases of remote calls to other instances, you can analyze the destination instance to identify why processing and data return are delayed.

PAGEIOLATCH_*:** This event occurs when SQL Server is waiting to read disk/storage pages not currently in memory, causing disk contention. To reduce this I/O event, consider increasing disk speed (to reduce time), adding more memory (to allocate more pages), or tuning queries causing this I/O event. A very common cause of this event is the lack of optimal indexes for a given query, which can often be resolved by creating new indexes with the help of Missing Index DMVs to avoid Scan operations (check out the post "Understanding Indexes in SQL Server" for more details).

PAGELATCH_*:** The most common causes of this event are contention in TempDB on heavily loaded instances. The PAGELATCH_UP wait generally occurs when multiple threads are trying to access the same bitmap, while the PAGELATCH_EX event occurs when threads are trying to insert data into the same disk page, and PAGELATCH_SH indicates that a thread is attempting to read data from a page being modified.

IO_COMPLETION: This wait event occurs when SQL Server is waiting for I/O operations to complete processing. These are not index or data reads from the disk but rather reads of disk allocation bitmaps (GAM, SGAM, PFS), transaction logs, writes of sort buffers to disk, reads of VLF headers in transaction logs, reads/writes of merge join/eager spools on disk, etc. This event is normal, as it often appears just before starting operations requiring I/O, but it could be problematic if the wait time is very high and your top waits include ASYNC_IO_COMPLETION, LOGMGR, WRITELOG, or PAGEIOLATCH_*.

When there is an I/O issue on the instance, the SQL Server log often shows messages like "I/O requests are taking longer than X seconds to complete," which could occur due to slow disks or operations consuming high I/O, such as BACKUP, ALTER/CREATE DATABASE, and AutoGrowth events.

To investigate this issue, look for queries with high disk read/write times, including the following counters: Disk Avg. Read Time, Disk Avg. Write Time, Avg. Disk Queue Length, Buffer Cache Hit Ratio, Buffer Free Pages, Buffer Page Life Expectancy, and Machine: Memory Used.

SOS_SCHEDULER_YIELD: This event occurs when SqlOS (SOS) is waiting for more CPU resources to finish processing, which could indicate CPU overload, preventing the server from processing all tasks being requested. However, this doesn't always mean a general issue with the instance; it might indicate a specific query requiring more CPU.

If your query has parallelism inhibitors (e.g., HINT MAXDOP(1), serial UDF functions, system table queries, etc.), it might force processing to use only one CPU core, causing the SOS_SCHEDULER_YIELD event even if the server has multiple cores available and low CPU usage. To identify this, try locating queries consuming the most CPU or where CPU time is less than execution time (in parallelism, a query might consume 10s of CPU but execute in 2s, meaning it ran across several CPU cores).

WRITELOG: When a transaction is waiting on the WRITELOG event, it means the SQL Server is waiting for log cache data to be flushed to disk (transaction log). Since this involves disk writes, which are generally much slower than memory access or CPU processing, this type of event can be common on instances with slower disks. To reduce this wait, one option is to use the Delayed Durability feature available from SQL Server 2014 onwards.

LCK*:** A very common event not directly related to performance, this occurs when a transaction modifies an object and places a lock on it, preventing other transactions from modifying or accessing the data while processing. If another transaction tries to access this object, it must wait for the lock to be released before continuing, generating this wait event. This is often mistaken for a performance issue because the query will take as long as necessary for the lock to be released, giving users the impression of slowness, especially if it's a heavily used query in the system. In such cases, analyze the session generating the lock to determine why it’s taking so long to process and release the object.

Thanks

Sandro



Jorge Zacharias

Data Scientist | Data Analyst | Generative AI | Machine Learning | Cloud Computing | Artificial Intelligence | AWS Certified

2 个月

Very helpful

Malcolm P.

TSQL C# Python Javascript Typescript Node.js Powershell AWS Github Jenkins solution provider

2 个月

Absolutely agree, this is the dark matter of SQL as it is the lack of resource usage identifying the problem. The biggest gain I made from analysing waits is when the transfer of the data to the client took too long. This identified the correct area to focus on for optimisation work.

Sandro Rafael Hwizdaleck

5x AWS Black Belt | 5x AWS Certified | 4X Oracle | AWS Database Specialty | AWS Solutions Architect | AWS Developer | AWS SysOps | Oracle certified OCP | SQL Server | MySQL | Postgres | Aurora | Dynamo | Linux

2 个月

Dicas úteis

要查看或添加评论,请登录

Sandro Servino的更多文章

社区洞察

其他会员也浏览了