On the Journey to be a SQL Server Professional: Troubleshooting Blocking and Deadlocks

1. Is blocking is caused by application? If yes,

a. Error messages from the applications?

b. In which step does the error occur? What does it do? Remember the slow response and no response are not necessarily caused by SQL Server.

c. If it is a SQL Server error, blocking more likely raise the timeout errors. If something else, maybe blocking are not the reason.

d. If the problem is caused by blocking, what are the symptoms?

i. When does it happen

ii. How long?

iii. Related to workload?

iv. Related to certain tasks?

v. How is it gone, automatically or you have to restart SQL Server?

2. Are the problems are caused by SQL Server? If yes, what do we want to know for troubleshooting?

If it is a SQL Server error, blocking more likely raises the timeout errors. If something else, maybe blocking is not the reason.

--Let’s test it

--Window 1：

USE [AdventureWorks]

BEGIN TRAN

UPDATE [dbo].[Employee_Demo_Heap]

SET [Title]='aaa'

WHERE [EmployeeID]=70

UPDATE [dbo].[Employee_Demo_BTree]

SET [Title]='aaa'

WHERE [EmployeeID]=70

SELECT [EmployeeID],[LoginID],[Title] FROM [dbo].[Employee_Demo_Heap] WHERE [EmployeeID]=70

--Window 2. It should be blocked as it uses a table scan.

USE [AdventureWorks]

SELECT [EmployeeID],[LoginID],[Title] FROM [dbo].[Employee_Demo_Heap] WHERE [EmployeeID] IN (3,30,)

a. Any blocking? – check the blocked column from sysprocesses. If yes,

--Window 1. If blocked !=0

SELECT * FROM master.[sys].[sysprocesses]WHERE spid>50

51 is blocked by 54, while 54 is not blocked. 54 is the root lock.

kpid = thread is

spid kpid blocked

51 15348 54

52 0 0

53 0 0

54 15192 0

--Window 1：

--FIND THE DATABASE

SELECT name,dbid FROM sys.[sysdatabases] WHERE dbid=8

b. When? – checking the waittime column from sysprocesses

c. Which Databases, tables, or resources?

--FIND THE DATABASE

SELECT name,dbid FROM sys.[sysdatabases] WHERE dbid=8

--Find the table or other resources

--Option 1 - Find the object ID is locked. This script only works for a single object blocked

-- If more than one object, find the rsc_objid then use Select Object_Name(rsc_objid), one by one.

SELECT CONVERT(SMALLINT, [req_spid]) AS spid ,

[rsc_dbid] AS dbid ,

[rsc_objid] AS objid ,

o.name AS name,

[rsc_indid] AS indid ,

SUBSTRING(v.[name], 1, 4) AS TYPE ,

SUBSTRING([rsc_text], 1, 32) AS RESOURCE ,

SUBSTRING(u.[name], 1, 8) AS mode ,

SUBSTRING(x.[name], 1, 5) AS status

FROM master.[dbo].[syslockinfo] ,

master.[dbo].[spt_values] v ,

master.[dbo].[spt_values] x ,

master.[dbo].[spt_values] u,

sys.objects o

WHERE master.[dbo].[syslockinfo].[rsc_type] = v.[number]

AND v.[type] = 'LR'

AND master.[dbo].[syslockinfo].[req_status] = x.[number]

AND x.[type] = 'LS'

AND master.[dbo].[syslockinfo].[req_mode] + 1 = u.[number]

AND u.[type] = 'L'

AND SUBSTRING(x.[name], 1, 5) = 'WAIT'

AND o.object_id=rsc_objid

ORDER BY [SPID]

--Option 2: Find the object IDs are locked using sp_lock

--This option insert the returned results from sys.sp_lock into a temp table

--using OPENROWSET. Need to turn on the option first.

sp_configure 'Show Advanced Options', 1

RECONFIGURE

sp_configure 'Ad Hoc Distributed Queries', 1

RECONFIGURE

SELECT * INTO #TestTable FROM OPENROWSET('SQLNCLI', 'Server=localhost\SQL5Instance1;Trusted_Connection=yes;',

'EXEC sys.sp_lock')

--the blokced ones have the 'WAIT' status

SELECT *

FROM #TestTable

WHERE STATUS='WAIT'

--find the object name from object ID

SELECT OBJECT_NAME(1179)

--find the indexes of a table by using object_ID

SELECT * FROM sys.[indexes] WHERE [object_id]=1179

d. From whom? - check the hostname, program_name, loginname etc. from sysdatabases

e. How did the blocking occur? Why so?

i. Blocking may occur in the running statements of the current connection – use the following script to find all of the running connections including blocked or suspended ones. The idle/sleep connections will not be returned.

SELECT p.[session_id] ,

p.[request_id] ,

p.[start_time] ,

p.[status] ,

p.[command] ,

p.[blocking_session_id] ,

p.[wait_type] ,

p.[wait_time] ,

p.[wait_resource] ,

p.[total_elapsed_time] ,

p.[open_transaction_count] ,

p.[transaction_isolation_level] ,

SUBSTRING(qt.[text], p.[statement_start_offset] / 2,

( CASE WHEN p.[statement_end_offset] = -1

THEN LEN(CONVERT(NVARCHAR(MAX), qt.[text])) * 2

ELSE p.[statement_end_offset]

END - p.[statement_start_offset] ) / 2) AS 'sql statement' ,

p.[statement_end_offset] ,

batch = qt.[text]

FROM master.sys.[dm_exec_requests] p

CROSS APPLY [sys].[dm_exec_sql_text](p.[sql_handle]) AS qt

WHERE p.[session_id] > 50

ii. Blocking may occur before the running statements of the current connection – use SQL Trance before blocking occur. If SQL Trace did not start, we would not know. But we can find the last executed statement before it became idle using DBCC INPUTBUFFER.

运行DBCC INPUTBUFFER(SPID)可以获得从客户端发送到SQL实例的最后一个批处理语句。这句话的优点是不管连接是否正在运行，都会返回结果。缺点是他返回的是整个批处理语句 (All statements in a batch)，而不是当前正在执行的子句。所以对于正在运行的连接，第一种方法比较好。*/

DBCC INPUTBUFFER(@spid)

f. What is the current status of the root lock?

Check kpid and waittype in the sysprocesses.

If both are 0, the thread is idle or in sleeping.

If kpid !=0 and waittype=0, it is running or runnable.

If both are not 0, the thread is running or suspended/blocked waiting for resources.

i. Is it running or sleeping? If it is running/runnable, why not finishing so long?

Check the Open_tran column, if it is !=0, it has an open tran

ii. If it is sleeping, why not releasing the locks

If both are 0, the thread is idle or in sleeping. It should release the resources, if not, check the open_tran column. If it is >0, it indicates an open tran has not been committed. Then use DBCC INPUTBUFFER to find the statement.

g. What did the blocked connections try to do? Why did they also need to use the locked resources?

i. First, find the blocked connections

SELECT CONVERT(SMALLINT, [req_spid]) AS spid ,

[rsc_dbid] AS dbid ,

[rsc_objid] AS objid ,

o.name AS name,

[rsc_indid] AS indid ,

SUBSTRING(v.[name], 1, 4) AS TYPE ,

SUBSTRING([rsc_text], 1, 32) AS RESOURCE ,

SUBSTRING(u.[name], 1, 8) AS mode ,

SUBSTRING(x.[name], 1, 5) AS status

FROM master.[dbo].[syslockinfo] ,

master.[dbo].[spt_values] v ,

master.[dbo].[spt_values] x ,

master.[dbo].[spt_values] u,

sys.objects o

WHERE master.[dbo].[syslockinfo].[rsc_type] = v.[number]

AND v.[type] = 'LR'

AND master.[dbo].[syslockinfo].[req_status] = x.[number]

AND x.[type] = 'LS'

AND master.[dbo].[syslockinfo].[req_mode] + 1 = u.[number]

AND u.[type] = 'L'

AND SUBSTRING(x.[name], 1, 5) = 'WAIT'

AND o.object_id=rsc_objid

ORDER BY [SPID]

ii. Then compare the returned result with that from sp_lock to estimate if the number of locks requested is reasonable.

iii. If not, try to optimize the indexes and queries.

3. Five common types of lock problems and solutions

Type	waittype	open_tran	status	Auto-disappear	Reasons	Solutions
1	!=0	>=0	runnable	Yes, when the execution is done.	Just need to take time to finish. Waiting for system resources such as IO/CPU/Memory	1. Optimize database design and query 2. Scale up - Solve the hardware bottleneck 3. Scale out – separate OLAP from OLTP
2	0x0000	>0	sleeping	No. can kill it	1. The client has a command timeout (ADO.NET often sets 30s, if an execution is not finished within 30s, SQL Server stops the statement/batch execution, but keep the connection open for the client), or the client has cancelled the execution of the last statement, but have not rolled back the open tran. There will an Attention event in SQL Trace. 2. SET IMPLICIT_TRANSACTION ON (Some DB connection drivers or ActiveX use this mode, but application failed)	1. The application program can: (a) catch up the error and handle it such as “IF @@TranCount>0 Rollback Tran, or (b) SET XACT_ABORT ON to roll back the entire tran., or (c) use sp_reset_connection 2. If caused by implication transaction on, (a) let SQL Server rather than DB drivers or ActiveX to handle events if possible, (b) Set it off and use the auto-commit mode
3	0x0000 0x0800 0x0063	>=0	runnable	No, until you rollback/disconnect/kill (up to 30s)	The client has not received all of the results before the timeout. Althugh open_tran=0 and TIL is READ COMMITTED, the resource is still locked.	1. Return only what needed, in small chunks if possible 2. Separate OLAP and OLTP 3. Use Read Uncommitted
4	0x0000	>0	rollback	Yes	The blocking spid is always in rollback. This is because SQL Server need to roll back all of the uncommitted trans for data consistency when disconnected or a spid is killed. Some roll back may take hours. You may choose to shutdown SQL Server to stop the rollback, but the rollback still needs to be done after restart. You will see an Attention in SQL Trace, indicating the client has a timeout or is rolling back.	1. Make the transactions small 2. Do the large chunk operation at off-peak time. 3. If it has been in rollback for a long time, just let it go and wait for its completion.
5	various	>=0	runnable	No, until you rollback/disconnect/kill (up to 30s)	The application program causes deadlock, but it appears as a lock in SQL Server. The hostname for the blocking and blocked connections are the same.	Set a timeout in the application program, and catch and handle the command timeout error.

4. How to reduce blocking in application programs?

a. Blocking is often caused by application programs, rather than by SQL Server itself. Many factors in the application may influence blocking in SQL Server. Some of them are:

i. Use which transaction isolation level?

ii. When to start and end a transaction?

iii. How to connect and discharge the connection?

iv. How complicated of the commands?

v. The size of returned results

vi. How long of the execution time?

vii. How to catch and handle the timeout command?

viii. Does the application program use a database control component? If yes, understand it and use it carefully

1. How does it establish connection, begin a transaction, and return a result set?

2. Does it use implicit transaction?

3. Does it use cursors?

4. Does it use Unicode as the data type of passing parameters?

5. Which T. I. L. does it use?

b. Recommendations:

i. Choose the lowest T. I. L. that meets the business requirement

ii. Control the granularity and complexity of transactions properly

iii. Catch and handle timeout errors

iv. Good database and query design

v. Separate OLAP from OLTP

5. Connection Pooling and Blocking

a. ADO and ADO.NET uses connection pooling as the default

b. A connection pool can be used by multiple users. sp_reset_connection can be used to reset a connection if neccessary

c. Advantages

i. Reduce physical connections

ii. Increase response speed

iii. Lessen SQL Server workload

d. Disadvantages

i. Bring extra problems on blocking and others if the errors and exceptions are not handled well in the applications

ii. Hard to debug if problems occur in the application when the application codes handle the transaction

e. Recommendations

i. It’s better to let T-SQL handle the transactions. If the application codes have to be used, make sure to add extra codes to catch/handle errors, to make sure the tran is committed when the connection is closed.

6. Troubleshooting Deadlocks

a. Deadlocks can be on:

i. Locks (on row, page, meta-data, or application programs)

ii. Queued threads (one thread is sleeping, whereas other threads are waiting for it)

iii. Memory (on parallel queries)

iv. Deadlock on threads when concurrent execution of a query

b. Deadlock detection mechanism

i. Runs every 5s

ii. The victim is the one can be rolled back with less resources

c. Locking, Blocking, and Deadlock

i. LOCKING occurs when a connection needs access to a piece of data in database. It is necessary for SQL Server to manage concurrent connections.

ii. Blocking occurs when one SPID holds a lock on a specific resource and a second SPID attempts to acquire a conflicting lock type on the same resource. By default, it will wait forever unless you set Lock_Timeout or exceeds the timeout limit set up by application.

SET LOCK_TIMEOUT timeout_period - Specifies the number of milliseconds a statement waits for a lock to be released. A value of -1 (default) indicates no time-out period (that is, wait forever). A value of 0 means to not wait at all and return a message as soon as a lock is encountered. When a wait for a lock exceeds the time-out value, an error is returned.

iii. A deadlock occurs when two or more tasks permanently block each other by each task having a lock on a resource which the other tasks are trying to lock (i.e., mutual blocking)

d. Diagnosing

i. Trace Flag 1204(for SQL 2000) and Trace 1222(for SQL 2005+) – in Errorlog

DBCC TRACEON (1222,-1)

ii. SQL Server Profiler – Locks – Deadlock Graph (+ 1222 Flag or SQL Trace if necessary)

e. Best Practices

i. Visit objects in the same order

ii. Minimize the interdependency of transactions

iii. Keep transaction short and in batches (instead of multiple parallel execution threads. The batch approach will minimize the network transmission)

iv. Use lower transaction isolation level

v. Optimize the execution plans to reduce the locks

vi. In some cases of high concurrency, deadlock is unavoidable, one workaround is to increase the granularity to a higher level, making one spid is blocked first (i.e., use blocking to replace deadlock)

f. Case study

i. Run the codes in window 1 first and then that in window 2

--In windows 1

DBCC TRACEON (1222,-1) –record the deadlock info in errorlog

USE [AdventureWorks]

SET NOCOUNT ON

--Run this one first

WHILE 1 = 1

BEGIN

BEGIN TRAN

UPDATE [dbo].[Employee_Demo_Heap]

SET [BirthDate] = GETDATE()

WHERE [NationalIDNumber] = '480951955'

SELECT *

FROM [dbo].[Employee_Demo_Heap]

WHERE [NationalIDNumber] = '480951955'

COMMIT TRAN

END

--in Window 2

USE [AdventureWorks]

SET NOCOUNT ON

--then run the code below

WHILE 1 = 1

BEGIN

BEGIN TRAN

UPDATE dbo.[Employee_Demo_Heap]

SET [BirthDate] = GETDATE()

WHERE [NationalIDNumber] = '407505660'

SELECT *

FROM [dbo].[Employee_Demo_Heap]

WHERE [NationalIDNumber] = '407505660'

COMMIT TRAN

END

ii. Due to the table scan on the heap table, the two non-related queries will result in a deadlock. From the errors recorded by trace flag 1222 in error log, you will find the locked RIDs. Then use DBCC TRACEON (3604) DBCC (dbid, fileid, pagenum, 3) to find details about the page.

iii. Options:

1. Optimize the index – create a nonclustered index on NationalIDNumber.

USE [AdventureWorks]

CREATE NONCLUSTERED INDEX NationalIDNumber ON [Employee_Demo_Heap]([NationalIDNumber] ASC)

a. Pros: Increase the efficiency and eliminate the deadlock, highly recommended. If not feasible, try the other three “rude” options below.

b. Cons: the rows in the two connections not overlapping, the permission to add an index, may have other negative impacts.

2. Use ‘with (nolock)’ - let SELECT does not put S lock on the queried table

WHILE 1 = 1

BEGIN

BEGIN TRAN

UPDATE [dbo].[Employee_Demo_Heap]

SET [BirthDate] = GETDATE()

WHERE [NationalIDNumber] = '480951955'

SELECT *

FROM [dbo].[Employee_Demo_Heap] WITH ( NOLOCK )

WHERE [NationalIDNumber] = '480951955'

COMMIT TRAN

END

a. Pros: instant

b. Cons: possible dirty reads, only good for S locks, need the permission to change query in SQL. Developers may need to use the dynamic SQL in the application codes.

DECLARE @a NVARCHAR(4000)

SET @a='SELECT * FROM [dbo].[Employee_Demo_Heap] WITH (NOLOCK)

WHERE [NationalIDNumber]='+'480951955'

EXEC(@a)

3. Increase lock granularity such as using PAGLOCK instead of the RIDs on the same page – making a deadlock become a blocking

WHILE 1 = 1

BEGIN

BEGIN TRAN

UPDATE [dbo].[Employee_Demo_Heap] WITH ( PAGLOCK )

SET [BirthDate] = GETDATE()

WHERE [NationalIDNumber] = '480951955'

SELECT *

FROM [dbo].[Employee_Demo_Heap] WITH ( PAGLOCK )

WHERE [NationalIDNumber] = '480951955'

COMMIT TRAN

END

a. Pros: no need to change the T.I.L.

b. Cons: easily blocked

4. Use SNAPSHOT transaction isolation level before the transaction

SET TRANSACTION ISOLATION LEVEL SNAPSHOT