Troubleshooting I/O Issues

Summary: 

·      First, check if there are obvious memory bottleneck in Windows or SQL Server causing the paging problems

·      Second, analyze the disk system. If there are problems, what, when, where, and how long?

·      Was the problem caused by SQL Server? If yes, which operation (read/write/lazy write/checkpoints/log write) caused the problem?

·      Anyway to reduce the IO requests such as lessening the memory problem, re-design the database, re-write or optimize the SQL queries?

 

1.  Major symptoms

1)            Error 833 in ErrorLog

2)            No errors in error log, but many sessions have pageiolatch_* or writelog waittype in sys.sysprocesses

2.  When Disk I/O Occurs?

1)            For data not in the buffer pool

2)            Modifications->Log cache->Log File (at checkpoint) ->Data file

3)            Lazy writer: old data pages and plans in the buffer pool are wrote to disk when Buffer Pool has memory pressure

4)            Some routine tasks many need a large IO activity: DBCC CHECKDB, REINDEX, UPDATE STATISTICS, BACKUP DATABASE

3.  What affects IO?

1)            Recovery interval – default checkpoint is fine.

2)            Auto-growth and autoshrink – negative impact

3)            Fragmentation – negative

4)            Indexes on the tables – cluster index is good

5)            Data compression – good for IO, bad for CPU

6)            Data and log on the same disk? – Put log on the fast write RAID 10, whereas data file on the fast read RAID 5

7)            Have many files for a file group on different disks help data file IO, but no impact on log file.

4.  Does the Windows system have I/O problems?

1)            Check the object counters for the physical disks

A.     %idle time – 0%~100%, precise

B.      %disk time - % of time in reading and writing, not precise

C.      %disk read time

D.     %disk write time

E.      avg.disk sec/read – 10-20ms fair, <10ms very good.

F.       avg.disk sec/write

G.      avg.disk sec/transfer

H.     avg.disk bytes/transfer

I.         avg.disk queue length

J.        avg.disk read queue length

K.     avg.disk write queue length

L.       disk bytes/sec – read+write, 10-20 MB fair

M.    disk transfer/sec

N.     disk read/sec

O.     disk writes/sec

P.       disk read bytes/sec

Q.     disk write bytes/sec

R.      current disk queue length – should be >2 for a long time

2)            Error 833

A.     I/O request not done within 15 seconds.

B.      Disk problem or user request issue? – analyze the errorlog and performance monitor

5.  Analyze IO Problems in SQL Server

1)            Check waittype (pageiolatch_ex(write) or pageiolatch_sh(read)) in sys.sysprocesses or sys.dm_os_wait_stats

--Focus on the counts and time for 3 types: PageIOLatch_EX, PageIOLatch_SH, WriteLog

--High number indicates IO problems

 

SELECT wait_typewaiting_tasks_count,wait_time_ms

FROM sys.dm_os_wait_stats

WHERE wait_type LIKE pageiolatch% OR wait_type LIKE 'writelog'

ORDER BY wait_type

 

2)            Then find out which database, which file need these reads or writes?

--cumulative since startup

SELECT     *

FROM sys.dm_io_virtual_file_stats ({database_id|null}, {file_id|null})

 

--find which files need frequent readsnum_of_reads/num_of_bytes_read

--which ones need frequent writesnum_of_writes/num_of_bytes_written),

--which ones need frequent read/write waitio_stall_write_ms/io_stall_read_ms/io_stall

-- Also cumulative

SELECT

    db.name AS databasename,

    f.fileid AS fileid,

    f.filename AS FILENAME,

    i.num_of_reads AS numofread,

    i.num_of_bytes_read AS numofbyteread,

    i.io_stall_read_ms AS ioinstallread,

    i.num_of_writes AS numofwrite,

    i.num_of_bytes_written AS numofbytewrite,

    i.io_stall_write_ms AS iostallwrite,

    i.io_stall AS iostall,

    i.size_on_disk_bytes AS sizeondiskbyte

FROM sys.databases db

INNER JOIN sys.sysaltfiles f

    ON db.database_id = f.dbid

INNER JOIN sys.dm_io_virtual_file_stats(NULL, NULL) i

    ON i.database_id = f.dbid AND i.file_id = f.fileid

 

--if you want to find the stall IO at the moment. Try run it when problem occurs.

SELECT

    database_id AS dbid,

    FILE_ID AS fileid,

    io_stall AS iostall,

    io_pending_ms_ticks AS iopendingmstick,

    scheduler_address AS scheduleraddress

FROM       sys.dm_io_virtual_file_stats(NULL, NULL) AS t1,

           sys.dm_io_pending_io_requests AS t2

WHERE t1.file_handle = t2.io_handle

 

3)            Use object counters in performance monitor

A.     Buffer manager – reveals information about buffer pool-related I/O

·       page reads/sec and page writes/sec – caused by buffer pool

·       lazy writes/sec – happens when lazy writes need to empty the buffer pool

·       checkpoint writes/sec – dirty pages wrote to disk from buffer pool

·       Readahead pages/sec

B.      Access Methods – some auxiliary jobs also impact IO

·       Freespace Scans/sec – to find the available space in heap.

·       page splits/sec

·       page allocations/sec

·       workfiles/sec

·       worktables/sec

·       full scans/sec

·       index searches/sec

C.      Database Log Activities

·       Log flushes/sec – number of write to the log/sec

·       Log Bytes Flushed/sec

·       Log Flush Waits/sec – should be 0

D.     SQL Server Overall Performance Indicators (very roughly)

·       MSSQL:SQL Statistics -Batch Request/sec – number of batch request finished per second

·       MSSQL:Databases Active Transactions – open tran not committed yet

6.  A free IO Test tool – SQLIO – see Brent’s post for details (http://www.brentozar.com/archive/2008/09/finding-your-san-bottlenecks-with-sqlio/) – I tried it on my laptop!