Summary:

· First, check if there are obvious memory bottleneck in Windows or SQL Server causing the paging problems

· Second, analyze the disk system. If there are problems, what, when, where, and how long?

· Was the problem caused by SQL Server? If yes, which operation (read/write/lazy write/checkpoints/log write) caused the problem?

· Anyway to reduce the IO requests such as lessening the memory problem, re-design the database, re-write or optimize the SQL queries?

1. Major symptoms

1) Error 833 in ErrorLog

2) No errors in error log, but many sessions have pageiolatch_* or writelog waittype in sys.sysprocesses

2. When Disk I/O Occurs?

1) For data not in the buffer pool

2) Modifications->Log cache->Log File (at checkpoint) ->Data file

3) Lazy writer: old data pages and plans in the buffer pool are wrote to disk when Buffer Pool has memory pressure

4) Some routine tasks many need a large IO activity: DBCC CHECKDB, REINDEX, UPDATE STATISTICS, BACKUP DATABASE

3. What affects IO?

1) Recovery interval – default checkpoint is fine.

2) Auto-growth and autoshrink – negative impact

3) Fragmentation – negative

4) Indexes on the tables – cluster index is good

5) Data compression – good for IO, bad for CPU

6) Data and log on the same disk? – Put log on the fast write RAID 10, whereas data file on the fast read RAID 5

7) Have many files for a file group on different disks help data file IO, but no impact on log file.

4. Does the Windows system have I/O problems?

1) Check the object counters for the physical disks

4.1.

A. %idle time – 0%~100%, precise

B. %disk time - % of time in reading and writing, not precise

C. %disk read time

D. %disk write time

E. avg.disk sec/read – 10-20ms fair, <10ms very good.

F. avg.disk sec/write

G. avg.disk sec/transfer

H. avg.disk bytes/transfer

I. avg.disk queue length

J. avg.disk read queue length

K. avg.disk write queue length

L. disk bytes/sec – read+write, 10-20 MB fair

M. disk transfer/sec

N. disk read/sec

O. disk writes/sec

P. disk read bytes/sec

Q. disk write bytes/sec

R. current disk queue length – should be >2 for a long time

2) Error 833

4.2.

A. I/O request not done within 15 seconds.

B. Disk problem or user request issue? – analyze the errorlog and performance monitor

5. Analyze IO Problems in SQL Server

1) Check waittype (pageiolatch_ex(write) or pageiolatch_sh(read)) in sys.sysprocesses or sys.dm_os_wait_stats

--Focus on the counts and time for 3 types: PageIOLatch_EX, PageIOLatch_SH, WriteLog

--High number indicates IO problems

SELECT wait_type, waiting_tasks_count,wait_time_ms

FROM sys.dm_os_wait_stats

WHERE wait_type LIKE ‘pageiolatch%’ OR wait_type LIKE 'writelog'

ORDER BY wait_type

2) Then find out which database, which file need these reads or writes?

--cumulative since startup

SELECT *

FROM sys.dm_io_virtual_file_stats ({database_id|null}, {file_id|null})

--find which files need frequent reads（num_of_reads/num_of_bytes_read）

--which ones need frequent writes（num_of_writes/num_of_bytes_written），

--which ones need frequent read/write wait（io_stall_write_ms/io_stall_read_ms/io_stall）

-- Also cumulative

SELECT

db.name AS databasename,

f.fileid AS fileid,

f.filename AS FILENAME,

i.num_of_reads AS numofread,

i.num_of_bytes_read AS numofbyteread,

i.io_stall_read_ms AS ioinstallread,

i.num_of_writes AS numofwrite,

i.num_of_bytes_written AS numofbytewrite,

i.io_stall_write_ms AS iostallwrite,

i.io_stall AS iostall,

i.size_on_disk_bytes AS sizeondiskbyte

FROM sys.databases db

INNER JOIN sys.sysaltfiles f

ON db.database_id = f.dbid

INNER JOIN sys.dm_io_virtual_file_stats(NULL, NULL) i

ON i.database_id = f.dbid AND i.file_id = f.fileid

--if you want to find the stall IO at the moment. Try run it when problem occurs.

SELECT

database_id AS dbid,

FILE_ID AS fileid,

io_stall AS iostall,

io_pending_ms_ticks AS iopendingmstick,

scheduler_address AS scheduleraddress

FROM sys.dm_io_virtual_file_stats(NULL, NULL) AS t1,

sys.dm_io_pending_io_requests AS t2

WHERE t1.file_handle = t2.io_handle

3) Use object counters in performance monitor

5.1.

5.2.

5.3.

A. Buffer manager – reveals information about buffer pool-related I/O

· page reads/sec and page writes/sec – caused by buffer pool

· lazy writes/sec – happens when lazy writes need to empty the buffer pool

· checkpoint writes/sec – dirty pages wrote to disk from buffer pool

· Readahead pages/sec

B. Access Methods – some auxiliary jobs also impact IO

· Freespace Scans/sec – to find the available space in heap.

· page splits/sec

· page allocations/sec

· workfiles/sec

· worktables/sec

· full scans/sec

· index searches/sec

C. Database Log Activities

· Log flushes/sec – number of write to the log/sec

· Log Bytes Flushed/sec

· Log Flush Waits/sec – should be 0

D. SQL Server Overall Performance Indicators (very roughly)

· MSSQL:SQL Statistics -Batch Request/sec – number of batch request finished per second

· MSSQL:Databases Active Transactions – open tran not committed yet

6. A free IO Test tool – SQLIO – see Brent’s post for details (http://www.brentozar.com/archive/2008/09/finding-your-san-bottlenecks-with-sqlio/) – I tried it on my laptop!

Troubleshooting I/O Issues