Summary:
· First, check if there are obvious
memory bottleneck in Windows or SQL Server causing the paging problems
· Second, analyze the disk system. If
there are problems, what, when, where, and how long?
· Was the problem caused by SQL Server? If
yes, which operation (read/write/lazy write/checkpoints/log write) caused the
problem?
· Anyway to reduce the IO requests such
as lessening the memory problem, re-design the database, re-write or optimize
the SQL queries?
1. Major symptoms
1) Error 833 in ErrorLog
2) No errors in error log, but many sessions have pageiolatch_* or writelog waittype in sys.sysprocesses
2. When Disk I/O Occurs?
1) For data not in the buffer pool
2) Modifications->Log cache->Log File (at checkpoint) ->Data file
3) Lazy writer: old data pages and plans in the buffer pool are wrote to disk when Buffer Pool has memory pressure
4) Some routine tasks many need a large IO activity: DBCC CHECKDB, REINDEX, UPDATE STATISTICS, BACKUP DATABASE
3. What affects IO?
1) Recovery interval – default checkpoint is fine.
2) Auto-growth and autoshrink – negative impact
3) Fragmentation – negative
4) Indexes on the tables – cluster index is good
5) Data compression – good for IO, bad for CPU
6) Data and log on the same disk? – Put log on the fast write RAID 10, whereas data file on the fast read RAID 5
7) Have many files for a file group on different disks help data file IO, but no impact on log file.
4. Does the Windows system have I/O problems?
1) Check the object counters for the physical disks
A. %idle time – 0%~100%, precise
B. %disk time - % of time in reading and writing, not precise
C. %disk read time
D. %disk write time
E. avg.disk sec/read – 10-20ms fair, <10ms very good.
F. avg.disk sec/write
G. avg.disk sec/transfer
H. avg.disk bytes/transfer
I. avg.disk queue length
J. avg.disk read queue length
K. avg.disk write queue length
L. disk bytes/sec – read+write, 10-20 MB fair
M. disk transfer/sec
N. disk read/sec
O. disk writes/sec
P. disk read bytes/sec
Q. disk write bytes/sec
R. current disk queue length – should be >2 for a long time
2) Error 833
A. I/O request not done within 15 seconds.
B. Disk problem or user request issue? – analyze the errorlog and performance monitor
5. Analyze IO Problems in SQL Server
1) Check waittype (pageiolatch_ex(write) or pageiolatch_sh(read)) in sys.sysprocesses or sys.dm_os_wait_stats
--Focus on the counts and time for 3
types: PageIOLatch_EX, PageIOLatch_SH,
WriteLog
--High number indicates IO problems
SELECT wait_type, waiting_tasks_count,wait_time_ms
FROM sys.dm_os_wait_stats
WHERE wait_type LIKE ‘pageiolatch%’ OR wait_type LIKE 'writelog'
ORDER BY wait_type
2) Then find out which database, which file need these reads or writes?
--cumulative since startup
SELECT *
FROM sys.dm_io_virtual_file_stats ({database_id|null}, {file_id|null})
--find which files need frequent reads(num_of_reads/num_of_bytes_read)
--which ones need frequent writes(num_of_writes/num_of_bytes_written),
--which ones need frequent read/write
wait(io_stall_write_ms/io_stall_read_ms/io_stall)
-- Also cumulative
SELECT
db.name AS databasename,
f.fileid AS fileid,
f.filename AS FILENAME,
i.num_of_reads AS numofread,
i.num_of_bytes_read AS numofbyteread,
i.io_stall_read_ms AS ioinstallread,
i.num_of_writes AS numofwrite,
i.num_of_bytes_written AS numofbytewrite,
i.io_stall_write_ms AS iostallwrite,
i.io_stall AS iostall,
i.size_on_disk_bytes AS sizeondiskbyte
FROM sys.databases db
INNER JOIN sys.sysaltfiles f
ON db.database_id = f.dbid
INNER JOIN sys.dm_io_virtual_file_stats(NULL, NULL) i
ON i.database_id = f.dbid AND i.file_id = f.fileid
--if you want to find the stall IO at
the moment. Try run it when problem occurs.
SELECT
database_id AS dbid,
FILE_ID AS fileid,
io_stall AS iostall,
io_pending_ms_ticks AS iopendingmstick,
scheduler_address AS scheduleraddress
FROM sys.dm_io_virtual_file_stats(NULL, NULL) AS t1,
sys.dm_io_pending_io_requests AS t2
WHERE t1.file_handle = t2.io_handle
3) Use object counters in performance monitor
A. Buffer manager – reveals information about buffer pool-related I/O
·
page
reads/sec and page writes/sec – caused by buffer pool
·
lazy
writes/sec – happens when lazy writes need to empty the buffer pool
·
checkpoint
writes/sec – dirty pages wrote to disk from buffer pool
·
Readahead pages/sec
B. Access Methods – some auxiliary jobs also impact IO
·
Freespace Scans/sec – to find the available space in heap.
·
page
splits/sec
·
page
allocations/sec
·
workfiles/sec
·
worktables/sec
·
full
scans/sec
·
index
searches/sec
C. Database Log Activities
·
Log
flushes/sec – number of write to the log/sec
·
Log
Bytes Flushed/sec
·
Log
Flush Waits/sec – should be 0
D. SQL Server Overall Performance Indicators (very roughly)
·
MSSQL:SQL
Statistics -Batch Request/sec – number of batch request finished per second
·
MSSQL:Databases Active Transactions – open tran
not committed yet