Understand the execution plan

Summary: We have three primary ways to get the execution plan:

1)   Display Estimated or Include Actual Execution Plan (omitted),

2)   SET Commands (SET SHOWPLAN_ALL ON, SET SHOWPLAN_XML ON, and SET STATISTICS PROFILE ON) and

3)   SQL Trace.

 

In understanding the execution plan, one key part is to understand the UNION option. SQL Server做联接的时候,会按照输入数据集所基于的表格的结构,衡量可能利用的索引,也根据统计信息,预估两个输入数据集的大小,选择使用三种Join方式其中的一种。如果选得不对,可能就会造成Join的速度非常慢。There are other operators such as aggregation, sort, union, and parallelism. SET STATISTICS IO ON and SET STATISTICS TIME ON are also useful in understanding the execution plan, especially for a single query.

 

1.  Two ways to get execution plan

1)  The SET command – USED FOR TESTING A SINGLE QUERY

                                               I.          SET SHOWPLAN_ALL ON - Causes Microsoft SQL Server not to execute Transact-SQL statements. Instead, SQL Server returns detailed information about how the statements are executed and provides estimates of the resource requirements for the statements. Return the results before executing the statements.

                                             II.          SET SHOWPLAN_XML ON - Causes SQL Server not to execute Transact-SQL statements. Instead, SQL Server returns detailed information about how the statements are going to be executed in the form of a well-defined XML document. Only this option can display the missing indexes and if any plans reused. Return the results before executing the statements as well.

                                            III.          SET STATISTICS PROFILE ON - Displays the profile information for a statement. STATISTICS PROFILE works for ad hoc queries, views, and stored procedures. When STATISTICS PROFILE is ON, each executed query returns its regular result set, followed by an additional result set that shows a profile of the query execution. The additional result set contains the SHOWPLAN_ALL columns for the query and these additional columns: Rows - Actual number of rows produced by each operator, and Executes - Number of times the operator has been executed. It return the results after executing the statements!

 

USE [AdventureWorks]

GO

 

SET SHOWPLAN_ALL ON

GO

 

--只返回执行计划,没有结果集返回

--执行计划里只有EsitmateRows,没有“Rows”(实际行数)

SELECT  a.[SalesOrderID] ,

        a.[OrderDate] ,

        a.[CustomerID] ,

        b.[SalesOrderDetailID] ,

        b.[ProductID] ,

        b.[OrderQty] ,

        b.[UnitPrice]

FROM    [dbo].[SalesOrderHeader_test] a

        INNER JOIN [dbo].[SalesOrderDetail_test] b ON a.[SalesOrderID] = b.[SalesOrderID]

WHERE   a.[SalesOrderID] = 43659

  GO

 

SET SHOWPLAN_ALL OFF

  GO

  

SET STATISTICS PROFILE ON

  GO

 

  --先返回结果集,再返回执行计划

  --执行计划里有EsitmateRows,也有“Rows”(实际行数)

SELECT  a.[SalesOrderID] ,

        a.[OrderDate] ,

        a.[CustomerID] ,

        b.[SalesOrderDetailID] ,

        b.[ProductID] ,

        b.[OrderQty] ,

        b.[UnitPrice]

FROM    [dbo].[SalesOrderHeader_test] a

        INNER JOIN [dbo].[SalesOrderDetail_test] b ON a.[SalesOrderID] = b.[SalesOrderID]

WHERE   a.[SalesOrderID] = 43659

   GO

  

SET STATISTICS PROFILE OFF

  GO

 

                                           IV.          Other helpful SET commands: SET STATISTICS IO ON, SET STATISTICS TIME ON

 

-- 清除buffer pool里的所有缓存的数据

DBCC DROPCLEANBUFFERS 

GO 

SET STATISTICS IO ON 

GO 

SELECT DISTINCT

        ProductID ,

        UnitPrice

FROM    dbo.SalesOrderDetail_test

WHERE   ProductID = 777

GO

 

-- 清除buffer pool里的所有缓存的数据

dbcc dropcleanbuffers 

 

-- 清除buffer pool里的所有缓存的执行计划

DBCC freeproccache 

go 

 

Set statistics time on 

go 

select distinct ProductIDUnitPrice 

from dbo.SalesOrderDetail_test 

where ProductID = 777

union 

select ProductIDUnitPrice 

from dbo.SalesOrderDetail_test 

where ProductID = 777

go 

Set statistics time off 

GO

 

2)  SQL Trace

                                               I.          PERFORMANCE: SHOWPLAN ALL (=SET SHOWPLAN_ALL ON)

                                             II.          PERFORMANCE: SHOWPLAN STATISTICS PROFILE (=SET STATISTICS PROFILE ON)

                                            III.          PERFORMANCE: SHOWPLAN XML STATISTICS PROFILE (=SET SHOWPLAN_XML ON)

 

2.  THREE TYPES OF JOIN

1)  Nested Loops Join

                                     I.          Logic

 

FOR each row r1 IN the OUTER TABLE

 begin

 FOR each row r2 IN the INNER TABLE

   IF r1 joins WITH r2

   OUTPUT(r1,r2)

 IF r1 did NOT JOIN

   OUTPUT(r1,NULL)

 END

 

                                   II.          Chart

https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg03ifJ7ZQdvEmj6df4GXoxwYLKW0FvXr6TJUZ3piuYyZDP8TesDWSQLgJPJ6CtQI9qZeJw1NdXygA7XDvhO1DklhvLAJ9jBuHPyW9G2K-uw09xrFL8QZlot5_L-jiUujLaxqwn7eq22OE/s1600/nested+loops.jpg

 

                                  III.          Pros: no additional spaces are needed, tempdb is not needed, save memory as well

                                 IV.          Cons: not good for big tables, it will scan the table a*b times (a is the number of rows met criteria in the outer table)

                                   V.          When is it used?

·        Small tables or the number of row in the outer table met the conditions is small.

                                 VI.          How to improve?

·        Indexing the outer table

·        Indexing the searched column in the inner table as well

2)  Merge Join

                                     I.          Logic

GET FIRST row r1 FROM INPUT1

GET FIRST row r2 FROM INPUT2

 WHILE NOT at the END OF either INPUT

 begin

 IF r1 joins WITH r2

 begin

 OUTPUT(r1,r2)

 GET NEXT row r2 FROM INPUT 2

 END

 ELSE

 IF r1>r2 GET NEXT row r1 FROM INPUT 1

 ELSE GET NEXT row2 FROM INPUT 2

 END

 

 --也就是说,从两边的数据集里各取一个值,比较一下。如果相等,就把这两行联接起来返回。

 --如果不相等,那就把小的那个值丢掉,按顺序取下一个更大的。两边的数据集有一边遍历

 --结束,整个join的过程就结束。所以整个算法的复杂度最大就是大的那个数据集里的记录

 --数量,这个比起nested loops join两个数据集相乘的复杂度,的确是小了很多。所以

 --在数据集很大的情况下,merge join的优势是非常明显的

 

                                   II.          Chart

https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiLmOE-abdCdl139t_3rir_FHIwjcbaEHV7ybhMW_uVc1ZLs1ZHw1JdWriZIHAAgkHTe2vJwf3y0Mw_Tf3JQbtyHKDxjn4qnywWLoHaoAyIjQ_N0JJyzMJuVJ2nShhwnpOXznIF9gKF9zE/s1600/merge+join.jpg

                                  III.          Pros: not a*b, when one table is done, the join is done. Much less computation than nested loops for large tables.

                                 IV.          Cons: the tables need to be sorted before the join. If there are multiple rows for a value, SQL Server needs to create worktable in tempdb or memory. This Many-to-Many join could be very expensive.

                                   V.          When is it used?

·        For big tables.

                                 VI.          How to improve?

·        Add unique for the indexes if ever possible!!!

3)  Hash Join Build and Probe the hash table

                                     I.          Logic

for each row R1 in the build table

   begin

      calculate hash value on join key(s) of R1

      insert R1 into the appropriate hash bucket

   end

for each row R2 in the probe table

   begin

      calculate hash value on join key(s) of R2

      for each row R1 in the corresponding hash bucket

         if R1 joins with R2

            output (R1, R2)

   end

算法描述:

选择两个需要join的表中的一个a,对于a中的每一个记录R1,计算其联接列的hash值,然后根据hash值将R1插入到hash bucket当中。

 

选择另外一张表b,对于b中的每一条记录R2,我们也计算其联接列的hash值,然后去hash bucket上查找。如果hash bucket上有R1能够跟R2进行连接,那么久输出(R1,R2)的联接结果,可能有多个R1的记录。

 

                                   II.          Chart

https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhAurQLdgnjjc7Wwsv4sLP3xudctGYk1rjDZJKcAljmU4pG8Z4e6Rke4_5E_YcmjVzzMyRjrkEqQ5yfVgBUEfcIK21FMlenVEwT9s2d5O7VuQMN9SV4i0KGmuKK0zajoBGzvvGGESq2U3U/s1600/hashmap.png

(This is only the part on building the hash table)

                                  III.          Pros: only need to scan the two tables once, do not need to sort or have indexes, support multiple CPUs executions

                                 IV.          Cons: need to use CPU, memory and tempdb to build the hash table and probe the data.

                                   V.          When to use it?

·        Large tables without indexes

                                 VI.          How to improve?

·        Add indexes

3.  Comparisons of the three joins 

https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi9dhof4-xSOHSBjA9DaRGH_-zS9xz7hQ79JM1n4CHJT1NGvhN_FjQfAehSN439GnbGh3ucdj1jl0zpC4tgBbgzhg-spvYWWEDHEofh7GMliPa_nv3CAk_UwWAbTRfoTJwtp2s85B8jAGI/s1600/comparison+of+three+joins.png

 

4.  Other operations

1)  Aggregation

                                               I.          For aggregation calculations

                                             II.          Two types:

·        stream aggregation - stream first, then aggregate

·        hash aggregation – build the hash table first, then aggregate

2)  UNION

                                               I.          Union All

                                             II.          Union - more expensive

3)  PARALLELISM – if the cost is high and you have multiple CPUs.