Preparation of
Data
For this demo, I use the project sample (pp. 399-403) in
the TK 448 book. Essentially, the project needs to use the AdventureWorksDW2008
or AdventureWorksDW2008R2 database. Particularly, it needs two views in the
database: vAssocSeqOrders and vAssocSeqLineItems.
The vAssocSeqOrders has 4 columns and 21,255 rows, OrderNumber
is the Key.
/***************21255
Rows ********************************
OrderNumber CustomerKey Region IncomeGroup
SO51176 18239 Pacific High
SO51177 27873 Pacific Low
SO51178 11245 Europe High
***********************************************************/
SELECT
* FROM [dbo].[vAssocSeqOrders]
The vAssocSeqLineItems has 3 columns and 52,761 rows,
OrderNumber and Line are the Key.
/**************
52761 Rows **********************************
OrderNumber LineNumber Model
SO51176 1 Road-250
SO51176 2 Road Bottle Cage
SO51177 1 Touring-2000
SO51177 2 Sport-100
SO51178 1 Mountain-200
SO51178 2 Mountain Bottle Cage
SO51178 3 Water Bottle
**************************************************************/
SELECT
*
FROM
[dbo].[vAssocSeqLineItems]
The model
structure of the association rules model
Our goal is to look for
a model that predict other models. In other words, we are searching for models
that appear together on the same customer order. If they appear together many
times, the appearance of one model predicts the appearance of another model or
models.
In order to do so, we
need to use the vAssocSeqOrders view as the case table, and the vAssocSeqLineItems
as the nested table. In other words, the dbo.vAssocSeqLineItems view is used as
a column of the case table. In this way, we are searching for the existence of
rows of the nested table for each customer order; we do this through a row’s
key (i.e., Order Number). Because a model is unique in an order, we can use it
as a key. Also because we want to predict the models, we would use the Model column
as both input and predictable. The other columns are useless in this model.
Table below shows the
model structure or the column usage for the association rules model.
Case Table Column
|
Nested Table Column
|
Usage
|
Order Number
|
Key
|
|
vAssocSeqLineItems
|
||
Model
|
Key, input, predictable
|
Understand the
results
The results for an association rules model under mining
model viewer has three tabs: Rules, Itemsets,
and Dependency Network; and one output under the Microsoft Generic Content
Tree Viewer.
Let’s look at the contents under the ‘Rules’ tab first.
Rules – shows three columns: Probability, Importance, and Rule
Probability
It displays the probability for each rule. For
example, in the chart above, the first row has a probability of 1.000. It means
the rule “If Touring Tire = Existing, Water Bottle = Existing, then Touring
Tire Tube = Existing” is always true.
If the rule is If {A}, then {B}, it is calculated by the ratio
of cases with A and B over cases with A. For example, the first rule is if {Touring
Tire = Existing, Water Bottle = Existing}, then {Touring Tire Tube = Existing},
the probability is calculated as below, with a value 1, the same as that in the
chart above:
/*
AntecedentAAndYesCount AntecedentACount Probability
25 25 1
*/
declare
@AntecedentA_PositiveCount float
declare
@AntecedentACount FLOAT
--Get
the total count for the rule AntecedentA implies DidBuyAPersonalEquityPlan='YES'
set
@AntecedentA_PositiveCount=
(
select cast(count(*) as float)
from [dbo].[vAssocSeqLineItems] t1
JOIN [dbo].[vAssocSeqLineItems]
t2
ON t1.OrderNumber=t2.OrderNumber
JOIN [dbo].[vAssocSeqLineItems]
t3
ON t2.OrderNumber=t3.OrderNumber
where t1.Model='Touring Tire'
and t2.Model='Water Bottle'
and t3.Model='Touring Tire Tube'
)
--Get
the total count for all AntecedentA
set
@AntecedentACount=
(
select cast(count(*) as float)
from [dbo].[vAssocSeqLineItems] t1
JOIN [dbo].[vAssocSeqLineItems]
t2
ON t1.OrderNumber=t2.OrderNumber
where t1.Model='Touring Tire'
and t2.Model='Water Bottle'
)
select
@AntecedentA_PositiveCount as
AntecedentAAndYesCount,
@AntecedentACount as AntecedentACount,
round(@AntecedentA_PositiveCount/@AntecedentACount,3) as Probability
Importance
It displays the importance for each rule. It is
designed to measure the usefulness of a rule. In some cases, although the
probability that a rule will occur may be high, the usefulness of the rule may be
unimportant in itself. For example, if every itemset contains a specific state
of an attribute, a rule that predicts state is trivial, even though the
probability is very high.
Importance is also referred to as lift. For an association
rule, Lift Importance is
calculated by the log likelihood of the right-hand side of the rule, given the
left-hand side of the rule. For example, in the rule If {A} Then
{B}, Analysis Services calculates the ratio of cases with A and B over
cases with B but without A, and then normalizes that ratio by using a
logarithmic scale.
Another
source says, it needs 4 ratios to calculate the importance:
The count where the antecedent is A and the
consequent is positive.
The count where the antecedent is A and the
consequent is negative.
The count where the antecedent is NOT A and
the consequent is positive.
The count where the
antecedent is NOT A and the consequent is negative
In following the example, I have calculated the
importance for the 1st rule (e.g., if {Touring Tire = Existing,
Water Bottle = Existing}, then {Touring Tire Tube = Existing}) as below.
Unfortunately, I did not get the exact value for important as that in the chart:
DECLARE @YesA_And_YesB_YesC FLOAT;
DECLARE @NoA_Or_NoB_NoC FLOAT;
DECLARE @YesA_And_YesB_NoC FLOAT;
DECLARE @NoA_Or_NoB_YesC FLOAT;
DECLARE @Numerator FLOAT;
DECLARE @Denominator FLOAT;
-- The algorithm adds one to each count
to prevent divide by zero errors
-- The count where the antecedent is A
and the consequent is positive.
SET @YesA_And_YesB_YesC = ( SELECT CAST(COUNT(DISTINCT c.OrderNumber) + 1 AS FLOAT)
FROM [dbo].[vAssocSeqLineItems] c
JOIN ( SELECT a.OrderNumber
FROM [dbo].[vAssocSeqLineItems] a
JOIN [dbo].[vAssocSeqLineItems]
b ON a.OrderNumber
= b.OrderNumber
WHERE a.Model = 'Touring Tire'
AND b.Model
= 'Water Bottle'
) d ON c.OrderNumber = d.OrderNumber
WHERE c.Model = 'Touring Tire Tube'
);
SET @NoA_Or_NoB_NoC = ( SELECT COUNT(DISTINCT c.OrderNumber) + 1
FROM [dbo].[vAssocSeqLineItems] c
JOIN ( SELECT DISTINCT
c.OrderNumber
FROM [dbo].[vAssocSeqLineItems] c
LEFT JOIN ( SELECT DISTINCT
a.OrderNumber
FROM
[dbo].[vAssocSeqLineItems] a
JOIN
[dbo].[vAssocSeqLineItems] b ON a.OrderNumber = b.OrderNumber
WHERE
a.Model =
'Touring Tire'
AND b.Model = 'Water Bottle'
) d ON c.OrderNumber = d.OrderNumber
WHERE d.OrderNumber IS NULL
) d ON c.OrderNumber = d.OrderNumber
WHERE c.Model != 'Touring Tire Tube'
);
SET @YesA_And_YesB_NoC = ( SELECT COUNT(DISTINCT OrderNumber)
+ 1
FROM [dbo].[vAssocSeqLineItems]
WHERE
OrderNumber NOT IN (
SELECT
OrderNumber
FROM [dbo].[vAssocSeqLineItems]
WHERE Model = 'Touring Tire Tube' )
AND OrderNumber IN (
SELECT
OrderNumber
FROM [dbo].[vAssocSeqLineItems]
WHERE Model = 'Touring Tire'
AND OrderNumber IN (
SELECT
OrderNumber
FROM [dbo].[vAssocSeqLineItems]
WHERE Model = 'Water Bottle' ) )
);
SET @NoA_Or_NoB_YesC = ( SELECT COUNT(DISTINCT OrderNumber) + 1
FROM [dbo].[vAssocSeqLineItems]
WHERE model = 'Touring Tire Tube' AND (OrderNumber NOT IN
(SELECT OrderNumber FROM [dbo].[vAssocSeqLineItems]
WHERE model='Touring Tire') or OrderNumber NOT
IN
(SELECT OrderNumber FROM [dbo].[vAssocSeqLineItems]
WHERE model='Water Bottle')
)
/*
FROM [dbo].[vAssocSeqLineItems] c
JOIN (
SELECT DISTINCT
c.OrderNumber
FROM [dbo].[vAssocSeqLineItems]
c
LEFT JOIN ( SELECT DISTINCT
a.OrderNumber
FROM
[dbo].[vAssocSeqLineItems]
a
JOIN [dbo].[vAssocSeqLineItems] b ON a.OrderNumber = b.OrderNumber
WHERE
a.Model = 'Touring Tire'
AND b.Model = 'Water Bottle'
) d ON c.OrderNumber = d.OrderNumber
WHERE d.OrderNumber IS NULL
) d ON
c.OrderNumber = d.OrderNumber
WHERE
c.Model = 'Touring Tire Tube'
*/
);
--echo the values
SELECT
@YesA_And_YesB_YesC AS
YesA_And_YesB_YesC ,
@NoA_Or_NoB_NoC AS NoA_Or_NoB_NoC ,
@YesA_And_YesB_NoC AS YesA_And_YesB_NoC ,
@NoA_Or_NoB_YesC AS NoA_Or_NoB_YesC;
SET @Numerator = ( SELECT
@YesA_And_YesB_YesC / (
@YesA_And_YesB_YesC + @YesA_And_YesB_NoC )
);
SET @Denominator = ( SELECT @NoA_Or_NoB_YesC / ( @NoA_Or_NoB_YesC + @NoA_Or_NoB_NoC )
);
SELECT @Numerator AS Numerator ,
@Denominator AS Denominator ,
@Numerator / @Denominator AS Quotient;
SELECT ROUND(LOG10(@Numerator / @Denominator), 3) AS Importance;
It’s simply expressed as if {A} then {B}. Please note
that A could be more than one item, although B is always one item in our case.
Menu Options
It’s also worthwhile to denote some of the menu options.
Minimum Probability – the value .40 is the default value for minimum probability. We can change it through model parameter.
Minimum Importance
– The value .14 is calculated and displayed automatically by
SSAS.
Filter Rule – a searching string. For instance, if you
enter ‘Water’, then the rules must contain ‘Water’
Maximum Rows –
This value is also calculated and displayed automatically. The actual maximum
rows from the dataset is 1623. If you put a value greater than 1623, it would
be the same effect as 1623. If you set a value 10, then the displayed rules
will have a maximum row of 10.
Also please note the number of rules at the bottom of the screen!
Itemsets
The Itemsets tab displays the list of itemsets that
the model identified as frequently found together. The tab has three columns: Support, Size, and Itemset.
Support - lists the number of cases supporting the itemset.
Size - displays the number of items in the itemset.
Itemset - displays the actual itemset that the model discovered.
Support - lists the number of cases supporting the itemset.
Size - displays the number of items in the itemset.
Itemset - displays the actual itemset that the model discovered.
Menu Options
Minimum
support – the minimum number of rows to set the itemsets. The value 21 is
calculated from the dataset and displayed automatically.
Minimum
itemset size – we are often interested in 2 or more item sizes. Itemsize =1
means customers buy the item alone. Of course, the same item can appear
together with other items as well if customers buy them together.
Maximum
rows – the maximum of itemsets displayed in the pane. In our dataset, the total
itemsets is 643. Any number greater than 643 will have the same effect as that
for 643. If you enter 10, then only 10 itemsets will be displayed.
Filter
Itemset – the same as in Rules – a search string.
Also please note the size of itemsets meeting the criteria
is displayed at the bottom of the screen.
Dependency
Network
Each node in the Dependency Network viewer represents an item. The arrow between nodes
represents the association between items. The direction of the arrow dictates
the association between the items according to the rules that the algorithm
discovered. For example, if the viewer contains three items, A, B, and C, and C
is predicted by A and B, if you select node C, two arrows point toward node C —
A to C and B to C.
The
slider at the left of the viewer acts as a filter that is tied to the
probability of the rules. Lowering the slider shows only the strongest links.
In our case, the strongest link is: Touring Tire Tube predicts Touring Tube.
Please note, not all items
are displayed in viewer, only those with possible strong links are displayed.