1.
Preparation of Data
For this demo, I also use
the project sample (pp. 399-403) in the TK 448 book. The project needs to use
the AdventureWorksDW2008 or AdventureWorksDW2008R2 database. Particularly, it
needs two views in the database: vAssocSeqOrders and vAssocSeqLineItems.
The vAssocSeqOrders has
4 columns and 21,255 rows, OrderNumber is the Key.
/***************21255
Rows ********************************
OrderNumber CustomerKey Region IncomeGroup
SO51176 18239 Pacific High
SO51177 27873 Pacific Low
SO51178 11245 Europe High
***********************************************************/
SELECT * FROM [dbo].[vAssocSeqOrders]
The vAssocSeqLineItems
has 3 columns and 52,761 rows, OrderNumber and Line are the Key.
/**************
52761 Rows **********************************
OrderNumber LineNumber Model
SO51176 1 Road-250
SO51176 2 Road
Bottle Cage
SO51177 1 Touring-2000
SO51177 2 Sport-100
SO51178 1 Mountain-200
SO51178 2 Mountain
Bottle Cage
SO51178 3 Water
Bottle
**************************************************************/
SELECT *
FROM [dbo].[vAssocSeqLineItems]
2.
The model structure of the sequence clustering model
Our goal is to use the Line Number attribute in
the dbo.vAssocSeqLineItems view to define the sequence of purchases of product models
inside a single customer order. With multiple orders, we can try to find rules
in sequences.
In order to do so, we need to use the
vAssocSeqOrders view as the case table, and the vAssocSeqLineItems as the
nested table. We set up the mining structure as below. In this way, we use Line
Number as the Key Sequence column, indicating that it has a sequence in it.
Because the sequence predicts a model, we use this column as input as well. The
Model column is predictable, but because we want to find not just sequences,
but sequences of models, we have to use it as input as well.
Essentially, the sequence cluster model asks
two questions: 1. what are the clusters? In other words, when customers
purchase multiple products, which products are likely to be purchased together? 2. What are the sequences when they purchase products?
Case Table Column
|
Nested Table Column
|
Usage
|
Order Number
|
Key
|
|
vAssocSeqLineItems
|
||
Line Number
|
Key sequence, Input
|
|
Model
|
Input, predictable
|
3.
Understand the results
The results for an
association rules model under mining model viewer has five tabs: Cluster Diagram, Cluster Profiles, Cluster
Characteristics, Cluster Discrimination and State Transitions; and one
output under the Microsoft Generic
Content Tree Viewer.
Before we look at the
viewers, we need to understand some basic concepts about sequence clustering
first.
The most fundamental
concept is transition or transition sequence. What is it? Let’s look at the data.
First, let’s see how many product models there are in the view.
/******* 37 Models in the view********
All-Purpose Bike Stand
Bike Wash
Classic Vest
Cycling Cap
Fender Set - Mountain
Half-Finger Gloves
Hitch Rack - 4-Bike
HL Mountain Tire
HL Road Tire
Hydration Pack
LL Mountain Tire
LL Road Tire
Long-Sleeve Logo Jersey
ML Mountain Tire
ML Road Tire
Mountain Bottle Cage
Mountain Tire Tube
Mountain-200
Mountain-400-W
Mountain-500
Patch kit
Racing Socks
Road Bottle Cage
Road Tire Tube
Road-250
Road-350-W
Road-550-W
Road-750
Short-Sleeve Classic Jersey
Sport-100
Touring Tire
Touring Tire Tube
Touring-1000
Touring-2000
Touring-3000
Water Bottle
Women's Mountain Shorts
***********************************/
SELECT DISTINCT
model FROM vAssocSeqLineItems
ORDER BY model
In sequence clustering,
these 37 models are also called states or
transition sequence states. Now, let
us understand the concept ‘transition’
or ‘transition sequence’. Let’s pick up an order - one with the most line
items. SO58845 is one of them.
/*
OrderNumber NumberOfModels
SO58845 8
SO70714 8
SO72656 8
SO51555 7
SO54042 7
*/
USE AdventureWorksDW2008R2
SELECT TOP 5
OrderNumber, Count(*) AS NumberOfModels
FROM vAssocSeqLineItems
GROUP BY
OrderNumber
ORDER BY Count(*) DESC
Let’s see what’s on the
order. From the result below, we can see that on this order, Mountain-500 is the 1st
model, followed by LL Mountain Tire. The flow from Mountain-500 to LL Mountain
Tire is a transition sequence or
simply a transition. From the 2nd line item to the 3rd
line item is another transition sequence.
Thus, there are 7 transition sequences on this order.
/*
OrderNumber LineNumber Model
SO58845 1 Mountain-500
SO58845 2 LL Mountain Tire
SO58845 3 Mountain Tire Tube
SO58845 4 Fender Set - Mountain
SO58845 5 Mountain Bottle Cage
SO58845 6 Water Bottle
SO58845 7 Sport-100
SO58845 8 Long-Sleeve Logo Jersey
*/
SELECT * FROM vAssocSeqLineItems
WHERE OrderNumber='SO58845'
But on other orders,
Mountain-500 may be followed by other models. SSAS has figured out there are
total 19,219 transition sequences on all of the orders.
The next important concept is cluster. But before looking at the clusters, let’s examine the
population. There are total 52,761 items in the vAssocSeqLineItems view. The frequency and ratio of each product
model are listed below:
/*
Meaning(Model)
|
Frequencies
|
Distribution(Probability)
|
Sport-100
|
6171
|
0.117
|
Water Bottle
|
4076
|
0.077
|
Patch kit
|
3010
|
0.057
|
Mountain Tire
Tube
|
2908
|
0.055
|
Mountain-200
|
2477
|
0.047
|
Road Tire
Tube
|
2216
|
0.042
|
Cycling Cap
|
2095
|
0.04
|
Fender Set -
Mountain
|
2014
|
0.038
|
Mountain
Bottle Cage
|
1941
|
0.037
|
Road Bottle
Cage
|
1702
|
0.032
|
Long-Sleeve
Logo Jersey
|
1642
|
0.031
|
Short-Sleeve
Classic Jersey
|
1537
|
0.029
|
Road-750
|
1443
|
0.027
|
Half-Finger
Gloves
|
1363
|
0.026
|
Touring Tire
Tube
|
1397
|
0.026
|
HL Mountain
Tire
|
1331
|
0.025
|
Touring-1000
|
1255
|
0.024
|
ML Mountain
Tire
|
1083
|
0.021
|
Road-550-W
|
1033
|
0.02
|
LL Road Tire
|
980
|
0.019
|
Road-350-W
|
929
|
0.018
|
Women's
Mountain Shorts
|
958
|
0.018
|
Touring Tire
|
881
|
0.017
|
ML Road Tire
|
890
|
0.017
|
Bike Wash
|
864
|
0.016
|
HL Road Tire
|
804
|
0.015
|
LL Mountain
Tire
|
799
|
0.015
|
Hydration
Pack
|
706
|
0.013
|
Road-250
|
706
|
0.013
|
Racing Socks
|
543
|
0.01
|
Classic Vest
|
527
|
0.01
|
Touring-3000
|
540
|
0.01
|
Mountain-400-W
|
543
|
0.01
|
Mountain-500
|
479
|
0.009
|
Touring-2000
|
372
|
0.007
|
Hitch Rack -
4-Bike
|
308
|
0.006
|
All-Purpose
Bike Stand
|
238
|
0.005
|
*/
DECLARE @TotalItem decimal(8,2)
SELECT
@TotalItem = COUNT (*) FROM vAssocSeqLineItems
SELECT Model AS
[Meaning(Model)], COUNT(*) AS Frequencies, CAST (COUNT(*)/@TotalItem
AS DECIMAL (8,3)) AS
[Distribution(Probability)] FROM
vAssocSeqLineItems
GROUP BY model
ORDER BY
[Distribution(Probability)] DESC
This is exactly the same as the population profile in the chart below. Please note that the ratio of each product in the legend is not based on transition sequences, rather than based on the frequencies on the orders. For example, Sport-100 has been bought 6171 times, and the total line items in the view is 52,761, so the distribution of the model ‘Sport-100’ is 6171/52761=11.7%. It’s nothing to do with the total 19,129 transition sequences.
SSAS has identified 15
clusters from the dataset as below.
Each cluster contains two or more models. Let’s see
some examples. For instance, Cluster 12 has only two models: Water Bottle and
Cycling Cap.
89.6% of the cluster are Water Bottle, whereas the remaining 10.4%
are cycling cap. This is much clear from the detailed information from the
Generic Content Tree Viewer on Cluster 12 as shown in table below. Cluster 12 contains a frequency of about
32 on Cycling Cap (about 10.4%), and the remaining 89.6% are 274 Water bottle.
Please note this number is not related to Node_Support 294 in the cluster.
ATTRIBUTE_NAME
|
ATTRIBUTE_VALUE
|
SUPPORT
|
PROBABILITY
|
Model
|
Missing
|
0
|
1.00E-14
|
Model
|
All-Purpose
Bike Stand
|
0
|
1.00E-14
|
Model
|
Bike Wash
|
0
|
1.00E-14
|
Model
|
Classic Vest
|
0
|
1.00E-14
|
Model
|
Cycling Cap
|
31.74247782
|
0.103975535
|
Model
|
Fender Set -
Mountain
|
0
|
1.00E-14
|
Model
|
Half-Finger
Gloves
|
0
|
1.00E-14
|
Model
|
Hitch Rack -
4-Bike
|
0
|
1.00E-14
|
Model
|
HL Mountain
Tire
|
0
|
1.00E-14
|
Model
|
HL Road Tire
|
0
|
1.00E-14
|
Model
|
Hydration Pack
|
0
|
1.00E-14
|
Model
|
LL Mountain
Tire
|
0
|
1.00E-14
|
Model
|
LL Road Tire
|
0
|
1.00E-14
|
Model
|
Long-Sleeve
Logo Jersey
|
0
|
1.00E-14
|
Model
|
ML Mountain
Tire
|
0
|
1.00E-14
|
Model
|
ML Road Tire
|
0
|
1.00E-14
|
Model
|
Mountain Bottle
Cage
|
0
|
1.00E-14
|
Model
|
Mountain Tire
Tube
|
0
|
1.00E-14
|
Model
|
Mountain-200
|
0
|
1.00E-14
|
Model
|
Mountain-400-W
|
0
|
1.00E-14
|
Model
|
Mountain-500
|
0
|
1.00E-14
|
Model
|
Patch kit
|
0
|
1.00E-14
|
Model
|
Racing Socks
|
0
|
1.00E-14
|
Model
|
Road Bottle
Cage
|
0
|
1.00E-14
|
Model
|
Road Tire Tube
|
0
|
1.00E-14
|
Model
|
Road-250
|
0
|
1.00E-14
|
Model
|
Road-350-W
|
0
|
1.00E-14
|
Model
|
Road-550-W
|
0
|
1.00E-14
|
Model
|
Road-750
|
0
|
1.00E-14
|
Model
|
Short-Sleeve
Classic Jersey
|
0
|
1.00E-14
|
Model
|
Sport-100
|
0
|
1.00E-14
|
Model
|
Touring Tire
|
0
|
1.00E-14
|
Model
|
Touring Tire
Tube
|
0
|
1.00E-14
|
Model
|
Touring-1000
|
0
|
1.00E-14
|
Model
|
Touring-2000
|
0
|
1.00E-14
|
Model
|
Touring-3000
|
0
|
1.00E-14
|
Model
|
Water Bottle
|
273.5454706
|
0.896024465
|
Model
|
Women's
Mountain Shorts
|
0
|
1.00E-14
|
Cluster 2 is much more
complicated with many models in it. The composition of Cluster 2 is shown
below. Cluster 13 also have a simple composition: 48.6% Mountain Tire Tube and
51.4% ML Mountain Tire. For some reason I do not know yet, SSAS has found
Cluster 2 and Cluster 13 has the strongest link among the 15 clusters as
demonstrated later.
With this background, let’s
examine the diagrams.
a. Cluster Diagram
This viewer displays all the
clusters that are found in a mining model as shown above. By default, the shade represents the
population. The darker the shading, the greater the transition sequence is for a
cluster. Conversely, the distribution decreases as the shading gets lighter. In
our case, there are 15 clusters found in the mining models. Cluster 1 has the
largest distribution (2678/19219=14%), where 2678 is the total number of transition
sequences for Cluster 1 and 19,219 is the total number of transitions sequences
in the population.
The shading of the line
that connects one cluster to another represents the strength of the similarity
of the clusters. As the line becomes darker, the similarity of the links
becomes stronger. We can adjust how many lines the viewer shows by adjusting
the slider to the right of the clusters. Lowering the slider shows only the
strongest links. In our case, Cluster 2 and Cluster 13 appear to have the
strongest link in the population. This strongest link population shows on all
of the cluster diagrams at the state level, as stated below.
We can also select
Model for ShadingVariable
and
any particular model for State to
see the cluster diagram for a particular state. For instance, we select Model
for ShadingVariable and Sport-100 for State, we obtained the
following cluster diagram. It indicates the distribution of Sport-100 in each
of the 15 clusters. Among them, cluster 8 has the largest ratio of 21%.
Similarly, if we choose the product model ‘Women’s Mountain Shoes’, we will
see the cluster diagram as below, showing the distribution of ‘Women’s Mountain
Shoes’ in all of the 15 clusters. Cluster 14 has the largest ratio of ‘Women’s
Mountain Shoes’, 66%.
The ratios of ‘Women’s Mountain Shoes’ in other clusters are only Cluster 1
(1%), Cluster 2 (2%), Cluster 4 (2%), Cluster 7(1%), Cluster 8 (1%) and Cluster
9 (1%). Please note the total of those percentages is not 100%, actually it is
66%+1%+2%+2%+1%+1%+1% = 73%. What’s to be 100% is the composition of states in
each cluster. For instance, for Cluster 14, we see the following state composition
(from the cluster profile):
Racing Socks (6.4%) +
Cycling Cap (10.9%) + Long Sleeve Logo Jersey (17%) + Women’s Mountain Shorts (65.7%) = 100%
The Cluster Profile tab provides an overall view of the
clusters that the algorithm in your model creates. Each column that follows the Population column in the grid represents a
cluster that the model discovered. The <attribute> row describes all the
items that the cluster contains and their overall distribution, and the
<attribute>.samples row represents different sequences of data that exist
in the cluster. For instance, the 1st chart below for the population
displays the all the items it contains (e.g., Sport-100 =11.7%, Water
Bottle=7.7%, Patch Kit = 5.7% etc.). The 2nd chart displays all
different transition sequences in the population such as Mountain Tire Tube
followed by LL Mountain Tire followed by Cycling Cap and followed by Sport-100;
HL Road Tire followed by Road Tire Tube and followed by Sport-100; etc.
The Histogram
bars option
controls the number of bars that are visible in the histogram. If more bars
exist than you choose to display, the bars of highest importance are retained,
and the remaining bars are grouped together into a gray bucket.
You can change the
default names of the clusters, to make the names more descriptive. Rename a
cluster by right-clicking its column heading and selecting Rename
cluster. You can hide clusters by selecting Hide column, and you
can also drag columns to reorder them in the viewer.
c. Cluster Characteristics
To use the Cluster Characteristics tab, select a cluster from the Cluster list. After you select a cluster, you
can examine the characteristics that make up that specific cluster. The
attributes that the cluster contains are listed in the Variables columns, and the state of the listed
attribute is listed in the Values column. Attribute states are listed in
order of importance, described by the probability that they will appear in the
cluster. The probability is shown in the Probability
column.
Let’s pick up a simple
one: Cluster 12.
We also can get the same
picture from the DMX query:
CALL
System.Microsoft.AnalysisServices.System.DataMining.Clustering.GetClusterCharacteristics('[TK448 Ch09 Sequence Clustering]','12',0.0005);
The result is:
Attributes
|
Values
|
Frequency
|
Support
|
Model
|
Water Bottle
|
89.6%
|
274
|
Model.Transitions
|
89.6%
|
270
|
|
Model.Transitions
|
[Start] -> Water Bottle
|
88.4%
|
270
|
Model.Transitions
|
[Start] -> Cycling Cap
|
11.6%
|
35
|
Model
|
Cycling Cap
|
10.4%
|
32
|
Model.Transitions
|
Cycling Cap,Water Bottle
|
10.4%
|
35
|
In this cluster, there
are only two product models: Cycling Cap is 10.4% and Water Bottle is 89.6%.
For the transition sequences, 11.6% of the sequences originate from Cycling Cap
and the remaining 88.4% originate from Water Bottle.
Now the hard part – the
‘missing’ transitions. It is paired with the multiple item sets. In this case,
it is paired with the model transition – Cycling Cap, Water Bottle. It’s
nothing to do with the modes in the cluster or the model transition start with
[Start]. The model transition ‘Cycling Cap, Water Bottle’ means customers buy
these two models together and in that exact sequence. The figure above indicates
such a collective purchase is 10.4%. The ratio of ‘Missing’ is defined as: (Support of Missing) / (Support of Missing + Support of First State). The support of missing is defined as the model is the first state and 'Missing' as the second state. The Support of First State is the frequency of the model as the 1st state. In our case, we have starting models: cycling cap and water bottle (as we have two starting models). Let's retrieve the missing supports - that is, cycling cap or water bottle is the first state and 'Missing' is the second state using the following DMX:
SELECT FLATTENED
(SELECT
ATTRIBUTE_VALUE AS Product2,
[Support] AS
[P2 Support],
[Probability] AS [P2 Probability]
FROM NODE_DISTRIBUTION WHERE ATTRIBUTE_VALUE = 'Missing')
AS t
FROM [TK448 Ch09 Sequence Clustering].CONTENT
WHERE NODE_UNIQUE_NAME = '819192' or
NODE_UNIQUE_NAME = '819224';
It returns:
t.Product2
|
t.P2 Support
|
t.P2 Probability
|
Missing
|
0
|
0.0%
|
Missing
|
269.8620432
|
100.0%
|
Thus the ratio of missing for Cycling Cap is 0.
For another starting point, Water Bottle, the support of missing is about 270,
the support of first state is 32 (the frequency for cycling cap). Thus,
(Support of Missing) / (Support of
Missing + Support of First State) =
269.862/ (269.862+32) =89.4%
Please note the numbers appear to be slightly off due to the way the Expectation Maximization algorithm works.
Let’s look at another example,
Cluster 14. The following table can be obtained by running a DMX query:
CALL
System.Microsoft.AnalysisServices.System.DataMining.Clustering.GetClusterCharacteristics('[TK448 Ch09 Sequence Clustering]','14',0.0005);
Attributes
|
Values
|
Frequency
|
Support
|
Model
|
Women's
Mountain Shorts
|
65.7%
|
294
|
Model
|
Long-Sleeve
Logo Jersey
|
17.0%
|
76
|
Model
|
Cycling Cap
|
10.9%
|
49
|
Model
|
Racing Socks
|
6.4%
|
28
|
Model.Transitions
|
[Start] ->
Women's Mountain Shorts
|
100.0%
|
447
|
Model.Transitions
|
34.3%
|
233
|
|
Model.Transitions
|
Women's
Mountain Shorts,Long-Sleeve Logo Jersey
|
16.7%
|
114
|
Model.Transitions
|
Women's
Mountain Shorts,Cycling Cap
|
8.8%
|
60
|
Model.Transitions
|
Women's
Mountain Shorts,Racing Socks
|
6.0%
|
41
|
First, four models in
the cluster adds up to 100%. Second, all sequences start with Women’s Mountain
Shorts. Third, the likelihood of customers buying Women's Mountain Shorts
followed by Long-Sleeve Logo Jersey is 16.7%. Similarly, the probabilities of
the purchase sequences of Women's Mountain Shorts followed by Cycling Cap and Women's
Mountain Shorts followed by Racing Socks are 8.8%, and 6.0%, respectively. Not
surprisingly, the ratio for ‘Missing’ is not 61.5% (100%-16.7%-8.8%-6.0%),
rather than another number: 34.3%, because customers may buy products in
different ways such as buying Women's Mountain Shorts alone or three products
together. Let’s figure out the ratio for missing:
SELECT FLATTENED
(SELECT
ATTRIBUTE_VALUE AS Product2,
[Support] AS
[P2 Support],
[Probability] AS [P2 Probability]
FROM NODE_DISTRIBUTION WHERE ATTRIBUTE_VALUE = 'Missing')
AS t
FROM [TK448 Ch09 Sequence Clustering].CONTENT
WHERE NODE_UNIQUE_NAME = '950295'; -- Women’s
Mountain Shorts
t.Product2 t.P2 Support t.P2 Probability
Missing 232.729883335169 0.521212121212121
(Support of Missing) / (Support of
Missing + Support of First State) = 232.73/ (232.73+447) =34.2%
d. Cluster Discrimination – easy one!
You can use the Cluster Discrimination tab to compare attributes between two
clusters, to determine how items in a sequence favor one cluster over another.
Use the Cluster 1 and Cluster
2 lists to select the
clusters to compare. The viewer determines the most important differences
between the clusters, and displays the attribute states that are associated
with the differences, in order of importance. A bar to the right of the
attribute shows which cluster the state favors, and the size of the bar shows
how strongly the state favors the cluster.
e. Cluster Transitions
By selecting a cluster
on the Cluster
Transitions tab, you can
browse the transitions between sequence states in the selected cluster. Each
node in the viewer represents a state of the sequence column. An arrow
represents a transition between two states and the probability that is
associated with the transition. If a transition returns back to the originating
node, an arrow can point back to the originating node (bidirectional).
An arrow that
originates from a dot represents the probability that the node is the start of
a sequence. An ending edge that leads to a null represents the probability that
the node is the end of the sequence. You can filter the edge of the nodes by
using the slider at the left of the tab.
For instance, for Cluster 12, 12% (11.6% actually) of the
sequences originate from Cycling Cap and 88% of the sequences originate from Water
Bottle. Of those originating from Cycling Cap are followed by Water Bottle at
100%. Water bottle is the only end of the sequences, 88% by itself, and another
12% from cycling cap.