1.
Preparation of Data
For this demo, I also use
the project sample (pp. 399403) in the TK 448 book. The project needs to use
the AdventureWorksDW2008 or AdventureWorksDW2008R2 database. Particularly, it
needs two views in the database: vAssocSeqOrders and vAssocSeqLineItems.
The vAssocSeqOrders has
4 columns and 21,255 rows, OrderNumber is the Key.
/***************21255
Rows ********************************
OrderNumber CustomerKey Region IncomeGroup
SO51176 18239 Pacific High
SO51177 27873 Pacific Low
SO51178 11245 Europe High
***********************************************************/
SELECT * FROM [dbo].[vAssocSeqOrders]
The vAssocSeqLineItems
has 3 columns and 52,761 rows, OrderNumber and Line are the Key.
/**************
52761 Rows **********************************
OrderNumber LineNumber Model
SO51176 1 Road250
SO51176 2 Road
Bottle Cage
SO51177 1 Touring2000
SO51177 2 Sport100
SO51178 1 Mountain200
SO51178 2 Mountain
Bottle Cage
SO51178 3 Water
Bottle
**************************************************************/
SELECT *
FROM [dbo].[vAssocSeqLineItems]
2.
The model structure of the sequence clustering model
Our goal is to use the Line Number attribute in
the dbo.vAssocSeqLineItems view to define the sequence of purchases of product models
inside a single customer order. With multiple orders, we can try to find rules
in sequences.
In order to do so, we need to use the
vAssocSeqOrders view as the case table, and the vAssocSeqLineItems as the
nested table. We set up the mining structure as below. In this way, we use Line
Number as the Key Sequence column, indicating that it has a sequence in it.
Because the sequence predicts a model, we use this column as input as well. The
Model column is predictable, but because we want to find not just sequences,
but sequences of models, we have to use it as input as well.
Essentially, the sequence cluster model asks
two questions: 1. what are the clusters? In other words, when customers
purchase multiple products, which products are likely to be purchased together? 2. What are the sequences when they purchase products?
Case Table Column

Nested Table Column

Usage

Order Number

Key


vAssocSeqLineItems


Line Number

Key sequence, Input


Model

Input, predictable

3.
Understand the results
The results for an
association rules model under mining model viewer has five tabs: Cluster Diagram, Cluster Profiles, Cluster
Characteristics, Cluster Discrimination and State Transitions; and one
output under the Microsoft Generic
Content Tree Viewer.
Before we look at the
viewers, we need to understand some basic concepts about sequence clustering
first.
The most fundamental
concept is transition or transition sequence. What is it? Let’s look at the data.
First, let’s see how many product models there are in the view.
/******* 37 Models in the view********
AllPurpose Bike Stand
Bike Wash
Classic Vest
Cycling Cap
Fender Set  Mountain
HalfFinger Gloves
Hitch Rack  4Bike
HL Mountain Tire
HL Road Tire
Hydration Pack
LL Mountain Tire
LL Road Tire
LongSleeve Logo Jersey
ML Mountain Tire
ML Road Tire
Mountain Bottle Cage
Mountain Tire Tube
Mountain200
Mountain400W
Mountain500
Patch kit
Racing Socks
Road Bottle Cage
Road Tire Tube
Road250
Road350W
Road550W
Road750
ShortSleeve Classic Jersey
Sport100
Touring Tire
Touring Tire Tube
Touring1000
Touring2000
Touring3000
Water Bottle
Women's Mountain Shorts
***********************************/
SELECT DISTINCT
model FROM vAssocSeqLineItems
ORDER BY model
In sequence clustering,
these 37 models are also called states or
transition sequence states. Now, let
us understand the concept ‘transition’
or ‘transition sequence’. Let’s pick up an order  one with the most line
items. SO58845 is one of them.
/*
OrderNumber NumberOfModels
SO58845 8
SO70714 8
SO72656 8
SO51555 7
SO54042 7
*/
USE AdventureWorksDW2008R2
SELECT TOP 5
OrderNumber, Count(*) AS NumberOfModels
FROM vAssocSeqLineItems
GROUP BY
OrderNumber
ORDER BY Count(*) DESC
Let’s see what’s on the
order. From the result below, we can see that on this order, Mountain500 is the 1^{st}
model, followed by LL Mountain Tire. The flow from Mountain500 to LL Mountain
Tire is a transition sequence or
simply a transition. From the 2^{nd} line item to the 3^{rd}
line item is another transition sequence.
Thus, there are 7 transition sequences on this order.
/*
OrderNumber LineNumber Model
SO58845 1 Mountain500
SO58845 2 LL Mountain Tire
SO58845 3 Mountain Tire Tube
SO58845 4 Fender Set  Mountain
SO58845 5 Mountain Bottle Cage
SO58845 6 Water Bottle
SO58845 7 Sport100
SO58845 8 LongSleeve Logo Jersey
*/
SELECT * FROM vAssocSeqLineItems
WHERE OrderNumber='SO58845'
But on other orders,
Mountain500 may be followed by other models. SSAS has figured out there are
total 19,219 transition sequences on all of the orders.
The next important concept is cluster. But before looking at the clusters, let’s examine the
population. There are total 52,761 items in the vAssocSeqLineItems view. The frequency and ratio of each product
model are listed below:
/*
Meaning(Model)

Frequencies

Distribution(Probability)

Sport100

6171

0.117

Water Bottle

4076

0.077

Patch kit

3010

0.057

Mountain Tire
Tube

2908

0.055

Mountain200

2477

0.047

Road Tire
Tube

2216

0.042

Cycling Cap

2095

0.04

Fender Set 
Mountain

2014

0.038

Mountain
Bottle Cage

1941

0.037

Road Bottle
Cage

1702

0.032

LongSleeve
Logo Jersey

1642

0.031

ShortSleeve
Classic Jersey

1537

0.029

Road750

1443

0.027

HalfFinger
Gloves

1363

0.026

Touring Tire
Tube

1397

0.026

HL Mountain
Tire

1331

0.025

Touring1000

1255

0.024

ML Mountain
Tire

1083

0.021

Road550W

1033

0.02

LL Road Tire

980

0.019

Road350W

929

0.018

Women's
Mountain Shorts

958

0.018

Touring Tire

881

0.017

ML Road Tire

890

0.017

Bike Wash

864

0.016

HL Road Tire

804

0.015

LL Mountain
Tire

799

0.015

Hydration
Pack

706

0.013

Road250

706

0.013

Racing Socks

543

0.01

Classic Vest

527

0.01

Touring3000

540

0.01

Mountain400W

543

0.01

Mountain500

479

0.009

Touring2000

372

0.007

Hitch Rack 
4Bike

308

0.006

AllPurpose
Bike Stand

238

0.005

*/
DECLARE @TotalItem decimal(8,2)
SELECT
@TotalItem = COUNT (*) FROM vAssocSeqLineItems
SELECT Model AS
[Meaning(Model)], COUNT(*) AS Frequencies, CAST (COUNT(*)/@TotalItem
AS DECIMAL (8,3)) AS
[Distribution(Probability)] FROM
vAssocSeqLineItems
GROUP BY model
ORDER BY
[Distribution(Probability)] DESC
This is exactly the same as the population profile in the chart below. Please note that the ratio of each product in the legend is not based on transition sequences, rather than based on the frequencies on the orders. For example, Sport100 has been bought 6171 times, and the total line items in the view is 52,761, so the distribution of the model ‘Sport100’ is 6171/52761=11.7%. It’s nothing to do with the total 19,129 transition sequences.
SSAS has identified 15
clusters from the dataset as below.
Each cluster contains two or more models. Let’s see
some examples. For instance, Cluster 12 has only two models: Water Bottle and
Cycling Cap.
89.6% of the cluster are Water Bottle, whereas the remaining 10.4%
are cycling cap. This is much clear from the detailed information from the
Generic Content Tree Viewer on Cluster 12 as shown in table below. Cluster 12 contains a frequency of about
32 on Cycling Cap (about 10.4%), and the remaining 89.6% are 274 Water bottle.
Please note this number is not related to Node_Support 294 in the cluster.
ATTRIBUTE_NAME

ATTRIBUTE_VALUE

SUPPORT

PROBABILITY

Model

Missing

0

1.00E14

Model

AllPurpose
Bike Stand

0

1.00E14

Model

Bike Wash

0

1.00E14

Model

Classic Vest

0

1.00E14

Model

Cycling Cap

31.74247782

0.103975535

Model

Fender Set 
Mountain

0

1.00E14

Model

HalfFinger
Gloves

0

1.00E14

Model

Hitch Rack 
4Bike

0

1.00E14

Model

HL Mountain
Tire

0

1.00E14

Model

HL Road Tire

0

1.00E14

Model

Hydration Pack

0

1.00E14

Model

LL Mountain
Tire

0

1.00E14

Model

LL Road Tire

0

1.00E14

Model

LongSleeve
Logo Jersey

0

1.00E14

Model

ML Mountain
Tire

0

1.00E14

Model

ML Road Tire

0

1.00E14

Model

Mountain Bottle
Cage

0

1.00E14

Model

Mountain Tire
Tube

0

1.00E14

Model

Mountain200

0

1.00E14

Model

Mountain400W

0

1.00E14

Model

Mountain500

0

1.00E14

Model

Patch kit

0

1.00E14

Model

Racing Socks

0

1.00E14

Model

Road Bottle
Cage

0

1.00E14

Model

Road Tire Tube

0

1.00E14

Model

Road250

0

1.00E14

Model

Road350W

0

1.00E14

Model

Road550W

0

1.00E14

Model

Road750

0

1.00E14

Model

ShortSleeve
Classic Jersey

0

1.00E14

Model

Sport100

0

1.00E14

Model

Touring Tire

0

1.00E14

Model

Touring Tire
Tube

0

1.00E14

Model

Touring1000

0

1.00E14

Model

Touring2000

0

1.00E14

Model

Touring3000

0

1.00E14

Model

Water Bottle

273.5454706

0.896024465

Model

Women's
Mountain Shorts

0

1.00E14

Cluster 2 is much more
complicated with many models in it. The composition of Cluster 2 is shown
below. Cluster 13 also have a simple composition: 48.6% Mountain Tire Tube and
51.4% ML Mountain Tire. For some reason I do not know yet, SSAS has found
Cluster 2 and Cluster 13 has the strongest link among the 15 clusters as
demonstrated later.
With this background, let’s
examine the diagrams.
a. Cluster Diagram
This viewer displays all the
clusters that are found in a mining model as shown above. By default, the shade represents the
population. The darker the shading, the greater the transition sequence is for a
cluster. Conversely, the distribution decreases as the shading gets lighter. In
our case, there are 15 clusters found in the mining models. Cluster 1 has the
largest distribution (2678/19219=14%), where 2678 is the total number of transition
sequences for Cluster 1 and 19,219 is the total number of transitions sequences
in the population.
The shading of the line
that connects one cluster to another represents the strength of the similarity
of the clusters. As the line becomes darker, the similarity of the links
becomes stronger. We can adjust how many lines the viewer shows by adjusting
the slider to the right of the clusters. Lowering the slider shows only the
strongest links. In our case, Cluster 2 and Cluster 13 appear to have the
strongest link in the population. This strongest link population shows on all
of the cluster diagrams at the state level, as stated below.
We can also select
Model for ShadingVariable
and
any particular model for State to
see the cluster diagram for a particular state. For instance, we select Model
for ShadingVariable and Sport100 for State, we obtained the
following cluster diagram. It indicates the distribution of Sport100 in each
of the 15 clusters. Among them, cluster 8 has the largest ratio of 21%.
Similarly, if we choose the product model ‘Women’s Mountain Shoes’, we will
see the cluster diagram as below, showing the distribution of ‘Women’s Mountain
Shoes’ in all of the 15 clusters. Cluster 14 has the largest ratio of ‘Women’s
Mountain Shoes’, 66%.
The ratios of ‘Women’s Mountain Shoes’ in other clusters are only Cluster 1
(1%), Cluster 2 (2%), Cluster 4 (2%), Cluster 7(1%), Cluster 8 (1%) and Cluster
9 (1%). Please note the total of those percentages is not 100%, actually it is
66%+1%+2%+2%+1%+1%+1% = 73%. What’s to be 100% is the composition of states in
each cluster. For instance, for Cluster 14, we see the following state composition
(from the cluster profile):
Racing Socks (6.4%) +
Cycling Cap (10.9%) + Long Sleeve Logo Jersey (17%) + Women’s Mountain Shorts (65.7%) = 100%
The Cluster Profile tab provides an overall view of the
clusters that the algorithm in your model creates. Each column that follows the Population column in the grid represents a
cluster that the model discovered. The <attribute> row describes all the
items that the cluster contains and their overall distribution, and the
<attribute>.samples row represents different sequences of data that exist
in the cluster. For instance, the 1^{st} chart below for the population
displays the all the items it contains (e.g., Sport100 =11.7%, Water
Bottle=7.7%, Patch Kit = 5.7% etc.). The 2^{nd} chart displays all
different transition sequences in the population such as Mountain Tire Tube
followed by LL Mountain Tire followed by Cycling Cap and followed by Sport100;
HL Road Tire followed by Road Tire Tube and followed by Sport100; etc.
The Histogram
bars option
controls the number of bars that are visible in the histogram. If more bars
exist than you choose to display, the bars of highest importance are retained,
and the remaining bars are grouped together into a gray bucket.
You can change the
default names of the clusters, to make the names more descriptive. Rename a
cluster by rightclicking its column heading and selecting Rename
cluster. You can hide clusters by selecting Hide column, and you
can also drag columns to reorder them in the viewer.
c. Cluster Characteristics
To use the Cluster Characteristics tab, select a cluster from the Cluster list. After you select a cluster, you
can examine the characteristics that make up that specific cluster. The
attributes that the cluster contains are listed in the Variables columns, and the state of the listed
attribute is listed in the Values column. Attribute states are listed in
order of importance, described by the probability that they will appear in the
cluster. The probability is shown in the Probability
column.
Let’s pick up a simple
one: Cluster 12.
We also can get the same
picture from the DMX query:
CALL
System.Microsoft.AnalysisServices.System.DataMining.Clustering.GetClusterCharacteristics('[TK448 Ch09 Sequence Clustering]','12',0.0005);
The result is:
Attributes

Values

Frequency

Support

Model

Water Bottle

89.6%

274

Model.Transitions

89.6%

270


Model.Transitions

[Start] > Water Bottle

88.4%

270

Model.Transitions

[Start] > Cycling Cap

11.6%

35

Model

Cycling Cap

10.4%

32

Model.Transitions

Cycling Cap,Water Bottle

10.4%

35

In this cluster, there
are only two product models: Cycling Cap is 10.4% and Water Bottle is 89.6%.
For the transition sequences, 11.6% of the sequences originate from Cycling Cap
and the remaining 88.4% originate from Water Bottle.
Now the hard part – the
‘missing’ transitions. It is paired with the multiple item sets. In this case,
it is paired with the model transition – Cycling Cap, Water Bottle. It’s
nothing to do with the modes in the cluster or the model transition start with
[Start]. The model transition ‘Cycling Cap, Water Bottle’ means customers buy
these two models together and in that exact sequence. The figure above indicates
such a collective purchase is 10.4%. The ratio of ‘Missing’ is defined as: (Support of Missing) / (Support of Missing + Support of First State). The support of missing is defined as the model is the first state and 'Missing' as the second state. The Support of First State is the frequency of the model as the 1st state. In our case, we have starting models: cycling cap and water bottle (as we have two starting models). Let's retrieve the missing supports  that is, cycling cap or water bottle is the first state and 'Missing' is the second state using the following DMX:
SELECT FLATTENED
(SELECT
ATTRIBUTE_VALUE AS Product2,
[Support] AS
[P2 Support],
[Probability] AS [P2 Probability]
FROM NODE_DISTRIBUTION WHERE ATTRIBUTE_VALUE = 'Missing')
AS t
FROM [TK448 Ch09 Sequence Clustering].CONTENT
WHERE NODE_UNIQUE_NAME = '819192' or
NODE_UNIQUE_NAME = '819224';
It returns:
t.Product2

t.P2 Support

t.P2 Probability

Missing

0

0.0%

Missing

269.8620432

100.0%

Thus the ratio of missing for Cycling Cap is 0.
For another starting point, Water Bottle, the support of missing is about 270,
the support of first state is 32 (the frequency for cycling cap). Thus,
(Support of Missing) / (Support of
Missing + Support of First State) =
269.862/ (269.862+32) =89.4%
Please note the numbers appear to be slightly off due to the way the Expectation Maximization algorithm works.
Let’s look at another example,
Cluster 14. The following table can be obtained by running a DMX query:
CALL
System.Microsoft.AnalysisServices.System.DataMining.Clustering.GetClusterCharacteristics('[TK448 Ch09 Sequence Clustering]','14',0.0005);
Attributes

Values

Frequency

Support

Model

Women's
Mountain Shorts

65.7%

294

Model

LongSleeve
Logo Jersey

17.0%

76

Model

Cycling Cap

10.9%

49

Model

Racing Socks

6.4%

28

Model.Transitions

[Start] >
Women's Mountain Shorts

100.0%

447

Model.Transitions

34.3%

233


Model.Transitions

Women's
Mountain Shorts,LongSleeve Logo Jersey

16.7%

114

Model.Transitions

Women's
Mountain Shorts,Cycling Cap

8.8%

60

Model.Transitions

Women's
Mountain Shorts,Racing Socks

6.0%

41

First, four models in
the cluster adds up to 100%. Second, all sequences start with Women’s Mountain
Shorts. Third, the likelihood of customers buying Women's Mountain Shorts
followed by LongSleeve Logo Jersey is 16.7%. Similarly, the probabilities of
the purchase sequences of Women's Mountain Shorts followed by Cycling Cap and Women's
Mountain Shorts followed by Racing Socks are 8.8%, and 6.0%, respectively. Not
surprisingly, the ratio for ‘Missing’ is not 61.5% (100%16.7%8.8%6.0%),
rather than another number: 34.3%, because customers may buy products in
different ways such as buying Women's Mountain Shorts alone or three products
together. Let’s figure out the ratio for missing:
SELECT FLATTENED
(SELECT
ATTRIBUTE_VALUE AS Product2,
[Support] AS
[P2 Support],
[Probability] AS [P2 Probability]
FROM NODE_DISTRIBUTION WHERE ATTRIBUTE_VALUE = 'Missing')
AS t
FROM [TK448 Ch09 Sequence Clustering].CONTENT
WHERE NODE_UNIQUE_NAME = '950295';  Women’s
Mountain Shorts
t.Product2 t.P2 Support t.P2 Probability
Missing 232.729883335169 0.521212121212121
(Support of Missing) / (Support of
Missing + Support of First State) = 232.73/ (232.73+447) =34.2%
d. Cluster Discrimination – easy one!
You can use the Cluster Discrimination tab to compare attributes between two
clusters, to determine how items in a sequence favor one cluster over another.
Use the Cluster 1 and Cluster
2 lists to select the
clusters to compare. The viewer determines the most important differences
between the clusters, and displays the attribute states that are associated
with the differences, in order of importance. A bar to the right of the
attribute shows which cluster the state favors, and the size of the bar shows
how strongly the state favors the cluster.
e. Cluster Transitions
By selecting a cluster
on the Cluster
Transitions tab, you can
browse the transitions between sequence states in the selected cluster. Each
node in the viewer represents a state of the sequence column. An arrow
represents a transition between two states and the probability that is
associated with the transition. If a transition returns back to the originating
node, an arrow can point back to the originating node (bidirectional).
An arrow that
originates from a dot represents the probability that the node is the start of
a sequence. An ending edge that leads to a null represents the probability that
the node is the end of the sequence. You can filter the edge of the nodes by
using the slider at the left of the tab.
For instance, for Cluster 12, 12% (11.6% actually) of the
sequences originate from Cycling Cap and 88% of the sequences originate from Water
Bottle. Of those originating from Cycling Cap are followed by Water Bottle at
100%. Water bottle is the only end of the sequences, 88% by itself, and another
12% from cycling cap.