How to Interpret the Results from a Sequence Clustering Model?

1.   Preparation of Data

For this demo, I also use the project sample (pp. 399-403) in the TK 448 book. The project needs to use the AdventureWorksDW2008 or AdventureWorksDW2008R2 database. Particularly, it needs two views in the database: vAssocSeqOrders and vAssocSeqLineItems.
The vAssocSeqOrders has 4 columns and 21,255 rows, OrderNumber is the Key.
/***************21255 Rows ********************************
OrderNumber        CustomerKey       Region                   IncomeGroup
SO51176               18239                    Pacific                    High
SO51177               27873                    Pacific                    Low
SO51178               11245                    Europe                   High
***********************************************************/
SELECT * FROM [dbo].[vAssocSeqOrders]
The vAssocSeqLineItems has 3 columns and 52,761 rows, OrderNumber and Line are the Key.
/************** 52761 Rows **********************************
OrderNumber        LineNumber          Model
SO51176               1                            Road-250
SO51176               2                            Road Bottle Cage
SO51177               1                            Touring-2000
SO51177               2                            Sport-100
SO51178               1                            Mountain-200
SO51178               2                            Mountain Bottle Cage
SO51178               3                            Water Bottle
**************************************************************/
SELECT *
FROM [dbo].[vAssocSeqLineItems]

2.   The model structure of the sequence clustering model

Our goal is to use the Line Number attribute in the dbo.vAssocSeqLineItems view to define the sequence of purchases of product models inside a single customer order. With multiple orders, we can try to find rules in sequences.
In order to do so, we need to use the vAssocSeqOrders view as the case table, and the vAssocSeqLineItems as the nested table. We set up the mining structure as below. In this way, we use Line Number as the Key Sequence column, indicating that it has a sequence in it. Because the sequence predicts a model, we use this column as input as well. The Model column is predictable, but because we want to find not just sequences, but sequences of models, we have to use it as input as well.
Essentially, the sequence cluster model asks two questions: 1. what are the clusters? In other words, when customers purchase multiple products, which products are likely to be purchased together? 2.  What are the sequences when they purchase products?
Case Table Column
Nested Table Column
Usage
Order Number

Key
Customer Key


Region


Income Group


vAssocSeqLineItems



Order Number


Line Number
Key sequence, Input

Model
Input, predictable

3.   Understand the results

The results for an association rules model under mining model viewer has five tabs: Cluster Diagram, Cluster Profiles, Cluster Characteristics, Cluster Discrimination and State Transitions; and one output under the Microsoft Generic Content Tree Viewer.
Before we look at the viewers, we need to understand some basic concepts about sequence clustering first.
The most fundamental concept is transition or transition sequence. What is it? Let’s look at the data. First, let’s see how many product models there are in the view.
/******* 37 Models in the view********
All-Purpose Bike Stand
Bike Wash
Classic Vest
Cycling Cap
Fender Set - Mountain
Half-Finger Gloves
Hitch Rack - 4-Bike
HL Mountain Tire
HL Road Tire
Hydration Pack
LL Mountain Tire
LL Road Tire
Long-Sleeve Logo Jersey
ML Mountain Tire
ML Road Tire
Mountain Bottle Cage
Mountain Tire Tube
Mountain-200
Mountain-400-W
Mountain-500
Patch kit
Racing Socks
Road Bottle Cage
Road Tire Tube
Road-250
Road-350-W
Road-550-W
Road-750
Short-Sleeve Classic Jersey
Sport-100
Touring Tire
Touring Tire Tube
Touring-1000
Touring-2000
Touring-3000
Water Bottle
Women's Mountain Shorts
***********************************/
SELECT DISTINCT model FROM vAssocSeqLineItems
ORDER BY model
In sequence clustering, these 37 models are also called states or transition sequence states. Now, let us understand the concept ‘transition’ or ‘transition sequence’.  Let’s pick up an order - one with the most line items. SO58845 is one of them.
/*
OrderNumber        NumberOfModels
SO58845               8
SO70714               8
SO72656               8
SO51555               7
SO54042               7
*/
USE AdventureWorksDW2008R2
SELECT TOP 5 OrderNumber, Count(*) AS NumberOfModels
FROM vAssocSeqLineItems
GROUP BY OrderNumber
ORDER BY Count(*) DESC
Let’s see what’s on the order. From the result below, we can see that on this order, Mountain-500 is the 1st model, followed by LL Mountain Tire. The flow from Mountain-500 to LL Mountain Tire is a transition sequence or simply a transition. From the 2nd line item to the 3rd line item is another transition sequence. Thus, there are 7 transition sequences on this order.
/*
OrderNumber        LineNumber Model
SO58845               1        Mountain-500
SO58845               2        LL Mountain Tire
SO58845               3        Mountain Tire Tube
SO58845               4        Fender Set - Mountain
SO58845               5        Mountain Bottle Cage
SO58845               6        Water Bottle
SO58845               7        Sport-100
SO58845               8        Long-Sleeve Logo Jersey
*/
SELECT * FROM vAssocSeqLineItems
WHERE OrderNumber='SO58845'
But on other orders, Mountain-500 may be followed by other models. SSAS has figured out there are total 19,219 transition sequences on all of the orders.

The next important concept is cluster. But before looking at the clusters, let’s examine the population. There are total 52,761 items in the vAssocSeqLineItems view. The frequency and ratio of each product model are listed below:
/*
Meaning(Model)
Frequencies
Distribution(Probability)
Sport-100
6171
0.117
Water Bottle
4076
0.077
Patch kit
3010
0.057
Mountain Tire Tube
2908
0.055
Mountain-200
2477
0.047
Road Tire Tube
2216
0.042
Cycling Cap
2095
0.04
Fender Set - Mountain
2014
0.038
Mountain Bottle Cage
1941
0.037
Road Bottle Cage
1702
0.032
Long-Sleeve Logo Jersey
1642
0.031
Short-Sleeve Classic Jersey
1537
0.029
Road-750
1443
0.027
Half-Finger Gloves
1363
0.026
Touring Tire Tube
1397
0.026
HL Mountain Tire
1331
0.025
Touring-1000
1255
0.024
ML Mountain Tire
1083
0.021
Road-550-W
1033
0.02
LL Road Tire
980
0.019
Road-350-W
929
0.018
Women's Mountain Shorts
958
0.018
Touring Tire
881
0.017
ML Road Tire
890
0.017
Bike Wash
864
0.016
HL Road Tire
804
0.015
LL Mountain Tire
799
0.015
Hydration Pack
706
0.013
Road-250
706
0.013
Racing Socks
543
0.01
Classic Vest
527
0.01
Touring-3000
540
0.01
Mountain-400-W
543
0.01
Mountain-500
479
0.009
Touring-2000
372
0.007
Hitch Rack - 4-Bike
308
0.006
All-Purpose Bike Stand
238
0.005
*/
DECLARE @TotalItem decimal(8,2)
SELECT  @TotalItem = COUNT (*) FROM vAssocSeqLineItems
SELECT Model AS [Meaning(Model)], COUNT(*) AS Frequencies, CAST (COUNT(*)/@TotalItem AS DECIMAL (8,3)) AS [Distribution(Probability)] FROM vAssocSeqLineItems
GROUP BY model
ORDER BY [Distribution(Probability)]  DESC

  This is exactly the same as the population profile in the chart below. Please note that the ratio of each product in the legend is not based on transition sequences, rather than based on the frequencies on the orders. For example, Sport-100 has been bought 6171 times, and the total line items in the view is 52,761, so the distribution of the model ‘Sport-100’ is 6171/52761=11.7%. It’s nothing to do with the total 19,129 transition sequences.


SSAS has identified 15 clusters from the dataset as below. 

Each cluster contains two or more models. Let’s see some examples. For instance, Cluster 12 has only two models: Water Bottle and Cycling Cap.  

89.6% of the cluster are Water Bottle, whereas the remaining 10.4% are cycling cap. This is much clear from the detailed information from the Generic Content Tree Viewer on Cluster 12 as shown in table below. Cluster 12 contains a frequency of about 32 on Cycling Cap (about 10.4%), and the remaining 89.6% are 274 Water bottle. Please note this number is not related to Node_Support 294 in the cluster.
ATTRIBUTE_NAME
ATTRIBUTE_VALUE
SUPPORT
PROBABILITY
Model
Missing
0
1.00E-14
Model
All-Purpose Bike Stand
0
1.00E-14
Model
Bike Wash
0
1.00E-14
Model
Classic Vest
0
1.00E-14
Model
Cycling Cap
31.74247782
0.103975535
Model
Fender Set - Mountain
0
1.00E-14
Model
Half-Finger Gloves
0
1.00E-14
Model
Hitch Rack - 4-Bike
0
1.00E-14
Model
HL Mountain Tire
0
1.00E-14
Model
HL Road Tire
0
1.00E-14
Model
Hydration Pack
0
1.00E-14
Model
LL Mountain Tire
0
1.00E-14
Model
LL Road Tire
0
1.00E-14
Model
Long-Sleeve Logo Jersey
0
1.00E-14
Model
ML Mountain Tire
0
1.00E-14
Model
ML Road Tire
0
1.00E-14
Model
Mountain Bottle Cage
0
1.00E-14
Model
Mountain Tire Tube
0
1.00E-14
Model
Mountain-200
0
1.00E-14
Model
Mountain-400-W
0
1.00E-14
Model
Mountain-500
0
1.00E-14
Model
Patch kit
0
1.00E-14
Model
Racing Socks
0
1.00E-14
Model
Road Bottle Cage
0
1.00E-14
Model
Road Tire Tube
0
1.00E-14
Model
Road-250
0
1.00E-14
Model
Road-350-W
0
1.00E-14
Model
Road-550-W
0
1.00E-14
Model
Road-750
0
1.00E-14
Model
Short-Sleeve Classic Jersey
0
1.00E-14
Model
Sport-100
0
1.00E-14
Model
Touring Tire
0
1.00E-14
Model
Touring Tire Tube
0
1.00E-14
Model
Touring-1000
0
1.00E-14
Model
Touring-2000
0
1.00E-14
Model
Touring-3000
0
1.00E-14
Model
Water Bottle
273.5454706
0.896024465
Model
Women's Mountain Shorts
0
1.00E-14
Cluster 2 is much more complicated with many models in it. The composition of Cluster 2 is shown below. Cluster 13 also have a simple composition: 48.6% Mountain Tire Tube and 51.4% ML Mountain Tire. For some reason I do not know yet, SSAS has found Cluster 2 and Cluster 13 has the strongest link among the 15 clusters as demonstrated later.



With this background, let’s examine the diagrams.
a.     Cluster Diagram 
This viewer displays all the clusters that are found in a mining model as shown above. By default, the shade represents the population. The darker the shading, the greater the transition sequence is for a cluster. Conversely, the distribution decreases as the shading gets lighter. In our case, there are 15 clusters found in the mining models. Cluster 1 has the largest distribution (2678/19219=14%), where 2678 is the total number of transition sequences for Cluster 1 and 19,219 is the total number of transitions sequences in the population.
The shading of the line that connects one cluster to another represents the strength of the similarity of the clusters. As the line becomes darker, the similarity of the links becomes stronger. We can adjust how many lines the viewer shows by adjusting the slider to the right of the clusters. Lowering the slider shows only the strongest links. In our case, Cluster 2 and Cluster 13 appear to have the strongest link in the population. This strongest link population shows on all of the cluster diagrams at the state level, as stated below.
We can also select Model for ShadingVariable and any particular model for State to see the cluster diagram for a particular state. For instance, we select Model for ShadingVariable and Sport-100 for State, we obtained the following cluster diagram. It indicates the distribution of Sport-100 in each of the 15 clusters. Among them, cluster 8 has the largest ratio of 21%.
Similarly, if we choose the product model ‘Women’s Mountain Shoes’, we will see the cluster diagram as below, showing the distribution of ‘Women’s Mountain Shoes’ in all of the 15 clusters. Cluster 14 has the largest ratio of ‘Women’s Mountain Shoes’, 66%. The ratios of ‘Women’s Mountain Shoes’ in other clusters are only Cluster 1 (1%), Cluster 2 (2%), Cluster 4 (2%), Cluster 7(1%), Cluster 8 (1%) and Cluster 9 (1%). Please note the total of those percentages is not 100%, actually it is 66%+1%+2%+2%+1%+1%+1% = 73%. What’s to be 100% is the composition of states in each cluster. For instance, for Cluster 14, we see the following state composition (from the cluster profile):
Racing Socks (6.4%) + Cycling Cap (10.9%) + Long Sleeve Logo Jersey (17%) + Women’s Mountain Shorts (65.7%) = 100%
b.    Cluster Profiles
The Cluster Profile tab provides an overall view of the clusters that the algorithm in your model creates. Each column that follows the Population column in the grid represents a cluster that the model discovered. The <attribute> row describes all the items that the cluster contains and their overall distribution, and the <attribute>.samples row represents different sequences of data that exist in the cluster. For instance, the 1st chart below for the population displays the all the items it contains (e.g., Sport-100 =11.7%, Water Bottle=7.7%, Patch Kit = 5.7% etc.). The 2nd chart displays all different transition sequences in the population such as Mountain Tire Tube followed by LL Mountain Tire followed by Cycling Cap and followed by Sport-100; HL Road Tire followed by Road Tire Tube and followed by Sport-100; etc.



 The Histogram bars option controls the number of bars that are visible in the histogram. If more bars exist than you choose to display, the bars of highest importance are retained, and the remaining bars are grouped together into a gray bucket.
You can change the default names of the clusters, to make the names more descriptive. Rename a cluster by right-clicking its column heading and selecting Rename cluster. You can hide clusters by selecting Hide column, and you can also drag columns to reorder them in the viewer.
c.     Cluster Characteristics
To use the Cluster Characteristics tab, select a cluster from the Cluster list. After you select a cluster, you can examine the characteristics that make up that specific cluster. The attributes that the cluster contains are listed in the Variables columns, and the state of the listed attribute is listed in the Values column. Attribute states are listed in order of importance, described by the probability that they will appear in the cluster. The probability is shown in the Probability column.
Let’s pick up a simple one: Cluster 12.
We also can get the same picture from the DMX query:
CALL System.Microsoft.AnalysisServices.System.DataMining.Clustering.GetClusterCharacteristics('[TK448 Ch09 Sequence Clustering]','12',0.0005);
The result is:
Attributes
Values
Frequency
Support
Model
Water Bottle
89.6%
274
Model.Transitions
89.6%
270
Model.Transitions
[Start] -> Water Bottle
88.4%
270
Model.Transitions
[Start] -> Cycling Cap
11.6%
35
Model
Cycling Cap
10.4%
32
Model.Transitions
Cycling Cap,Water Bottle
10.4%
35
In this cluster, there are only two product models: Cycling Cap is 10.4% and Water Bottle is 89.6%. For the transition sequences, 11.6% of the sequences originate from Cycling Cap and the remaining 88.4% originate from Water Bottle.
Now the hard part – the ‘missing’ transitions. It is paired with the multiple item sets. In this case, it is paired with the model transition – Cycling Cap, Water Bottle. It’s nothing to do with the modes in the cluster or the model transition start with [Start]. The model transition ‘Cycling Cap, Water Bottle’ means customers buy these two models together and in that exact sequence. The figure above indicates such a collective purchase is 10.4%. The ratio of ‘Missing’ is defined as: (Support of Missing) / (Support of Missing + Support of First State). The support of missing is defined as the model is the first state and 'Missing' as the second state. The Support of First State is the frequency of the model as the 1st state. In our case, we have starting models: cycling cap and water bottle (as we have two starting models). Let's retrieve the missing supports - that is, cycling cap or water bottle is the first state and 'Missing' is the second state using the following DMX:
SELECT FLATTENED
(SELECT ATTRIBUTE_VALUE AS Product2,
[Support] AS [P2 Support],
[Probability] AS [P2 Probability]
FROM NODE_DISTRIBUTION WHERE ATTRIBUTE_VALUE = 'Missing') AS t
FROM [TK448 Ch09 Sequence Clustering].CONTENT
WHERE NODE_UNIQUE_NAME = '819192' or NODE_UNIQUE_NAME = '819224';
It returns:
t.Product2
t.P2 Support
t.P2 Probability
Missing
0
0.0%
Missing
269.8620432
100.0%

Thus the ratio of missing for Cycling Cap is 0. For another starting point, Water Bottle, the support of missing is about 270, the support of first state is 32 (the frequency for cycling cap). Thus,
(Support of Missing) / (Support of Missing + Support of First State) =
269.862/ (269.862+32) =89.4%
Please note the numbers appear to be slightly off due to the way the Expectation Maximization algorithm works.
Let’s look at another example, Cluster 14. The following table can be obtained by running a DMX query:
CALL System.Microsoft.AnalysisServices.System.DataMining.Clustering.GetClusterCharacteristics('[TK448 Ch09 Sequence Clustering]','14',0.0005);

Attributes
Values
Frequency
Support
Model
Women's Mountain Shorts
65.7%
294
Model
Long-Sleeve Logo Jersey
17.0%
76
Model
Cycling Cap
10.9%
49
Model
Racing Socks
6.4%
28
Model.Transitions
[Start] -> Women's Mountain Shorts
100.0%
447
Model.Transitions
34.3%
233
Model.Transitions
Women's Mountain Shorts,Long-Sleeve Logo Jersey
16.7%
114
Model.Transitions
Women's Mountain Shorts,Cycling Cap
8.8%
60
Model.Transitions
Women's Mountain Shorts,Racing Socks
6.0%
41

First, four models in the cluster adds up to 100%. Second, all sequences start with Women’s Mountain Shorts. Third, the likelihood of customers buying Women's Mountain Shorts followed by Long-Sleeve Logo Jersey is 16.7%. Similarly, the probabilities of the purchase sequences of Women's Mountain Shorts followed by Cycling Cap and Women's Mountain Shorts followed by Racing Socks are 8.8%, and 6.0%, respectively. Not surprisingly, the ratio for ‘Missing’ is not 61.5% (100%-16.7%-8.8%-6.0%), rather than another number: 34.3%, because customers may buy products in different ways such as buying Women's Mountain Shorts alone or three products together. Let’s figure out the ratio for missing:
SELECT FLATTENED
(SELECT ATTRIBUTE_VALUE AS Product2,
[Support] AS [P2 Support],
[Probability] AS [P2 Probability]
FROM NODE_DISTRIBUTION WHERE ATTRIBUTE_VALUE = 'Missing') AS t
FROM [TK448 Ch09 Sequence Clustering].CONTENT
WHERE NODE_UNIQUE_NAME = '950295'; -- Women’s Mountain Shorts

t.Product2    t.P2 Support           t.P2 Probability
Missing                  232.729883335169         0.521212121212121
(Support of Missing) / (Support of Missing + Support of First State) = 232.73/ (232.73+447) =34.2% 
d.    Cluster Discrimination – easy one!
You can use the Cluster Discrimination tab to compare attributes between two clusters, to determine how items in a sequence favor one cluster over another. Use the Cluster 1 and Cluster 2 lists to select the clusters to compare. The viewer determines the most important differences between the clusters, and displays the attribute states that are associated with the differences, in order of importance. A bar to the right of the attribute shows which cluster the state favors, and the size of the bar shows how strongly the state favors the cluster.
e.     Cluster Transitions
By selecting a cluster on the Cluster Transitions tab, you can browse the transitions between sequence states in the selected cluster. Each node in the viewer represents a state of the sequence column. An arrow represents a transition between two states and the probability that is associated with the transition. If a transition returns back to the originating node, an arrow can point back to the originating node (bidirectional).
An arrow that originates from a dot represents the probability that the node is the start of a sequence. An ending edge that leads to a null represents the probability that the node is the end of the sequence. You can filter the edge of the nodes by using the slider at the left of the tab.
          For instance, for Cluster 12, 12% (11.6% actually) of the sequences originate from Cycling Cap and 88% of the sequences originate from Water Bottle. Of those originating from Cycling Cap are followed by Water Bottle at 100%. Water bottle is the only end of the sequences, 88% by itself, and another 12% from cycling cap.