1. Preparation of Data

For this demo, I also use the project sample (pp. 399-403) in the TK 448 book. The project needs to use the AdventureWorksDW2008 or AdventureWorksDW2008R2 database. Particularly, it needs two views in the database: vAssocSeqOrders and vAssocSeqLineItems.

The vAssocSeqOrders has 4 columns and 21,255 rows, OrderNumber is the Key.

/***************21255 Rows ********************************

OrderNumber CustomerKey Region IncomeGroup

SO51176 18239 Pacific High

SO51177 27873 Pacific Low

SO51178 11245 Europe High

***********************************************************/

SELECT * FROM [dbo].[vAssocSeqOrders]

The vAssocSeqLineItems has 3 columns and 52,761 rows, OrderNumber and Line are the Key.

/************** 52761 Rows **********************************

OrderNumber LineNumber Model

SO51176 1 Road-250

SO51176 2 Road Bottle Cage

SO51177 1 Touring-2000

SO51177 2 Sport-100

SO51178 1 Mountain-200

SO51178 2 Mountain Bottle Cage

SO51178 3 Water Bottle

**************************************************************/

SELECT *

FROM [dbo].[vAssocSeqLineItems]

2. The model structure of the sequence clustering model

Our goal is to use the Line Number attribute in the dbo.vAssocSeqLineItems view to define the sequence of purchases of product models inside a single customer order. With multiple orders, we can try to find rules in sequences.

In order to do so, we need to use the vAssocSeqOrders view as the case table, and the vAssocSeqLineItems as the nested table. We set up the mining structure as below. In this way, we use Line Number as the Key Sequence column, indicating that it has a sequence in it. Because the sequence predicts a model, we use this column as input as well. The Model column is predictable, but because we want to find not just sequences, but sequences of models, we have to use it as input as well.

Essentially, the sequence cluster model asks two questions: 1. what are the clusters? In other words, when customers purchase multiple products, which products are likely to be purchased together? 2. What are the sequences when they purchase products?

Case Table Column	Nested Table Column	Usage
Order Number		Key
~~Customer Key~~
~~Region~~
~~Income Group~~
vAssocSeqLineItems
	~~Order Number~~
	Line Number	Key sequence, Input
	Model	Input, predictable

3. Understand the results

The results for an association rules model under mining model viewer has five tabs: Cluster Diagram, Cluster Profiles, Cluster Characteristics, Cluster Discrimination and State Transitions; and one output under the Microsoft Generic Content Tree Viewer.

Before we look at the viewers, we need to understand some basic concepts about sequence clustering first.

The most fundamental concept is transition or transition sequence. What is it? Let’s look at the data. First, let’s see how many product models there are in the view.

/******* 37 Models in the view********

All-Purpose Bike Stand

Bike Wash

Classic Vest

Cycling Cap

Fender Set - Mountain

Half-Finger Gloves

Hitch Rack - 4-Bike

HL Mountain Tire

HL Road Tire

Hydration Pack

LL Mountain Tire

LL Road Tire

Long-Sleeve Logo Jersey

ML Mountain Tire

ML Road Tire

Mountain Bottle Cage

Mountain Tire Tube

Mountain-200

Mountain-400-W

Mountain-500

Patch kit

Racing Socks

Road Bottle Cage

Road Tire Tube

Road-250

Road-350-W

Road-550-W

Road-750

Short-Sleeve Classic Jersey

Sport-100

Touring Tire

Touring Tire Tube

Touring-1000

Touring-2000

Touring-3000

Water Bottle

Women's Mountain Shorts

***********************************/

SELECT DISTINCT model FROM vAssocSeqLineItems

ORDER BY model

In sequence clustering, these 37 models are also called states or transition sequence states. Now, let us understand the concept ‘transition’ or ‘transition sequence’. Let’s pick up an order - one with the most line items. SO58845 is one of them.

OrderNumber NumberOfModels

SO58845 8

SO70714 8

SO72656 8

SO51555 7

SO54042 7

USE AdventureWorksDW2008R2

SELECT TOP 5 OrderNumber, Count(*) AS NumberOfModels

FROM vAssocSeqLineItems

GROUP BY OrderNumber

ORDER BY Count(*) DESC

Let’s see what’s on the order. From the result below, we can see that on this order, Mountain-500 is the 1^st model, followed by LL Mountain Tire. The flow from Mountain-500 to LL Mountain Tire is a transition sequence or simply a transition. From the 2^nd line item to the 3^rd line item is another transition sequence. Thus, there are 7 transition sequences on this order.

OrderNumber LineNumber Model

SO58845 1 Mountain-500

SO58845 2 LL Mountain Tire

SO58845 3 Mountain Tire Tube

SO58845 4 Fender Set - Mountain

SO58845 5 Mountain Bottle Cage

SO58845 6 Water Bottle

SO58845 7 Sport-100

SO58845 8 Long-Sleeve Logo Jersey

SELECT * FROM vAssocSeqLineItems

WHERE OrderNumber='SO58845'

But on other orders, Mountain-500 may be followed by other models. SSAS has figured out there are total 19,219 transition sequences on all of the orders.

The next important concept is cluster. But before looking at the clusters, let’s examine the population. There are total 52,761 items in the vAssocSeqLineItems view. The frequency and ratio of each product model are listed below:

Meaning(Model)	Frequencies	Distribution(Probability)
Sport-100	6171	0.117
Water Bottle	4076	0.077
Patch kit	3010	0.057
Mountain Tire Tube	2908	0.055
Mountain-200	2477	0.047
Road Tire Tube	2216	0.042
Cycling Cap	2095	0.04
Fender Set - Mountain	2014	0.038
Mountain Bottle Cage	1941	0.037
Road Bottle Cage	1702	0.032
Long-Sleeve Logo Jersey	1642	0.031
Short-Sleeve Classic Jersey	1537	0.029
Road-750	1443	0.027
Half-Finger Gloves	1363	0.026
Touring Tire Tube	1397	0.026
HL Mountain Tire	1331	0.025
Touring-1000	1255	0.024
ML Mountain Tire	1083	0.021
Road-550-W	1033	0.02
LL Road Tire	980	0.019
Road-350-W	929	0.018
Women's Mountain Shorts	958	0.018
Touring Tire	881	0.017
ML Road Tire	890	0.017
Bike Wash	864	0.016
HL Road Tire	804	0.015
LL Mountain Tire	799	0.015
Hydration Pack	706	0.013
Road-250	706	0.013
Racing Socks	543	0.01
Classic Vest	527	0.01
Touring-3000	540	0.01
Mountain-400-W	543	0.01
Mountain-500	479	0.009
Touring-2000	372	0.007
Hitch Rack - 4-Bike	308	0.006
All-Purpose Bike Stand	238	0.005

DECLARE @TotalItem decimal(8,2)

SELECT @TotalItem = COUNT (*) FROM vAssocSeqLineItems

SELECT Model AS [Meaning(Model)], COUNT(*) AS Frequencies, CAST (COUNT(*)/@TotalItem AS DECIMAL (8,3)) AS [Distribution(Probability)] FROM vAssocSeqLineItems

GROUP BY model

ORDER BY [Distribution(Probability)] DESC

This is exactly the same as the population profile in the chart below. Please note that the ratio of each product in the legend is not based on transition sequences, rather than based on the frequencies on the orders. For example, Sport-100 has been bought 6171 times, and the total line items in the view is 52,761, so the distribution of the model ‘Sport-100’ is 6171/52761=11.7%. It’s nothing to do with the total 19,129 transition sequences.

SSAS has identified 15 clusters from the dataset as below.

Each cluster contains two or more models. Let’s see some examples. For instance, Cluster 12 has only two models: Water Bottle and Cycling Cap.

89.6% of the cluster are Water Bottle, whereas the remaining 10.4% are cycling cap. This is much clear from the detailed information from the Generic Content Tree Viewer on Cluster 12 as shown in table below. Cluster 12 contains a frequency of about 32 on Cycling Cap (about 10.4%), and the remaining 89.6% are 274 Water bottle. Please note this number is not related to Node_Support 294 in the cluster.

ATTRIBUTE_NAME	ATTRIBUTE_VALUE	SUPPORT	PROBABILITY
Model	Missing	0	1.00E-14
Model	All-Purpose Bike Stand	0	1.00E-14
Model	Bike Wash	0	1.00E-14
Model	Classic Vest	0	1.00E-14
Model	Cycling Cap	31.74247782	0.103975535
Model	Fender Set - Mountain	0	1.00E-14
Model	Half-Finger Gloves	0	1.00E-14
Model	Hitch Rack - 4-Bike	0	1.00E-14
Model	HL Mountain Tire	0	1.00E-14
Model	HL Road Tire	0	1.00E-14
Model	Hydration Pack	0	1.00E-14
Model	LL Mountain Tire	0	1.00E-14
Model	LL Road Tire	0	1.00E-14
Model	Long-Sleeve Logo Jersey	0	1.00E-14
Model	ML Mountain Tire	0	1.00E-14
Model	ML Road Tire	0	1.00E-14
Model	Mountain Bottle Cage	0	1.00E-14
Model	Mountain Tire Tube	0	1.00E-14
Model	Mountain-200	0	1.00E-14
Model	Mountain-400-W	0	1.00E-14
Model	Mountain-500	0	1.00E-14
Model	Patch kit	0	1.00E-14
Model	Racing Socks	0	1.00E-14
Model	Road Bottle Cage	0	1.00E-14
Model	Road Tire Tube	0	1.00E-14
Model	Road-250	0	1.00E-14
Model	Road-350-W	0	1.00E-14
Model	Road-550-W	0	1.00E-14
Model	Road-750	0	1.00E-14
Model	Short-Sleeve Classic Jersey	0	1.00E-14
Model	Sport-100	0	1.00E-14
Model	Touring Tire	0	1.00E-14
Model	Touring Tire Tube	0	1.00E-14
Model	Touring-1000	0	1.00E-14
Model	Touring-2000	0	1.00E-14
Model	Touring-3000	0	1.00E-14
Model	Water Bottle	273.5454706	0.896024465
Model	Women's Mountain Shorts	0	1.00E-14

Cluster 2 is much more complicated with many models in it. The composition of Cluster 2 is shown below. Cluster 13 also have a simple composition: 48.6% Mountain Tire Tube and 51.4% ML Mountain Tire. For some reason I do not know yet, SSAS has found Cluster 2 and Cluster 13 has the strongest link among the 15 clusters as demonstrated later.

With this background, let’s examine the diagrams.

a. Cluster Diagram

This viewer displays all the clusters that are found in a mining model as shown above. By default, the shade represents the population. The darker the shading, the greater the transition sequence is for a cluster. Conversely, the distribution decreases as the shading gets lighter. In our case, there are 15 clusters found in the mining models. Cluster 1 has the largest distribution (2678/19219=14%), where 2678 is the total number of transition sequences for Cluster 1 and 19,219 is the total number of transitions sequences in the population.

The shading of the line that connects one cluster to another represents the strength of the similarity of the clusters. As the line becomes darker, the similarity of the links becomes stronger. We can adjust how many lines the viewer shows by adjusting the slider to the right of the clusters. Lowering the slider shows only the strongest links. In our case, Cluster 2 and Cluster 13 appear to have the strongest link in the population. This strongest link population shows on all of the cluster diagrams at the state level, as stated below.

We can also select Model for ShadingVariable and any particular model for State to see the cluster diagram for a particular state. For instance, we select Model for ShadingVariable and Sport-100 for State, we obtained the following cluster diagram. It indicates the distribution of Sport-100 in each of the 15 clusters. Among them, cluster 8 has the largest ratio of 21%.

Similarly, if we choose the product model ‘Women’s Mountain Shoes’, we will see the cluster diagram as below, showing the distribution of ‘Women’s Mountain Shoes’ in all of the 15 clusters. Cluster 14 has the largest ratio of ‘Women’s Mountain Shoes’, 66%. The ratios of ‘Women’s Mountain Shoes’ in other clusters are only Cluster 1 (1%), Cluster 2 (2%), Cluster 4 (2%), Cluster 7(1%), Cluster 8 (1%) and Cluster 9 (1%). Please note the total of those percentages is not 100%, actually it is 66%+1%+2%+2%+1%+1%+1% = 73%. What’s to be 100% is the composition of states in each cluster. For instance, for Cluster 14, we see the following state composition (from the cluster profile):

Racing Socks (6.4%) + Cycling Cap (10.9%) + Long Sleeve Logo Jersey (17%) + Women’s Mountain Shorts (65.7%) = 100%

b. Cluster Profiles

The Cluster Profile tab provides an overall view of the clusters that the algorithm in your model creates. Each column that follows the Population column in the grid represents a cluster that the model discovered. The <attribute> row describes all the items that the cluster contains and their overall distribution, and the <attribute>.samples row represents different sequences of data that exist in the cluster. For instance, the 1^st chart below for the population displays the all the items it contains (e.g., Sport-100 =11.7%, Water Bottle=7.7%, Patch Kit = 5.7% etc.). The 2^nd chart displays all different transition sequences in the population such as Mountain Tire Tube followed by LL Mountain Tire followed by Cycling Cap and followed by Sport-100; HL Road Tire followed by Road Tire Tube and followed by Sport-100; etc.

The Histogram bars option controls the number of bars that are visible in the histogram. If more bars exist than you choose to display, the bars of highest importance are retained, and the remaining bars are grouped together into a gray bucket.

You can change the default names of the clusters, to make the names more descriptive. Rename a cluster by right-clicking its column heading and selecting Rename cluster. You can hide clusters by selecting Hide column, and you can also drag columns to reorder them in the viewer.

c. Cluster Characteristics

To use the Cluster Characteristics tab, select a cluster from the Cluster list. After you select a cluster, you can examine the characteristics that make up that specific cluster. The attributes that the cluster contains are listed in the Variables columns, and the state of the listed attribute is listed in the Values column. Attribute states are listed in order of importance, described by the probability that they will appear in the cluster. The probability is shown in the Probability column.

Let’s pick up a simple one: Cluster 12.

We also can get the same picture from the DMX query:

CALL System.Microsoft.AnalysisServices.System.DataMining.Clustering.GetClusterCharacteristics('[TK448 Ch09 Sequence Clustering]','12',0.0005);

The result is:

Attributes	Values	Frequency	Support
Model	Water Bottle	89.6%	274
Model.Transitions		89.6%	270
Model.Transitions	[Start] -> Water Bottle	88.4%	270
Model.Transitions	[Start] -> Cycling Cap	11.6%	35
Model	Cycling Cap	10.4%	32
Model.Transitions	Cycling Cap,Water Bottle	10.4%	35

In this cluster, there are only two product models: Cycling Cap is 10.4% and Water Bottle is 89.6%. For the transition sequences, 11.6% of the sequences originate from Cycling Cap and the remaining 88.4% originate from Water Bottle.

Now the hard part – the ‘missing’ transitions. It is paired with the multiple item sets. In this case, it is paired with the model transition – Cycling Cap, Water Bottle. It’s nothing to do with the modes in the cluster or the model transition start with [Start]. The model transition ‘Cycling Cap, Water Bottle’ means customers buy these two models together and in that exact sequence. The figure above indicates such a collective purchase is 10.4%. The ratio of ‘Missing’ is defined as: (Support of Missing) / (Support of Missing + Support of First State). The support of missing is defined as the model is the first state and 'Missing' as the second state. The Support of First State is the frequency of the model as the 1st state. In our case, we have starting models: cycling cap and water bottle (as we have two starting models). Let's retrieve the missing supports - that is, cycling cap or water bottle is the first state and 'Missing' is the second state using the following DMX:

SELECT FLATTENED

(SELECT ATTRIBUTE_VALUE AS Product2,

[Support] AS [P2 Support],

[Probability] AS [P2 Probability]

FROM NODE_DISTRIBUTION WHERE ATTRIBUTE_VALUE = 'Missing') AS t

FROM [TK448 Ch09 Sequence Clustering].CONTENT

WHERE NODE_UNIQUE_NAME = '819192' or NODE_UNIQUE_NAME = '819224';

It returns:

t.Product2	t.P2 Support	t.P2 Probability
Missing	0	0.0%
Missing	269.8620432	100.0%

Thus the ratio of missing for Cycling Cap is 0. For another starting point, Water Bottle, the support of missing is about 270, the support of first state is 32 (the frequency for cycling cap). Thus,

(Support of Missing) / (Support of Missing + Support of First State) =

269.862/ (269.862+32) =89.4%

Please note the numbers appear to be slightly off due to the way the Expectation Maximization algorithm works.

Let’s look at another example, Cluster 14. The following table can be obtained by running a DMX query:

CALL System.Microsoft.AnalysisServices.System.DataMining.Clustering.GetClusterCharacteristics('[TK448 Ch09 Sequence Clustering]','14',0.0005);

Attributes	Values	Frequency	Support
Model	Women's Mountain Shorts	65.7%	294
Model	Long-Sleeve Logo Jersey	17.0%	76
Model	Cycling Cap	10.9%	49
Model	Racing Socks	6.4%	28
Model.Transitions	[Start] -> Women's Mountain Shorts	100.0%	447
Model.Transitions		34.3%	233
Model.Transitions	Women's Mountain Shorts,Long-Sleeve Logo Jersey	16.7%	114
Model.Transitions	Women's Mountain Shorts,Cycling Cap	8.8%	60
Model.Transitions	Women's Mountain Shorts,Racing Socks	6.0%	41

First, four models in the cluster adds up to 100%. Second, all sequences start with Women’s Mountain Shorts. Third, the likelihood of customers buying Women's Mountain Shorts followed by Long-Sleeve Logo Jersey is 16.7%. Similarly, the probabilities of the purchase sequences of Women's Mountain Shorts followed by Cycling Cap and Women's Mountain Shorts followed by Racing Socks are 8.8%, and 6.0%, respectively. Not surprisingly, the ratio for ‘Missing’ is not 61.5% (100%-16.7%-8.8%-6.0%), rather than another number: 34.3%, because customers may buy products in different ways such as buying Women's Mountain Shorts alone or three products together. Let’s figure out the ratio for missing:

SELECT FLATTENED

(SELECT ATTRIBUTE_VALUE AS Product2,

[Support] AS [P2 Support],

[Probability] AS [P2 Probability]

FROM NODE_DISTRIBUTION WHERE ATTRIBUTE_VALUE = 'Missing') AS t

FROM [TK448 Ch09 Sequence Clustering].CONTENT

WHERE NODE_UNIQUE_NAME = '950295'; -- Women’s Mountain Shorts

t.Product2 t.P2 Support t.P2 Probability

Missing 232.729883335169 0.521212121212121

(Support of Missing) / (Support of Missing + Support of First State) = 232.73/ (232.73+447) =34.2%

d. Cluster Discrimination – easy one!

You can use the Cluster Discrimination tab to compare attributes between two clusters, to determine how items in a sequence favor one cluster over another. Use the Cluster 1 and Cluster 2 lists to select the clusters to compare. The viewer determines the most important differences between the clusters, and displays the attribute states that are associated with the differences, in order of importance. A bar to the right of the attribute shows which cluster the state favors, and the size of the bar shows how strongly the state favors the cluster.

e. Cluster Transitions

By selecting a cluster on the Cluster Transitions tab, you can browse the transitions between sequence states in the selected cluster. Each node in the viewer represents a state of the sequence column. An arrow represents a transition between two states and the probability that is associated with the transition. If a transition returns back to the originating node, an arrow can point back to the originating node (bidirectional).

An arrow that originates from a dot represents the probability that the node is the start of a sequence. An ending edge that leads to a null represents the probability that the node is the end of the sequence. You can filter the edge of the nodes by using the slider at the left of the tab.

For instance, for Cluster 12, 12% (11.6% actually) of the sequences originate from Cycling Cap and 88% of the sequences originate from Water Bottle. Of those originating from Cycling Cap are followed by Water Bottle at 100%. Water bottle is the only end of the sequences, 88% by itself, and another 12% from cycling cap.

On the Journey to be a SQL Server Professional

How to Interpret the Results from a Sequence Clustering Model?

1. Preparation of Data

2. The model structure of the sequence clustering model

3. Understand the results