How to Interpret the Results from a Clustering Model in SSAS?

I use the same project exercises from the TK 70-448 book as the previous post. In the project, 13 variables are used to find the salient predictors for the predicable variable - Bikebuyer (yes or no).

CustomerKey is the key column.

BikeBuyer is the predictable column, and an input as well. 

The other 12 input columns are Age, CommuteDistance, EnglishEducation, EnglishOccupation, Gender, HouseOwnerFlag, MaritalStatus, NumberCarsOwned, NumberChildrenAtHome, Region, TotalChildren, and YearlyIncome.

Now let us look at the results from the Clustering Model.

There are four result tabs: Custer Diagram, Cluster Profiles, Cluster Characteristics, and Cluster Discrimination.

Let’s start with the easy one.

Cluster Profiles – It shows each attribute, together with the distribution of the attribute in each cluster. 



Our training set or population has 12,939 customers. The 13 input columns and their respective state values or subgroups are listed below (by ignoring the ‘Other’ or ‘Missing’ subgroup as we do not have missing data):

Yearly income – 5 groups
Region – 3
English Occupation – 5
Commute Distance – 5
Number Cars Owned – 5
Total Children – 6
English Education – 5
Bike buyer – 2
Number of Children at Home – 6
Marital Status – 2
House Owner Flag – 2
Age – 5
Gender – 2

Thus, there are total 53 subgroups or attributes in these 13 input columns.

This viewer simply shows the frequency distribution of the 53 subgroups of the 13 input variable in each cluster.

SSAS has divided the 12939 customers in the entire population or the training set into 10 clusters, based on similarity of characteristics. These 10 clusters are mutually exclusive. In other words, one customer can only belong to one cluster.

Cluster 1 has 1,842 customers. The main characteristic of this cluster is Yearly Income < 39221.4065840128, as all customers in this cluster meet this criterion. Please note: Not all customers with yearly income < 39221.4065840128 are in Cluster 1. Other clusters can also have customers with yearly income < 39221.4065840128. Other characteristics in this cluster do not have such an all-or-none feature. For instance, the 5 subgroups on Age has the following distribution. That is, 33% customer in cluster 1 has age 42-52, 27.7% are less than 42, and 26.5% are between 52 and 62, etc.

Age
Percentage
42 - 52
0.33
< 42
0.277
52 - 62
0.265
62 - 71
0.105
>= 71
0.023

Similarly, on Bike Buyer, 41.4% of these 1842 customers did not buy a bike, whereas 58.6% did.

0 = nonbuyer

0.414
1 = buyer

0.586

The distribution on other 10 input columns in Cluster 1 can be found from the legend in a similar way, and are omitted here for the sake of brevity.

The bottom of the legend for the cluster also displays some attributes. But it is just a brief description, not an inclusive list. In other words, not every subgroup is listed here. Some subgroups are omitted in the description area. In our case, only 21 out of the 53 subgroups are listed for Cluster 1 (see table below). The other 32 subgroups are omitted. Why? There are several possible reasons. Reason 1: no value or value =0 for some attributes. For instance, as all customer in this cluster has yearly income<39221.40, it is not necessary to list other subgroups on yearly income. Reason 2: the other subgroup is the complementary of the listed subgroup. For example, the subgroup ‘Bike Buyer =1’ has a value of .586, thus, the attribute ‘Bike Buyer =0’ should have a value of .414. Reason 3: the omitted subgroup may characterize other clusters heavily. For instance, on Age, although the subgroup (42 – 52) has a value .33, even higher than the two listed ones: 0.277 for age <42 and .265 for age between 52 and 62, it is NOT listed under Cluster 1 because it is the main characteristic of Cluster 3 and Cluster 7.

Age < 42
0.277
Age=52 - 62 ,
0.265
Bike Buyer=1 ,
0.586
Commute Distance=0-1 Miles ,
0.642
Commute Distance=2-5 Miles ,
0.215
English Education=High School ,
0.214
English Education=Partial College ,
0.397
English Occupation=Clerical ,
0.448
English Occupation=Manual ,
0.506
Gender=M ,
0.521
House Owner Flag=0 ,
0.381
Marital Status=S ,
0.528
Number Cars Owned=0 ,
0.468
Number Cars Owned=1 ,
0.393
Number Children At Home=0 ,
0.688
Number Children At Home=1 ,
0.183
Number Children At Home=2 ,
0.107
Region=Europe ,
0.822
Total Children=1 ,
0.398
Total Children=2 ,
0.22
Yearly Income < 39221.4065840128 ,
1.000

In describing the clusters, we should look at the main characteristics with the high percentages. For example, Cluster 1 is primarily characterized by Yearly Income < 39221.4065840128, Region=Europe, and Number Children At Home=0. 

Other 9 clusters can be described similarly.

Cluster 2 (n= 2001):
Cluster 3 (n= 1690)
Cluster 4 (n= 1722)
Cluster 5 (n= 1169)
Cluster 6 (n= 958)
Cluster 7 (n= 981)
Cluster 8 (n= 874)
Cluster 9 (n= 1000)
Cluster 10 (n= 702)

Cluster Characteristics – This viewer display the characteristics that make up the selected specific cluster. We can further arrange the characteristics in descending order for easy comprehension. For instance, the top three characteristics of Cluster 1 are: yearly income<39221.4065840128; Region = Europe, Number of children at home = 0 (68.794%).



Cluster Discrimination – SSAS calculates the score for each attribute on the two competing clusters to determine which cluster wins the attribute. It further arranges the attributes based on the calculated standardized scores in the descending order. For instance, if comparing Cluster 1 with complement of Cluster 1, that is, comparing the 1842 customers in Cluster 1 with the rest 11,097 (i.e., 12939-1842) customers in the population, the attribute ‘yearly income<39221.4065840128’ favors Cluster 1 with a standardized sore 100. The attribute ‘yearly income 39221.4065840128 - 71010.0501921792’ has the next highest standardized score, 47.387, and favors complement of Cluster 1, and so on. The method of calculating the standardized score and setting up a threshold to rule out the insignificant attributes is the same as that explain in an earlier post. By the way, we can rename Cluster 1 as Income<39221 (living in Europe - if necessary to differentiate the clusters).


Cluster Diagram – It displays all the clusters that are in a mining model and how close these clusters are to one another.

In our case, the clustering model has found 10 clusters. By default, the shade represents the population of the cluster.  But we can select any shading variable from the drop-down list box (containing population and the 13 input variables) and a corresponding state value for the selected variable. After that, the ‘density’ bar will show the value range from the minimum to the maximum for the selected shading variable / state, accordingly. The darker of a cluster, the higher % of customers in it for the ShadingVariable/State combination. For instance, in the chart below for the default, we have Population as the shading variable. The state is grayed out, meaning that the state is not bound to a particular value. In other words, all attributes are considered. The density ranges from 'None' to 15% in Cluster 2 (2001/12939=15.46%).



The shading of the line that connects one cluster to another represents the strength of the similarity of the clusters. If the shading is light or nonexistent, the clusters are not very similar. As the line becomes darker, the similarity of the links becomes stronger. Additionally, we can adjust how many lines the viewer shows by adjusting the slider to the left of the clusters. Lowering the slider shows only the strongest links.

On the chart above for the default, there is a link between Cluster 4 and Cluster 7. It means that in the population with all attributes considered, these two clusters have the strongest link. It is interesting to note that this link for the population exists in every subsequent cluster diagram. If we want to find more links in the population by sliding the vertical bar on the left, we found the next strongest similarity exists between Cluster 2 and Cluster 7, and the link between Cluster 6 and Cluster 10 ranks 3rd.


Let’s look at one more: on Bike Buyer and state =1. The density ranges from 'None' to 72% in Cluster 7. The three strongest links are also between C4 and C7, between C2 and C7, and between C6 and C10, the exact conclusion as that for the population. This is not by accident because the default diagram for the population actually searches the attribute similarities on the predicable attribute, which is Bike Buyer=1. They should reveal the same findings.




How to Interpret the Results from a Naive Bayes Model in SSAS?

I use the same project exercises from the TK 70-448 book as the previous post. In the project, 13 variables are used to find the salient predictors for the predicable variable - Bikebuyer (yes or no).
CustomerKey is the key column.

BikeBuyer is the predictable column, and an input as well. 

The other 12 input columns are Age, CommuteDistance, EnglishEducation, EnglishOccupation, Gender, HouseOwnerFlag, MaritalStatus, NumberCarsOwned, NumberChildrenAtHome, Region, TotalChildren, and YearlyIncome.

Now let us look at the results for the Naive Bayes Model.

There are four result tabs: Dependency Network, Attribute Profiles, Attributes Characteristics, and Attribute Discrimination.

Dependency Network - the same purposes as those in the decision tree model. That is, it shows us which input variables are significant predictors on bike purchase, and which one is the most critical, which one is 2nd important, 3rd important, and so on. But the results are usually not the same as those in the decision tree model. It shows 8 inputs, not 11 as in the decision tree model. The 8 salient input variables, in the order of importance are: number of cars owned, commute distance, English education, total children, age, number children at home, region, and marital status.


Attribute Profiles - It simply shows the frequency distribution on the two values of the predicable variable for each group of the input variables. The numbers highlighted are further elaborated in the next paragraph.


Attributes Characteristics - The dependent variable or predictable attribute has two values: 1 for buyer, 0 for non-buyer. There are 8 significant input variables, each has two or more groups (Age: 5 groups, commute distance: 5, English education: 5, marital status: 3, number of cars owned: 5, number of children at home: 5, region: 4, total children: 5; thus, total 37 groups). The chart of attributes characteristics simply lists the groups based on their probabilities on the selected value of the predicable variable in descending order. For instance, if I select the value 1 for the predicable attribute (meaning bike buyer), the group of customers with no children at home has the highest percentage among the 37 groups, with a value of 62.979%. This is consistent with the value on the chart of Attribute Profiles above. The married group has the 2nd highest, in 51.365% (not displayed on the chart below), and so on. It is said that this chart shows up to 50 top groups.


Attribute Discrimination - In short, this viewer designates which state of the predictable attribute is favored by the input groups via calculating the differences in the input attributes across the states of an output attribute. The message on chart is easy to be understood. But it is not intuitive to understand the values for the bar indicators on the chart, such as why the 2nd group (Number car owned =2) has a score of 79.396. 


It is very tempting to use the 4 counts in the legend to mingle the data to derive the value. Unfortunately, it doesn't work this way, at least, not directly. Let's try to dismantle it in the right way.

First, we need to select a value for the predicable variable. In our case, we have two values: 1 for buyers, 0 for non-buyers. Let's say, we select 1 (i.e., buyer) as value 1, therefore, value 2 has a value 0 or all other states. 

Next, the Naive Bayes model, under the hood, will calculate a score for each of the 37 groups of the eight significant input columns on the predicable attribute. If the score is positive, the bar indicator is shown under the '1' column, if negative, the bar is under the "All other states" column. The score is actually calculated using the following stored procedure. 

GetAttributeDiscrimination(
string strModel,
string strPredictableNodeUniqueID,
string strValue1,
int iVal1Type,
string strValue2,
int iVal2Type,
double dThreshold,
bool in_bRescaled)

The parameters are:

strModel – The name of the model.

strPredictableUniqueIDNode – The Node Unique Name of the Predictable attribute. You can find it in the Microsoft Generic Content Tree View, or get the list of predictable attributes and their Node Unique Names by calling another stored procedure – CALL System.GetPredictableAttributes('ModelName').  This stored procedure returns two columns – one for the attribute name and one for the Node Unique Name.
    /******* Return ****************

    ATTRIBUTE_NAME  NODE_UNIQUE_NAME
    Bike Buyer      100000001

   *********************************/

   CALL System.GetPredictableAttributes('TK448 Ch09 Prediction Naïve Bayes')

strValue1 – The name of the value you want to compare on the left hand side.  The usage of this parameter depends on the value of the next parameter.

iValType1 – This parameter indicates how to treat strValue1.  It can have values 0, 1, or 2.  If this parameter is a 1, the value in strValue1 is the actual state of the attribute.  However, if this parameter is a 0 or 2, the value in strValue1 is ignored.  If the value is 0, the left-hand value is considered to be the “missing state”.  If the value is 2, the left hand value is considered to be “all other states.”  In the example above, “All other states” is specified only because it looks nice (and it’s easier to just drop the combo box value into the function call even if it will just be ignored).

strValue2 – Like strValue1, but for the right hand side.

iValType2 – Like iValType2, but for the right hand side.

dThreshold – A threshold value used to filter results, such that small correlations don’t come back in the results.    Usually you set it to a really small number like 0.0005 in the example above.

bNormalize - Whether or not the result is normalized.  If this value is true, the results are normalized to a maximum absolute value of 100, giving a possible range of –100 to 100.  All this does is take the largest absolute value in the result and divide that into 100, and then multiple all the other numbers by that amount.  If set to false, the numbers are whatever they are and you can figure it out yourself – it’s up to you, but the NB viewer always sets this to true, by default.

After I deploy the project, I use the following DMX:

CALL System.GetAttributeDiscrimination
  ('TK448 Ch09 Prediction Naïve Bayes'-- Model Name
   '100000001',                         -- Node_Unique_Name for the Predicable Attribute. It can be found from the Generic Content Tree View
   '1',                                 -- The name of the value you want to compare on the left hand side. It depends on the next value.
    1,                                  -- If 1, the value above is the actual state of the attribute.
                                        -- If 0, the left-hand value is considered to be the “missing state”. 
                                        -- If 2, the left hand value is considered to be “all other states.” 
    'All other states',                 -- see below
    2,                                  -- If 2, the right hand value is considered to be “all other states.” 
    0.0005,                             -- A threshold value used to filter results, such that small correlations don’t come back in the results.
    true                                -- Whether or not the result is normalized.  If true, the results are normalized to a maximum absolute value of 100.
)
The results are:
Attributes
Values
Score
InState1
InState2
OutState1
OutState2
Number Cars Owned
0
100.000
1842
1070
4530
5497
Number Cars Owned
2
-79.396
1810
2713
4562
3854
English Education
Partial High School
-62.066
329
774
6043
5793
Age
>= 71
-51.177
129
414
6243
6153
Total Children
1
47.292
1512
1009
4860
5558
Commute Distance
10+ Miles
-45.468
638
1122
5734
5445
Commute Distance
0-1 Miles
45.355
2491
1921
3881
4646
Region
Pacific
44.493
1523
1033
4849
5534
Total Children
5
-43.573
310
668
6062
5899
Region
North America
-32.846
2940
3617
3432
2950
English Education
Bachelors
29.086
2110
1670
4262
4897
Total Children
4
-27.761
618
993
5754
5574
Number Children At Home
3
-27.503
276
544
6096
6023
Number Children At Home
4
-26.870
254
510
6118
6057
Commute Distance
5-10 Miles
-22.959
913
1316
5459
5251
Number Children At Home
2
20.979
700
451
5672
6116
Age
42 - 52
20.538
2334
1958
4038
4609
Number Cars Owned
1
18.864
1912
1565
4460
5002
Age
62 - 71
-18.848
549
847
5823
5720
Commute Distance
2-5 Miles
18.433
1298
993
5074
5574
English Education
High School
-18.379
945
1316
5427
5251
Number Cars Owned
4
-15.440
323
539
6049
6028
Marital Status
S
11.446
3099
2823
3273
3744
Marital Status
M
-11.446
3273
3744
3099
2823
Number Cars Owned
3
-7.774
485
680
5887
5887
Number Children At Home
0
7.254
4013
3830
2359
2737
English Education
Graduate Degree
3.239
1192
1047
5180
5520
Commute Distance
1-2 Miles
-1.417
1032
1215
5340
5352
Number Children At Home
5
-0.628
286
369
6086
6198

That's why the first group has a score of 100, the 2nd group has a score of 79.396, etc.

Please note some groups with scores below the threshold are ruled out in the chart.