I use the same project exercises from the TK 70-448 book
as the previous post. In the project, 13 variables
are used to find the salient predictors for the predicable variable - Bikebuyer
(yes or no).
CustomerKey is the key column.
BikeBuyer is the predictable column, and an input as
well.
The other 12 input columns are Age, CommuteDistance,
EnglishEducation, EnglishOccupation, Gender, HouseOwnerFlag, MaritalStatus,
NumberCarsOwned, NumberChildrenAtHome, Region, TotalChildren, and YearlyIncome.
Now let us look at the results from the Clustering Model.
There are four result tabs: Custer Diagram, Cluster Profiles,
Cluster Characteristics, and Cluster Discrimination.
Let’s start with the easy one.
Cluster Profiles – It shows each attribute, together with the distribution of the attribute
in each cluster.
Our training set or population has 12,939 customers. The 13
input columns and their respective state values or subgroups are listed below
(by ignoring the ‘Other’ or ‘Missing’ subgroup as we do not have missing data):
Yearly income – 5 groups
Region – 3
English Occupation – 5
Commute Distance – 5
Number Cars Owned – 5
Total Children – 6
English Education – 5
Bike buyer – 2
Number of Children at Home – 6
Marital Status – 2
House Owner Flag – 2
Age – 5
Gender – 2
Thus, there are total 53 subgroups or attributes in these 13
input columns.
This viewer simply shows the frequency distribution of the 53
subgroups of the 13 input variable in each cluster.
SSAS has divided the 12939 customers in the entire population or
the training set into 10 clusters, based on similarity of characteristics.
These 10 clusters are mutually exclusive. In other words, one customer can only
belong to one cluster.
Cluster 1 has 1,842 customers. The main characteristic of this
cluster is Yearly Income < 39221.4065840128, as all
customers in this cluster meet this criterion. Please note: Not all customers
with yearly income < 39221.4065840128 are in Cluster 1. Other clusters can
also have customers with yearly income < 39221.4065840128. Other characteristics
in this cluster do not have such an all-or-none feature. For instance, the 5
subgroups on Age has the following distribution. That is, 33% customer in
cluster 1 has age 42-52, 27.7% are less than 42, and 26.5% are between 52 and 62,
etc.
Age
|
Percentage
|
42 - 52
|
0.33
|
< 42
|
0.277
|
52 - 62
|
0.265
|
62 - 71
|
0.105
|
>= 71
|
0.023
|
Similarly, on Bike Buyer, 41.4% of these
1842 customers did not buy a bike, whereas 58.6% did.
0 = nonbuyer
|
0.414
|
|
1 = buyer
|
0.586
|
The distribution on other 10 input columns in Cluster 1 can be
found from the legend in a similar way, and are omitted here for the sake of brevity.
The bottom of the legend for the cluster also displays some attributes.
But it is just a brief description, not an inclusive list. In other words, not
every subgroup is listed here. Some subgroups are omitted in the description
area. In our case, only 21 out of the 53 subgroups are listed for Cluster 1
(see table below). The other 32 subgroups are omitted. Why? There are several
possible reasons. Reason 1: no value or value =0 for some attributes. For
instance, as all customer in this cluster has yearly income<39221.40, it is
not necessary to list other subgroups on yearly income. Reason 2: the other
subgroup is the complementary of the listed subgroup. For example, the subgroup
‘Bike Buyer =1’ has a value of .586, thus, the attribute ‘Bike Buyer =0’ should
have a value of .414. Reason 3: the omitted subgroup may characterize other
clusters heavily. For instance, on Age, although the subgroup (42 – 52) has a
value .33, even higher than the two listed ones: 0.277 for age <42 and .265
for age between 52 and 62, it is NOT listed under Cluster 1 because it is the
main characteristic of Cluster 3 and Cluster 7.
Age < 42
|
0.277
|
Age=52 - 62 ,
|
0.265
|
Bike Buyer=1 ,
|
0.586
|
Commute Distance=0-1 Miles ,
|
0.642
|
Commute Distance=2-5 Miles ,
|
0.215
|
English Education=High School ,
|
0.214
|
English Education=Partial College ,
|
0.397
|
English Occupation=Clerical ,
|
0.448
|
English Occupation=Manual ,
|
0.506
|
Gender=M ,
|
0.521
|
House Owner Flag=0 ,
|
0.381
|
Marital Status=S ,
|
0.528
|
Number Cars Owned=0 ,
|
0.468
|
Number Cars Owned=1 ,
|
0.393
|
Number Children At Home=0 ,
|
0.688
|
Number Children At Home=1 ,
|
0.183
|
Number Children At Home=2 ,
|
0.107
|
Region=Europe ,
|
0.822
|
Total Children=1 ,
|
0.398
|
Total Children=2 ,
|
0.22
|
Yearly Income < 39221.4065840128 ,
|
1.000
|
In describing the clusters, we should look at the main
characteristics with the high percentages. For example, Cluster 1 is primarily characterized
by Yearly Income < 39221.4065840128, Region=Europe, and Number Children At Home=0.
Other 9 clusters can be described similarly.
Cluster 2 (n= 2001):
Cluster 3 (n= 1690)
Cluster 4 (n= 1722)
Cluster 5 (n= 1169)
Cluster 6 (n= 958)
Cluster 7 (n= 981)
Cluster 8 (n= 874)
Cluster 9 (n= 1000)
Cluster 10 (n= 702)
Cluster Characteristics – This viewer display the characteristics that make up the selected specific cluster. We can further arrange
the characteristics in descending order for easy comprehension. For instance, the
top three characteristics of Cluster 1 are: yearly income<39221.4065840128;
Region = Europe, Number of children at home = 0 (68.794%).
Cluster Discrimination – SSAS calculates the
score for each attribute on the two competing clusters to determine which cluster
wins the attribute. It further arranges the attributes based on the calculated standardized scores in the descending order.
For instance, if comparing Cluster 1 with complement of Cluster 1, that is,
comparing the 1842 customers in Cluster 1 with the rest 11,097 (i.e., 12939-1842)
customers in the population, the attribute ‘yearly income<39221.4065840128’
favors Cluster 1 with a standardized sore 100. The attribute ‘yearly income 39221.4065840128
- 71010.0501921792’ has
the next highest standardized score, 47.387, and favors complement of Cluster
1, and so on. The method of calculating the standardized score and setting up a
threshold to rule out the insignificant attributes is the same as that explain
in an earlier post. By the way, we can rename Cluster 1 as Income<39221 (living in Europe - if necessary to differentiate the clusters).
Cluster Diagram – It displays all the clusters that are in a mining model and how close these
clusters are to one another.
In our case, the clustering model has found 10 clusters. By default, the shade represents the population of the cluster. But we can select any shading variable from
the drop-down list box (containing population and the 13 input variables) and a corresponding state value for the selected variable. After that, the ‘density’ bar will show
the value range from the minimum to the maximum for the selected shading
variable / state, accordingly. The darker of a cluster, the higher % of customers
in it for the ShadingVariable/State combination. For instance, in the chart
below for the default, we have Population as the shading variable. The state is grayed out, meaning that the state is not bound to a particular value. In other
words, all attributes are considered. The density ranges from 'None' to 15% in
Cluster 2 (2001/12939=15.46%).
The shading of the line
that connects one cluster to another represents the strength of the similarity
of the clusters. If the shading is light or nonexistent, the clusters are not
very similar. As the line becomes darker, the similarity of the links becomes
stronger. Additionally, we can adjust how many lines the viewer shows by
adjusting the slider to the left of the clusters. Lowering the slider shows
only the strongest links.
On the chart above for
the default, there is a link between Cluster 4 and Cluster 7. It means that in
the population with all attributes considered, these two clusters have the
strongest link. It is interesting to note that this link for the population exists
in every subsequent cluster diagram. If we want to find more links in the population by
sliding the vertical bar on the left, we found the next strongest similarity exists
between Cluster 2 and Cluster 7, and the link between Cluster 6 and Cluster 10
ranks 3rd.
Let’s look at one more: on
Bike Buyer and state =1. The density ranges from 'None' to 72% in Cluster 7. The three
strongest links are also between C4 and C7, between C2 and C7, and between C6
and C10, the exact conclusion as that for the population. This is not by
accident because the default diagram for the population actually searches the
attribute similarities on the predicable attribute, which is Bike Buyer=1. They should reveal the same findings.