I use the same project exercises from the TK
70-448 book as the previous post. In the project, 13 variables
are used to find the salient predictors for the predicable variable - Bikebuyer
(yes or no).
CustomerKey is the key
column.
BikeBuyer is the
predictable column, and an input as well.
The other 12 input
columns are Age, CommuteDistance, EnglishEducation, EnglishOccupation, Gender,
HouseOwnerFlag, MaritalStatus, NumberCarsOwned, NumberChildrenAtHome, Region,
TotalChildren, and YearlyIncome.
Now let us look at the
results for the Naive Bayes Model.
There are four result
tabs: Dependency Network, Attribute Profiles, Attributes Characteristics, and
Attribute Discrimination.
Dependency Network - the same purposes as those in the decision tree model.
That is, it shows us which input variables are significant predictors on bike
purchase, and which one is the most critical, which one is 2nd important, 3rd
important, and so on. But the results are usually not the same as those in the
decision tree model. It shows 8 inputs, not 11 as in the decision tree model.
The 8 salient input variables, in the order of importance are: number
of cars owned, commute distance, English education, total children, age, number
children at home, region, and marital status.
Attribute Profiles - It simply shows the frequency distribution on the two
values of the predicable variable for each group of the input variables. The
numbers highlighted are further elaborated in the next paragraph.
Attributes
Characteristics - The dependent
variable or predictable attribute has two values: 1 for buyer, 0 for non-buyer.
There are 8 significant input variables, each has two or more groups (Age: 5
groups, commute distance: 5, English education: 5, marital status: 3, number of
cars owned: 5, number of children at home: 5, region: 4, total children: 5;
thus, total 37 groups). The chart of attributes
characteristics simply lists the groups based on their probabilities on the
selected value of the predicable variable in descending order. For
instance, if I select the value 1 for the predicable attribute (meaning bike
buyer), the group of customers with no children at home has the highest
percentage among the 37 groups, with a value of 62.979%. This is consistent
with the value on the chart of Attribute Profiles above. The married group has
the 2nd highest, in 51.365% (not displayed on the chart below), and so on. It
is said that this chart shows up to 50 top groups.
Attribute Discrimination - In short, this viewer designates which state of the predictable attribute is favored by the input groups via calculating the differences in the input attributes across the states of an output attribute. The message on chart is easy to be understood. But it is not intuitive to understand the values for the bar indicators on the chart, such as why the 2nd group (Number car owned =2) has a score of 79.396.
It is very tempting to use the 4 counts in the legend to mingle
the data to derive the value. Unfortunately, it doesn't work this way, at
least, not directly. Let's try to dismantle it
in the right way.
First, we need to select
a value for the predicable variable. In our case, we have two values: 1 for
buyers, 0 for non-buyers. Let's say, we select 1 (i.e., buyer) as value 1,
therefore, value 2 has a value 0 or all other states.
Next, the Naive Bayes
model, under the hood, will calculate a score for each of the 37 groups of the
eight significant input columns on the predicable attribute. If the score is
positive, the bar indicator is shown under the '1' column, if negative, the bar
is under the "All other states" column. The score is actually
calculated using the following stored procedure.
GetAttributeDiscrimination(
string strModel,
string strPredictableNodeUniqueID,
string strValue1,
int iVal1Type,
string strValue2,
int iVal2Type,
double dThreshold,
bool in_bRescaled)
The parameters are:
strModel – The name of the model.
strPredictableUniqueIDNode – The Node Unique Name of the
Predictable attribute. You can find it in the Microsoft Generic Content Tree
View, or get the list of predictable attributes and their Node Unique Names by
calling another stored procedure – CALL
System.GetPredictableAttributes('ModelName'). This stored
procedure returns two columns – one for the attribute name and one for the Node
Unique Name.
/******* Return ****************
ATTRIBUTE_NAME NODE_UNIQUE_NAME
Bike Buyer 100000001
*********************************/
CALL System.GetPredictableAttributes('TK448 Ch09 Prediction Naïve Bayes')
strValue1 – The name of the value you want to compare on the left hand
side. The usage of this parameter depends on the value of the next
parameter.
iValType1 – This parameter indicates how to treat strValue1. It
can have values 0, 1, or 2. If this parameter is a 1, the value in
strValue1 is the actual state of the attribute. However, if this
parameter is a 0 or 2, the value in strValue1 is ignored. If the value is
0, the left-hand value is considered to be the “missing state”. If the
value is 2, the left hand value is considered to be “all other states.”
In the example above, “All other states” is specified only because it looks
nice (and it’s easier to just drop the combo box value into the function call
even if it will just be ignored).
strValue2 – Like strValue1, but for the right hand side.
iValType2 – Like iValType2, but for the right hand side.
dThreshold – A threshold value used to filter results, such that small
correlations don’t come back in the results. Usually you set
it to a really small number like 0.0005 in the example above.
bNormalize - Whether or not the result is normalized. If this
value is true, the results are normalized to a maximum absolute value of 100,
giving a possible range of –100 to 100. All this does is take the largest
absolute value in the result and divide that into 100, and then multiple all
the other numbers by that amount. If set to false, the numbers are
whatever they are and you can figure it out yourself – it’s up to you, but the
NB viewer always sets this to true, by default.
After I deploy the project, I use the following DMX:
CALL System.GetAttributeDiscrimination
('TK448 Ch09 Prediction
Naïve Bayes', -- Model Name
'100000001', -- Node_Unique_Name for the Predicable
Attribute. It can be found from the Generic Content Tree View
'1', -- The name of the value you want to compare on
the left hand side. It depends on the next value.
1, -- If 1, the value above is the actual state of
the attribute.
-- If 0, the left-hand value is considered to be
the “missing state”.
-- If 2, the left hand value is considered to be
“all other states.”
'All other states', -- see below
2, -- If 2, the right hand value is considered to
be “all other states.”
0.0005, -- A threshold value used to filter results,
such that small correlations don’t come back in the results.
true -- Whether or not the result is
normalized. If true, the results are normalized to a maximum absolute
value of 100.
)
The results are:
Attributes
|
Values
|
Score
|
InState1
|
InState2
|
OutState1
|
OutState2
|
Number
Cars Owned
|
0
|
100.000
|
1842
|
1070
|
4530
|
5497
|
Number
Cars Owned
|
2
|
-79.396
|
1810
|
2713
|
4562
|
3854
|
English
Education
|
Partial
High School
|
-62.066
|
329
|
774
|
6043
|
5793
|
Age
|
>=
71
|
-51.177
|
129
|
414
|
6243
|
6153
|
Total
Children
|
1
|
47.292
|
1512
|
1009
|
4860
|
5558
|
Commute
Distance
|
10+
Miles
|
-45.468
|
638
|
1122
|
5734
|
5445
|
Commute
Distance
|
0-1
Miles
|
45.355
|
2491
|
1921
|
3881
|
4646
|
Region
|
Pacific
|
44.493
|
1523
|
1033
|
4849
|
5534
|
Total
Children
|
5
|
-43.573
|
310
|
668
|
6062
|
5899
|
Region
|
North
America
|
-32.846
|
2940
|
3617
|
3432
|
2950
|
English
Education
|
Bachelors
|
29.086
|
2110
|
1670
|
4262
|
4897
|
Total
Children
|
4
|
-27.761
|
618
|
993
|
5754
|
5574
|
Number
Children At Home
|
3
|
-27.503
|
276
|
544
|
6096
|
6023
|
Number
Children At Home
|
4
|
-26.870
|
254
|
510
|
6118
|
6057
|
Commute
Distance
|
5-10
Miles
|
-22.959
|
913
|
1316
|
5459
|
5251
|
Number
Children At Home
|
2
|
20.979
|
700
|
451
|
5672
|
6116
|
Age
|
42
- 52
|
20.538
|
2334
|
1958
|
4038
|
4609
|
Number
Cars Owned
|
1
|
18.864
|
1912
|
1565
|
4460
|
5002
|
Age
|
62
- 71
|
-18.848
|
549
|
847
|
5823
|
5720
|
Commute
Distance
|
2-5
Miles
|
18.433
|
1298
|
993
|
5074
|
5574
|
English
Education
|
High
School
|
-18.379
|
945
|
1316
|
5427
|
5251
|
Number
Cars Owned
|
4
|
-15.440
|
323
|
539
|
6049
|
6028
|
Marital
Status
|
S
|
11.446
|
3099
|
2823
|
3273
|
3744
|
Marital
Status
|
M
|
-11.446
|
3273
|
3744
|
3099
|
2823
|
Number
Cars Owned
|
3
|
-7.774
|
485
|
680
|
5887
|
5887
|
Number
Children At Home
|
0
|
7.254
|
4013
|
3830
|
2359
|
2737
|
English
Education
|
Graduate
Degree
|
3.239
|
1192
|
1047
|
5180
|
5520
|
Commute
Distance
|
1-2
Miles
|
-1.417
|
1032
|
1215
|
5340
|
5352
|
Number
Children At Home
|
5
|
-0.628
|
286
|
369
|
6086
|
6198
|
That's why the first group
has a score of 100, the 2nd group has a score of 79.396, etc.
Please note some groups
with scores below the threshold are ruled out in the chart.