CustomerKey is the key column.
BikeBuyer is the predictable column, and an input as well.
The other 12 input columns are Age, CommuteDistance, EnglishEducation, EnglishOccupation,
Gender, HouseOwnerFlag, MaritalStatus, NumberCarsOwned, NumberChildrenAtHome, Region,
TotalChildren, and YearlyIncome.
Now let us look at the results for the decision tree model.
Let's examine the dependency network first. This view is so great as it can easily tell us which input is the most important, which one is the 2nd most important, which one is the 3rd, etc, and which ones are not important. The chart shows that NumberCarsOwned is the No.1 predictor.
Region is No.2 predictor:
TotalChildren is the 3rd one:
Similarly, I have found order of the importance of the 11 significant input variables on predicting bikebuyer is: NumberCarsOwned > Region > TotalChildren > YearlyIncome
> CommuteDistance >NumberChildrenAtHome > Age > Marital Status >English
Occupation > English Education > House Owner Flag
Please note that the other two input variables (i.e., Gender and BikeBuyer) turn out to be NOT important, when competing with other 11 input variables. It's a big surprise to me that the previous purchasing history (i.e., BikeBuyer) does not significantly predict the future purchase. I would think it should be a significant one. Otherwise, I would have a hard time to interpret the findings. One possible reason is that this is just a hypothetical dataset, not a real one. But anyway, this is what it is.
Now, let's examine the result in the tree view. First, I will explain the menu options as they have confused me.
Tree - Bike Buyer. The dropdown box shows all of the predictable variables. As there is only one predictable variable in our project, so the only value showing is Bike Buyer.
Default Expansion - Just show you some convenient options. In the chart below, it shows 3, 4, 5, and all levels. Of course, you can look at the decision tree beyond these levels, such as at 1 level, 2 levels, or 6 levels etc, which can be done with the option below it - Show Level.
Show Level - From 1 to 9. For example, in the chart below, I choose level 7. The figure shows 7 levels of the tree accordingly. Please note the highest level is 9, not 11 as found in the dependency network. This is because the two chart views examine the prediction from different perspectives. In the dependency network, the question is which factors, when they are taken into account collectively, are salient predictors on predicting bike buyer. On the chart of the decision tree, the tree path stops to grow, or the level stops to expand when it can make a definite decision no matter what the value is for other attributes.
Background - In this training dataset (70% of the entire dataset, the other 30% is for testing and validation), there are 12,939 customers. Of them, 6,567 did not buy a bike, 6,372 bought it. No missing data. So there are four possible groups: All customers, the buyers, the non-buyers, and the missing data one (although it is 0 in this dataset).
In the chart below, I select 0 (or non-buyer) as the background. Thus, the darker, the higher percentage of the non-buyers. So the group with 4 cars have the highest % of customer without buying a bike because its color is the darkest, whereas the group with 0 car should be on the opposite.
If I select 1 (the buyer) as the background, then, the darker, the higher percentage of customer buying bikes. The lighter, the lower % of customers purchasing a bike. So the group with 0 car has the highest % on buying bikes, whereas the group with 4 cars is on the opposite. The same conclusions as before, jut different presentations.
If I select All as the background, Level 1 has all the customers, so it is the darkest. For level 2 - number of cars owned, due to the background is all customers, we have to calculate the % for each group again all customers. So at this level, the group with 2 cars has the highest % of customers, whereas the group with 4 cars has the least % of customers. Note, we do not care about buyers or non-buyers anymore, as our background is All Customers. Other levels should be interpreted similarly.
Histogram - It indicates how many horizontal bars or histograms showing at the bottom of each group of the customers. If I select 1, it means at there should be only one bar for each group, the other part should be grey. In this case, the showing bar or histogram is the one with the higher %. For instance, in the chart below, in Level 1 with all customers, the % of non-buyers is more than that of buyers (50.75% vs. 49.25%), so the histogram shows the non-buyers, the blue bar. But, in level 2, for the customer with 1 car, the ratios of buyer and non-buyers is 54.99% vs 45.01%, so the histogram or the bar is in pink.
In this particular project, actually we only need two bars or histograms, indicating the % of buyers vs. non-buyer, because we do not have missing data, so three bars are not needed. Although we can select a value between 4 and 9 technically, but they do not make any sense in this project. Thus, we should use 2 for histogram on this project.
Now, finally let's look at the entire decisions tree. In the chart below, we select all levels, the buyer as the background, and histogram =2 for the buyers vs non-buyers. Level 1 is all customers. Level 2 predicts level 1 the most, so the input "number of cars owned" is the most significant predictor of bike purchase. Among them, the group with 0 car has the highest % of purchasing. For this particular node, the next level best predictor is Region. Pacific customer has the highest percentage. This is the final conclusion! No matter what values of other input variables, as long as the customer do not have a car and are from the Pacific region, they are the most likely buyers.
Other paths or tree branches can be interpreted in a similar way, until each reaches a definite conclusion.