How to evaluate the prediction models in SSAS?

1.  Introduction

To resolve a business problem, we may need to use a number of data mining models with different algorithms and different algorithm parameters. You should find the best one and deploy it into production. For this purpose, we need to measure a model’s accuracy, reliability, and usefulness.

Accuracy determines how well an outcome from a model correlates with real data. The standard methods to measure accuracy include lift charts, profit charts, and classification matrices.

Reliability assesses how well a data mining model performs on different datasets. This is often achieved by using cross validation. With cross validation, we partition the training dataset into many smaller sections. SSAS then creates multiple models on the cross sections using one section at a time as test data and other sections as training data, trains the models, and creates many different accuracy measures across partitions. If the measures across different partitions differ widely, the model is not robust on different training and test set combinations.

Usefulness measures how helpful the information gathered with data mining is. Usefulness is typically measured through the perception of business users, using questionnaires and similar means.

2.  Measuring Accuracy

a.    Lift Charts

A lift chart is the most popular way to show the performance of predictive models. Figure below shows a lift chart for the predictive models of the predicted variable (Bike Buyer) for the value 1 (buyers).

The chart shows four curved lines and two straight lines. The four curves show the predictive models (Decision Trees, Naïve Bayes, Neural Network, and Clustering), and the two straight lines represent the Ideal Model (top) and the Random Guess (bottom). The x-axis represents the percentage of population (all cases), and the y-axis represents the percentage of the target population (bike buyers).

From the Ideal Model line, you can see that approximately 48 percent of Adventure Works customers buy bikes. That means that, IDEALLY, if you could predict with 100 percent probability which customers will buy a bike and which will not, you would need to target only 48 percent of the population. On the other hand, the Random Guess line indicates that if you were to pick cases out of the population randomly, you would need 100 percent of the cases for 100 percent of bike buyers. Likewise, with 80 percent of the population, you would get 80 percent of all bike buyers, with 60 percent of the population 60 percent of bike buyers, and so on.

Data mining models give better results in terms of percentage of bike buyers than the Random Guess line but worse results than the Ideal Model line. From the lift chart, we can measure the lift of the data mining models from the Random Guess line. For example, if you take the highest curve, directly below the Ideal Model line, we can see that if we select 70 percent of the population based on this model, you would get nearly 88 percent of bike buyers. From the Mining Legend window, we can see that this is actually 87.49% for the Decision Trees model.

The value for Predict probability represents the threshold required to include a customer among the "likely to buy" cases. For example, to identify the customers from the decision tree model who are likely buyers, we would use a query to retrieve cases with a Predict probability of at least 32.54% (assuming we select 70 percent of the population). To get the customers targeted by the last model – the clustering model, we would create query that retrieved cases with a Predict Probability value of at least 41.50%.

It is interesting to compare the models. Still assuming we select 70 percent of the population, the decision tree model appears to capture more potential customers, but when you target customers with a prediction probability score of 32.5%, we also have a 67.5% chance of sending a mailing to someone who will not buy a bike. On the other hand, the clustering model captures less potential customers, but we have a less chance (58.5% = 1-41.5%) of sending a mailing to someone who will not buy a bike. Therefore, if we were deciding which model is better, we would want to balance the greater precision and smaller target size of the filtered model against the selectiveness of the basic model. This is better addressed with the Profit Chart discussed below.

The value for Score helps us compare models by calculating the effectiveness of the model across a normalized population. A higher score is better, so in this case we might decide that the decision tree model is the most effective strategy, despite the lower prediction probability. The Neural Network algorithm generates the second best, the Naïve Bayes algorithm generates the third best, and the Clustering algorithm generates the worst. Interestingly, the rank of score for the four models is consistent with the rank of target population.

b.   Profit Chart

Profit chart answers the question: what is the correct percent of the population to target in order to maximize profit? Following is profit chart with four models.

The figure above is based on the following parameter settings:

Setting

Value

Comments

Predict value

1

[Bike Buyer] =1, meaning customers who are likely to buy a bike

Population

50,000

Set the value for the total target population

Your database might contain many customers, but to save on mailing expenses you might choose to target only the 50,000 customers who are most likely to respond. You can get this list by running a prediction query and sorting by the probability output by the predictive model.

Fixed cost

5,000

The one-time cost of setting up a targeted mailing campaign for 50,000 people. This might include printing, or the cost of setting up an e-mail campaign. We enter $5,000.

Individual cost

3

Enter the per-unit cost for the targeted mailing campaign, $3.

This amount will be multiplied by a number equal to or less than 50,000, depending on how many customers the model predicts are good prospects.

Revenue per individual

15

Enter a value that represents the amount of profit or income that can be expected from a successful result. In this case, we’ll assume that mailing a catalog results in purchase of accessories or bikes averaging $15.

This amount will be used to project the total profit associated with high probability cases.

It appears that there are two peak points for the decision tree model on the chart: one at about 80% (cannot get 80% exactly, I get 79.21% instead) and another one at about 90%. The results are below:

Target population

Profit

Predict Probability (Response Rate)

79.21%

$216,671.30

24.78%

90.10%

$218,025.30

15.39%

Thus, to maximize profit, we should target 90% of the population. If we do not want so high percent, we can target 80% of the population, and so on.

As far as how the profit is calculated, it is unknown to me from the Microsoft sources. But from another source, I find the profit at the selected population is calculated as:

(True Positive*Revenue) - Fixed Cost - Action Cost (False Positive + True Positive)

Sometimes we may need the second form of lift chart, which measures the quality of global predictions. That is, we are measuring predictions of all states of the target variable. For example, the chart below measures the quality of predictions for both states of the Bike Buyer - for both buyers and non-buyers. You can see that the Decision Trees algorithm predicts correctly in approximately 70 percent of cases, and about 60% for the clustering model (see the legend).

c.    Classification Matrices

A classification matrix shows actual values compared to predicted values. Figure below shows the classification matrix for the predictive models.

For the Decision Trees algorithm, for example, we can calculate that the algorithm predicted 2,575 buyers (741 + 1,834). Of these predictions, 1,834 were correct (71.2%), while 741 predictions were false, meaning that customers from the test set did not actually purchase a bike. Of the predicted 2970 (2102+868) non-buyers, 2102 were correct (70.7%), while 868 predictions were false. The other three models can be interpreted similarly. Overall, the decision tree appears to have the largest prediction, followed by Neural Network, then Naïve Bayes, and finally by clustering.

3.  Measuring Reliability - Cross Validation

Figure below shows the settings for a cross validation and the results of cross validation of the predictive models.

First, we define the cross-validation settings as follows:

·       Fold count - define how many partitions you want to create in your training data. In Figure below, three partitions are created: when partition 1 is used as the test data, the model is trained on partitions 2 and 3; when partition 2 is used as the test data, the model is trained on partitions 1 and 3; and when partition 3 is used as the test data, the model is trained on partitions 1 and 2.

·       Max cases - define the maximum number of cases to use for cross validation. Cases are taken randomly from each partition. This example uses 9,000 cases, which means that each partition will hold 3,000 cases.

·       Target attribute - This is the variable that you are predicting.

·       Target state - You can check overall predictions if you leave this field empty or check predictions for a single state that you are specifically interested in. In this example, you are interested in bike buyers (state 1).

·       Target threshold - With this parameter, you set the accuracy bar for the predictions. If prediction probability exceeds your accuracy bar, the prediction is counted as correct; if not, the prediction is counted as incorrect.

          The full blown version of the result table is as follows:

TK448 Ch09 Prediction Decision Trees

Partition Index

Partition Size

Test

Measure

Value

1

3000

Classification

True Positive

904

2

3000

Classification

True Positive

1115

3

3000

Classification

True Positive

1097

Average

1038.6667

Standard Deviation

95.5068

1

3000

Classification

False Positive

332

2

3000

Classification

False Positive

556

3

3000

Classification

False Positive

593

Average

493.6667

Standard Deviation

115.3092

1

3000

Classification

True Negative

1172

2

3000

Classification

True Negative

949

3

3000

Classification

True Negative

911

Average

1010.6667

Standard Deviation

115.1299

1

3000

Classification

False Negative

592

2

3000

Classification

False Negative

380

3

3000

Classification

False Negative

399

Average

457

Standard Deviation

95.774

1

3000

Likelihood

Log Score

-0.5825

2

3000

Likelihood

Log Score

-0.5965

3

3000

Likelihood

Log Score

-0.6039

Average

-0.5943

Standard Deviation

0.0088

1

3000

Likelihood

Lift

0.1106

2

3000

Likelihood

Lift

0.0967

3

3000

Likelihood

Lift

0.0893

Average

0.0988

Standard Deviation

0.0088

1

3000

Likelihood

Root Mean Square Error

0.3294

2

3000

Likelihood

Root Mean Square Error

0.3293

3

3000

Likelihood

Root Mean Square Error

0.3265

Average

0.3284

Standard Deviation

0.0013

TK448 Ch09 Prediction Naive Bayes

Partition Index

Partition Size

Test

Measure

Value

1

3000

Classification

True Positive

916

2

3000

Classification

True Positive

958

3

3000

Classification

True Positive

971

Average

948.3333

Standard Deviation

23.471

1

3000

Classification

False Positive

539

2

3000

Classification

False Positive

555

3

3000

Classification

False Positive

566

Average

553.3333

Standard Deviation

11.0855

1

3000

Classification

True Negative

965

2

3000

Classification

True Negative

950

3

3000

Classification

True Negative

938

Average

951

Standard Deviation

11.0454

1

3000

Classification

False Negative

580

2

3000

Classification

False Negative

537

3

3000

Classification

False Negative

525

Average

547.3333

Standard Deviation

23.6126

1

3000

Likelihood

Log Score

-0.6767

2

3000

Likelihood

Log Score

-0.6681

3

3000

Likelihood

Log Score

-0.6773

Average

-0.674

Standard Deviation

0.0042

1

3000

Likelihood

Lift

0.0164

2

3000

Likelihood

Lift

0.025

3

3000

Likelihood

Lift

0.0158

Average

0.0191

Standard Deviation

0.0042

1

3000

Likelihood

Root Mean Square Error

0.2963

2

3000

Likelihood

Root Mean Square Error

0.2977

3

3000

Likelihood

Root Mean Square Error

0.2966

Average

0.2968

Standard Deviation

0.0006

TK448 Ch09 Prediction Neural Network

Partition Index

Partition Size

Test

Measure

Value

1

3000

Classification

True Positive

796

2

3000

Classification

True Positive

833

3

3000

Classification

True Positive

807

Average

812

Standard Deviation

15.5134

1

3000

Classification

False Positive

538

2

3000

Classification

False Positive

587

3

3000

Classification

False Positive

502

Average

542.3333

Standard Deviation

34.8361

1

3000

Classification

True Negative

966

2

3000

Classification

True Negative

918

3

3000

Classification

True Negative

1002

Average

962

Standard Deviation

34.4093

1

3000

Classification

False Negative

700

2

3000

Classification

False Negative

662

3

3000

Classification

False Negative

689

Average

683.6667

Standard Deviation

15.9652

1

3000

Likelihood

Log Score

-0.6665

2

3000

Likelihood

Log Score

-0.659

3

3000

Likelihood

Log Score

-0.6694

Average

-0.665

Standard Deviation

0.0044

1

3000

Likelihood

Lift

0.0267

2

3000

Likelihood

Lift

0.0341

3

3000

Likelihood

Lift

0.0237

Average

0.0282

Standard Deviation

0.0044

1

3000

Likelihood

Root Mean Square Error

0.3802

2

3000

Likelihood

Root Mean Square Error

0.3805

3

3000

Likelihood

Root Mean Square Error

0.3736

Average

0.3781

Standard Deviation

0.0032

TK448 Ch09 Prediction Clustering

Partition Index

Partition Size

Test

Measure

Value

1

3000

Classification

True Positive

754

2

3000

Classification

True Positive

577

3

3000

Classification

True Positive

792

Average

707.6667

Standard Deviation

93.6886

1

3000

Classification

False Positive

433

2

3000

Classification

False Positive

287

3

3000

Classification

False Positive

471

Average

397

Standard Deviation

79.3137

1

3000

Classification

True Negative

1071

2

3000

Classification

True Negative

1218

3

3000

Classification

True Negative

1033

Average

1107.3333

Standard Deviation

79.7761

1

3000

Classification

False Negative

742

2

3000

Classification

False Negative

918

3

3000

Classification

False Negative

704

Average

788

Standard Deviation

93.2237

1

3000

Likelihood

Log Score

-0.6594

2

3000

Likelihood

Log Score

-0.6658

3

3000

Likelihood

Log Score

-0.6644

Average

-0.6632

Standard Deviation

0.0027

1

3000

Likelihood

Lift

0.0337

2

3000

Likelihood

Lift

0.0273

3

3000

Likelihood

Lift

0.0288

Average

0.0299

Standard Deviation

0.0027

1

3000

Likelihood

Root Mean Square Error

0.389

2

3000

Likelihood

Root Mean Square Error

0.4111

3

3000

Likelihood

Root Mean Square Error

0.3903

Average

0.3968

Standard Deviation

0.0101

The table above shows that the True Positive classification of Decision Trees does not give you constant results across partitions. The standard deviation of this measure is quite high, 95.5. When checking the Neural Network model, you will see that it is much more constant for the True Positive classification: 15.5, which means that this model is more robust on different datasets than the Decision Trees model. From the cross-validation results, it seems that you should deploy the Neural Network model in production—although the accuracy is slightly worse than the accuracy of Decision Trees, the reliability is much higher. Of course, in production, you should perform many additional accuracy and reliability tests before you decide which model to deploy.