Model Validation Measures
Forecast performance scores and graphics can be obtained for the cross-validated forecasts, and if the retroactive forecast option was selected then results for these forecasts is also available. Using the Tools ~ Validation menu item, select whether it is the cross-validated or the retroactive forecasts that are to be verified, and then whether performance statistics, bootstrap confidence interval and permutation significance tests, contingency tables, or scatter and residual plots should be provided for an individual series, or a map/bar chart (depending on whether the Y data are in gridded/station format) for all series. A validation window will open.
Performance Measures
The Performance Measures window for an individual series provides a variety of forecast performance scores divided into those based on continuous measures, and those based on measures in which the observations, and in some cases the forecasts as well, are divided into three categories. The continuous forecast measures calculated are:
- Pearson's product moment correlation coefficient , which describes the strength of the linear association between the forecasts and the observations;
- Spearman's rank order correlation coefficient , which describes the strength of the monotonic association between the forecasts and the observations;
- 2AFC score (continuous) , which indicates the probability of correctly discriminating a higher from a lower value (e.g., the wetter or warmer of two observations);
- Mean squared error , which defines the average squared difference between each forecast and observation;
- Root mean squared error , which is the square root of the mean squared error;
- Mean absolute error , which defines the average amount by which the forecast was incorrect;
- Bias , which defines the difference between the mean of the forecasts and the mean of the observations.
- Variance Ratio , which is the variance of the forecasts divided by the variance of the observations.
The categorical forecast measures are:
- Hit score , which defines the percentage of times the forecast category corresponds with the observed category;
- Hit skill score , which defines the percentage of times, beyond that expected by chance, the forecast category corresponds with the observed category;
- LEPS score , which calculates a score defined using a scoring table that gives different scores for hits and depending on the observed category and on the prior probabilities of the categories;
- Gerrity score , which calculates a score defined using an alternative scoring table to that for the LEPS score;
- 2AFC (forecast categories) , which indicates the probability of correctly discriminating an observation in a higher category from one in a lower (e.g., an "above-normal" observation from a "normal" observation) given the forecasts expressed in categorical form ("above-normal", "normal", or "below-normal");
- 2AFC (continuous forecasts) , which indicates the probability of correctly discriminating an observation in a higher category from one in a lower (e.g., an "above-normal" observation from a "normal" observation) given the forecasts expressed in deterministic form (i.e., the forecasts values shown in the accompanying graph);
- ROC area (below-normal) , which defines the area beneath the ROC curve for forecasts of the below-normal category, and gives the proportion of times that below-normal conditions can be distinguished successfully from the other categories;
- ROC area (above-normal) , which defines the area beneath the ROC curve for forecasts of the above-normal category, and gives the proportion of times that above-normal conditions can be distinguished successfully from the other categories.
Beneath the continuous measures is a graph showing the forecasts (green line) and observations (red line). The graph is divided vertically into three categories. The definition of these categories is explained later in Customising the Results .
Beneath the categorical measures are relative operating characteristic (ROC) graphs for the above- (red line) and below-normal (blue line) categories. The observations are categorised using cross-validated category definitions (see Contingency Tables for further details on the definitions of the categories), but the forecasts are considered on the continuous scale. The forecasts are ranked, and the forecast with the highest value is taken as the most confident forecast for above-normal conditions, and that with the lowest value is taken as the least confident forecast. For forecasts of below-normal conditions, this ranking is inverted so that the forecast with the highest value is taken as the most confident forecast for below-normal conditions, and that with the highest value is taken as the least confident forecast. The areas beneath the graphs are given under the categorical skill measures above the ROC graph.
Scores and graphs are shown for one series at a time. Information for the desired series can be shown by setting the appropriate number at the top left of the validation window. A series that has been omitted in the calculations is skipped when cycling through the series using the arrows.
Bootstrap Results
The Bootstrap window provides confidence limits and significance tests for a variety of forecast performance scores. The confidence limits are calculated using bootstrap resampling, and provide an indication of the sampling errors in each performance measure. The bootstrap confidence level used is indicated, and can be adjusted using the Options ~ Resampling Settings menu item. The actual sample scores are indicated, and are the same as those provided by Performance Measures .
As well as providing confidence limits, significance levels are also provided. The p-value indicates the probability that the sample score would be bettered by chance. Permutation procedures are used to calculate the p-values. The accuracy of the p-values depends upon the number of permutations, which can be set using the Options ~ Resampling Settings menu item. It is recommended that at least 200 permutations be used, and more if computation time permits.
Skill Maps
If the Skill Map option is chosen, a window showing a map (if the Y data are gridded/stations) or a bar chart (otherwise) for all series will be shown. It is possible to choose which score to use for the map, simply by checking the button next to the desired score.
Scatter Plots
The Scatter Plots option shows a graph of the forecast residuals (differences between the forecasts and the observations), as well as a scatter plot of the observations against the forecasts. The scatter plot includes horizontal and vertical divisions that indicate the three categories. In both cases the divisions are defined by the terciles of the observations using all the cases. A best fit linear regression line is shown on the scatter plot, but only over the range of the forecasts.
Contingency Tables
The Contingency Table window provides frequency and contingency tables for the forecasts. The frequency tables give the counts of forecasts for each of three categories, marked below-normal (B), normal (N), and above-normal (A), and the number of times each of the three categories verified. Column totals are given showing the total numbers of times that each of the three categories were forecast/observed. The total number of forecasts is also provided, and for cross-validated forecasts should indicate the total number of cases available, as specified on the Input Window. The contingency tables indicate the percentages of times that each of the three categories verified given the forecast category, and can be obtained from the frequency tables by simply dividing each element in the table by the respective column total. The row totals on the contingency tables indicate the percentage of times that the observations were in each category, and should be identical on the variance-adjusted and unadjusted tables. The column totals on the contingency tables indicate the relative frequency with which the best-guess forecast was in each category.
Note that the contingency tables are calculated based on the category that the best-guess forecast is in rather than on the most probable category. It is often incorrectly assumed that if the best-guess forecast is in the middle ("normal") category then that category is the most likely to occur. For low-skill forecasts, the best-guess forecast will be in the normal category most of the time, but the normal category will rarely be the category with the highest probability. Given these interpretation problems, it is recommended that the contingency tables be used with caution.