Input Data

Two datasets are required by CPT:

X-variables
These variables are sometimes called "predictors", "independent variables", or "explanatory variables". In the context of MOS applications, the X variables will normally be a GCM output field, such as precipitation or geopotential heights, while in a more traditional model the X-variables typically will be something like a set of sea-surface temperature data, or an ENSO index. The X-variables are used to predict the variables in the second dataset, namely
Y variables
The Y-variables are sometimes called "predictands", "dependent variables", or "response variables". Most frequently, the Y dataset contains a set of station rainfall totals or temperature averages.

It is important to consider which dates to include in the input datasets since the analyses that CPT will perform depend upon how the data are structured. CPT is designed to be used for:

  • Seasonal forecasting: in which there is typically one X and Y value per year for each preditor / predictand, and the sample size (i.e., the length of the training period) is the number of years. If there is more than one value per year, the additional values are interpreted as "lagged fields" (defined below), and are treated as separate variables rather than as additional samples. For example, if you are trying to predict rainfall at one location and the Y file contains 30 years of data, but two months of data for each year, the sample size will be 30, and the number of variables will be two instead of one (i.e., separate forecasts will be made for each month). Similarly, the X file should typically contain one value per year for each variable, and if there is more than one value, they are treated as lagged fields and therefore used as additional predictors. However, if the X and/or Y files contain all twelve months per year, the data are treated differently as discussed below.
  • Sub-seasonal forecasting: in which there are typically one or more X values per year, but for only a limited time of the year. Unlike for seasonal forecasting, each date is now treated as increasing the training period. The X data are typically daily, pentadal, weekly, two-weekly, or dekadal data, and the Y dates should match the X dates, or should be continuous daily data (see notes below on daily Y input data). Lagged fields are not implicitly recognised, and so any desired lags would have to be included as separate fields. However, using lagged files would be unconventional for sub-seasonal forecasting, and this capability has not been tested in CPT.

Monthly or daily Y data: If the Y input file contains monthly data for all months of the year or daily data for all days of the year, CPT will attempt to use these data to calculate a (sub-)seasonal average, total, count or occurrence depending on the dates included in the X file. CPT will identify the appropriate (sub-)season automatically if the X-input file contains cpt:S and cpt:T tags or when using the Probabilistic Forecast Verification (PFV) option. In both cases the Y season is automatically set to match the cpt:T dates in the X file. The cpt:S tag is typically present in a data file of GCM outputs. The automatically set season can be over-written using Edit ~ Target Season . If the cpt:S tag is not present, you will be prompted for season settings. If you want the X and the Y dates to be the same (or to overlap), you should switch on Options ~ Data ~ Synchronous Predictors first.

For the X file, if all twelve months are present, CPT will read the data as if there were a total of 12 lagged fields rather than as representing 12 samples. It would almost certainly be inappropriate to treat the 12 months in the X file as separate samples, in part because skill levels are likely to be inflated if the seasonal cycle represents a large proportion of the variance. However, it is possible to allow monthly samples, by switching on Options ~ Data ~ Permit Monthly Analysis.

To open a dataset in CPT, locate the file using the "browse" buttons. After slecting, a file CPT will automatically try to identify the structure of the dataset and the amount of data in the file. The CPT program requires the input files to follow strictly one of three structures ( gridded , station , and unreferenced or index), each of which is described on the following pages. Currently, the input files for each of these structures must be in ASCII (or text) format, although other formats are being developed and will be implemented in later releases of the software.

Last modified: