FORECAST VERIFICATION: A PRACTITIONER'S GUIDE IN ATMOSPHERIC SCIENCE Ian T. Jolliffe and David B. Stephenson, Eds., 2003, 254 pp., $95.00, hardbound, John Wiley & Sons, ISBN 0-471-49759-2
Forecast Verification: A Practitioner's Guide in Atmospheric Science, edited by Ian T. Jolliffe and David B. Stephenson, is a fairly tightly coupled set of chapters written by generally well-known experts (in some cases, perhaps more so in Europe than in North America) in verification and, especially, in the subjects of their particular chapters. In a book in which sections or chapters are written by different authors, one asks the questions, 1) how well do the individual chapters read and present the material logically, accurately, and comprehensively?, and 2) how well do the chapters relate to one another and cover the full subject of the book? On the first question, the book gets high marks for most chapters. On the second, it is better than many, although in some instances the editors have not suppressed individuality enough for it to read like a fully cohesive book. This book is a voluminously referenced and well-indexed survey of what is known about, and a historical account of, verification, and the related topic of evaluation as it exists in the meteorological literature. The editors have put much emphasis on standardizing mathematical notation throughout, and were quite successful-an achievement in itself. While the methods presented can be applicable to most any forecasting problem, the discussion and examples are tied to weather and climate forecasting, as acknowledged in the book's preface, which hardly translates into the full scope of "atmospheric science."
In chapter 1, the editors reiterate Brier and Allen's (1951) reasons for verification; use their terms "economic," "administrative," and "scientific"; and note that a common theme of these is that any verification scheme be informative (p. 4). They note it is highly desirable that the verification system be objective, and examine various scores according to the attributes of reliability, resolution, discrimination, and sharpness [as suggested by Murphy and Winkler (1987)], and for "goodness," of which Murphy (1993) identified three types: consistency, quality (accuracy or skill), and value (utility). They also note that in order to quantify the value of a forecast, a baseline is needed, and that persistence, climatology, and randomization are common baselines.
While development of objective forecasting systems is not the subject of the book and gets little direct treatment (other than ensembles), the authors do note that artificial skill is a danger in developing a forecasting system, and emphasize cross-validation and separate training and test datasets.
Chapter 2, written by J. M. Potts, deals with basic concepts and covers topics such as types of predictands-continuous and categorical (ordinal and nominal), of which binary is the simplest type. Potts implies that all forecasts are made by "forecasting systems" (p. 13) and calls the variable for which the forecasts are formulated the "predictand." Strictly speaking, that may be correct, but the preponderance of forecasts, other than those made by NWP, are made subjectively by forecasters, and the variable they are forecasting is generally not thought of as a predictand. This term comes from statistical objective systems, dating back to 1949 or before. In the AMS Glossary of Meteorology, predictand is defined only in terms of regression (AMS 2000; pp. 594, 641).
Potts states a predictand (again, her use of the term) can be either deterministic or probabilistic. She says on page 14: "A deterministic forecast is really just a special case of a probabilistic forecast in which a probability of unity is assigned to one of the categories and zero to the others." However, the editors state in chapter 9 (p. 192): ". . . deterministic point forecasts are not perfectly sharp forecasts with probabilities equal to 1 and 0, but instead they should be assigned unknown sharpness." Contradiction? These statements are brought together, but only in the glossary.
Potts introduces statistical concepts such as distributions and the first four moments-mean, variance, skewness, and kurtosis. She also introduces the Murphy-Winkler framework, Bayes theorem, and verification as a regression problem. The definition of "equitable" is presented as requiring that all constant forecasts and random forecasts receive the same value of the score.
The next several chapters are nicely divided into verification of binary events, multicategorical events, continuous variables, and spatial fields-all specific cases of different "types" of predictands. One of my biggest disappointments with the book is its lumping the verification of probability forecasts with ensemble forecasts, but that is what was done (more about that later).
In chapter 3, Ian Mason goes into great detail in dealing with the two-category event, and continues a theme of the book in discussing this situation in terms of Finley's tornado forecasts. The terminology brought from the Relative Operating Characteristic (ROC) methodology-"base rate," "hit rate," and "false alarm rate"-is predominantly used, and most scores are in, or put into, those terms. If they prevail, it will be because of the ROC influence.
I find it curious that there are several places (pp. 50, 53, 55, 70, 73) that negative skill (a set of binary forecasts that do worse than the baseline) seems to be no problem to the authors-just reverse the labels, and the negative skill becomes positive. Well, so! But usually we don't have the luxury of changing the forecasts we are verifying. Evidently, this statement is from the perspective of developing an objective system and the influence of ROC, but the book is about verification, which involves determining the correspondence of the forecasts and the "observations," not about switching labels at the end. Belaboring this point is not useful to a "practitioner" of verification.
ROC is given good treatment, covering its relationship to Types 1 and 2 errors in hypothesis testing. The parametric (or modeled) area under the ROC curve is described, along with the associated discrimination distance.
Chapter 4, written by Robert Livezey, is devoted to categorical events-actually multicategory events, since the subject of two-category events is covered in chapter 3. Departing somewhat from previous chapters, in which methods are presented but no one is particularly recommended above others, Livezey makes recommendations. For instance, he states on page 78: "In this chapter, the exclusive use of sample probabilities (observed frequencies) of categories of the forecast/observation set being verified is recommended, rather than the use of historical data. The only exception to this is for the case where statistics are stationary and very well estimated." With most verifications, the purpose comes into play. If one is comparing a set of subjective temperature forecasts with the baseline available to the forecaster when the forecasts are being made, the baseline is the historical record, not the mean of the time series yet to be observed, regardless of the stationarity of the time series. (Extreme nonstationarity would indicate climatic forecasts were inappropriate as a baseline, but this is usually not known when the forecasts are being made, so that is the baseline available and used.) In any case, the usual "skill" scores computed on multidimensional tables do generally base skill on the sample.
Livezey reviews various scores, but soon mentions most scores are deficient when compared to the Gandin and Murphy "equitable" family of scores (basically a recommendation). Although the Heidke and Pierce skill scores are both equitable, they have the undesirable properties of depending on the forecast distribution, and not utilizing off-diagonal elements in the contingency table. The relatively new LEPSCAT score and sampling variability of the contingency table and skill scores are discussed, but a major thrust of the chapter leads to the family of Gandin and Murphy scores and how they can be constructed; the Gerrity score (GS), one of the family, is recommended as the preferred one.
Chapter 5, written by Michael Deque, deals with continuous variables, and the term "variable" is used instead of the statistical developer's term "predictand" preferred by Potts (p. 13).
Some (but minimal) treatment is given to the topics of sampling error, artificial skill, and significance testing. These are very important topics and deserve more discussion. "Prediction interval" is contrasted to "confidence interval" (p. 105), but no definitive explanation of the difference is given; verification as a regression problem is mentioned in chapter 2, and the discussion of the Pearson product moment correlation coefficient is here (p. 106) and provides an excellent opportunity to demonstrate the difference.
One can agree with Deque's statement, ". . . it is desirable that the overall distribution of forecasts is similar to that of the observations, irrespective of their case-to-case relationship with the observations" (p. 113). However, the statement, "Before the forecasts are delivered to unsuspecting users, it is important to rescale (inflate) them" (p. 114) can be questioned. A definition of "inflate" is not given, but has come to mean, in many instances, that defined for regression estimates by Klein et al. (1959), and may or may not be desirable. The Mean Square Error skill score (p. 104) for inflated, unbiased forecasts will be negative if the (Pearson product moment) correlation coefficient between noninflated forecasts and observations is < 0.5 (Glahn and Allen 1966). That is, in developing the regression equation, if the reduction of variance is < 0.25, inflated forecasts will have a larger mean square error than the sample mean. An unsuspecting user, having been given inflated forecasts, might expect them to be skillful!
Chapter 6, a relatively short chapter written by Wasyl Drosdowsky and Huqiang Zhang, deals with the difficult problem of verifying spatial fields. The mean square error introduced previously can be used averaged over a field. The different anomaly correlation coefficients in the literature and S1 score are defined, and principle component analysis is introduced as a method of reducing dimensionality. Spatial rainfall forecasts are singled out as being especially challenging to verify.
Chapter 7, by multiple authors, displays their bias toward ensembles and climate forecasts early on when they mention statistical techniques but give no reference to U.S. or Canadian work in the short range (0-10 days), both primary centers of postprocessing activity for many years. They nail it down with their statement (p. 155), "Ensemble forecasting is now one of the most commonly used methods for generating probability forecasts that can take account of uncertainty in initial and final conditions." While ensemble forecasting is in its ascendancy, and the statement is true in terms of the uncertainty of the initial conditions estimated by data assimilation, precious little overall work has been done operationally with ensembles in a postprocessing probabilistic sense, except for the occurrence of precipitation, which, being binary, lends itself well to direct relative frequency treatment. Ensembles are not the most commonly used method of making probability forecasts; rather, statistical postprocessing of single model runs have produced a plethora of probabilistic guidance forecasts for many weather elements for many years. In addition, probability of precipitation forecasts have been produced as official forecasts by the NWS since 1966. The book's editors should have forced a more balanced view.
With that said, there is good information in the chapter, and it generally reviews the literature on the subject. The authors state (p. 138) that ". . . the two most important attributes of probability forecasts (are) referred to as reliability and resolution," not news to the reader at this point. Later, they say resolution is the most important attribute of a forecast system (p. 142). While these two statements are not quite contradictory, editing could have provided a clearer picture of the authors' views and more clearly differentiated a set of probability forecasts from a system that could produce reliable forecasts by recalibrating. The idea seems to be that forecasts (or forecast systems) don't really have to be reliable, just calibrate so that they are. This is emphasized later (p. 163) and contributes to the perception that the book is oriented for a developer of systems, not for one who is going to verify or evaluate an actual, unchangeable, set of forecasts. Both purposes of verification are important, but the book never clearly makes the distinction.
Chapter 8, written by D. S. Richardson, discusses the third type of goodness identified by Murphy, value or utility. Other measures associated with the correspondence of forecasts and observations (e.g., skill and accuracy) are not directly measures of the usefulness of forecasts to a user, although they are certainly related. In keeping with a theme in the book, hit rate and false alarm rate are brought into play, and the usefulness of the Peirce skill score and the Clayton skill score in this context are discussed.
The ROC is again addressed and, in contrast to chapter 8 where only the modeled area under the curve Az is discussed, Richardson goes to some length to discuss the actual area A under the curve when points are plotted on the hit rate/false alarm rate axes. He also defines a ROC skill score ROCSS = 2A - 1, which ranges from 0 for no skill to 1 for perfect forecasts. In whatever chapter the ROC is described, both the actual (from plotted points) and the modeled area should be discussed (and in that order), not sequestered for the reader to attempt to coalesce. The chapter ends with a good summary.
Chapter 9, by the editors, is a review of some key concepts in the book, a look at forecast evaluation in disciplines other than atmospheric science (statistics, finance and economics, environmental and earth sciences, and medical and clinical studies), and directions where forecast verification could benefit from more attention.
This book will provide a good reference, and I recommend it especially for developers and evaluators of statistical forecast systems.
-BOB GLAHN
[Reference]
REFERENCES
AMS, 2000: Glossary of Meteorology. 2d ed. Amer. Meteor. Soc., 855 pp.
_____, 2002: Enhancing weather information with probability forecasts. Bull. Amer. Meteor. Soc., 83, 450-452.
Brier, G. W., and R. A. Allen, 1951: Verification of weather forecasts. Compendium of Meteorology, T. F. Malone, Ed., Amer. Meteor. Soc., 841-855.
Glahn, H. R., and R. A. Allen, 1966: A note concerning the "inflation" of regression forecasts. J. Appl. Meteor., 5, 124-126.
Klein, W. H., B. M. Lewis, and I. Enger, 1959: Objective prediction of five-day mean temperature during winter. J. Meteor., 16, 672-682.
Murphy, A. H., 1993: What is a good forecast? An essay on the nature of goodness in weather forecasting. Wea. Forecasting, 8, 281-293.
_____, and R. L. Winkler, 1987: A general framework for forecast verification. Mon. Wea. Rev., 115, 1330-1338.
[Author Affiliation]
Bob Glahn is director of The Meteorological Development Laboratory, Office of Science and Technology, National Weather Service, National Oceanic and Atmospheric Administration, U.S. Department of Commerce.
Комментариев нет:
Отправить комментарий