Research Article |
Corresponding author: Yu-Pin Lin ( yplin@ntu.edu.tw ) Academic editor: Petr Keil
© 2019 Rainer Ferdinand Wunderlich, Yu-Pin Lin, Johnathen Anthony, Joy R. Petway.
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Wunderlich RF, Lin Y-P, Anthony J, Petway JR (2019) Two alternative evaluation metrics to replace the true skill statistic in the assessment of species distribution models. Nature Conservation 35: 97-116. https://doi.org/10.3897/natureconservation.35.33918
|
Model evaluation metrics play a critical role in the selection of adequate species distribution models for conservation and for any application of species distribution modelling (SDM) in general. The responses of these metrics to modelling conditions, however, are rarely taken into account. This leads to inadequate model selection, downstream analyses and uniformed decisions. To aid modellers in critically assessing modelling conditions when choosing and interpreting model evaluation metrics, we analysed the responses of the True Skill Statistic (TSS) under a variety of presence-background modelling conditions using purely theoretical scenarios. We then compared these responses with those of two evaluation metrics commonly applied in the field of meteorology which have potential for use in SDM: the Odds Ratio Skill Score (ORSS) and the Symmetric Extremal Dependence Index (SEDI). We demonstrate that (1) large cell number totals in the confusion matrix, which is strongly biased towards ‘true’ absences in presence-background SDM and (2) low prevalence both compromise model evaluation with TSS. This is since (1) TSS fails to differentiate useful from random models at extreme prevalence levels if the confusion matrix cell number total exceeds ~30,000 cells and (2) TSS converges to hit rate (sensitivity) when prevalence is lower than ~2.5%. We conclude that SEDI is optimal for most presence-background SDM initiatives. Further, ORSS may provide a better alternative if absence data are available or if equal error weighting is strictly required.
Species distribution modelling, True Skill Statistic, evaluation, presence-background
Species Distribution Modelling (SDM) relates independent environmental variables to species occurrence data and, in turn, predicts a dependent variable such as probability or the relative likelihood of occurrence (
Confusion matrix with cell designation as defined by the agreement of predictions (rows) and observations (columns).
Predicted | Observed | |
Present | Absent | |
Present | True (a) | False (b) |
Absent | False (c) | True (d) |
Observed absences in presence-absence datasets can be either true, i.e. the species does not occur, or false, i.e. the species does occur but remains undetected (
The capability to distinguish observed true and false absences may also dictate the applicability of model evaluation metrics, many of which differ in weights assigned to each of the four categories in the confusion matrix (Table
While it is possible to estimate or model the probability of detection by repeated surveys and, hence, to discern observed true and false absences (
When using presence-background occurrence data, the measured performance of modelling techniques, such as Generalised Linear Models (
The issue of confounding effects on model performance is particularly important in conservation planning and reserve area selection, since both regularly take SDM predictions into account (
In summary, the interpretability of measured model performance garnered from presence-background data is limited (
In this paper, we use purely theoretical scenarios to compare the responses of three evaluation metrics, the True Skill statistic (TSS;
Detailed definitions of some of the more technical terms and a comparison of the mathematical properties of the analysed evaluation metrics are found in Table
Comparison of selected properties of binary evaluation metrics compared in this article. ‘Consistent at maxSSS’ refers to the threshold maximising the sum of sensitivity and specificity (SSS) suggested by
Property | Definition | TSS | ORSS | SEDI |
---|---|---|---|---|
Asymptotically equitable | Random predictions yield a score of zero | + | + | + |
Prevalence independent | Same result when prevalence changes if both H and F remain unchanged | + | + | + |
Complement symmetric | Same result when switching a and c with d and b | + | + | + |
Consistent at maxSSS | Maximising SSS maximises the evaluation metric | + | – | – |
Fixed range | Minimum/maximum possible values do not depend on prevalence | + | + | + |
Hard to hedge | Monotonic increase with H and monotonic decrease with F | + | + | + |
Non-degenerate | Meaningful results when prevalence approaches zero | – | – | + |
Regular | Isopleths of the evaluation metric pass through the origin | – | + | + |
Transpose symmetric | Same result when swapping b and c | – | + | – |
Below are the equations of four simple evaluation metrics (variables a, b, c and d according to Table
H = a/(a + c) (1)
F = b/(b + d) (2)
bias = (a + b)/(a + c) (3)
prevalence = (a + c)/(a + b + c + d) (4)
TSS measures the difference between H and F and was first developed as Peirce’s skill score in meteorology (
TSS = H – F (5)
ORSS measures skill compared to a random prediction, is a synonym of Yule’s Q (1900) and was introduced to meteorology by
ORSS = (ad – bc)/(ad + bc) (6)
Using two types of extreme prediction settings (“Optimistic” and “Pessimistic”) and two types of typical species prevalence settings (“Differential Bias” and “Changing Bias”), we investigated the response of TSS, ORSS and SEDI to increasing cell number total in the confusion matrix, varying commission error rates, omission error rates and species prevalence. These settings were each divided into two theoretical scenarios: “Incorrectly Optimistic” (IO) and “Correctly Optimistic” (CO), “Correctly Pessimistic” (CP) and “Incorrectly Pessimistic” (IP), “Commission Bias” (CB) and “Omission Bias” (OB) and “Low Commission Rate” (LC) and “High Commission Rate” (HC). For each of the eight scenarios (Table
Although not mutually exclusive, each scenario is designed to reflect signals that could have arisen from biological signals or artifacts, thereby revealing how susceptible model evaluation metrics are to conflating the two. Below, we briefly describe all scenarios and Fig.
Potential spatial distributions of confusion matrix categories corresponding to values of the True Skill Statistic (TSS), the Odds Ratio Skill Score (ORSS) and the Symmetric Extremal Dependence Score (SEDI) for selected scenario cases.
Description of the theoretical scenarios, where IO, CO, CP, IP, CB, OB, LC and HC are abbreviations for scenarios “Incorrectly Optimistic”, “Correctly Optimistic”, “Correctly Pessimistic”, “Commission Bias”, “Omission Bias”, “Low Commission Rate” and “High Commission Rate”. True positives, false positives, false negatives and true negatives are represented by a, b, c, and d, respectively. Total lists the sum of all four cells in the confusion matrix. The formulations are provided in pseudo-R-code, i.e. square brackets (“[“ and “]”) indicate vectors and colons (“:”) indicate a series. For example, “[x:y]” represents a vector of integers ranging from x to y. “...” are used to indicate repeating the same number, and n is the case number.
Scenario | a | b | c | d | Total |
---|---|---|---|---|---|
IO | increasing: 0.005*[1:n]1.25*1000+248 | increasing: [1:n]1.25*1000+248-[a] | constant: [1...1] | constant: [1...1] | min: 1250 max: 42545 |
CO | increasing: 0.995*[1:n]1.25*1000+248 | increasing: [1:n]1.25*1000+248-[a] | constant: [1...1] | constant: [1...1] | min: 1250 max: 42545 |
CP | constant: [1...1] | constant: [1...1] | increasing: [1:n]1.25*1000 +248-[a] | increasing: 0.995*[1:n]1.25 *1000+248 | min: 1250 max: 42545 |
IP | constant: [1...1] | constant: [1...1] | increasing: [1:n]1.25*1000 +248-[a] | increasing: 0.005*[1:n]1.25 *1000+248 | min: 1250 max: 42545 |
CB | constant: [200...200] | constant: [30...30] | constant: [20...20] | increasing: [1:n]1.25*1000 | min: 1250 max: 42545 |
OB | constant: [200...200] | constant: [20...20] | constant: [30...30] | increasing: [1:n]1.25*1000 | min: 1250 max: 42545 |
LC | ‘logistic’: [175:189,190...190] | increasing: [cn:c1] | decreasing: 200-[a] | increasing: [1:n]1.25*1000 | min: 1210 max: 42520 |
HC | ‘logistic’: [175:189,190...190] | increasing: 3*[cn:c1] | decreasing: 200-[a] | increasing: [1:n]1.25*1000 | min: 1230 max: 42570 |
Scenarios IO, CO, CP and IP were designed to demonstrate how evaluation metrics at essentially constant extreme levels of prevalence react to an increasing cell number total in the confusion matrix. The biological component of these scenarios is analogous to specialist or generalist species that have a constant prevalence of 0.5% or 99.5% of the study area. The artefactual component is related to the implications of study area increases for the number of background points and total number of cells and their effect on the calculation of evaluation metrics. Scenario IO was characterised by large numbers of commission errors as it evaluated an extreme incorrectly optimistic modelling prediction (over-prediction) when true species prevalence is equal to 0.5%, reflecting extreme specialisation or rarity and under increasing background size. Scenario CO was identical to scenario IO in its extreme prediction. However, as true species prevalence was equal to 99.5% (reflecting extremely low specialisation), it no longer resembled an over-prediction and was consequently dominated by true presences. Scenario CP evaluated an extreme correctly pessimistic prediction when true species prevalence was equal to 0.5%, reflecting a high degree of ecological specialisation and species presence was only predicted for a small proportion of the study area, under increasing background size. This scenario was characterised by large numbers of true absences. Scenario IP was identical to scenario CP in its extreme prediction but dominated by false absences since true species prevalence was now equal to 99.5%, turning it into a gross under-prediction.
Scenarios CB and OB were designed to reveal the effect of bias on evaluation metrics under decreasing prevalence (~17% to ~0.5%) as the study area increased. In these scenarios, evaluation metrics should consistently penalise model predictions according to the degree of their bias, across the whole range of prevalence. Scenario CB was more optimistic (more commission errors and more predicted presences) than scenario OB (more omissions and fewer predicted presences). Therefore, the two scenarios together can be seen as a test of transpose symmetry.
Scenarios LC and HC examined the response of evaluation metrics to changes in bias while model fit (i.e. the number of true positives) and the total number of cells increased as prevalence decreased (~17% to ~0.5%). More specifically, the number of observations was held constant in both scenarios, while the numbers of true positives and omissions increased and decreased, respectively. However, at the same time, commission errors became more frequent. In other words, the bias of the model changed together with prevalence and the size of the study area. Scenarios LC and HC differed only in their rate of commission errors which was three times higher in scenario HC than in scenario LC. The biological component could represent increasing specialisation of a given species as the study extent increases; whereas the artefactual component could represent resultant increases in model fit as increasing specialisation makes for easier characterisation (
Our results are summarised in Table
Evaluation scores (rounded to four digits) for all evaluation measures metrics considered across all scenarios. IO, CO, CP, IP, CB, OB, LC and HC are abbreviations for scenarios “Incorrectly Optimistic”, “Correctly Optimistic”, “Correctly Pessimistic”, “Commission Bias”, “Omission Bias”, “Low Commission Rate” and “High Commission Rate”. Total lists the sum of all four cells in the confusion matrix. H, F, TSS, ORSS and SEDI list evaluation metric values for hit rate, false positive rate, True Skill Statistic, Odds Ratio Skill Score and Symmetric Extremal Dependence Score, respectively. Cases #6 and #7 closely resemble typical presence-background modelling conditions in MAXENT.
Cell Total | Scenario | H | F | TSS | ORSS | SEDI |
---|---|---|---|---|---|---|
ca. 9,000 – 10,000 (Case #6) | IO | 0.9796 | 0.9999 | -0.0203 | -0.9900 | -0.4050 |
CO | 0.9999 | 0.9796 | 0.020 | 0.9900 | 0.4050 | |
CP | 0.0204 | 0.0001 | 0.0203 | 0.9900 | 0.4050 | |
IP | 0.0001 | 0.0204 | -0.0203 | -0.9900 | -0.4050 | |
CB | 0.9091 | 0.0032 | 0.9059 | 0.9994 | 0.9761 | |
OB | 0.8696 | 0.0021 | 0.8674 | 0.9994 | 0.9659 | |
LC | 0.9000 | 0.0012 | 0.8988 | 0.9997 | 0.9767 | |
HC | 0.9000 | 0.0035 | 0.8965 | 0.9992 | 0.9730 | |
ca. 11,000 – 12,000 (Case #7) | IO | 0.9831 | 0.9999 | -0.0169 | -0.9900 | -0.3937 |
CO | 0.9999 | 0.9831 | 0.0169 | 0.9900 | 0.3937 | |
CP | 0.0169 | 0.0001 | 0.0169 | 0.9900 | 0.3937 | |
IP | 0.0001 | 0.0169 | -0.0169 | -0.9900 | -0.3937 | |
CB | 0.9091 | 0.0026 | 0.9065 | 0.9995 | 0.9768 | |
OB | 0.8696 | 0.0018 | 0.8678 | 0.9995 | 0.9668 | |
LC | 0.9050 | 0.0011 | 0.9039 | 0.9998 | 0.9783 | |
HC | 0.9050 | 0.0032 | 0.9018 | 0.9993 | 0.9749 |
TSS shows a strong response to increased study area size and, hence, confusion matrix cell number totals and rapidly converges to zero, rendering indifferent useful and random models beyond cell number totals in the confusion matrix of approximately 30,000 cells. SEDI shows only a moderate response and converges much later to zero. Of note, H completely fails in this respect since both incorrectly optimistic predictions (IO) and correctly pessimistic predictions (CP) converge to one and zero, respectively and yield scores very similar to those of their correct (CO) and incorrect (IP) counterparts. Finally, SEDI has stronger discriminatory power than TSS at intermediate study areas yet, only ORSS is expected to correctly assess model performance as study area size converges to infinity (Fig.
Plots of the values of hit rate (H), the True Skill Statistic (TSS), the Odds Ratio Skill Score (ORSS) and the Symmetric Extremal Dependence Score (SEDI) for all eight scenarios. Panels a, b, c, d display scenarios “Incorrectly Optimistic” and “Correctly Optimistic”, “Correctly Pessimistic” and “Incorrectly Pessimistic”, “Commission Bias” and “Omission Bias” and “Low Commission Rate” and “High Commission Rate”. In panels a and b, the x-axis denotes the log of the total number of cells, i.e. the size of the study area, whereas in panels c and d, the x-axis denotes prevalence (%).
TSS quickly converges with H and always favours over-predictions to under-predictions. However, the degree to how much over-predictions are favoured increases as prevalence decreases. Although SEDI also favours over-predictions, it does so to a much smaller degree and is not significantly affected by prevalence. Just as in scenarios CO and CP, ORSS rapidly converges to one (Fig.
Proportional difference between scenarios “Low Commission Rate” (LC) and “High Commission Rate” (HC) for (in dark green) the Symmetric Extremal Dependence Score (SEDI) vs. the True Skill Statistic (TSS), (in orange) the Odds Ratio Skill Score (ORSS) vs. TSS and (in blue) SEDI vs. ORSS. The black, horizontal, dashed line represents equal differentiation. As there are slight differences in prevalence between scenarios LC and HC, the x-axis shows the mean prevalence for given cases across both scenarios.
Using eight theoretical scenarios, we have shown that TSS, ORSS and SEDI, as well as their underlying evaluation measures (H and F, see F in Table
Our analysis confirmed a very problematic property of TSS. That is, a very large number found in any of the four cells of the confusion matrix (Table
By grossly over- or under-predicting the distribution of a hypothetical target species, we observed the response of evaluation metrics to extreme biases in less realistic scenarios. These extreme scenarios, however, have also shown that discernment of strongly and weakly performing models greatly differs amongst evaluation metrics and modelling conditions. While these scenario results support the use of ORSS for large study extents, because of its rapid convergence to one, even for imperfect predictions (
These scenarios have been designed to reflect common modelling conditions in order to observe the response of evaluation measures to differential (CB and OB) and changing (LC and HC) biases, under decreasing prevalence as the size of the study area and, hence, the number of background points increased. Analysis of these scenarios revealed very distinct responses to the differing modelling conditions. Results for scenarios CB and OB and LC and HC suggest the use of SEDI since: 1) TSS encourages over-predictions due to its strongly biased treatment of errors which increases as prevalence decreases; 2) TSS quickly loses the discriminatory power to differentiate between models, differing only in their commission rate as it always converges to H; and 3) ORSS converges to one so rapidly (
Our analysis reaffirms the importance of selecting model evaluation metrics corresponding with modelling questions and conditions (
Our results suggested a limited capacity of TSS to provide consistent performance comparisons across varying modelling conditions. This is worrying because TSS may yield misleading estimates of model fidelity, which can lead to the selection of inadequate models. Although it may be tempting to assume that researchers would recognise anomalous conditions where TSS scores are misleading (such as those presented here), this is not necessarily the case – as demonstrated by the broad and seemingly uncritical application of TSS in presence-background SDM over the last decade. Complications, owing to the relative inability of TSS to provide information on commission errors as prevalence approaches zero, are more nuanced. Ultimately, such complications are only problematic in as much as commission errors matter, which depends on bias > 1, the question, the available data and the biology of the species. In fact, it is widely acknowledged that even absence data from professional surveys have greater degrees of both sampling and ecological uncertainty than presence data (
While the above reasoning is persuasive, simply ignoring commission errors in presence background data by limiting evaluation to H, is not a viable option under all but a small subset of questions, modelling conditions and biological assumptions. More specifically, doing so would be incongruent with the biological circumstances, sampling realities and the intents of most modelling initiatives. Further, evaluation scores would become more vulnerable to artificial inflation. From a biological perspective, model evaluation metrics that ignore commission errors are equivalent to assuming that all background points are locations where the species is present but unobserved. That is, assuming that observed presence locations may represent the subset of relative high use or occupied conditions within local settings (
Furthermore, even when the above biological assumptions and survey prerequisites are valid, explicitly choosing to ignore commission errors further assumes that unobserved locations are irrelevant—an assumption that is seldom the case since these locations may correspond to low population density areas (
This study demonstrated that more consistent commission error weighting (as with SEDI) also circumvents a number of potentially artefactual signals as prevalence approaches zero. We also discussed the relative inability of TSS to compare performance across modelling conditions. For these reasons, whereas maximising TSS may be instrumental when presence-absence thresholds are required (
The use of similarity measures as an alternative to TSS has recently been suggested by
In our study, we focused on the importance of model evaluation in the context of ecology and conservation. The problems discussed are particularly relevant in systematic conservation planning (
Our results indicate that ORSS is a suitable evaluation metric for high-confidence presence-absence data, high prevalence situations or if strictly equal error weighting is required. SEDI and to a lesser degree TSS, are suitable evaluation metrics for presence-background SDM initiatives, since the error weighting of the evaluation metrics better reflects low-confidence pseudo-absence data. However, since SEDI provides more consistent performance scores and weighting of commission errors over a wide range of study extents (and background point totals) and prevalence, it is better suited for presence-background SDM, which is applied over a wide range of modelling conditions (i.e. to common or rare species and across single protected areas or whole continents). Finally, we strongly recommend abstaining from the use of TSS whenever prevalence is lower than approximately 2.5% or when a large number of background points is used that drives the total number of cells in the confusion matrix to more than roughly 30,000 cells since TSS will not distinguish between low and high commission error rates or useful and random models.
We are grateful to our reviewers Eduardo Arle and Paul Holloway and Petr Keil for their helpful comments and suggestions. We also thank P.A. Château for his comments on an earlier version of this manuscript.