Choosing an appropriate figure of merit

4.12 How much finished 0 percent

WARNING: Usage of FOM = HrSe or FOM = HrSp is strongly discouraged. Consider comparing two readers or two treatments using either of these FOMs. The rating is a subjective ordered label. It need not be used consistently between readers and treatments. A reader using a strict reporting criteria, who only marks a lesion when he is very confident, will have smaller HrSe and larger HrSp than a reader who adopts a laxer criteria, even though true performance, as measured by ROC AUC or percentage correct in 2AFC task, are identical. This is ROC-101: ROC AUC was recommended by Metz, ca. 1978 instead of sensitivity or specificity.

4.13 Introduction

Assuming the study has been properly conducted, e.g., ROC or FROC, probably the most important step before beginning to analyze the dataset is to choose an apprpriate figure of merit (i.e., performance metric).

4.14 ROC dataset

In the ROC paradigm every modality-reader-case combination yields a single rating. The appropriate FOM is the Wilcoxon statistic, which is identical to the AUC under the empirical ROC curve.

4.15 FROC dataset

In the FROC paradigm every modality-reader-case combination yields a random number (zero or more) of mark-rating pairs.

4.15.1 FOM = wAFROC

For most FROC datasets the appropriate FOM is the AUC under the weighted AFROC plot, as illustrated next for daset05 which has two modalities and 9 readers.

fom_wAFROC <- UtilFigureOfMerit(dataset = dataset05, FOM = "wAFROC")
as.data.frame(lapply(fom_wAFROC, format, decimal.mark = ".", digits = 4))

##   X.0.7245. X.0.8024. X.0.881. X.0.9686. X.0.8096. X.0.846. X.0.6133. X.0.7514.
## 1    0.7245    0.8024    0.881    0.9686    0.8096    0.846    0.6133    0.7514
##   X.0.5773. X.0.7209. X.0.8493. X.0.8719. X.0.8928. X.0.937. X.0.8026.
## 1    0.5773    0.7209    0.8493    0.8719    0.8928    0.937    0.8026
##   X.0.8995. X.0.764. X.0.819.
## 1    0.8995    0.764    0.819

4.15.2 FOM = HrSe

Recall that the concepts of sensitivity and specificity are reserved for ROC data - i.e., one rating per case. To compute these from FROC data one needs a method for inferring a single rating from the possibly multiple (zero or more) ratings ocurring on each case (if the case has no marks one assings a rating that is smaller than any the ratings of explicitly marked locations, e.g., minus infinity). The recommended procedure is to assign the rating of the highest rated mark on each case, of \(-\infty\) if the case has no marks, as its inferred ROC rating. This has the has the effect of converting the FROC dataset to an inferred ROC dataset. The function DfFroc2Roc does exactly this:

dataset05$descriptions$type

## [1] "FROC"

ds <- DfFroc2Roc(dataset05)
ds$descriptions$type

## [1] "ROC"

HrSe is the abbreviation for “highest rating sensitivity”, sensitivity derived from the rating of the highest rated mark on each case. Replacing the possibly multiple ratings occurring on each case with the highest rating amounts to an assumption, a very good one in my opinion. Since the ratings are ordered labels (i.e., non-numeric vaues) any numerical computation, such as the average, would be invalid. It is also common sense: if a case has 3 marks rated 80, 30 and 15, why would the ROC rating be anything but 80. Finally, there is historical precedence for this assumption: (Bunch et al. 1977; Swensson 1996).

Usage of FOM = HrSe is illustrated next for dataset05.

fom_HrSe <- UtilFigureOfMerit(dataset = dataset05, FOM = "HrSe")
as.data.frame(lapply(fom_HrSe, format, decimal.mark = ".", digits = 4))

##   X.0.9362. X.1. X.0.8298. X.0.9574. X.0.8936. X.0.9574..1 X.0.7021. X.0.8511.
## 1    0.9362    1    0.8298    0.9574    0.8936      0.9574    0.7021    0.8511
##   X.0.8298..1 X.0.8511..1 X.0.9574..2 X.1..1 X.0.8723. X.0.9362..1 X.0.8936..1
## 1      0.8298      0.8511      0.9574      1    0.8723      0.9362      0.8936
##   X.0.9149. X.0.8936..2 X.0.9787.
## 1    0.9149      0.8936    0.9787

Notice that each listed value is greater TBA?? than the corresponding value when using FOM = "wAFROC". This should not come as a surprise as

for (i in 1:2)
    for (j in 1:9) {
        cat("i = ", i, ", j = ", j, "\n")
        if (fom_HrSe[i,j] > fom_wAFROC[i,j]) cat ("TRUE \n") else cat("FALSE \n") 
}

## i =  1 , j =  1 
## TRUE 
## i =  1 , j =  2 
## FALSE 
## i =  1 , j =  3 
## TRUE 
## i =  1 , j =  4 
## TRUE 
## i =  1 , j =  5 
## TRUE 
## i =  1 , j =  6 
## TRUE 
## i =  1 , j =  7 
## FALSE 
## i =  1 , j =  8 
## TRUE 
## i =  1 , j =  9 
## TRUE 
## i =  2 , j =  1 
## TRUE 
## i =  2 , j =  2 
## FALSE 
## i =  2 , j =  3 
## TRUE 
## i =  2 , j =  4 
## TRUE 
## i =  2 , j =  5 
## TRUE 
## i =  2 , j =  6 
## TRUE 
## i =  2 , j =  7 
## FALSE 
## i =  2 , j =  8 
## TRUE 
## i =  2 , j =  9 
## TRUE

REFERENCES

Bunch, Philip C, John F Hamilton, Gary K Sanderson, and Arthur H Simmons. 1977. “A Free Response Approach to the Measurement and Characterization of Radiographic Observer Performance.” In Application of Optical Instrumentation in Medicine VI, 127:124–35. International Society for Optics; Photonics.

Swensson, Richard G. 1996. “Unified Measurement of Observer Performance in Detecting and Localizing Target Objects on Images.” Medical Physics 23 (10): 1709–25.