Chapter 38 Standalone CAD vs. Radiologists

38.1 TBA How much finished

10%

38.2 Abstract

Computer aided detection (CAD) research for screening mammography has so far focused on measuring performance of radiologists with and without CAD. Typically a group of radiologists interpret a set of images with and without CAD assist. Standalone performance of CAD algorithms is rarely measured. The stated reason for this is that in the clinic CAD is never used alone, rather it is always used with radiologists. For this reason interest has been focused on the incremental improvement afforded by CAD.

Another reason for the lack of focus on standalone CAD performance is the lack of clear methodology for measuring standalone CAD performance. This chapter extends the methodology used in a recent study of standalone performance. The method is termed random-reader fixed case (1T-RRFC), since it only accounts for reader variability but does not account for case-variability. The extension includes the effect of case-sampling variability. Since in the proposed method CAD is treated as an additional reader within a single treatment, the method is termed one-treatment random-reader random-case (1T-RRRC) analysis. The new method is based on existing methodology allowing comparison of the average performance of readers in a single treatment to a specified value. The key modification is to regard the difference in performance between radiologists and CAD as a figure of merit, to which the existing work is then directly applicable. The 1T-RRRC method was compared to 1T-RRFC. It was also compared to an unorthodox usage of conventional ROC (receiver operating characteristic) analysis software, termed 2T-RRRC analysis, which involves replicating the CAD ratings as many times as there are radiologists, to in effect simulate a second treatment, i.e., CAD is regarded as the second treatment. The proposed 1T-RRRC analysis has 3 random parameters as compared to 6 parameters in 2T-RRRC and one parameter in 1T-RRFC. As expected, since one is including an additional source of variability, both RRRC analyses (1T and 2T) yielded larger p-values and wider confidence intervals as compared to 1T-RRFC. For the F-statistic, degrees of freedom and p-value, both 1T-RRRC and 2T-RRRC analyses yielded exactly the same results. However, 2T-RRRC model parameter estimates were unrealistic; for example, it yields zero between-reader variance, whereas 1T-RRRC yielded the expected non-zero value. All three methods are implemented in an open-source R package RJafroc.

38.3 Keywords

Technology assessment, computer-aided detection (CAD), screening mammography, standalone performance, single-treatment multi-reader ROC analysis.

38.4 Introduction

In the US the majority of screening mammograms are analyzed by computer aided detection (CAD) algorithms (Rao et al. 2010). Almost all major imaging device manufacturers provide CAD as part of their imaging workstation display software. In the United States CAD is approved for use as a second reader (FDA 2018), i.e., the radiologist first interprets the images (typically 4 views, 2 views of each breast) without CAD and then CAD information (i.e., cued suspicious regions, possibly shown with associated probabilities of malignancies) is shown and the radiologist has the opportunity to revise the initial interpretation. In response to the second reader usage, the evolution of CAD algorithms has been guided mainly by comparing observer performance of radiologists with and without CAD.

Clinical CAD systems sometimes only report the locations of suspicious regions, i.e., it may not provide ratings. However, a (continuous variable) malignancy index for every CAD-found suspicious region is available to the algorithm designer (Edwards et al. 2002). Standalone performance, i.e., performance of designer-level CAD by itself, regarded as an algorithmic reader, vs. radiologists, is rarely measured. In breast cancer screening I am aware of only one study (Hupse et al. 2013) where standalone performance was measured. [Standalone performance has been measured in CAD for computed tomography colonography, chest radiography and three dimensional ultrasound (Hein et al. 2010; Summers et al. 2008; Taylor et al. 2006; De Boo et al. 2011; Tan et al. 2012)].

One possible reason for not measuring standalone performance of CAD is the lack of an accepted assessment methodology for such measurements. The purpose of this work is to remove that impediment. It describes a method for comparing standalone performance of designer-level CAD to radiologists interpreting the same cases and compares the method to those described in two recent publications (Hupse et al. 2013; Kooi et al. 2016).

38.5 Methods

Summarized are two recent studies of CAD vs. radiologists in mammography. This is followed by comments on the methodologies used in the two studies. The second study used multi-treatment multi-reader receiver operating characteristic (ROC) software in an unorthodox or unconventional way. A statistical model and analysis method is described that avoids unorthodox, and perhaps unjustified, use of ROC software and has fewer model parameters.

38.5.1 Studies assessing performance of CAD vs. radiologists

The first study (Hupse et al. 2013) measured performance in finding and localizing lesions in mammograms, i.e., visual search was involved, while the second study (Kooi et al. 2016) measured lesion classification performance between non-diseased and diseased regions of interest (ROIs) previously found on mammograms by an independent algorithmic reader, i.e., visual search was not involved.

38.5.1.1 Study - 1

The first study (Hupse et al. 2013) compared standalone performance of a CAD device to that of 9 radiologists interpreting the same cases (120 non-diseased and 80 with a single malignant mass per case). It used the LROC (localization ROC) paradigm (Starr et al. 1975; Metz, Starr, and Lusted 1976; Richard G Swensson 1996), in which the observer gives an overall rating for presence of disease (an integer 0 to 100 scale was used) and indicates the location of the most suspicious region. On a non-diseased case the rating is classified as a false positive (FP) but on a diseased case it is classified as a correct localization (CL) if the location is sufficiently close to the lesion, and otherwise it is classified as an incorrect localization. For a given reporting threshold, the number of correct localizations divided by the number of diseased cases estimates the probability of correct localization (PCL) at that threshold. On non-diseased cases the number of false positives (FPs) divided by the number of non-diseased cases estimates the probability of a false positive, or false positive fraction (FPF), at that threshold. The plot of PCL (ordinate) vs. FPF defines the LROC curve. Study - 1 used as figures of merit (FOMs) the interpolated PCL at two values of FPF, specifically FPF = 0.05 and FPF = 0.2, denoted \(\text{PCL}_{0.05}\) and \(\text{PCL}_{0.2}\), respectively. The t-test between the radiologist \(\text{PCL}_{\text{FPF}}\) values and that of CAD was used to compute the two-sided p-value for rejecting the NH of equal performance. Study - 1 reported p-value = 0.17 for \(\text{PCL}_{0.05}\) and p-value \(\leq\) 0.001, with CAD being inferior, for \(\text{PCL}_{0.2}\).

38.5.1.2 Study - 2

The second study (Kooi et al. 2016) used 199 diseased and 199 non-diseased ROIs extracted by an independent CAD algorithm. These were interpreted using the ROC paradigm (i.e., rating only, no localization required) by a different CAD algorithmic observer from that used to determine the ROIs, and by four expert radiologists. The figure of merit was the empirical area (AUC) under the respective ROC curves (one per radiologist and one for CAD). The p-value for the difference in AUCs between the average radiologist and CAD was determined using an unorthodox application of the Dorfman-Berbaum-Metz (D. D. Dorfman, Berbaum, and Metz 1992) multiple-treatment multiple-reader multiple-case (DBM-MRMC) software with recent modifications (Stephen L Hillis, Berbaum, and Metz 2008). The unorthodox application was that in the input data file radiologists and CAD were entered as two treatments. In conventional (or orthodox) DBM-MRMC each reader provides two ratings per case and the data file would consist of paired ratings of a set of cases interpreted by 4 readers. To accommodate the paired data structure assumed by the software, the authors of Study - 2 replicated the CAD ratings four times in the input data file, as explained in the caption to Table 38.1. By this artifice they converted a single-treatment 5-reader (4 radiologists plus CAD) data file to a two-treatment 4-reader data file, in which the four readers in treatment 1 were the radiologists, and the four “readers” in treatment 2 were CAD replicated ratings. Note that for each case the four readers in the second treatment had identical ratings. In Table 1 the replicated CAD observers are labeled C1, C2, C3 and C4.

TABLE 38.1: The differences between the data structures in conventional DBM-MRMC analysis and the unorthodox application of the software used in Study - 2. There are four radiologists, labeled R1, R2, R3 and R4 interpreting 398 cases labeled 1, 2, …, 398, in two treatments, labeled 1 and 2. Sample ratings are shown only for the first and last radiologist and the first and last case. In the first four columns, labeled “Standard DBM-MRMC”, each radiologist interprets each case twice. In the next four columns, labeled “Unorthodox DBM-MRMC”, the radiologists interpret each case once. CAD ratings are replicated four times to effectively create the second “treatment”. The quotations emphasize that there is, in fact, only one treatment. The replicated CAD observers are labeled C1, C2, C3 and C4.
Standard DBM-MRMC
Unorthodox DBM-MRMC
Reader Treatment Case Rating Reader Treatment Case Rating
R1 1 1 75 R1 1 1 75
R1 1 398 0 R1 1 398 0
R4 1 1 50 R4 1 1 50
R4 1 398 25 R4 1 398 25
R1 2 1 45 C1 2 1 55
R1 2 398 25 C1 2 398 5
R4 2 1 95 C4 2 1 55
R4 2 398 20 C4 2 398 5

Study – 2 reported a not significant difference between CAD and the radiologists (p = 0.253).

38.5.1.3 Comments

For the purpose of this work, which focuses on the respective analysis methods, the difference in observer performance paradigms between the two studies, namely a search paradigm in Study - 1 vs. an ROI classification paradigm in Study – 2, is inconsequential. The paired t-test used in Study - 1 treats the case-sample as fixed. In other words, the analysis is not accounting for case-sampling variability but it is accounting for reader variability. While not explicitly stated, the reason for the unorthodox analysis in Study – 2 was the desire to include case-sampling variability.41

In what follows, the analysis in Study – 1 is referred to as random-reader fixed-case (1T-RRFC) while that in Study – 2 is referred to as dual-treatment random-reader random-case (2T-RRRC).

38.5.2 The 1T-RRFC analysis model

The sampling model for the FOM is:

\[\begin{equation} \left. \begin{aligned} \theta_j=\mu+R_j \\ \left (j = 1,2,...J \right ) \end{aligned} \right \} \tag{38.1} \end{equation}\]

Here \(\mu\) is a constant, \(\theta_j\) is the FOM for reader \(j\), and \(R_j\) is the random contribution for reader \(j\) distributed as:

\[\begin{equation} R_j \sim N\left ( 0,\sigma_R^2 \right ) \tag{38.2} \end{equation}\]

Because of the assumed normal distribution of \(R_j\), in order to compare the readers to a fixed value, that of CAD denoted \(\theta_0\), one uses the (unpaired) t-test, as done in Study – 1. As evident from the model, no allowance is made for case-sampling variability, which is the reason for calling it the 1T-RRFC method.

Performance of CAD on a fixed dataset does exhibit within-reader variability. The same algorithm applied repeatedly to a fixed dataset does not always produce the same mark-rating data. However, this source of CAD FOM variability is much smaller than inter-reader FOM variability of radiologists interpreting the same dataset. In fact the within-reader variability of radiologists is smaller than their inter-reader variability, and within-reader variability of CAD is even smaller still. For this reason one is justified in regarded \(\theta_0\) as a fixed quantity for a given dataset. Varying the dataset will result in different values for \(\theta_0\), i.e., its case sampling variability needs to be accounted for, as done in the following analyses.

38.5.3 The 2T-RRRC analysis model

This could be termed the conventional or the orthodox method. There are two treatments and the study design is fully crossed: each reader interprets each case in each treatment, i.e., the data structure is as in the left half of Table 1.

The following approach, termed 2T-RRRC, uses the Obuchowski and Rockette (OR) figure of merit sampling model (Obuchowski and Rockette 1995) instead of the pseudovalue-based model used in the original DBM paper (D. D. Dorfman, Berbaum, and Metz 1992). For the empirical FOM, Hillis has shown the two to be equivalent (Stephen L Hillis et al. 2005).

The OR model is:

\[\begin{equation} \theta_{ij\{c\}}=\mu+\tau_i+\left ( \tau \text{R} \right )_{ij}+\epsilon_{ij\{c\}} \tag{38.3} \end{equation}\]

Assuming two treatments, \(i\) (\(i = 1, 2\)) is the treatment index, \(j\) (\(j = 1, ..., J\)) is the reader index, and \(k\) (\(k = 1, ..., K\)) is the case index, and \(\theta_{ij\{c\}}\) is a figure of merit for reader \(j\) in treatment \(i\) and case-sample \(\{c\}\). A case-sample is a set or ensemble of cases, diseased and non-diseased, and different integer values of \(c\) correspond to different case-samples.

The first two terms on the right hand side of Eqn. (38.3) are fixed effects (average performance and treatment effect, respectively). The next two terms are random effect variables that, by assumption, are sampled as follows:

\[\begin{equation} \left. \begin{aligned} R_j \sim N\left ( 0,\sigma_R^2 \right )\\ \left ( \tau R \right )_{ij} \sim N\left ( 0,\sigma_{\tau R}^2 \right )\\ \end{aligned} \right \} \tag{38.4} \end{equation}\]

The terms \(R_j\) represents the random treatment-independent contribution of reader \(j\), modeled as a sample from a zero-mean normal distribution with variance \(\sigma_R^2\), \(\left ( \tau R \right )_{ij}\) represents the random treatment-dependent contribution of reader \(j\) in treatment \(i\), modeled as a sample from a zero-mean normal distribution with variance \(\sigma_{\tau R}^2\). The sampling of the last (error) term is described by:

\[\begin{equation} \epsilon_{ij\{c\}}\sim N_{I \times J}\left ( \vec{0} , \Sigma \right ) \tag{38.5} \end{equation}\]

Here \(N_{I \times J}\) is the \(I \times J\) variate normal distribution and \(\vec{0}\), a \(I \times J\) length zero-vector, represents the mean of the distribution. The \(\{I \times J\} \times \{I \times J\}\) dimensional covariance matrix \(\Sigma\) is defined by 4 parameters, \(\text{Var}\), \(\text{Cov}_1\), \(\text{Cov}_2\), \(\text{Cov}_3\), defined as follows:

\[\begin{equation} \text{Cov} \left (\epsilon_{ij\{c\}},\epsilon_{i'j'\{c\}} \right ) = \left\{\begin{matrix} \text{Var} \; (i=i',j=j') \\ \text{Cov1} \; (i\ne i',j=j')\\ \text{Cov2} \; (i = i',j \ne j')\\ \text{Cov3} \; (i\ne i',j \ne j') \end{matrix}\right\} \tag{38.6} \end{equation}\]

Software {U of Iowa and RJafroc} yields estimates of all terms appearing on the right hand side of Eqn. (38.6). Excluding fixed effects, the model represented by Eqn. (38.3) contains six parameters:

\[\begin{equation} \sigma_R^2, \sigma_{\tau R}^2, \text{Var}, \text{Cov}_1, \text{Cov}_2, \text{Cov}_3 \tag{38.7} \end{equation}\]

The meanings the last four terms are described in (Stephen L Hillis 2007; Obuchowski and Rockette 1995; Stephen L Hillis et al. 2005; Chakraborty 2017). Briefly, \(\text{Var}\) is the variance of a reader’s FOMs, in a given treatment, over interpretations of different case-samples, averaged over readers and treatments; \(\text{Cov}_1/\text{Var}\) is the correlation of a reader’s FOMs, over interpretations of different case-samples in different treatments, averaged over all different-treatment same-reader pairings; \(\text{Cov}_2/\text{Var}\) is the correlation of different reader’s FOMs, over interpretations of different case-samples in the same treatment, averaged over all same- treatment different-reader pairings and finally, \(\text{Cov}_3/\text{Var}\) is the correlation of different reader’s FOMs, over interpretations of different case-samples in different treatments, averaged over all different-treatment different-reader pairings. One expects the following inequalities to hold:

\[\begin{equation} \text{Var} \geq \text{Cov}_1 \geq \text{Cov}_2 \geq \text{Cov}_3 \tag{38.8} \end{equation}\]

In practice, since one is usually limited to one case-sample, i.e., \(c = 1\), resampling techniques (Efron and Tibshirani 1994) – e.g., the jackknife – are used to estimate these terms.

38.5.4 The 1T-RRRC analysis model

This is the contribution of this work. The key difference from the approach in Study - 2 is to regard standalone CAD as a different reader, not as a different treatment. Therefore, needed is a single treatment method for analyzing readers and CAD, where the latter is regarded as an additional reader. Accordingly the proposed method is termed single-treatment RRRC (1T-RRRC) analysis.

The starting point is the (Obuchowski and Rockette 1995) model for a single treatment, which for the radiologists (i.e., excluding CAD) interpreting in a single-treatment reduces to the following model:

\[\begin{equation} \theta_{j\{c\}}=\mu+R_j+\epsilon_{j\{c\}} \tag{38.9} \end{equation}\]

\(\theta_{j\{c\}}\) is the figure of merit for radiologist \(j\) (\(j = 1, 2, ..., J\)) interpreting case-sample \(\{c\}\); \(R_j\) is the random effect of radiologist \(j\) and \(\epsilon_{j\{c\}}\) is the error term. For single-treatment multiple-reader interpretations the error term is distributed as:

\[\begin{equation} \epsilon_{j\{c\}}\sim N_{J}\left ( \vec{0} , \Sigma \right ) \tag{38.10} \end{equation}\]

The \(J \times J\) covariance matrix \(\Sigma\) is defined by two parameters, \(\text{Var}\) and \(\text{Cov}_2\), as follows:

\[\begin{equation} \Sigma_{jj'} = \text{Cov}\left ( \epsilon_{j\{c\}}, \epsilon_{j'\{c\}} \right ) = \left\{\begin{matrix} \text{Var} & j = j'\\ \text{Cov}_2 & j \neq j' \end{matrix}\right. \tag{38.11} \end{equation}\]

The terms \(\text{Var}\) and \(\text{Cov}_2\) are estimated using resampling methods. Using the jackknife, and denoting the FOM with case \(k\) removed by \(\psi_{j(k)}\) (the index in parenthesis denotes deleted case \(k\), and since one is dealing with a single case-sample, the case-sample index \(c\) is now superfluous). The covariance matrix is estimated using (the dot symbol represents an average over the replaced index):

\[\begin{equation} \Sigma_{jj'}|_\text{jack} = \frac{K-1}{K} \sum_{k=1}^{K} \left ( \psi_{j(k)} - \psi_{j(\bullet)} \right ) \left ( \psi_{j'(k)} - \psi_{j'(\bullet)} \right ) \tag{38.12} \end{equation}\]

The final estimates of \(\text{Var}\) and \(\text{Cov}_2\) are averaged (indicated in the following equation by the angular brackets) over all pairings of radiologists satisfying the relevant equalities/inequalities shown just below the closing angular bracket:

\[\begin{equation} \left. \begin{aligned} \text{Var} = \left \langle \Sigma_{jj'}|_{\text{jack}} \right \rangle_{j=j'}\\ \text{Cov}_2 = \left \langle \Sigma_{jj'}|_{\text{jack}} \right \rangle_{j \neq j'} \end{aligned} \right \} \tag{38.13} \end{equation}\]

Hillis’ formulae (Stephen L Hillis et al. 2005; Stephen L Hillis 2007) permit one to test the NH: \(\mu = \mu_0\), where \(\mu_0\) is a pre-specified constant. One could set \(\mu_0\) equal to the performance of CAD, but that would not be accounting for the fact that the performance of CAD is itself a random variable, whose case-sampling variability needs to be accounted for.

Instead, the following model was used for the figure of merit of the radiologists and CAD (\(j = 0\) is used to denote the CAD algorithmic reader):

\[\begin{equation} \theta_{j\{c\}} = \theta_{0\{c\}} + \Delta \theta + R_j + \epsilon_{j\{c\}}\\ j=1,2,...J \tag{38.14} \end{equation}\]

\(\theta_{0\{c\}}\) is the CAD figure of merit for case-sample \(\{c\}\) and \(\Delta \theta\) is the average figure of merit increment of the radiologists over CAD. To reduce this model to one to which existing formulae are directly applicable, one subtracts the CAD figure of merit from each radiologist’s figure of merit (for the same case-sample), and defines this as the difference figure of merit \(\psi_{j\{c\}}\) , i.e.,

\[\begin{equation} \psi_{j\{c\}} = \theta_{j\{c\}} - \theta_{0\{c\}} \tag{38.15} \end{equation}\]

Then Eqn. (38.14) reduces to:

\[\begin{equation} \psi_{j\{c\}} = \Delta \theta + R_j + \epsilon_{j\{c\}}\\ j=1,2,...J \tag{38.16} \end{equation}\]

Eqn. (38.16) is identical in form to Eqn. (38.9) with the difference that the figure of merit on the left hand side of Eqn. (38.16) is a difference FOM, that between the radiologist’s and CAD. Eqn. (38.16) describes a model for \(J\) radiologists interpreting a common case set, each of whose performances is measured relative to that of CAD. Under the NH the expected difference is zero: \(\text{NH:} \Delta \theta = 0\). The method (Stephen L Hillis et al. 2005; Stephen L Hillis 2007) for single-treatment multiple-reader analysis is now directly applicable to the model described by Eqn. (38.16).

Apart from fixed effects, the model in Eqn. (38.16) contains three parameters:

\[\begin{equation} \sigma_R^2, \text{Var}, \text{Cov}_2 \tag{38.17} \end{equation}\]

Setting \(\text{Var} = 0, \text{Cov}_2 = 0\) yields the 1T-RRFC model, which contains only one random parameter, namely \(\sigma_R^2\). [One expects identical estimates of \(\sigma_R^2\) using 1T-RRFC, 2T-RRRC or 1T-RRRC analyses.]

38.6 Software implementation

The three analyses, namely random-reader fixed-case (1T-RRFC), dual-treatment random-reader random-case (2T-RRRC) and single-treatment random-reader random-case (1T-RRRC), are implemented in RJafroc, an R-package (D. Chakraborty, Philips, and Zhai 2020).

The following code shows usage of the software to generate the results corrsponding to the three analyses. Note that datasetCadLroc is the LROC dataset and dataset09 is the corresponding ROC dataset.



RRFC_1T_PCL_0_05 <- StSignificanceTestingCadVsRad (datasetCadLroc, 
FOM = "PCL", FPFValue = 0.05, method = "1T-RRFC")
RRRC_2T_PCL_0_05 <- StSignificanceTestingCadVsRad (datasetCadLroc, 
FOM = "PCL", FPFValue = 0.05, method = "2T-RRRC")
RRRC_1T_PCL_0_05 <- StSignificanceTestingCadVsRad (datasetCadLroc, 
FOM = "PCL", FPFValue = 0.05, method = "1T-RRRC")

RRFC_1T_PCL_0_2 <- StSignificanceTestingCadVsRad (datasetCadLroc, 
FOM = "PCL", FPFValue = 0.2, method = "1T-RRFC")
RRRC_2T_PCL_0_2 <- StSignificanceTestingCadVsRad (datasetCadLroc, 
FOM = "PCL", FPFValue = 0.2, method = "2T-RRRC")
RRRC_1T_PCL_0_2 <- StSignificanceTestingCadVsRad (datasetCadLroc, 
FOM = "PCL", FPFValue = 0.2, method = "1T-RRRC")

RRFC_1T_PCL_1 <- StSignificanceTestingCadVsRad (datasetCadLroc, 
FOM = "PCL", FPFValue = 1, method = "1T-RRFC")
RRRC_2T_PCL_1 <- StSignificanceTestingCadVsRad (datasetCadLroc, 
FOM = "PCL", FPFValue = 1, method = "2T-RRRC")
RRRC_1T_PCL_1 <- StSignificanceTestingCadVsRad (datasetCadLroc, 
FOM = "PCL", FPFValue = 1, method = "1T-RRRC")

RRFC_1T_AUC <- StSignificanceTestingCadVsRad (dataset09, 
FOM = "Wilcoxon", method = "1T-RRFC")
RRRC_2T_AUC <- StSignificanceTestingCadVsRad (dataset09, 
FOM = "Wilcoxon", method = "2T-RRRC")
RRRC_1T_AUC <- StSignificanceTestingCadVsRad (dataset09, 
FOM = "Wilcoxon", method = "1T-RRRC")

The results are organized as follows:

  • RRFC_1T_PCL_0_05 contains the results of 1T-RRFC analysis for figure of merit = \(PCL_{0.05}\).

  • RRRC_2T_PCL_0_05 contains the results of 2T-RRFC analysis for figure of merit = \(PCL_{0.05}\).

  • RRRC_1T_PCL_0_05 contains the results of 1T-RRFC analysis for figure of merit = \(PCL_{0.05}\).

  • RRFC_1T_PCL_0_2 contains the results of 1T-RRFC analysis for figure of merit = \(PCL_{0.2}\).

  • RRRC_2T_PCL_0_2 contains the results of 2T-RRRC analysis for figure of merit = \(PCL_{0.2}\).

  • RRRC_1T_PCL_0_2 contains the results of 1T-RRRC analysis for figure of merit = \(PCL_{0.2}\).

  • RRFC_1T_AUC contains the results of 1T-RRFC analysis for the Wilcoxon figure of merit.

  • RRRC_2T_AUC contains the results of 2T-RRRC analysis for the Wilcoxon figure of merit.

  • RRRC_1T_AUC contains the results of 1T-RRRC analysis for the Wilcoxon figure of merit.

The structures of these objects are illustrated with examples in the Appendix.

38.7 Results

The three methods, in historical order 1T-RRFC, 2T-RRRC and 1T-RRRC, were applied to an LROC dataset similar to that used in Study – 1 (I thank Prof. Karssemeijer for making this dataset available).

Shown next, Table 38.2, are the significance testing results corresponding to the three analyses.

TABLE 38.2: Significance testing results of the analyses for an LROC dataset. Three sets of results, namely RRRC, 2T-RRRC and 1T-RRRC, are shown for each figure of merit (FOM). Because it is accounting for an additional source of variability, each of the rows labeled RRRC yields a larger p-value and wider confidence intervals than the corresponding row labeled 1T-RRFC. [\(\theta_0\) = FOM CAD; \(\theta_{\bullet}\) = average FOM of radiologists; \(\psi_{\bullet}\) = average FOM of radiologists minus CAD; CI= 95 percent confidence interval of quantity indicated by the subscript, F = F-statistic; ddf = denominator degrees of freedom; p = p-value for rejecting the null hypothesis: \(\psi_{\bullet} = 0\).]
FOM Analysis \(\theta_0\) \(CI_{\theta_0}\) \(\theta_{\bullet}\) \(CI_{\theta_{\bullet}}\) \(\psi_{\bullet}\) \(CI_{\psi_{\bullet}}\) F ddf p
PCL_0_05 1T-RRFC 4.5e-01 0 4.93e-01 (4.18e-01,5.68e-01) 4.33e-02 (-3.16e-02,1.18e-01) 1.77e+00 8e+00 2.2e-01
PCL_0_05 2T-RRRC 4.5e-01 (2.58e-01,6.42e-01) 4.93e-01 (3.76e-01,6.11e-01) 4.33e-02 (-1.57e-01,2.44e-01) 1.79e-01 7.84e+02 6.7e-01
PCL_0_05 1T-RRRC 4.5e-01 NA 4.93e-01 (2.93e-01,6.94e-01) 4.33e-02 (-1.57e-01,2.44e-01) 1.79e-01 7.84e+02 6.7e-01
PCL_0_2 1T-RRFC 5.92e-01 0 7.1e-01 (6.69e-01,7.51e-01) 1.19e-01 (7.78e-02,1.59e-01) 4.5e+01 8e+00 1.51e-04
PCL_0_2 2T-RRRC 5.92e-01 (4.78e-01,7.05e-01) 7.1e-01 (6.33e-01,7.87e-01) 1.19e-01 (4.45e-03,2.33e-01) 4.16e+00 9.37e+02 4.2e-02
PCL_0_2 1T-RRRC 5.92e-01 NA 7.1e-01 (5.96e-01,8.24e-01) 1.19e-01 (4.45e-03,2.33e-01) 4.16e+00 9.37e+02 4.2e-02
PCL_1 1T-RRFC 6.75e-01 0 7.83e-01 (7.4e-01,8.27e-01) 1.08e-01 (6.48e-02,1.52e-01) 3.3e+01 8e+00 4.33e-04
PCL_1 2T-RRRC 6.75e-01 (5.71e-01,7.79e-01) 7.83e-01 (7.12e-01,8.54e-01) 1.08e-01 (4.5e-03,2.12e-01) 4.2e+00 4.93e+02 4.1e-02
PCL_1 1T-RRRC 6.75e-01 NA 7.83e-01 (6.8e-01,8.87e-01) 1.08e-01 (4.5e-03,2.12e-01) 4.2e+00 4.93e+02 4.1e-02
Wilcoxon 1T-RRFC 8.17e-01 0 8.49e-01 (8.26e-01,8.71e-01) 3.17e-02 (8.96e-03,5.45e-02) 1.03e+01 8e+00 1.24e-02
Wilcoxon 2T-RRRC 8.17e-01 (7.52e-01,8.82e-01) 8.49e-01 (8.07e-01,8.9e-01) 3.17e-02 (-3.1e-02,9.45e-02) 9.86e-01 8.78e+02 3.2e-01
Wilcoxon 1T-RRRC 8.17e-01 NA 8.49e-01 (7.86e-01,9.11e-01) 3.17e-02 (-3.1e-02,9.45e-02) 9.86e-01 8.78e+02 3.2e-01

Results are shown for the following FOMs: \(\text{PCL}_{0.05}\), \(\text{PCL}_{0.2}\), \(\text{PCL}_{1}\), and the empirical area (AUC) under the ROC curve estimated by the Wilcoxon statistic. The first two FOMs are identical to those used in Study – 1. Columns 3 and 4 list the CAD FOM \(\theta_0\) and its 95% confidence interval \(CI_{\theta_0}\), columns 5 and 6 list the average radiologist FOM \(\theta_{\bullet}\) (the dot symbol represents an average over the radiologist index) and its 95% confidence interval \(CI_{\theta_{\bullet}}\), columns 7 and 8 list the average difference FOM \(\psi_{\bullet}\), i.e., radiologist minus CAD, and its 95% confidence interval \(CI_{\psi_{\bullet}}\), and the last three columns list the F-statistic, the denominator degrees of freedom (ddf) and the p-value for rejecting the null hypothesis. The numerator degree of freedom of the F-statistic, not listed, is unity.

In Table 38.2 identical values in adjacent cells in vertical columns have been replaced by the common values. The last three columns show that 2T-RRRC and 1T-RRRC analyses yield identical F-statistics, ddf and p-values. So the intuition of the authors of Study – 2, that the unorthodox method of using DBM – MRMC software to account for both reader and case-sampling variability, turns out to be correct. If interest is solely in these statistics one is justified in using the unorthodox method.

Commented on next are other aspects of the results evident in Table 38.2.

  1. Where a direct comparison is possible, namely 1T-RRFC analysis using and as FOMs, the p-values in Table 38.2 are similar to those reported in Study – 1.
  2. All FOMs (i.e., \(\theta_0\), \(\theta_{\bullet}\) and \(\psi_{\bullet}\)) in Table 38.2 are independent of the method of analysis. However, the corresponding confidence intervals (i.e., \(CI_{\theta_0}\), \(CI_{\theta_{\bullet}}\) and \(CI_{\psi_{\bullet}}\)) depend on the analyses.
  3. Since 1T-RRFC analysis ignores case sampling variability, the CAD figure of merit is a constant, with zero-width confidence interval. For compactness the CI is listed as 0, rather than two identical values in parentheses. The confidence interval listed for 2T-RRRC analyses is centered on the corresponding CAD value, as are all confidence intervals in Table 38.2.
  4. The LROC FOMs increase as the value of FPF (the subscript) increases. This should be obvious, as PCL increases as FPF increases, a general feature of any partial curve based figure of merit.
  5. The area (AUC) under the ROC is larger than the largest PCL value, i.e., \(AUC \geq \text{PCL}_1\). This too should be obvious from the general features of the LROC (Richard G Swensson 1996).
  6. The p-value for either RRRC analyses (2T or 1T) is larger than the corresponding 1T-RRFC value. Accounting for case-sampling variability increases the p-value, leading to less possibility of finding a significant difference.
  7. Partial curve-based FOMs, such as \(\text{PCL}_{FPF}\), lead, depending on the choice of \(FPF\), to different conclusions. The p-values generally decrease as FPF increases. Measuring performance on the steep part of the LROC curve (i.e., small FPF) needs to account for greater reader variability and risks lower statistical power.
  8. Ignoring localization information (i.e., using the AUC FOM) led to a not-significant difference between CAD and the radiologists (\(p\) = 0.3210), while the corresponding FOM yielded a significant difference (\(p\) = 0.0409). Accounting for localization leads to a less “noisy” measurement. This has been demonstrated for the LROC paradigm (Richard G Swensson 1996) and I have demonstrated this for the FROC paradigm (Chakraborty 2008).
  9. For 1T-RRRC analysis, is listed as NA, for not applicable, since is not a model parameter, see Eqn. (38.16).

Shown next, Table 38.3, are the model-parameters corresponding to the three analyses.

TABLE 38.3: Parameter estimates for the analyses; NA = not applicable.
FOM Analysis \(\sigma_R^2\) \(\sigma_{\tau R}^2\) Cov1 Cov2 Cov3 Var
PCL_0_05 1T-RRFC 9.5e-03 NA NA NA NA NA
PCL_0_05 2T-RRRC 1.84e-18 -5.71e-03 1.31e-03 6.01e-03 1.31e-03 1.65e-02
PCL_0_05 1T-RRRC 9.5e-03 NA NA 9.4e-03 NA 3.03e-02
PCL_0_2 1T-RRFC 2.81e-03 NA NA NA NA NA
PCL_0_2 2T-RRRC -7.59e-19 2.65e-04 7.61e-04 2.29e-03 7.61e-04 3.43e-03
PCL_0_2 1T-RRRC 2.81e-03 NA NA 3.07e-03 NA 5.34e-03
PCL_1 1T-RRFC 3.2e-03 NA NA NA NA NA
PCL_1 2T-RRRC 1.63e-18 1e-03 6.43e-04 1.86e-03 6.43e-04 2.46e-03
PCL_1 1T-RRRC 3.2e-03 NA NA 2.44e-03 NA 3.64e-03
Wilcoxon 1T-RRFC 8.78e-04 NA NA NA NA NA
Wilcoxon 2T-RRRC 2.98e-19 2.01e-04 2.62e-04 7.24e-04 2.62e-04 9.62e-04
Wilcoxon 1T-RRRC 8.78e-04 NA NA 9.24e-04 NA 1.4e-03

The following characteristics are evident from Table 38.3.

  1. For 2T-RRRC analyses \(\sigma_R^2 = 0\). Actually, the analysis yielded very small values, of the order of \(10^{-18}\) to \(10^{-19}\), which, being smaller than double precision accuracy, were replaced by zeroes in Table 38.2. \(\sigma_R^2 = 0\) is clearly an incorrect result as the radiologists do not have identical performance. In contrast, 1T-RRRC analyses yielded more realistic values, identical to those obtained by 1T-RRFC analyses, and consistent with expectation – see comment following Eqn. (15).
  2. Because 2T analysis found zero reader variability, it follows from the definitions of the covariances (Obuchowski and Rockette 1995), that \(Cov_1 = Cov_3 = 0\), as evident in the table.
  3. When they can be compared (i.e., \(\sigma_R^2\), \(\text{Cov}_2\) and \(\text{Var}\)), all variance and covariance estimates were smaller for the 2T method than for the 1T method.
  4. For the 2T method the expected inequalities, Eqn. (38.8), are not obeyed (specifically, \(Cov_1 \geq Cov_2 \geq Cov_3\) is not obeyed).

For an analysis method to be considered statistically valid it needs to be tested with simulations to determine if it has the proper null hypothesis behavior. The design of a ratings simulator to statistically match a given dataset is addressed in Chapter 23 of reference (Chakraborty 2017). Using this simulator, the 1T-RRRC method had the expected null hypothesis behavior (Table 23.5, ibid).

38.8 Discussion

TBA TODOLAST The argument often made for not measuring standalone performance is that since CAD will be used only as a second reader, it is only necessary to measure performance of radiologists without and with CAD. It has been stated (Nishikawa and Pesce 2011):

High stand-alone performance is neither a necessary nor a sufficient condition for CAD to be truly useful clinically.

Assessing CAD utility this way, i.e, by measuring performance with and without CAD, may have inadvertently set a low bar for CAD to be considered useful. As examples, CAD is not penalized for missing cancers as long as the radiologist finds them and CAD is not penalized for excessive false positives (FPs) as long as the radiologist ignores them. Moreover, since both such measurements include the variability of radiologists, there is additional noise introduces that presumably makes it harder to determine if the CAD system is optimal.

Described is an extension of the analysis used in Study – 1 that accounts for case sampling variability. It extends (Stephen L Hillis et al. 2005) single-treatment analysis to a situation where one of the “readers” is a special reader, and the desire is to compare performance of this reader to the average of the remaining readers. The method, along with two other methods, was used to analyze an LROC data set using different figures of merit.

1T-RRRC analyses yielded identical overall results (specifically the F-statistic, degrees of freedom and p-value) to those yielded by the unorthodox application of DBM-MRMC software, termed 2T-RRRC analyses, where the CAD reader is regarded as a second treatment. However, the values of the model parameters of the dual-treatment analysis lacked clear physical meanings. In particular, the result \(\sigma_R^2 = 0\) is clearly an artifact. One can only speculate as to what happens when software is used in a manner that it was not designed for: perhaps finding that all readers in the second treatment have identical FOMs led the software to yield \(\sigma_R^2 = 0\). The single-treatment model has half as many parameters as the dual-treatment model and the parameters have clear physical meanings and the values are realistic.

The paradigm used to collect the observer performance data - e.g., receiver operating characteristic (ROC) (Metz 1986), free-response ROC (FROC) (Chakraborty et al. 1986), location ROC (LROC) (Starr et al. 1975) or region of interest (ROI) (Obuchowski, Lieber, and Powell 2000) - is irrelevant – all that is needed is a scalar performance measure for the actual paradigm used. In addition to PCL and AUC, RJafroc currently implements the partial area under the LROC, from FPF = 0 to a specified value as well other FROC-paradigm based FOMs.

While there is consensus that CAD works for microcalcifications, for masses its performance is controversial27,28. Two large clinical studies TBA 29,30 (222,135 and 684,956 women, respectively) showed that CAD actually had a detrimental effect on patient outcome. A more recent large clinical study has confirmed the negative view of CAD31 and there has been a call for ending Medicare reimbursement for CAD interpretations32.

In my opinion standalone performance is the most direct measure of CAD performance. Lack of clear-cut methodology to assess standalone CAD performance may have limited past CAD research. The current work hopoefully removes that impediment. Going forward, assessment of standalone performance of CAD vs. expert radiologists is strongly encouraged.

38.9 Appendix

The structures of theR objects generated by the software are illustrated with three examples.

38.9.1 Example 1

The first example shows the structure of `RRFC_1T_PCL_0_2

print(fom_individual_rad)
#>         rdr1 rdr2    rdr3  rdr4       rdr5       rdr6   rdr7  rdr8  rdr9
#> 1 0.69453125 0.65 0.80625 0.725 0.65982143 0.76845238 0.7375 0.675 0.675
print(stats)
#>       fomCAD  avgRadFom avgDiffFom        varR     Tstat df          pval
#> 1 0.59166667 0.71017278 0.11850612 0.002808612 6.7083568  8 0.00015139664
print(ConfidenceIntervals)
#>       CIAvgRadFom CIAvgDiffFom
#> Lower  0.66943619  0.077769525
#> Upper  0.75090938  0.159242710

The results are displayed as three data frames.

The first data frame :

  • fom_individual_rad shows the figures of merit for the nine radiologists in the study.

The next data frame summarizes the statistics.

  • fomCAD is the figure of merit for CAD.
  • avgRadFom is the average figure of merit of the nine radiologists in the study.
  • avgDiffFom is the average difference figure of merit, RAD - CAD.
  • varR is the variance of the figures of merit for the nine radiologists in the study.
  • Tstat is the t-statistic for testing the NH that the average difference FOM avgDiffFom is zero, whose square is the F-statistic.
  • df is the degrees of freedom of the t-statistic.
  • pval is the p-value for rejecting the NH. In the example shown below the value is highly signficant.

The last data frame summarizes the 95 percent confidence intervals.

  • CIAvgRadFom is the 95 percent confidence interval, listed as pairs Lower, Upper, for avgRadFom.
  • CIAvgDiffFom is the 95 percent confidence interval for avgDiffFom.
  • If the pair CIAvgDiffFom excludes zero, the difference is statistically significant.
  • In the example the interval excludes zero showing that the FOM difference is significant.

38.9.2 Example 2

The next example shows the structure of RRRC_2T_PCL_0_2.


print(fom_individual_rad)
#>         rdr1 rdr2    rdr3  rdr4       rdr5       rdr6   rdr7  rdr8  rdr9
#> 1 0.69453125 0.65 0.80625 0.725 0.65982143 0.76845238 0.7375 0.675 0.675
print(stats1)
#>       fomCAD  avgRadFom avgDiffFom
#> 1 0.59166667 0.71017278 0.11850612
print(stats2)
#>             varR         varTR          cov1         cov2          cov3
#> 1 -7.5894152e-19 0.00026488983 0.00076136841 0.0022942211 0.00076136841
#>            Var     FStat        df        pval
#> 1 0.0034336373 4.1576797 937.24371 0.041726262

In addition to the quantities defined previously, the output contains the covariance matrix for the Obuchowski-Rockette model, summarized in Eqn. (38.3) – Eqn. (38.6).

  • varTR is \(\sigma_{\tau R}^2\).
  • cov1 is \(\text{Cov}_1\).
  • cov2 is \(\text{Cov}_2\).
  • cov3 is \(\text{Cov}_3\).
  • Var is \(\text{Var}\).
  • FStat is the F-statistic for testing the NH.
  • ndf is the numerator degrees of freedom, equal to unity.
  • df is denominator degrees of freedom of the F-statistic for testing the NH.
  • Tstat is the t-statistic for testing the NH that the average difference FOM avgDiffFom is zero.
  • pval is the p-value for rejecting the NH. In the example shown below the value is signficant.

Notice that including the variability of cases results in a higher p-value for 2T-RRRC as compared to 1T-RRFC.

Shown next are the confidence interval statistics x$ciAvgRdrEachTrt for the two treatments (“trt1” = CAD, “trt2” = RAD):


print(x$ciAvgRdrEachTrt)
#>        Estimate      StdErr        DF    CILower    CIUpper         Cov2
#> trt1 0.59166667 0.058028349       Inf 0.47793319 0.70540014 0.0033672893
#> trt2 0.71017278 0.039156365 193.10832 0.63294372 0.78740185 0.0012211529
  • Estimate contains the difference FOM estimate.
  • StdErr contains the standard estimate of the difference FOM estimate.
  • DF contains the degrees of freedom of the t-statistic.
  • t contains the value of the t-statistic.
  • PrGtt contains the probability of exceeding the magnitude of the t-statistic.
  • CILower is the lower confidence interval for the difference FOM.
  • CIUpper is the upper confidence interval for the difference FOM.

Shown next are the confidence interval statistics x$ciDiffFom between the two treatments (“trt1-trt2” = CAD - RAD):


print(x$ciDiffFom)
#>             Estimate      StdErr        DF         t       PrGTt     CILower
#> trt2-trt1 0.11850612 0.058118615 937.24371 2.0390389 0.041726262 0.004448434
#>             CIUpper
#> trt2-trt1 0.2325638

The difference figure of merit statistics are contained in a dataframe x$ciDiffFom with elements:

  • Estimate contains the difference FOM estimate.
  • StdErr contains the standard estimate of the difference FOM estimate.
  • DF contains the degrees of freedom of the t-statistic.
  • t contains the value of the t-statistic.
  • PrGtt contains the probability of exceeding the magnitude of the t-statistic.
  • CILower is the lower confidence interval for the difference FOM.
  • CIUpper is the upper confidence interval for the difference FOM.

The figures of merit statistic for the two treatments, 1 is CAD and 2 is RAD.

  • trt1: statistics for CAD.
  • trt2: statistics for RAD.
  • Cov2: \(\text{Cov}_2\) calculated over individual treatments.

38.9.3 Example 3

The last example shows the structure of RRRC_1T_PCL_0_2.

RRRC_1T_PCL_0_2
#> $fomCAD
#> [1] 0.59166667
#> 
#> $fomRAD
#> [1] 0.69453125 0.65000000 0.80625000 0.72500000 0.65982143 0.76845238 0.73750000
#> [8] 0.67500000 0.67500000
#> 
#> $avgRadFom
#> [1] 0.71017278
#> 
#> $CIAvgRad
#> [1] 0.59611510 0.82423047
#> 
#> $avgDiffFom
#> [1] 0.11850612
#> 
#> $CIAvgDiffFom
#> [1] 0.004448434 0.232563801
#> 
#> $varR
#> [1] 0.002808612
#> 
#> $varError
#> [1] 0.0053445377
#> 
#> $cov2
#> [1] 0.0030657054
#> 
#> $Tstat
#>      rdr2 
#> 2.0390389 
#> 
#> $df
#>      rdr2 
#> 937.24371 
#> 
#> $pval
#>        rdr2 
#> 0.041726262

The differences from RRFC_1T_PCL_0_2 are listed next:

  • varR is \(\sigma_R^2\) of the single treatment model for comparing CAD to RAD, Eqn. (38.17).
  • cov2 is \(\text{Cov}_2\) of the single treatment model for comparing CAD to RAD.
  • varError is \(\text{Var}\) of the single treatment model for comparing CAD to RAD.

Notice that the RRRC_1T_PCL_0_2 p value, i.e., 0.04172626, is identical to that of RRRC_2T_PCL_0_2, i.e., 0.04172626.

38.10 References

References

Chakraborty, Dev P. 2017. Observer Performance Methods for Diagnostic Imaging: Foundations, Modeling, and Applications with R-Based Examples. Boca Raton, FL: CRC Press.

Chakraborty, Dev P. 2008. “Validation and Statistical Power Comparison of Methods for Analyzing Free-Response Observer Performance Studies.” Journal Article. Acad Radiol 15 (12): 1554–66. http://www.sciencedirect.com/science/article/B75BK-4TW6D0R-9/2/8f59ae9ff4ba7d2aa596076694b7de09.

Chakraborty, Dev, Peter Philips, and Xuetong Zhai. 2020. RJafroc: Analyzing Diagnostic Observer Performance Studies. https://dpc10ster.github.io/RJafroc/.

Chakraborty, D. P., E. S. Breatnach, M. V. Yester, B. Soto, G. T. Barnes, and R. G. Fraser. 1986. “Digital and Conventional Chest Imaging: A Modified ROC Study of Observer Performance Using Simulated Nodules.” Journal Article. Radiology 158: 35–39. https://doi.org/10.1148/radiology.158.1.3940394.

De Boo, Diederick W, Martin Uffmann, Michael Weber, Shandra Bipat, Eelco F Boorsma, Maeke J Scheerder, Nicole J Freling, and Cornelia M Schaefer-Prokop. 2011. “Computer-Aided Detection of Small Pulmonary Nodules in Chest Radiographs: An Observer Study.” Journal Article. Academic Radiology 18 (12): 1507–14.

Dorfman, Donald D, Kevin S Berbaum, and Charles E Metz. 1992. “Receiver Operating Characteristic Rating Analysis: Generalization to the Population of Readers and Patients with the Jackknife Method.” Investigative Radiology 27 (9): 723–31.

Edwards, Darrin C, Matthew A Kupinski, Charles E Metz, and Robert M Nishikawa. 2002. “Maximum Likelihood Fitting of Froc Curves Under an Initial-Detection-and-Candidate-Analysis Model.” Medical Physics 29 (12): 2861–70.

Efron, Bradley, and Robert J Tibshirani. 1994. An Introduction to the Bootstrap. CRC press.

FDA, U. 2018. “Guidance for Industry and Fda Staff Clinical Performance Assessment: Considerations for Computer-Assisted Detection Devices Applied to Radiology Images and Radiology Device Data—Premarket Approval (Pma) and Premarket Notification [510 (K)] Submission.”

Hein, Patrick A, Lasse D Krug, Valentina C Romano, Sonja Kandel, Bernd Hamm, and Patrik Rogalla. 2010. “Computer-Aided Detection in Computed Tomography Colonography with Full Fecal Tagging: Comparison of Standalone Performance of 3 Automated Polyp Detection Systems.” Journal Article. Canadian Association of Radiologists Journal 61 (2): 102–8.

Hillis, Stephen L. 2007. “A Comparison of Denominator Degrees of Freedom Methods for Multiple Observer (ROC) Analysis.” Statistics in Medicine 26 (3): 596–619.

Hillis, Stephen L, Kevin S Berbaum, and Charles E Metz. 2008. “Recent Developments in the Dorfman-Berbaum-Metz Procedure for Multireader Roc Study Analysis.” Academic Radiology 15 (5): 647–61.

Hillis, Stephen L, Nancy A Obuchowski, Kevin M Schartz, and Kevin S Berbaum. 2005. “A Comparison of the Dorfman–Berbaum–Metz and Obuchowski–Rockette Methods for Receiver Operating Characteristic (ROC) Data.” Statistics in Medicine 24 (10): 1579–1607.

Hupse, Rianne, Maurice Samulski, Marc Lobbes, Ard Heeten, MechliW Imhof-Tas, David Beijerinck, Ruud Pijnappel, Carla Boetes, and Nico Karssemeijer. 2013. “Standalone Computer-Aided Detection Compared to Radiologists’ Performance for the Detection of Mammographic Masses.” Journal Article. European Radiology 23 (1): 93–100. https://doi.org/10.1007/s00330-012-2562-7.

Kooi, Thijs, Albert Gubern-Merida, Jan-Jurre Mordang, Ritse Mann, Ruud Pijnappel, Klaas Schuur, Ard den Heeten, and Nico Karssemeijer. 2016. “A Comparison Between a Deep Convolutional Neural Network and Radiologists for Classifying Regions of Interest in Mammography.” In International Workshop on Breast Imaging, 51–56. Springer.

Metz, Charles E. 1986. “ROC Methodology in Radiologic Imaging.” Journal Article. Investigative Radiology 21 (9): 720–33. http://journals.lww.com/investigativeradiology/Fulltext/1986/09000/ROC_Methodology_in_Radiologic_Imaging.9.aspx.

Metz, Charles E, Stuart J Starr, and Lee B Lusted. 1976. “Observer Performance in Detecting Multiple Radiographic Signals: Prediction and Analysis Using a Generalized Roc Approach.” Radiology 121 (2): 337–47.

Nishikawa, Robert M, and Lorenzo L Pesce. 2011. “Fundamental Limitations in Developing Computer-Aided Detection for Mammography.” Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment 648: S251–S254.

Obuchowski, Nancy A., Michael L. Lieber, and Kimerly A. Powell. 2000. “Data Analysis for Detection and Localization of Multiple Abnormalities with Application to Mammography.” Journal Article. Acad. Radiol. 7 (7): 516–25.

Obuchowski, Nancy A., and Howard E. Rockette. 1995. “Hypothesis Testing of Diagnostic Accuracy for Multiple Readers and Multiple Tests an Anova Approach with Dependent Observations.” Communications in Statistics-Simulation and Computation 24 (2): 285–308.

Rao, Vijay M, David C Levin, Laurence Parker, Barbara Cavanaugh, Andrea J Frangos, and Jonathan H Sunshine. 2010. “How Widely Is Computer-Aided Detection Used in Screening and Diagnostic Mammography?” Journal of the American College of Radiology 7 (10): 802–5.

Starr, Stuart J, Charles E Metz, Lee B Lusted, and David J Goodenough. 1975. “Visual Detection and Localization of Radiographic Images.” Radiology 116 (3): 533–38.

Summers, Ronald M, Laurie R Handwerker, Perry J Pickhardt, Robert L Van Uitert, Keshav K Deshpande, Srinath Yeshwant, Jianhua Yao, and Marek Franaszek. 2008. “Performance of a Previously Validated Ct Colonography Computer-Aided Detection System in a New Patient Population.” Journal Article. American Journal of Roentgenology 191 (1): 168–74.

Swensson, Richard G. 1996. “Unified Measurement of Observer Performance in Detecting and Localizing Target Objects on Images.” Medical Physics 23 (10): 1709–25.

Tan, Tao, Bram Platel, Henkjan Huisman, Clara Sánchez, Roel Mus, and Nico Karssemeijer. 2012. “Computer-Aided Lesion Diagnosis in Automated 3-d Breast Ultrasound Using Coronal Spiculation.” Journal Article. Medical Imaging, IEEE Transactions on 31 (5): 1034–42.

Taylor, Stuart A, Steve Halligan, David Burling, Mary E Roddie, Lesley Honeyfield, Justine McQuillan, Hamdam Amin, and Jamshid Dehmeshki. 2006. “Computer-Assisted Reader Software Versus Expert Reviewers for Polyp Detection on Ct Colonography.” Journal Article. American Journal of Roentgenology 186 (3): 696–702.


  1. Prof. Karssemeijer (private communication, 10/27/2017) had consulted with a few ROC experts to determine if the procedure used in Study – 2 was valid, and while the experts thought it was probably valid they were not sure.↩︎