Chapter 10 Standalone CAD

10.1 How much finished 99%

10.2 Introduction

In the US the majority of screening mammograms are analyzed by computer aided detection (CAD) algorithms (Rao et al. 2010). Almost all major imaging device manufacturers provide CAD as part of their imaging workstation display software. In the United States CAD is approved for use as a second reader, i.e., the radiologist first interprets the images (typically 4 views, 2 views of each breast) without CAD and then CAD information (i.e., cued suspicious regions, possibly shown with associated probabilities of malignancies) is shown and the radiologist has the opportunity to revise the initial interpretation. In response to the FDA-approved second reader usage, the evolution of CAD algorithms has been guided mainly by comparing observer performance of radiologists with and without CAD.

Clinical CAD systems sometimes only report the locations of suspicious regions, i.e., it may not provide ratings. Analysis of this type of date is deferred to a following TBA chapter. However, a malignancy index (a continuous variable) for every CAD-found suspicious region is available to the algorithm designer (Edwards et al. 2002). Standalone performance, i.e., performance of designer-level CAD by itself, regarded as an algorithmic reader, vs. radiologists, is rarely measured. In breast cancer screening I am aware of only one study (Hupse et al. 2013) where standalone performance was measured. [Standalone performance has been measured in CAD for computed tomography colonography, chest radiography and three dimensional ultrasound (Hein et al. 2010; Summers et al. 2008; Taylor et al. 2006; De Boo et al. 2011; Tan et al. 2012)].

One possible reason for not measuring standalone performance of CAD is the lack of an accepted assessment method for such measurements. This chapter removes that impediment. It describes a method for comparing standalone performance of designer-level CAD to a group of radiologists interpreting the same cases and compares the method to those described in two relevant publications (Hupse et al. 2013; Kooi et al. 2016).

10.3 Overview

This chapter extends the method used in a study of standalone CAD performance (Hupse et al. 2013), termed one-treatment random-reader fixed case or 1T-RRFC analysis, since CAD is treated as an additional reader within a single treatment and since it only accounts for reader variability but does not account for case-variability.

The extension includes the effect of case-sampling variability and is hence termed one-treatment random-reader random-case or 1T-RRRC analysis. The method is based on an existing method allowing comparison of the average performance of readers in a single treatment to a specified value. The key modification is to regard the difference in performance between radiologists over CAD as a figure of merit to which the existing work is directly applicable. The 1T-RRRC method is compared to 1T-RRFC.

The 1T-RRRC method is also compared to an unorthodox usage of conventional multiple-treatment multiple-reader method, termed 2T-RRRC analysis, which involves replicating the CAD ratings as many times as there are radiologists, in effect simulating a second treatment, i.e., CAD is regarded as the second treatment (with identical readers within this treatment) to which existing methods (DBM or OR, as described in RJafrocRocBook) is applied. `

10.4 Methods

Summarized are two relevant studies of CAD vs. radiologists in mammography. This is followed by comments on the methods used in the two studies. The second study used multi-treatment multi-reader receiver operating characteristic (ROC) software in an unorthodox way. A statistical model and analysis method is described that avoids the unorthodox usage of ROC software and has fewer model parameters.

10.4.1 Studies assessing performance of CAD vs. radiologists

The first study (Hupse et al. 2013) measured performance in finding and localizing lesions in mammograms, i.e., visual search was involved, while the second study (Kooi et al. 2016) measured lesion classification performance between non-diseased and diseased regions of interest (ROIs) previously found on mammograms by an independent algorithmic reader, i.e., visual search was not involved.

10.4.1.1 Study - 1

The first study (Hupse et al. 2013) compared standalone performance of a CAD device to that of 9 radiologists interpreting the same cases (120 non-diseased and 80 with a single malignant mass per case). It used the LROC (localization ROC) paradigm (S. J. Starr et al. 1975; Charles E. Metz, Starr, and Lusted 1976; Swensson 1996), in which the observer gives an overall rating for presence of disease (an integer 0 to 100 scale was used) and indicates the location of the most suspicious region. On a non-diseased case the rating is classified as a false positive (FP) but on a diseased case it is classified as a correct localization (CL) if the location is sufficiently close to the lesion and otherwise it is classified as an incorrect localization. For a given reporting threshold, the number of correct localizations divided by the number of diseased cases estimates the probability of correct localization (PCL) at that threshold. On non-diseased cases the number of false positives (FPs) divided by the number of non-diseased cases estimates the probability of a false positive, or false positive fraction (FPF), at that threshold. The plot of PCL (ordinate) vs. FPF defines the empirical LROC curve. Study - 1 used as figures of merit (FOMs) the interpolated PCL at two values of FPF, specifically FPF = 0.05 and FPF = 0.2, denoted $\text{PCL}_{0.05}$ and $\text{PCL}_{0.2}$, respectively. A t-test between the radiologist $\text{PCL}_{\text{FPF}}$ values and that of CAD was used to compute the two-sided p-value for rejecting the NH of equal performance. Study - 1 reported p-value = 0.17 for $\text{PCL}_{0.05}$ and p-value $\leq$ 0.001, with CAD being inferior, for $\text{PCL}_{0.2}$.

10.4.1.2 Study - 2

The second study (Kooi et al. 2016) used 199 diseased and 199 non-diseased ROIs extracted by an independent CAD algorithm. These were analyzed by a different CAD algorithmic observer from that used to determine the ROIs and by four expert radiologists. In either case the ROC paradigm was used (i.e., a rating was obtained for each ROI) The figure of merit was the empirical area (AUC) under the respective ROC curves (one for each radiologist and one for CAD). The p-value for the difference in AUCs between the average radiologist’s AUC and CAD AUC was determined using an unorthodox application of the Dorfman-Berbaum-Metz (Dorfman, Berbaum, and Metz 1992) multiple-treatment multiple-reader multiple-case (DBM-MRMC) software.

The application was unorthodox in the sense that in the input data file radiologists and CAD were entered as two treatments. In conventional (or orthodox) DBM-MRMC each reader provides two ratings per case and the data file would consist of paired ratings of a set of cases interpreted by 4 readers. To accommodate the paired data structure assumed by the software, the authors of Study - 2 replicated the CAD ratings four times in the input data file, as explained in the caption to Table 10.1. By this artifice they converted a single-treatment 5-reader (4 radiologists plus CAD) data file to a two-treatment 4-reader data file in which the four readers in treatment 1 were the radiologists, and the four “readers” in treatment 2 were CAD replicated ratings. Note that for each case the four readers in the second treatment had identical ratings. In Table 1 the replicated CAD readers are labeled C1, C2, C3 and C4.

TABLE 10.1: The differences between the data structures in conventional DBM-MRMC analysis and the unorthodox application of the software used in Study - 2. There are four radiologists, labeled R1, R2, R3 and R4 interpreting 398 cases labeled 1, 2, …, 398, in two treatments, labeled 1 and 2. Sample ratings are shown only for the first and last radiologist and the first and last case. In the first four columns, labeled “Standard DBM-MRMC”, each radiologist interprets each case twice. In the next four columns, labeled “Unorthodox DBM-MRMC”, the radiologists interpret each case once. CAD ratings are replicated four times to effectively create the second “treatment”. The quotations emphasize that there is, in fact, only one treatment. The replicated CAD observers are labeled C1, C2, C3 and C4.
Standard DBM-MRMC				Unorthodox DBM-MRMC
Reader	Treatment	Case	Rating	Reader	Treatment	Case	Rating
R1	1	1	75	R1	1	1	75
…	…	…	…	…	…	…	…
R1	1	398	0	R1	1	398	0
…	…	…	…	…	…	…	…
R4	1	1	50	R4	1	1	50
…	…	…	…	…	…	…	…
R4	1	398	25	R4	1	398	25

R1	2	1	45	C1	2	1	55
…	…	…	…	…	…	…	…
R1	2	398	25	C1	2	398	5
…	…	…	…	…	…	…	…
R4	2	1	95	C4	2	1	55
…	…	…	…	…	…	…	…
R4	2	398	20	C4	2	398	5

Study – 2 reported a not significant difference between CAD and the radiologists (p = 0.253).

10.4.1.3 Comments

For the purpose of this work, which focuses on the respective analysis methods, the difference in observer performance paradigms between the two studies, namely a search paradigm in Study - 1 vs. an ROI classification paradigm in Study – 2, is inconsequential. The paired t-test used in Study - 1 treats the case-sample as fixed. In other words, the analysis is not accounting for case-sampling variability but it is accounting for reader variability. While not explicitly stated, the reason for the unorthodox analysis in Study – 2 was the desire to include case-sampling variability. Prof. Karssemeijer (private communication, 10/27/2017) had consulted with a few ROC experts to determine if the procedure used in Study – 2 was valid, and while the experts thought it was probably valid they were not sure.

In what follows, the analysis in Study – 1 is referred to as single-treatment random-reader fixed-case (1T-RRFC) while that in Study – 2 is referred to as dual-treatment random-reader random-case (2T-RRRC).

10.4.2 The 1T-RRFC analysis model

The sampling model for the FOM is:

\[\begin{equation} \left. \begin{aligned} \theta_j=\mu+R_j \\ \left (j = 1,2,...J \right ) \end{aligned} \right \} \tag{10.1} \end{equation}\]

Here $\mu$ is a constant, $\theta_j$ is the FOM for reader $j$, and $R_j$ is the random contribution for reader $j$ distributed as:

\[\begin{equation} R_j \sim N\left ( 0,\sigma_R^2 \right ) \tag{10.2} \end{equation}\]

Because of the assumed normal distribution of $R_j$, in order to compare the readers to a fixed value, that of CAD denoted $\theta_0$, one uses the (unpaired) t-test, as done in Study – 1. As evident from the model, no allowance is made for case-sampling variability, which is the reason for calling it the 1T-RRFC method.

Performance of CAD on a fixed dataset does exhibit within-CAD variability, i.e., CAD applied repeatedly to a fixed dataset does not always produce the same mark-rating data. However, this source of within-CAD variability is much smaller than inter-reader variability of radiologists interpreting the same dataset. The within-reader variability of radiologists is smaller than inter-reader variability and within-CAD variability is even smaller. For this reason one is justified in regarded $\theta_0$ as a fixed quantity for a given dataset. Varying the dataset will result in different values for $\theta_0$ reflecting case sampling variability which needs to be accounted for as done in the following analyses.

10.4.3 The 2T-RRRC analysis model

This could be termed the conventional or the orthodox method. There are two treatments and the study design is fully crossed: each reader interprets each case in each treatment, i.e., the data structure is as in the left half of Table 10.1.

The following approach, termed 2T-RRRC, uses the Obuchowski and Rockette (OR) figure of merit sampling model (Obuchowski and Rockette 1995). The OR model is:

\[\begin{equation} \theta_{ij\{c\}}=\mu+\tau_i+\left ( \tau \text{R} \right )_{ij}+\epsilon_{ij\{c\}} \tag{10.3} \end{equation}\]

Assuming two treatments, $i$ ($i = 1, 2$) is the treatment index, $j$ ($j = 1, ..., J$) is the reader index, and $k$ ($k = 1, ..., K$) is the case index, and $\theta_{ij\{c\}}$ is the figure of merit in treatment $i$ for reader $j$ and case-sample $\{c\}$. A case-sample is a set or ensemble of cases, diseased and non-diseased, and different integer values of $c$ correspond to different case-samples.

The first two terms on the right hand side of Eqn. (10.3) are fixed effects (average performance and treatment effect, respectively). The next two terms are random effect variables that, by assumption, are sampled as follows:

\[\begin{equation} \left. \begin{aligned} R_j \sim N\left ( 0,\sigma_R^2 \right )\\ \left ( \tau R \right )_{ij} \sim N\left ( 0,\sigma_{\tau R}^2 \right )\\ \end{aligned} \right \} \tag{10.4} \end{equation}\]

The terms $R_j$ represents the random treatment-independent contribution of reader $j$, modeled as a sample from a zero-mean normal distribution with variance $\sigma_R^2$, $\left ( \tau R \right )_{ij}$ represents the random treatment-dependent contribution of reader $j$ in treatment $i$, modeled as a sample from a zero-mean normal distribution with variance $\sigma_{\tau R}^2$. The sampling of the last (error) term is described by:

\[\begin{equation} \epsilon_{ij\{c\}}\sim N_{I \times J}\left ( \vec{0} , \Sigma \right ) \tag{10.5} \end{equation}\]

Here $N_{I \times J}$ is the $I \times J$ variate normal distribution and $\vec{0}$, a $I \times J$ length zero-vector, represents the mean of the distribution. The $\{I \times J\} \times \{I \times J\}$ dimensional covariance matrix $\Sigma$ is defined by 4 parameters, $\text{Var}$, $\text{Cov}_1$, $\text{Cov}_2$, $\text{Cov}_3$, defined as follows:

\[\begin{equation} \text{Cov} \left (\epsilon_{ij\{c\}},\epsilon_{i'j'\{c\}} \right ) = \left\{\begin{matrix} \text{Var} \; (i=i',j=j') \\ \text{Cov1} \; (i\ne i',j=j')\\ \text{Cov2} \; (i = i',j \ne j')\\ \text{Cov3} \; (i\ne i',j \ne j') \end{matrix}\right\} \tag{10.6} \end{equation}\]

Software {U of Iowa and RJafroc} yields estimates of all terms appearing on the right hand side of Eqn. (10.6). Excluding fixed effects the model represented by Eqn. (10.3) contains six parameters:

\[\begin{equation} \sigma_R^2, \sigma_{\tau R}^2, \text{Var}, \text{Cov}_1, \text{Cov}_2, \text{Cov}_3 \tag{10.7} \end{equation}\]

The meanings the last four terms are described in (Hillis 2007; Obuchowski and Rockette 1995; Hillis et al. 2005; Dev P. Chakraborty 2017). Briefly, $\text{Var}$ is the variance of a reader’s FOMs, in a given treatment, over interpretations of different case-samples, averaged over readers and treatments; $\text{Cov}_1/\text{Var}$ is the correlation of a reader’s FOMs, over interpretations of different case-samples in different treatments, averaged over all different-treatment same-reader pairings; $\text{Cov}_2/\text{Var}$ is the correlation of different reader’s FOMs, over interpretations of different case-samples in the same treatment, averaged over all same- treatment different-reader pairings and finally, $\text{Cov}_3/\text{Var}$ is the correlation of different reader’s FOMs, over interpretations of different case-samples in different treatments, averaged over all different-treatment different-reader pairings. One expects the following inequalities to hold:

\[\begin{equation} \text{Var} \geq \text{Cov}_1 \geq \text{Cov}_2 \geq \text{Cov}_3 \tag{10.8} \end{equation}\]

In practice, since one is usually limited to one case-sample, i.e., $c = 1$, resampling techniques (Efron and Tibshirani 1994) – e.g., the jackknife – are used to estimate these terms.

10.4.4 The 1T-RRRC analysis model

The difference from the approach in Study - 2, and the main contribution of this work, is to regard standalone CAD as a different reader, not as a different treatment. This section describes a single treatment method for analyzing readers and CAD, where CAD is regarded as an additional reader and artificially replicated CAD data becomes unnecessary. Accordingly the proposed method is termed single-treatment random-reader random-case (1T-RRRC) analysis.

The starting point is the (Obuchowski and Rockette 1995) model for a single treatment, which for the radiologists (i.e., excluding CAD) interpreting in a single-treatment reduces to the following model:

\[\begin{equation} \theta_{j\{c\}}=\mu+R_j+\epsilon_{j\{c\}} \tag{10.9} \end{equation}\]

$\theta_{j\{c\}}$ is the figure of merit for radiologist $j$ ($j = 1, 2, ..., J$) interpreting case-sample $\{c\}$; $R_j$ is the random effect of radiologist $j$ and $\epsilon_{j\{c\}}$ is the error term. For single-treatment multiple-reader interpretations the error term is distributed as:

\[\begin{equation} \epsilon_{j\{c\}}\sim N_{J}\left ( \vec{0} , \Sigma \right ) \tag{10.10} \end{equation}\]

The $J \times J$ covariance matrix $\Sigma$ is defined by two parameters, $\text{Var}$ and $\text{Cov}_2$, as follows:

\[\begin{equation} \Sigma_{jj'} = \text{Cov}\left ( \epsilon_{j\{c\}}, \epsilon_{j'\{c\}} \right ) = \left\{\begin{matrix} \text{Var} & j = j'\\ \text{Cov}_2 & j \neq j' \end{matrix}\right. \tag{10.11} \end{equation}\]

In practice the terms $\text{Var}$ and $\text{Cov}_2$ are estimated using the jackknife method.

10.4.4.1 Single treatment analysis for radiologists

Hillis (Hillis et al. 2005; Hillis 2007) has described how to use the single treatment model (10.9) to compare a groups of radiologists’ average performance to a fixed value, in effect the $\text{NH}: \mu = \mu_0$, where $\mu_0$ is a pre-specified constant.

One might be tempted to set $\mu_0$ equal to the performance of CAD but that would not be accounting for the fact that the performance of CAD is itself a random variable whose case-sampling variability needs to be accounted for.

10.4.4.2 Adaptation of single treatment analysis to accommodate CAD

Instead, the following model is used for the figure of merit of the radiologists and CAD (note that $j = 0$ is used to denote the CAD algorithmic reader):

\[\begin{equation} \theta_{j\{c\}} = \theta_{0\{c\}} + \Delta \theta + R_j + \epsilon_{j\{c\}}\\ j=1,2,...J \tag{10.12} \end{equation}\]

$\theta_{0\{c\}}$ is the CAD figure of merit for case-sample $\{c\}$ and $\Delta \theta$ is the average figure of merit increment of the radiologists over CAD. To reduce this model to one to which Hillis’ formulae are directly applicable, one subtracts the CAD figure of merit from each radiologist’s figure of merit for the same case-sample, and defines this as the difference figure of merit $\psi_{j\{c\}}$ , i.e.,

\[\begin{equation} \psi_{j\{c\}} = \theta_{j\{c\}} - \theta_{0\{c\}} \tag{10.13} \end{equation}\]

Then Eqn. (10.12) reduces to:

\[\begin{equation} \psi_{j\{c\}} = \Delta \theta + R_j + \epsilon_{j\{c\}} \tag{10.14} \end{equation}\]

Eqn. (10.14) is identical in form to Eqn. (10.9) excepting that the figure of merit on the left hand side of Eqn. (10.14) is a difference FOM, that between the radiologist’s and CAD, i.e., describing a model for $J$ radiologists interpreting a common case set, each of whose performances is measured relative to that of CAD. Under the NH the expected difference is zero: $\text{NH:} \Delta \theta = 0$. The method (Hillis et al. 2005; Hillis 2007) for single-treatment multiple-reader analysis is now directly applicable to the model described by Eqn. (10.14).

Apart from fixed effects, the model in Eqn. (10.14) contains three parameters:

\[\begin{equation} \sigma_R^2, \text{Var}, \text{Cov}_2 \tag{10.15} \end{equation}\]

Setting $\text{Var} = 0, \text{Cov}_2 = 0$ yields the 1T-RRFC model which contains only one random parameter, namely $\sigma_R^2$. One expects an identical estimate of this parameter using 1T-RRRC analyses.

10.5 Implementation

The three analyses, namely random-reader fixed-case (1T-RRFC), dual-treatment random-reader random-case (2T-RRRC) and single-treatment random-reader random-case (1T-RRRC), are implemented in RJafroc.

The following code shows usage of the software to generate the results. Note that RJafroc::datasetCadLroc is the LROC dataset and RJafroc::dataset09 is the corresponding ROC dataset.

RRFC_1T_PCL_0_05 <- RJafroc::StCadVsRad (RJafroc::datasetCadLroc, 
FOM = "PCL", FPFValue = 0.05, method = "1T-RRFC")
RRRC_2T_PCL_0_05 <- RJafroc::StCadVsRad (RJafroc::datasetCadLroc, 
FOM = "PCL", FPFValue = 0.05, method = "2T-RRRC")
RRRC_1T_PCL_0_05 <- RJafroc::StCadVsRad (RJafroc::datasetCadLroc, 
FOM = "PCL", FPFValue = 0.05, method = "1T-RRRC")

RRFC_1T_PCL_0_2 <- RJafroc::StCadVsRad (RJafroc::datasetCadLroc, 
FOM = "PCL", FPFValue = 0.2, method = "1T-RRFC")
RRRC_2T_PCL_0_2 <- RJafroc::StCadVsRad (RJafroc::datasetCadLroc, 
FOM = "PCL", FPFValue = 0.2, method = "2T-RRRC")
RRRC_1T_PCL_0_2 <- RJafroc::StCadVsRad (RJafroc::datasetCadLroc, 
FOM = "PCL", FPFValue = 0.2, method = "1T-RRRC")

RRFC_1T_PCL_1 <- RJafroc::StCadVsRad (RJafroc::datasetCadLroc, 
FOM = "PCL", FPFValue = 1, method = "1T-RRFC")
RRRC_2T_PCL_1 <- RJafroc::StCadVsRad (RJafroc::datasetCadLroc, 
FOM = "PCL", FPFValue = 1, method = "2T-RRRC")
RRRC_1T_PCL_1 <- RJafroc::StCadVsRad (RJafroc::datasetCadLroc, 
FOM = "PCL", FPFValue = 1, method = "1T-RRRC")

RRFC_1T_AUC <- RJafroc::StCadVsRad (RJafroc::dataset09, 
FOM = "Wilcoxon", method = "1T-RRFC")
RRRC_2T_AUC <- RJafroc::StCadVsRad (RJafroc::dataset09, 
FOM = "Wilcoxon", method = "2T-RRRC")
RRRC_1T_AUC <- RJafroc::StCadVsRad (RJafroc::dataset09, 
FOM = "Wilcoxon", method = "1T-RRRC")

The results are organized as follows:

RRFC_1T_PCL_0_05 contains the results of 1T-RRFC analysis for figure of merit = $\text{PCL}_{0.05}$.
RRRC_2T_PCL_0_05 contains the results of 2T-RRRC analysis for figure of merit = $\text{PCL}_{0.05}$.
RRRC_1T_PCL_0_05 contains the results of 1T-RRRC analysis for figure of merit = $\text{PCL}_{0.05}$.
RRFC_1T_PCL_0_2 contains the results of 1T-RRFC analysis for figure of merit = $\text{PCL}_{0.2}$.
RRRC_2T_PCL_0_2 contains the results of 2T-RRRC analysis for figure of merit = $\text{PCL}_{0.2}$.
RRRC_1T_PCL_0_2 contains the results of 1T-RRRC analysis for figure of merit = $\text{PCL}_{0.2}$.
RRFC_1T_AUC contains the results of 1T-RRFC analysis for the Wilcoxon figure of merit.
RRRC_2T_AUC contains the results of 2T-RRRC analysis for the Wilcoxon figure of merit.
RRRC_1T_AUC contains the results of 1T-RRRC analysis for the Wilcoxon figure of merit.

The structures of these objects are illustrated with examples in the Appendix.

10.6 Results

The three methods, 1T-RRFC, 2T-RRRC and 1T-RRRC, were applied to an LROC dataset similar to that used in Study – 1 (I thank Prof. Karssemeijer for making this dataset available), Table 10.2.

TABLE 10.2: Significance testing results for an LROC dataset. For each figure of merit (FOM) shown are results of RRRC, 2T-RRRC and 1T-RRRC analyses. Because it is accounting for an additional source of variability, each of the rows labeled RRRC yields a larger p-value and wider confidence interval than the corresponding row labeled RRFC. [$\theta_0$ = FOM CAD; $\theta_{\bullet}$ = average FOM of radiologists; $\psi_{\bullet}$ = average FOM of radiologists minus CAD; CI= 95 percent confidence interval of quantity indicated by the subscript, F = F-statistic; ddf = denominator degrees of freedom; p = p-value for rejecting the null hypothesis: $\psi_{\bullet} = 0$.]
FOM	Analysis	$\theta_0$	$CI_{\theta_0}$	$\theta_{\bullet}$	$CI_{\theta_{\bullet}}$	$\psi_{\bullet}$	$CI_{\psi_{\bullet}}$	F	ddf	p
PCL_0_05	1T-RRFC	0.45	NA	0.493	(0.42,0.57)	0.0433	(-0.032,0.12)	1.8	8	0.22
	2T-RRRC		(0.26,0.64)		(0.38,0.61)		(-0.16,0.24)	0.18	784	0.67
	1T-RRRC		NA		(0.29,0.69)		(-0.16,0.24)	0.18	784	0.67
PCL_0_2	1T-RRFC	0.592	NA	0.71	(0.67,0.75)	0.119	(0.078,0.16)	45	8	0.00015
	2T-RRRC		(0.48,0.71)		(0.63,0.79)		(0.0044,0.23)	4.2	937	0.042
	1T-RRRC		NA		(0.6,0.82)		(0.0044,0.23)	4.2	937	0.042
PCL_1	1T-RRFC	0.675	NA	0.783	(0.74,0.83)	0.108	(0.065,0.15)	33	8	0.00043
	2T-RRRC		(0.57,0.78)		(0.71,0.85)		(0.0045,0.21)	4.2	493	0.041
	1T-RRRC		NA		(0.68,0.89)		(0.0045,0.21)	4.2	493	0.041
Wilcoxon	1T-RRFC	0.817	NA	0.849	(0.83,0.87)	0.0317	(0.009,0.055)	10	8	0.012
	2T-RRRC		(0.75,0.88)		(0.81,0.89)		(-0.031,0.094)	0.99	878	0.32
	1T-RRRC		NA		(0.79,0.91)		(-0.031,0.094)	0.99	878	0.32

Results are shown for the following FOMs: $\text{PCL}_{0.05}$, $\text{PCL}_{0.2}$, $\text{PCL}_{1}$ and the empirical area (AUC) under the ROC curve estimated by the Wilcoxon statistic. The first two FOMs are identical to those used in Study – 1. Columns 3 and 4 list the CAD FOM $\theta_0$ and its 95% confidence interval $CI_{\theta_0}$, columns 5 and 6 list the average radiologist FOM $\theta_{\bullet}$ (the dot symbol represents an average over the non-zero radiologist index j = 1,2,…, 9) and its 95% confidence interval $CI_{\theta_{\bullet}}$, columns 7 and 8 list the average difference FOM $\psi_{\bullet}$, i.e., radiologist average minus CAD, and its 95% confidence interval $CI_{\psi_{\bullet}}$, and the last three columns list the F-statistic, the denominator degrees of freedom (ddf) and the p-value for rejecting the null hypothesis (the numerator degree of freedom of the F-statistic is unity).

The last three columns show that 2T-RRRC and 1T-RRRC analyses yield identical F-statistics, ddf and p-values. So the intuition of the authors of Study – 2, that the unorthodox method of using DBM – MRMC software to account for both reader and case-sampling variability, turns out to be correct. If interest is solely in these statistics one is justified in using the unorthodox method. Important caveats are noted below.

Other results evident in Table 10.2:

Where a direct comparison is possible, namely 1T-RRFC analysis using $\text{PCL}_{0.05}$ and $\text{PCL}_{0.2}$ as FOMs, the p-values in Table 10.2 are very close to those reported in Study – 1.
All FOMs (i.e., $\theta_0$, $\theta_{\bullet}$ and $\psi_{\bullet}$) in Table 10.2 are independent of the method of analysis. However, the corresponding confidence intervals (i.e., $CI_{\theta_0}$, $CI_{\theta_{\bullet}}$ and $CI_{\psi_{\bullet}}$) depend on the analyses.
Since the CAD figure of merit is a constant no confidence interval is appropriate for it for either 1T-RRFC or 1T-RRRC analysis and the listed values are NA (not applicable). Since 2T-RRRC analysis assumes CAD is a different treatment the analysis lists a confidence interval that is correctly centered on the CAD value but is otherwise meaningless, i.e., it is an artifact of the unintended usage of the OR analysis method.
The p-value for either RRRC analyses (2T or 1T) is larger than the corresponding 1T-RRFC value. Accounting for case-sampling variability increases the p-value leading to less possibility of finding a significant difference.
The LROC FOMs increase as the value of FPF (the subscript) increases, a general feature of any partial curve based figure of merit, as is the observation that the area (AUC) under the ROC is larger than the largest PCL value.
Using either RRRC analyses ignoring localization information (i.e., using the AUC FOM) leads to a not-significant difference between CAD and the radiologists ($p$ = 0.32) while using localization information via the $\text{PCL}_1$ FOM yields a significant difference ($p$ = 0.041), consistent with the expectation that using localization information leads to increased statistical power.
Partial curve-based FOMs, such as $\text{PCL}_\text{FPF}$, lead, depending on the choice of $\text{FPF}$, to different conclusions on whether to reject the NH. Using either RRRC analyses the p-values decrease as $\text{FPF}$ increases (e.g., $ 0.67 > 0.042 > 0.041$). This trend is not observed for 1T-RRFC analysis which shows a “sweet-spot” effect where the p-value has a minimum for $\text{FPF} = 0.2$

Shown next, Table 10.3, are the model-parameters corresponding to the three analyses.

TABLE 10.3: Significance testing results for an LROC dataset. For each figure of merit (FOM) shown are results of RRRC, 2T-RRRC and 1T-RRRC analyses. Because it is accounting for an additional source of variability, each of the rows labeled RRRC yields a larger p-value and wider confidence interval than the corresponding row labeled RRFC. [$\theta_0$ = FOM CAD; $\theta_{\bullet}$ = average FOM of radiologists; $\psi_{\bullet}$ = average FOM of radiologists minus CAD; CI= 95 percent confidence interval of quantity indicated by the subscript, F = F-statistic; ddf = denominator degrees of freedom; p = p-value for rejecting the null hypothesis: $\psi_{\bullet} = 0$.]
FOM	Analysis	$\sigma_R^2$	$\sigma_{\tau R}^2$	Cov1	Cov2	Cov3	Var
PCL_0_05	1T-RRFC	0.0095	NA	NA	NA	NA	NA
	2T-RRRC	-1.1e-19	-0.00571	0.00131	0.00601	0.00131	0.0165
	1T-RRRC	0.0095	NA	NA	0.0094	NA	0.0303
PCL_0_2	1T-RRFC	0.00281	NA	NA	NA	NA	NA
	2T-RRRC	-4.9e-19	0.000265	0.000761	0.00229	0.000761	0.00343
	1T-RRRC	0.00281	NA	NA	0.00307	NA	0.00534
PCL_1	1T-RRFC	0.0032	NA	NA	NA	NA	NA
	2T-RRRC	6e-19	0.001	0.000643	0.00186	0.000643	0.00246
	1T-RRRC	0.0032	NA	NA	0.00244	NA	0.00364
Wilcoxon	1T-RRFC	0.000878	NA	NA	NA	NA	NA
	2T-RRRC	7.9e-19	0.000201	0.000262	0.000724	0.000262	0.000962
	1T-RRRC	0.000878	NA	NA	0.000924	NA	0.0014

From Table 10.3 some inconsistencies are evident for 2T-RRRC analysis:

For 2T-RRRC analyses the listed values for $\sigma_R^2$ are smaller than machine accuracy, therefore one concludes that in fact $\sigma_R^2 = 0$ which is clearly an incorrect result as the radiologists do not have identical performances. In contrast, 1T-RRRC analyses yields the expected non-zero values, identical to those obtained by 1T-RRFC analyses (see comment following Eqn. (10.15)).
For the 2T_RRRC method the expected ordering of the inequalities, Eqn. (10.8) is not observed: one expects $\text{Cov}_1 \geq \text{Cov}_2 \geq \text{Cov}_3$ but instead one observes $\text{Cov}_1 = \text{Cov}_3$ and $\text{Cov}_2 > \text{Cov}_1$.

The design of a ratings simulator to statistically match a given dataset is addressed in Chapter 23 of my print book (Dev P. Chakraborty 2017). Using this simulator, the 1T-RRRC method had the expected null hypothesis behavior (Table 23.5, ibid).

10.7 Discussion

Described is an extension of the analysis used in Study – 1 that accounts for case sampling variability. It extends (Hillis et al. 2005) single-treatment analysis to a situation where one of the “readers” is a special reader subject to case-sampling variability only, and the desire is to compare performance of this special reader to the average of the remaining readers. Usage of the method along with two other methods is illustrated using an LROC dataset.

The proposed method, 1T-RRRC analyses, yields identical “overall” results (specifically the F-statistic, degrees of freedom and p-value) to those yielded by the unorthodox application of commonly available software, termed 2T-RRRC analyses, where the CAD reader is regarded as a second treatment (specifically the CAD ratings are replicated to match the number of radiologists). If interest is in just these values one is justified in using the 2T-RRRC method. However, 2T-RRRC model parameter estimates were unrealistic: for example, it yields zero between-reader variance. The result $\sigma_R^2 = 0$ is clearly an artifact. One can only speculate as to what happens when software is used in a manner that it was not designed for: perhaps finding that all readers in the second treatment have identical FOMs led the software to yield $\sigma_R^2 = 0$. Additionally, the covariance estimates are incorrect. Since sample-size estimation requires some of the covariance values the 2T-RRRC method should never be used to perform sample-size estimation for a prospective study.

The 1T-RRRC method described here is applicable to any scalar figure of merit. The paradigm used to collect the observer performance data - ROC, FROC, LROC or ROI - is irrelevant.

Assessing CAD utility by measuring performance with and without CAD may have inadvertently set a low bar for CAD to be considered useful. As examples, CAD is not penalized for missing cancers as long as the radiologist finds them and CAD is not penalized for excessive false positives (FPs) as long as the radiologist ignores them. Moreover, since both such measurements include the variability of radiologists, there is additional noise introduces that presumably makes it harder to determine if the CAD system is optimal.

In my opinion standalone performance is the most direct measure of CAD performance. Lack of a clear-cut method for assessing standalone CAD performance may have limited past CAD research. The current work hopefully removes that impediment. Going forward, assessment of standalone performance of CAD vs. expert radiologists is strongly encouraged.

10.8 Appendix 1

The structures of theR objects generated by the software are illustrated with three examples.

10.8.1 Example 1

The first example shows the structure of RRFC_1T_PCL_0_2.

x <- RRFC_1T_PCL_0_2
fom_individual_rad <- as.data.frame(t(x$fomRAD))
colnames(fom_individual_rad) <- paste0("rdr", seq(1:9))

stats <- data.frame(fomCAD = x$fomCAD, avgRadFom = x$avgRadFom, avgDiffFom = x$avgDiffFom, varR = x$varR, Tstat = x$Tstat, df = x$df, pval = x$pval)

ConfidenceIntervals <- data.frame(CIAvgRadFom = x$CIAvgRadFom, CIAvgDiffFom = x$CIAvgDiffFom)
rownames(ConfidenceIntervals) <- c("Lower", "Upper")

print(fom_individual_rad)
#>        rdr1 rdr2    rdr3  rdr4      rdr5      rdr6   rdr7  rdr8  rdr9
#> 1 0.6945313 0.65 0.80625 0.725 0.6598214 0.7684524 0.7375 0.675 0.675
print(stats)
#>      fomCAD avgRadFom avgDiffFom        varR    Tstat df         pval
#> 1 0.5916667 0.7101728  0.1185061 0.002808612 6.708357  8 0.0001513966
print(ConfidenceIntervals)
#>       CIAvgRadFom CIAvgDiffFom
#> Lower   0.6694362   0.07776953
#> Upper   0.7509094   0.15924271

The results are displayed as three data frames.

The first data frame :

fom_individual_rad shows the figures of merit for the nine radiologists in the study.

The next data frame summarizes the statistics.

fomCAD is the figure of merit for CAD.
avgRadFom is the average figure of merit of the nine radiologists in the study.
avgDiffFom is the average difference figure of merit, RAD - CAD.
varR is the variance of the figures of merit for the nine radiologists in the study.
Tstat is the t-statistic for testing the NH that the average difference FOM avgDiffFom is zero, whose square is the F-statistic.
df is the degrees of freedom of the t-statistic.
pval is the p-value for rejecting the NH. In the example shown below the value is highly signficant.

The last data frame summarizes the 95 percent confidence intervals.

CIAvgRadFom is the 95 percent confidence interval, listed as pairs Lower, Upper, for avgRadFom.
CIAvgDiffFom is the 95 percent confidence interval for avgDiffFom.
If the pair CIAvgDiffFom excludes zero, the difference is statistically significant.
In the example the interval excludes zero showing that the FOM difference is significant.

10.8.2 Example 2

The next example shows the structure of RRRC_2T_PCL_0_2.

x <- RRRC_2T_PCL_0_2

fom_individual_rad <- as.data.frame(t(x$fomRAD))
colnames(fom_individual_rad) <- paste0("rdr", seq(1:9))

stats1 <- data.frame(fomCAD = x$fomCAD, avgRadFom = x$avgRadFom, avgDiffFom = x$avgDiffFom)

stats2 <- data.frame(varR = x$varR, varTR = x$varTR, 
                 cov1 = x$cov1, cov2 = x$cov2 , 
                 cov3 = x$cov3 , Var = x$varError, 
                 FStat = x$FStat, df = x$df, pval = x$pval)


print(fom_individual_rad)
#>        rdr1 rdr2    rdr3  rdr4      rdr5      rdr6   rdr7  rdr8  rdr9
#> 1 0.6945313 0.65 0.80625 0.725 0.6598214 0.7684524 0.7375 0.675 0.675
print(stats1)
#>      fomCAD avgRadFom avgDiffFom
#> 1 0.5916667 0.7101728  0.1185061
print(stats2)
#>           varR        varTR         cov1        cov2         cov3         Var
#> 1 -4.87891e-19 0.0002648898 0.0007613684 0.002294221 0.0007613684 0.003433637
#>     FStat       df       pval
#> 1 4.15768 937.2437 0.04172626

In addition to the quantities defined previously, the output contains the covariance matrix for the Obuchowski-Rockette model, summarized in Eqn. (10.3) – Eqn. (10.6).

varTR is $\sigma_{\tau R}^2$.
cov1 is $\text{Cov}_1$.
cov2 is $\text{Cov}_2$.
cov3 is $\text{Cov}_3$.
Var is $\text{Var}$.
FStat is the F-statistic for testing the NH.
ndf is the numerator degrees of freedom, equal to unity.
df is denominator degrees of freedom of the F-statistic for testing the NH.
Tstat is the t-statistic for testing the NH that the average difference FOM avgDiffFom is zero.
pval is the p-value for rejecting the NH. In the example shown below the value is signficant.

Notice that including the variability of cases results in a higher p-value for 2T-RRRC as compared to 1T-RRFC.

Shown next are the confidence interval statistics x$ciAvgRdrEachTrt for the two treatments (“trt1” = CAD, “trt2” = RAD):


print(x$ciAvgRdrEachTrt)
#>       Estimate     StdErr       DF   CILower   CIUpper        Cov2
#> trt1 0.5916667 0.05802835      Inf 0.4779332 0.7054001 0.003367289
#> trt2 0.7101728 0.03915636 193.1083 0.6329437 0.7874018 0.001221153

Estimate contains the difference FOM estimate.
StdErr contains the standard estimate of the difference FOM estimate.
DF contains the degrees of freedom of the t-statistic.
t contains the value of the t-statistic.
PrGtt contains the probability of exceeding the magnitude of the t-statistic.
CILower is the lower confidence interval for the difference FOM.
CIUpper is the upper confidence interval for the difference FOM.

Shown next are the confidence interval statistics x$ciDiffFom between the two treatments (“trt1-trt2” = CAD - RAD):


print(x$ciDiffFom)
#>            Estimate     StdErr       DF        t      PrGTt     CILower
#> trt2-trt1 0.1185061 0.05811861 937.2437 2.039039 0.04172626 0.004448434
#>             CIUpper
#> trt2-trt1 0.2325638

The difference figure of merit statistics are contained in a dataframe x$ciDiffFom with elements:

Estimate contains the difference FOM estimate.
StdErr contains the standard estimate of the difference FOM estimate.
DF contains the degrees of freedom of the t-statistic.
t contains the value of the t-statistic.
PrGtt contains the probability of exceeding the magnitude of the t-statistic.
CILower is the lower confidence interval for the difference FOM.
CIUpper is the upper confidence interval for the difference FOM.

The figures of merit statistic for the two treatments, 1 is CAD and 2 is RAD.

trt1: statistics for CAD.
trt2: statistics for RAD.
Cov2: $\text{Cov}_2$ calculated over individual treatments.

10.8.3 Example 3

The last example shows the structure of RRRC_1T_PCL_0_2.

RRRC_1T_PCL_0_2
#> $fomCAD
#> [1] 0.5916667
#> 
#> $fomRAD
#> [1] 0.6945313 0.6500000 0.8062500 0.7250000 0.6598214 0.7684524 0.7375000
#> [8] 0.6750000 0.6750000
#> 
#> $avgRadFom
#> [1] 0.7101728
#> 
#> $CIAvgRad
#> [1] 0.5961151 0.8242305
#> 
#> $avgDiffFom
#> [1] 0.1185061
#> 
#> $CIAvgDiffFom
#> [1] 0.004448434 0.232563801
#> 
#> $varR
#> [1] 0.002808612
#> 
#> $varError
#> [1] 0.005344538
#> 
#> $cov2
#> [1] 0.003065705
#> 
#> $Tstat
#>     rdr2 
#> 2.039039 
#> 
#> $df
#>     rdr2 
#> 937.2437 
#> 
#> $pval
#>       rdr2 
#> 0.04172626

The differences from RRFC_1T_PCL_0_2 are listed next:

varR is $\sigma_R^2$ of the single treatment model for comparing CAD to RAD, Eqn. (10.15).
cov2 is $\text{Cov}_2$ of the single treatment model for comparing CAD to RAD.
varError is $\text{Var}$ of the single treatment model for comparing CAD to RAD.

Notice that the RRRC_1T_PCL_0_2 p value, i.e., 0.0417263, is identical to that of RRRC_2T_PCL_0_2, i.e., 0.0417263.

10.9 Appendix 2

Two text files R/standalone-cad/jaf_truth.txt and R/standalone-cad/jaf_truth.txt were provided by Prof. Nico Karssemeijer. These are read into a dataset object by the following code.

source(here::here("R/standalone-cad/DfReadLrocDataFile.R"))
lrocDataset <- DfReadLrocDataFile()

REFERENCES

———. 2017. Observer Performance Methods for Diagnostic Imaging: Foundations, Modeling, and Applications with r-Based Examples. Boca Raton, FL: CRC Press.

De Boo, Diederick W, Martin Uffmann, Michael Weber, Shandra Bipat, Eelco F Boorsma, Maeke J Scheerder, Nicole J Freling, and Cornelia M Schaefer-Prokop. 2011. “Computer-Aided Detection of Small Pulmonary Nodules in Chest Radiographs: An Observer Study.” Journal Article. Academic Radiology 18 (12): 1507–14.

Dorfman, Donald D, Kevin S Berbaum, and Charles E Metz. 1992. “Receiver Operating Characteristic Rating Analysis: Generalization to the Population of Readers and Patients with the Jackknife Method.” Investigative Radiology 27 (9): 723–31.

Edwards, Darrin C, Matthew A Kupinski, Charles E Metz, and Robert M Nishikawa. 2002. “Maximum Likelihood Fitting of FROC Curves Under an Initial-Detection-and-Candidate-Analysis Model.” Medical Physics 29 (12): 2861–70.

Efron, Bradley, and Robert J Tibshirani. 1994. An Introduction to the Bootstrap. CRC press.

Hein, Patrick A, Lasse D Krug, Valentina C Romano, Sonja Kandel, Bernd Hamm, and Patrik Rogalla. 2010. “Computer-Aided Detection in Computed Tomography Colonography with Full Fecal Tagging: Comparison of Standalone Performance of 3 Automated Polyp Detection Systems.” Journal Article. Canadian Association of Radiologists Journal 61 (2): 102–8.

Hillis, Stephen L. 2007. “A Comparison of Denominator Degrees of Freedom Methods for Multiple Observer (ROC) Analysis.” Statistics in Medicine 26 (3): 596–619.

Hillis, Stephen L, Nancy A Obuchowski, Kevin M Schartz, and Kevin S Berbaum. 2005. “A Comparison of the Dorfman–Berbaum–Metz and Obuchowski–Rockette Methods for Receiver Operating Characteristic (ROC) Data.” Statistics in Medicine 24 (10): 1579–1607.

Hupse, Rianne, Maurice Samulski, Marc Lobbes, Ard Heeten, MechliW Imhof-Tas, David Beijerinck, Ruud Pijnappel, Carla Boetes, and Nico Karssemeijer. 2013. “Standalone Computer-Aided Detection Compared to Radiologists’ Performance for the Detection of Mammographic Masses.” Journal Article. European Radiology 23 (1): 93–100. https://doi.org/10.1007/s00330-012-2562-7.

Kooi, Thijs, Albert Gubern-Merida, Jan-Jurre Mordang, Ritse Mann, Ruud Pijnappel, Klaas Schuur, Ard den Heeten, and Nico Karssemeijer. 2016. “A Comparison Between a Deep Convolutional Neural Network and Radiologists for Classifying Regions of Interest in Mammography.” In International Workshop on Breast Imaging, 51–56. Springer.

Metz, Charles E, Stuart J Starr, and Lee B Lusted. 1976. “Observer Performance in Detecting Multiple Radiographic Signals: Prediction and Analysis Using a Generalized ROC Approach.” Radiology 121 (2): 337–47.

Obuchowski, Nancy A., and Howard E. Rockette. 1995. “Hypothesis Testing of Diagnostic Accuracy for Multiple Readers and Multiple Tests an Anova Approach with Dependent Observations.” Communications in Statistics-Simulation and Computation 24 (2): 285–308.

Rao, Vijay M, David C Levin, Laurence Parker, Barbara Cavanaugh, Andrea J Frangos, and Jonathan H Sunshine. 2010. “How Widely Is Computer-Aided Detection Used in Screening and Diagnostic Mammography?” Journal of the American College of Radiology 7 (10): 802–5.

Starr, Stuart J, Charles E Metz, Lee B Lusted, and David J Goodenough. 1975. “Visual Detection and Localization of Radiographic Images.” Radiology 116 (3): 533–38.

Summers, Ronald M, Laurie R Handwerker, Perry J Pickhardt, Robert L Van Uitert, Keshav K Deshpande, Srinath Yeshwant, Jianhua Yao, and Marek Franaszek. 2008. “Performance of a Previously Validated CT Colonography Computer-Aided Detection System in a New Patient Population.” Journal Article. American Journal of Roentgenology 191 (1): 168–74.

Swensson, Richard G. 1996. “Unified Measurement of Observer Performance in Detecting and Localizing Target Objects on Images.” Medical Physics 23 (10): 1709–25.

Tan, Tao, Bram Platel, Henkjan Huisman, Clara Sánchez, Roel Mus, and Nico Karssemeijer. 2012. “Computer-Aided Lesion Diagnosis in Automated 3-d Breast Ultrasound Using Coronal Spiculation.” Journal Article. Medical Imaging, IEEE Transactions on 31 (5): 1034–42.

Taylor, Stuart A, Steve Halligan, David Burling, Mary E Roddie, Lesley Honeyfield, Justine McQuillan, Hamdam Amin, and Jamshid Dehmeshki. 2006. “Computer-Assisted Reader Software Versus Expert Reviewers for Polyp Detection on CT Colonography.” Journal Article. American Journal of Roentgenology 186 (3): 696–702.