Chapter 10 Standalone CAD
10.2 Introduction
In the US the majority of screening mammograms are analyzed by computer aided detection (CAD) algorithms (Rao et al. 2010). Almost all major imaging device manufacturers provide CAD as part of their imaging workstation display software. In the United States CAD is approved for use as a second reader, i.e., the radiologist first interprets the images (typically 4 views, 2 views of each breast) without CAD and then CAD information (i.e., cued suspicious regions, possibly shown with associated probabilities of malignancies) is shown and the radiologist has the opportunity to revise the initial interpretation. In response to the FDA-approved second reader usage, the evolution of CAD algorithms has been guided mainly by comparing observer performance of radiologists with and without CAD.
Clinical CAD systems sometimes only report the locations of suspicious regions, i.e., it may not provide ratings. Analysis of this type of date is deferred to a following TBA chapter. However, a malignancy index (a continuous variable) for every CAD-found suspicious region is available to the algorithm designer (Edwards et al. 2002). Standalone performance, i.e., performance of designer-level CAD by itself, regarded as an algorithmic reader, vs. radiologists, is rarely measured. In breast cancer screening I am aware of only one study (Hupse et al. 2013) where standalone performance was measured. [Standalone performance has been measured in CAD for computed tomography colonography, chest radiography and three dimensional ultrasound (Hein et al. 2010; Summers et al. 2008; Taylor et al. 2006; De Boo et al. 2011; Tan et al. 2012)].
One possible reason for not measuring standalone performance of CAD is the lack of an accepted assessment method for such measurements. This chapter removes that impediment. It describes a method for comparing standalone performance of designer-level CAD to a group of radiologists interpreting the same cases and compares the method to those described in two relevant publications (Hupse et al. 2013; Kooi et al. 2016).
10.3 Overview
This chapter extends the method used in a study of standalone CAD performance (Hupse et al. 2013), termed one-treatment random-reader fixed case or 1T-RRFC analysis, since CAD is treated as an additional reader within a single treatment and since it only accounts for reader variability but does not account for case-variability.
The extension includes the effect of case-sampling variability and is hence termed one-treatment random-reader random-case or 1T-RRRC analysis. The method is based on an existing method allowing comparison of the average performance of readers in a single treatment to a specified value. The key modification is to regard the difference in performance between radiologists over CAD as a figure of merit to which the existing work is directly applicable. The 1T-RRRC method is compared to 1T-RRFC.
The 1T-RRRC method is also compared to an unorthodox usage of conventional multiple-treatment multiple-reader method, termed 2T-RRRC analysis, which involves replicating the CAD ratings as many times as there are radiologists, in effect simulating a second treatment, i.e., CAD is regarded as the second treatment (with identical readers within this treatment) to which existing methods (DBM or OR, as described in RJafrocRocBook) is applied. `
10.4 Methods
Summarized are two relevant studies of CAD vs. radiologists in mammography. This is followed by comments on the methods used in the two studies. The second study used multi-treatment multi-reader receiver operating characteristic (ROC) software in an unorthodox way. A statistical model and analysis method is described that avoids the unorthodox usage of ROC software and has fewer model parameters.
10.4.1 Studies assessing performance of CAD vs. radiologists
The first study (Hupse et al. 2013) measured performance in finding and localizing lesions in mammograms, i.e., visual search was involved, while the second study (Kooi et al. 2016) measured lesion classification performance between non-diseased and diseased regions of interest (ROIs) previously found on mammograms by an independent algorithmic reader, i.e., visual search was not involved.
10.4.1.1 Study - 1
The first study (Hupse et al. 2013) compared standalone performance of a CAD device to that of 9 radiologists interpreting the same cases (120 non-diseased and 80 with a single malignant mass per case). It used the LROC (localization ROC) paradigm (S. J. Starr et al. 1975; Charles E. Metz, Starr, and Lusted 1976; Swensson 1996), in which the observer gives an overall rating for presence of disease (an integer 0 to 100 scale was used) and indicates the location of the most suspicious region. On a non-diseased case the rating is classified as a false positive (FP) but on a diseased case it is classified as a correct localization (CL) if the location is sufficiently close to the lesion and otherwise it is classified as an incorrect localization. For a given reporting threshold, the number of correct localizations divided by the number of diseased cases estimates the probability of correct localization (PCL) at that threshold. On non-diseased cases the number of false positives (FPs) divided by the number of non-diseased cases estimates the probability of a false positive, or false positive fraction (FPF), at that threshold. The plot of PCL (ordinate) vs. FPF defines the empirical LROC curve. Study - 1 used as figures of merit (FOMs) the interpolated PCL at two values of FPF, specifically FPF = 0.05 and FPF = 0.2, denoted \(\text{PCL}_{0.05}\) and \(\text{PCL}_{0.2}\), respectively. A t-test between the radiologist \(\text{PCL}_{\text{FPF}}\) values and that of CAD was used to compute the two-sided p-value for rejecting the NH of equal performance. Study - 1 reported p-value = 0.17 for \(\text{PCL}_{0.05}\) and p-value \(\leq\) 0.001, with CAD being inferior, for \(\text{PCL}_{0.2}\).
10.4.1.2 Study - 2
The second study (Kooi et al. 2016) used 199 diseased and 199 non-diseased ROIs extracted by an independent CAD algorithm. These were analyzed by a different CAD algorithmic observer from that used to determine the ROIs and by four expert radiologists. In either case the ROC paradigm was used (i.e., a rating was obtained for each ROI) The figure of merit was the empirical area (AUC) under the respective ROC curves (one for each radiologist and one for CAD). The p-value for the difference in AUCs between the average radiologist’s AUC and CAD AUC was determined using an unorthodox application of the Dorfman-Berbaum-Metz (Dorfman, Berbaum, and Metz 1992) multiple-treatment multiple-reader multiple-case (DBM-MRMC) software.
The application was unorthodox in the sense that in the input data file radiologists and CAD were entered as two treatments. In conventional (or orthodox) DBM-MRMC each reader provides two ratings per case and the data file would consist of paired ratings of a set of cases interpreted by 4 readers. To accommodate the paired data structure assumed by the software, the authors of Study - 2 replicated the CAD ratings four times in the input data file, as explained in the caption to Table 10.1. By this artifice they converted a single-treatment 5-reader (4 radiologists plus CAD) data file to a two-treatment 4-reader data file in which the four readers in treatment 1 were the radiologists, and the four “readers” in treatment 2 were CAD replicated ratings. Note that for each case the four readers in the second treatment had identical ratings. In Table 1 the replicated CAD readers are labeled C1, C2, C3 and C4.
Reader | Treatment | Case | Rating | Reader | Treatment | Case | Rating | |
---|---|---|---|---|---|---|---|---|
R1 | 1 | 1 | 75 | R1 | 1 | 1 | 75 | |
… | … | … | … | … | … | … | … | |
R1 | 1 | 398 | 0 | R1 | 1 | 398 | 0 | |
… | … | … | … | … | … | … | … | |
R4 | 1 | 1 | 50 | R4 | 1 | 1 | 50 | |
… | … | … | … | … | … | … | … | |
R4 | 1 | 398 | 25 | R4 | 1 | 398 | 25 | |
R1 | 2 | 1 | 45 | C1 | 2 | 1 | 55 | |
… | … | … | … | … | … | … | … | |
R1 | 2 | 398 | 25 | C1 | 2 | 398 | 5 | |
… | … | … | … | … | … | … | … | |
R4 | 2 | 1 | 95 | C4 | 2 | 1 | 55 | |
… | … | … | … | … | … | … | … | |
R4 | 2 | 398 | 20 | C4 | 2 | 398 | 5 |
Study – 2 reported a not significant difference between CAD and the radiologists (p = 0.253).
10.4.1.3 Comments
For the purpose of this work, which focuses on the respective analysis methods, the difference in observer performance paradigms between the two studies, namely a search paradigm in Study - 1 vs. an ROI classification paradigm in Study – 2, is inconsequential. The paired t-test used in Study - 1 treats the case-sample as fixed. In other words, the analysis is not accounting for case-sampling variability but it is accounting for reader variability. While not explicitly stated, the reason for the unorthodox analysis in Study – 2 was the desire to include case-sampling variability. Prof. Karssemeijer (private communication, 10/27/2017) had consulted with a few ROC experts to determine if the procedure used in Study – 2 was valid, and while the experts thought it was probably valid they were not sure.
In what follows, the analysis in Study – 1 is referred to as single-treatment random-reader fixed-case (1T-RRFC) while that in Study – 2 is referred to as dual-treatment random-reader random-case (2T-RRRC).
10.4.2 The 1T-RRFC analysis model
The sampling model for the FOM is:
\[\begin{equation} \left. \begin{aligned} \theta_j=\mu+R_j \\ \left (j = 1,2,...J \right ) \end{aligned} \right \} \tag{10.1} \end{equation}\]Here \(\mu\) is a constant, \(\theta_j\) is the FOM for reader \(j\), and \(R_j\) is the random contribution for reader \(j\) distributed as:
\[\begin{equation} R_j \sim N\left ( 0,\sigma_R^2 \right ) \tag{10.2} \end{equation}\]Because of the assumed normal distribution of \(R_j\), in order to compare the readers to a fixed value, that of CAD denoted \(\theta_0\), one uses the (unpaired) t-test, as done in Study – 1. As evident from the model, no allowance is made for case-sampling variability, which is the reason for calling it the 1T-RRFC method.
Performance of CAD on a fixed dataset does exhibit within-CAD variability, i.e., CAD applied repeatedly to a fixed dataset does not always produce the same mark-rating data. However, this source of within-CAD variability is much smaller than inter-reader variability of radiologists interpreting the same dataset. The within-reader variability of radiologists is smaller than inter-reader variability and within-CAD variability is even smaller. For this reason one is justified in regarded \(\theta_0\) as a fixed quantity for a given dataset. Varying the dataset will result in different values for \(\theta_0\) reflecting case sampling variability which needs to be accounted for as done in the following analyses.
10.4.3 The 2T-RRRC analysis model
This could be termed the conventional or the orthodox method. There are two treatments and the study design is fully crossed: each reader interprets each case in each treatment, i.e., the data structure is as in the left half of Table 10.1.
The following approach, termed 2T-RRRC, uses the Obuchowski and Rockette (OR) figure of merit sampling model (Obuchowski and Rockette 1995). The OR model is:
\[\begin{equation} \theta_{ij\{c\}}=\mu+\tau_i+\left ( \tau \text{R} \right )_{ij}+\epsilon_{ij\{c\}} \tag{10.3} \end{equation}\]Assuming two treatments, \(i\) (\(i = 1, 2\)) is the treatment index, \(j\) (\(j = 1, ..., J\)) is the reader index, and \(k\) (\(k = 1, ..., K\)) is the case index, and \(\theta_{ij\{c\}}\) is the figure of merit in treatment \(i\) for reader \(j\) and case-sample \(\{c\}\). A case-sample is a set or ensemble of cases, diseased and non-diseased, and different integer values of \(c\) correspond to different case-samples.
The first two terms on the right hand side of Eqn. (10.3) are fixed effects (average performance and treatment effect, respectively). The next two terms are random effect variables that, by assumption, are sampled as follows:
\[\begin{equation} \left. \begin{aligned} R_j \sim N\left ( 0,\sigma_R^2 \right )\\ \left ( \tau R \right )_{ij} \sim N\left ( 0,\sigma_{\tau R}^2 \right )\\ \end{aligned} \right \} \tag{10.4} \end{equation}\]The terms \(R_j\) represents the random treatment-independent contribution of reader \(j\), modeled as a sample from a zero-mean normal distribution with variance \(\sigma_R^2\), \(\left ( \tau R \right )_{ij}\) represents the random treatment-dependent contribution of reader \(j\) in treatment \(i\), modeled as a sample from a zero-mean normal distribution with variance \(\sigma_{\tau R}^2\). The sampling of the last (error) term is described by:
\[\begin{equation} \epsilon_{ij\{c\}}\sim N_{I \times J}\left ( \vec{0} , \Sigma \right ) \tag{10.5} \end{equation}\]Here \(N_{I \times J}\) is the \(I \times J\) variate normal distribution and \(\vec{0}\), a \(I \times J\) length zero-vector, represents the mean of the distribution. The \(\{I \times J\} \times \{I \times J\}\) dimensional covariance matrix \(\Sigma\) is defined by 4 parameters, \(\text{Var}\), \(\text{Cov}_1\), \(\text{Cov}_2\), \(\text{Cov}_3\), defined as follows:
\[\begin{equation} \text{Cov} \left (\epsilon_{ij\{c\}},\epsilon_{i'j'\{c\}} \right ) = \left\{\begin{matrix} \text{Var} \; (i=i',j=j') \\ \text{Cov1} \; (i\ne i',j=j')\\ \text{Cov2} \; (i = i',j \ne j')\\ \text{Cov3} \; (i\ne i',j \ne j') \end{matrix}\right\} \tag{10.6} \end{equation}\]Software {U of Iowa and RJafroc
} yields estimates of all terms appearing on the right hand side of Eqn. (10.6). Excluding fixed effects the model represented by Eqn. (10.3) contains six parameters:
The meanings the last four terms are described in (Hillis 2007; Obuchowski and Rockette 1995; Hillis et al. 2005; Dev P. Chakraborty 2017). Briefly, \(\text{Var}\) is the variance of a reader’s FOMs, in a given treatment, over interpretations of different case-samples, averaged over readers and treatments; \(\text{Cov}_1/\text{Var}\) is the correlation of a reader’s FOMs, over interpretations of different case-samples in different treatments, averaged over all different-treatment same-reader pairings; \(\text{Cov}_2/\text{Var}\) is the correlation of different reader’s FOMs, over interpretations of different case-samples in the same treatment, averaged over all same- treatment different-reader pairings and finally, \(\text{Cov}_3/\text{Var}\) is the correlation of different reader’s FOMs, over interpretations of different case-samples in different treatments, averaged over all different-treatment different-reader pairings. One expects the following inequalities to hold:
\[\begin{equation} \text{Var} \geq \text{Cov}_1 \geq \text{Cov}_2 \geq \text{Cov}_3 \tag{10.8} \end{equation}\]In practice, since one is usually limited to one case-sample, i.e., \(c = 1\), resampling techniques (Efron and Tibshirani 1994) – e.g., the jackknife – are used to estimate these terms.
10.4.4 The 1T-RRRC analysis model
The difference from the approach in Study - 2, and the main contribution of this work, is to regard standalone CAD as a different reader, not as a different treatment. This section describes a single treatment method for analyzing readers and CAD, where CAD is regarded as an additional reader and artificially replicated CAD data becomes unnecessary. Accordingly the proposed method is termed single-treatment random-reader random-case (1T-RRRC) analysis.
The starting point is the (Obuchowski and Rockette 1995) model for a single treatment, which for the radiologists (i.e., excluding CAD) interpreting in a single-treatment reduces to the following model:
\[\begin{equation} \theta_{j\{c\}}=\mu+R_j+\epsilon_{j\{c\}} \tag{10.9} \end{equation}\]\(\theta_{j\{c\}}\) is the figure of merit for radiologist \(j\) (\(j = 1, 2, ..., J\)) interpreting case-sample \(\{c\}\); \(R_j\) is the random effect of radiologist \(j\) and \(\epsilon_{j\{c\}}\) is the error term. For single-treatment multiple-reader interpretations the error term is distributed as:
\[\begin{equation} \epsilon_{j\{c\}}\sim N_{J}\left ( \vec{0} , \Sigma \right ) \tag{10.10} \end{equation}\]The \(J \times J\) covariance matrix \(\Sigma\) is defined by two parameters, \(\text{Var}\) and \(\text{Cov}_2\), as follows:
\[\begin{equation} \Sigma_{jj'} = \text{Cov}\left ( \epsilon_{j\{c\}}, \epsilon_{j'\{c\}} \right ) = \left\{\begin{matrix} \text{Var} & j = j'\\ \text{Cov}_2 & j \neq j' \end{matrix}\right. \tag{10.11} \end{equation}\]In practice the terms \(\text{Var}\) and \(\text{Cov}_2\) are estimated using the jackknife method.
10.4.4.1 Single treatment analysis for radiologists
Hillis (Hillis et al. 2005; Hillis 2007) has described how to use the single treatment model (10.9) to compare a groups of radiologists’ average performance to a fixed value, in effect the \(\text{NH}: \mu = \mu_0\), where \(\mu_0\) is a pre-specified constant.
One might be tempted to set \(\mu_0\) equal to the performance of CAD but that would not be accounting for the fact that the performance of CAD is itself a random variable whose case-sampling variability needs to be accounted for.
10.4.4.2 Adaptation of single treatment analysis to accommodate CAD
Instead, the following model is used for the figure of merit of the radiologists and CAD (note that \(j = 0\) is used to denote the CAD algorithmic reader):
\[\begin{equation} \theta_{j\{c\}} = \theta_{0\{c\}} + \Delta \theta + R_j + \epsilon_{j\{c\}}\\ j=1,2,...J \tag{10.12} \end{equation}\]\(\theta_{0\{c\}}\) is the CAD figure of merit for case-sample \(\{c\}\) and \(\Delta \theta\) is the average figure of merit increment of the radiologists over CAD. To reduce this model to one to which Hillis’ formulae are directly applicable, one subtracts the CAD figure of merit from each radiologist’s figure of merit for the same case-sample, and defines this as the difference figure of merit \(\psi_{j\{c\}}\) , i.e.,
\[\begin{equation} \psi_{j\{c\}} = \theta_{j\{c\}} - \theta_{0\{c\}} \tag{10.13} \end{equation}\]Then Eqn. (10.12) reduces to:
\[\begin{equation} \psi_{j\{c\}} = \Delta \theta + R_j + \epsilon_{j\{c\}} \tag{10.14} \end{equation}\]Eqn. (10.14) is identical in form to Eqn. (10.9) excepting that the figure of merit on the left hand side of Eqn. (10.14) is a difference FOM, that between the radiologist’s and CAD, i.e., describing a model for \(J\) radiologists interpreting a common case set, each of whose performances is measured relative to that of CAD. Under the NH the expected difference is zero: \(\text{NH:} \Delta \theta = 0\). The method (Hillis et al. 2005; Hillis 2007) for single-treatment multiple-reader analysis is now directly applicable to the model described by Eqn. (10.14).
Apart from fixed effects, the model in Eqn. (10.14) contains three parameters:
\[\begin{equation} \sigma_R^2, \text{Var}, \text{Cov}_2 \tag{10.15} \end{equation}\]Setting \(\text{Var} = 0, \text{Cov}_2 = 0\) yields the 1T-RRFC model which contains only one random parameter, namely \(\sigma_R^2\). One expects an identical estimate of this parameter using 1T-RRRC analyses.
10.5 Implementation
The three analyses, namely random-reader fixed-case (1T-RRFC), dual-treatment random-reader random-case (2T-RRRC) and single-treatment random-reader random-case (1T-RRRC), are implemented in RJafroc
.
The following code shows usage of the software to generate the results. Note that RJafroc::datasetCadLroc
is the LROC dataset and RJafroc::dataset09
is the corresponding ROC dataset.
<- RJafroc::StCadVsRad (RJafroc::datasetCadLroc,
RRFC_1T_PCL_0_05 FOM = "PCL", FPFValue = 0.05, method = "1T-RRFC")
<- RJafroc::StCadVsRad (RJafroc::datasetCadLroc,
RRRC_2T_PCL_0_05 FOM = "PCL", FPFValue = 0.05, method = "2T-RRRC")
<- RJafroc::StCadVsRad (RJafroc::datasetCadLroc,
RRRC_1T_PCL_0_05 FOM = "PCL", FPFValue = 0.05, method = "1T-RRRC")
<- RJafroc::StCadVsRad (RJafroc::datasetCadLroc,
RRFC_1T_PCL_0_2 FOM = "PCL", FPFValue = 0.2, method = "1T-RRFC")
<- RJafroc::StCadVsRad (RJafroc::datasetCadLroc,
RRRC_2T_PCL_0_2 FOM = "PCL", FPFValue = 0.2, method = "2T-RRRC")
<- RJafroc::StCadVsRad (RJafroc::datasetCadLroc,
RRRC_1T_PCL_0_2 FOM = "PCL", FPFValue = 0.2, method = "1T-RRRC")
<- RJafroc::StCadVsRad (RJafroc::datasetCadLroc,
RRFC_1T_PCL_1 FOM = "PCL", FPFValue = 1, method = "1T-RRFC")
<- RJafroc::StCadVsRad (RJafroc::datasetCadLroc,
RRRC_2T_PCL_1 FOM = "PCL", FPFValue = 1, method = "2T-RRRC")
<- RJafroc::StCadVsRad (RJafroc::datasetCadLroc,
RRRC_1T_PCL_1 FOM = "PCL", FPFValue = 1, method = "1T-RRRC")
<- RJafroc::StCadVsRad (RJafroc::dataset09,
RRFC_1T_AUC FOM = "Wilcoxon", method = "1T-RRFC")
<- RJafroc::StCadVsRad (RJafroc::dataset09,
RRRC_2T_AUC FOM = "Wilcoxon", method = "2T-RRRC")
<- RJafroc::StCadVsRad (RJafroc::dataset09,
RRRC_1T_AUC FOM = "Wilcoxon", method = "1T-RRRC")
The results are organized as follows:
RRFC_1T_PCL_0_05
contains the results of 1T-RRFC analysis for figure of merit = \(\text{PCL}_{0.05}\).RRRC_2T_PCL_0_05
contains the results of 2T-RRRC analysis for figure of merit = \(\text{PCL}_{0.05}\).RRRC_1T_PCL_0_05
contains the results of 1T-RRRC analysis for figure of merit = \(\text{PCL}_{0.05}\).RRFC_1T_PCL_0_2
contains the results of 1T-RRFC analysis for figure of merit = \(\text{PCL}_{0.2}\).RRRC_2T_PCL_0_2
contains the results of 2T-RRRC analysis for figure of merit = \(\text{PCL}_{0.2}\).RRRC_1T_PCL_0_2
contains the results of 1T-RRRC analysis for figure of merit = \(\text{PCL}_{0.2}\).RRFC_1T_AUC
contains the results of 1T-RRFC analysis for the Wilcoxon figure of merit.RRRC_2T_AUC
contains the results of 2T-RRRC analysis for the Wilcoxon figure of merit.RRRC_1T_AUC
contains the results of 1T-RRRC analysis for the Wilcoxon figure of merit.
The structures of these objects are illustrated with examples in the Appendix.
10.6 Results
The three methods, 1T-RRFC, 2T-RRRC and 1T-RRRC, were applied to an LROC dataset similar to that used in Study – 1 (I thank Prof. Karssemeijer for making this dataset available), Table 10.2.
FOM | Analysis | \(\theta_0\) | \(CI_{\theta_0}\) | \(\theta_{\bullet}\) | \(CI_{\theta_{\bullet}}\) | \(\psi_{\bullet}\) | \(CI_{\psi_{\bullet}}\) | F | ddf | p |
---|---|---|---|---|---|---|---|---|---|---|
PCL_0_05 | 1T-RRFC | 0.45 | NA | 0.493 | (0.42,0.57) | 0.0433 | (-0.032,0.12) | 1.8 | 8 | 0.22 |
2T-RRRC | (0.26,0.64) | (0.38,0.61) | (-0.16,0.24) | 0.18 | 784 | 0.67 | ||||
1T-RRRC | NA | (0.29,0.69) | (-0.16,0.24) | 0.18 | 784 | 0.67 | ||||
PCL_0_2 | 1T-RRFC | 0.592 | NA | 0.71 | (0.67,0.75) | 0.119 | (0.078,0.16) | 45 | 8 | 0.00015 |
2T-RRRC | (0.48,0.71) | (0.63,0.79) | (0.0044,0.23) | 4.2 | 937 | 0.042 | ||||
1T-RRRC | NA | (0.6,0.82) | (0.0044,0.23) | 4.2 | 937 | 0.042 | ||||
PCL_1 | 1T-RRFC | 0.675 | NA | 0.783 | (0.74,0.83) | 0.108 | (0.065,0.15) | 33 | 8 | 0.00043 |
2T-RRRC | (0.57,0.78) | (0.71,0.85) | (0.0045,0.21) | 4.2 | 493 | 0.041 | ||||
1T-RRRC | NA | (0.68,0.89) | (0.0045,0.21) | 4.2 | 493 | 0.041 | ||||
Wilcoxon | 1T-RRFC | 0.817 | NA | 0.849 | (0.83,0.87) | 0.0317 | (0.009,0.055) | 10 | 8 | 0.012 |
2T-RRRC | (0.75,0.88) | (0.81,0.89) | (-0.031,0.094) | 0.99 | 878 | 0.32 | ||||
1T-RRRC | NA | (0.79,0.91) | (-0.031,0.094) | 0.99 | 878 | 0.32 |
Results are shown for the following FOMs: \(\text{PCL}_{0.05}\), \(\text{PCL}_{0.2}\), \(\text{PCL}_{1}\) and the empirical area (AUC) under the ROC curve estimated by the Wilcoxon statistic. The first two FOMs are identical to those used in Study – 1. Columns 3 and 4 list the CAD FOM \(\theta_0\) and its 95% confidence interval \(CI_{\theta_0}\), columns 5 and 6 list the average radiologist FOM \(\theta_{\bullet}\) (the dot symbol represents an average over the non-zero radiologist index j = 1,2,…, 9) and its 95% confidence interval \(CI_{\theta_{\bullet}}\), columns 7 and 8 list the average difference FOM \(\psi_{\bullet}\), i.e., radiologist average minus CAD, and its 95% confidence interval \(CI_{\psi_{\bullet}}\), and the last three columns list the F-statistic, the denominator degrees of freedom (ddf) and the p-value for rejecting the null hypothesis (the numerator degree of freedom of the F-statistic is unity).
The last three columns show that 2T-RRRC and 1T-RRRC analyses yield identical F-statistics, ddf and p-values. So the intuition of the authors of Study – 2, that the unorthodox method of using DBM – MRMC software to account for both reader and case-sampling variability, turns out to be correct. If interest is solely in these statistics one is justified in using the unorthodox method. Important caveats are noted below.
Other results evident in Table 10.2:
- Where a direct comparison is possible, namely 1T-RRFC analysis using \(\text{PCL}_{0.05}\) and \(\text{PCL}_{0.2}\) as FOMs, the p-values in Table 10.2 are very close to those reported in Study – 1.
- All FOMs (i.e., \(\theta_0\), \(\theta_{\bullet}\) and \(\psi_{\bullet}\)) in Table 10.2 are independent of the method of analysis. However, the corresponding confidence intervals (i.e., \(CI_{\theta_0}\), \(CI_{\theta_{\bullet}}\) and \(CI_{\psi_{\bullet}}\)) depend on the analyses.
- Since the CAD figure of merit is a constant no confidence interval is appropriate for it for either 1T-RRFC or 1T-RRRC analysis and the listed values are NA (not applicable). Since 2T-RRRC analysis assumes CAD is a different treatment the analysis lists a confidence interval that is correctly centered on the CAD value but is otherwise meaningless, i.e., it is an artifact of the unintended usage of the OR analysis method.
- The p-value for either RRRC analyses (2T or 1T) is larger than the corresponding 1T-RRFC value. Accounting for case-sampling variability increases the p-value leading to less possibility of finding a significant difference.
- The LROC FOMs increase as the value of FPF (the subscript) increases, a general feature of any partial curve based figure of merit, as is the observation that the area (AUC) under the ROC is larger than the largest PCL value.
- Using either RRRC analyses ignoring localization information (i.e., using the AUC FOM) leads to a not-significant difference between CAD and the radiologists (\(p\) = 0.32) while using localization information via the \(\text{PCL}_1\) FOM yields a significant difference (\(p\) = 0.041), consistent with the expectation that using localization information leads to increased statistical power.
- Partial curve-based FOMs, such as \(\text{PCL}_\text{FPF}\), lead, depending on the choice of \(\text{FPF}\), to different conclusions on whether to reject the NH. Using either RRRC analyses the p-values decrease as \(\text{FPF}\) increases (e.g., $ 0.67 > 0.042 > 0.041$). This trend is not observed for 1T-RRFC analysis which shows a “sweet-spot” effect where the p-value has a minimum for \(\text{FPF} = 0.2\)
Shown next, Table 10.3, are the model-parameters corresponding to the three analyses.
FOM | Analysis | \(\sigma_R^2\) | \(\sigma_{\tau R}^2\) | Cov1 | Cov2 | Cov3 | Var |
---|---|---|---|---|---|---|---|
PCL_0_05 | 1T-RRFC | 0.0095 | NA | NA | NA | NA | NA |
2T-RRRC | -1.1e-19 | -0.00571 | 0.00131 | 0.00601 | 0.00131 | 0.0165 | |
1T-RRRC | 0.0095 | NA | NA | 0.0094 | NA | 0.0303 | |
PCL_0_2 | 1T-RRFC | 0.00281 | NA | NA | NA | NA | NA |
2T-RRRC | -4.9e-19 | 0.000265 | 0.000761 | 0.00229 | 0.000761 | 0.00343 | |
1T-RRRC | 0.00281 | NA | NA | 0.00307 | NA | 0.00534 | |
PCL_1 | 1T-RRFC | 0.0032 | NA | NA | NA | NA | NA |
2T-RRRC | 6e-19 | 0.001 | 0.000643 | 0.00186 | 0.000643 | 0.00246 | |
1T-RRRC | 0.0032 | NA | NA | 0.00244 | NA | 0.00364 | |
Wilcoxon | 1T-RRFC | 0.000878 | NA | NA | NA | NA | NA |
2T-RRRC | 7.9e-19 | 0.000201 | 0.000262 | 0.000724 | 0.000262 | 0.000962 | |
1T-RRRC | 0.000878 | NA | NA | 0.000924 | NA | 0.0014 |
From Table 10.3 some inconsistencies are evident for 2T-RRRC analysis:
- For 2T-RRRC analyses the listed values for \(\sigma_R^2\) are smaller than machine accuracy, therefore one concludes that in fact \(\sigma_R^2 = 0\) which is clearly an incorrect result as the radiologists do not have identical performances. In contrast, 1T-RRRC analyses yields the expected non-zero values, identical to those obtained by 1T-RRFC analyses (see comment following Eqn. (10.15)).
- For the 2T_RRRC method the expected ordering of the inequalities, Eqn. (10.8) is not observed: one expects \(\text{Cov}_1 \geq \text{Cov}_2 \geq \text{Cov}_3\) but instead one observes \(\text{Cov}_1 = \text{Cov}_3\) and \(\text{Cov}_2 > \text{Cov}_1\).
The design of a ratings simulator to statistically match a given dataset is addressed in Chapter 23 of my print book (Dev P. Chakraborty 2017). Using this simulator, the 1T-RRRC method had the expected null hypothesis behavior (Table 23.5, ibid).
10.7 Discussion
Described is an extension of the analysis used in Study – 1 that accounts for case sampling variability. It extends (Hillis et al. 2005) single-treatment analysis to a situation where one of the “readers” is a special reader subject to case-sampling variability only, and the desire is to compare performance of this special reader to the average of the remaining readers. Usage of the method along with two other methods is illustrated using an LROC dataset.
The proposed method, 1T-RRRC analyses, yields identical “overall” results (specifically the F-statistic, degrees of freedom and p-value) to those yielded by the unorthodox application of commonly available software, termed 2T-RRRC analyses, where the CAD reader is regarded as a second treatment (specifically the CAD ratings are replicated to match the number of radiologists). If interest is in just these values one is justified in using the 2T-RRRC method. However, 2T-RRRC model parameter estimates were unrealistic: for example, it yields zero between-reader variance. The result \(\sigma_R^2 = 0\) is clearly an artifact. One can only speculate as to what happens when software is used in a manner that it was not designed for: perhaps finding that all readers in the second treatment have identical FOMs led the software to yield \(\sigma_R^2 = 0\). Additionally, the covariance estimates are incorrect. Since sample-size estimation requires some of the covariance values the 2T-RRRC method should never be used to perform sample-size estimation for a prospective study.
The 1T-RRRC method described here is applicable to any scalar figure of merit. The paradigm used to collect the observer performance data - ROC, FROC, LROC or ROI - is irrelevant.
Assessing CAD utility by measuring performance with and without CAD may have inadvertently set a low bar for CAD to be considered useful. As examples, CAD is not penalized for missing cancers as long as the radiologist finds them and CAD is not penalized for excessive false positives (FPs) as long as the radiologist ignores them. Moreover, since both such measurements include the variability of radiologists, there is additional noise introduces that presumably makes it harder to determine if the CAD system is optimal.
In my opinion standalone performance is the most direct measure of CAD performance. Lack of a clear-cut method for assessing standalone CAD performance may have limited past CAD research. The current work hopefully removes that impediment. Going forward, assessment of standalone performance of CAD vs. expert radiologists is strongly encouraged.
10.8 Appendix 1
The structures of theR
objects generated by the software are illustrated with three examples.
10.8.1 Example 1
The first example shows the structure of RRFC_1T_PCL_0_2
.
<- RRFC_1T_PCL_0_2
x <- as.data.frame(t(x$fomRAD))
fom_individual_rad colnames(fom_individual_rad) <- paste0("rdr", seq(1:9))
<- data.frame(fomCAD = x$fomCAD, avgRadFom = x$avgRadFom, avgDiffFom = x$avgDiffFom, varR = x$varR, Tstat = x$Tstat, df = x$df, pval = x$pval)
stats
<- data.frame(CIAvgRadFom = x$CIAvgRadFom, CIAvgDiffFom = x$CIAvgDiffFom)
ConfidenceIntervals rownames(ConfidenceIntervals) <- c("Lower", "Upper")
print(fom_individual_rad)
#> rdr1 rdr2 rdr3 rdr4 rdr5 rdr6 rdr7 rdr8 rdr9
#> 1 0.6945313 0.65 0.80625 0.725 0.6598214 0.7684524 0.7375 0.675 0.675
print(stats)
#> fomCAD avgRadFom avgDiffFom varR Tstat df pval
#> 1 0.5916667 0.7101728 0.1185061 0.002808612 6.708357 8 0.0001513966
print(ConfidenceIntervals)
#> CIAvgRadFom CIAvgDiffFom
#> Lower 0.6694362 0.07776953
#> Upper 0.7509094 0.15924271
The results are displayed as three data frames.
The first data frame :
fom_individual_rad
shows the figures of merit for the nine radiologists in the study.
The next data frame summarizes the statistics.
fomCAD
is the figure of merit for CAD.avgRadFom
is the average figure of merit of the nine radiologists in the study.avgDiffFom
is the average difference figure of merit, RAD - CAD.varR
is the variance of the figures of merit for the nine radiologists in the study.Tstat
is the t-statistic for testing the NH that the average difference FOMavgDiffFom
is zero, whose square is the F-statistic.df
is the degrees of freedom of the t-statistic.pval
is the p-value for rejecting the NH. In the example shown below the value is highly signficant.
The last data frame summarizes the 95 percent confidence intervals.
CIAvgRadFom
is the 95 percent confidence interval, listed as pairsLower
,Upper
, foravgRadFom
.CIAvgDiffFom
is the 95 percent confidence interval foravgDiffFom
.- If the pair
CIAvgDiffFom
excludes zero, the difference is statistically significant. - In the example the interval excludes zero showing that the FOM difference is significant.
10.8.2 Example 2
The next example shows the structure of RRRC_2T_PCL_0_2
.
<- RRRC_2T_PCL_0_2
x
<- as.data.frame(t(x$fomRAD))
fom_individual_rad colnames(fom_individual_rad) <- paste0("rdr", seq(1:9))
<- data.frame(fomCAD = x$fomCAD, avgRadFom = x$avgRadFom, avgDiffFom = x$avgDiffFom)
stats1
<- data.frame(varR = x$varR, varTR = x$varTR,
stats2 cov1 = x$cov1, cov2 = x$cov2 ,
cov3 = x$cov3 , Var = x$varError,
FStat = x$FStat, df = x$df, pval = x$pval)
print(fom_individual_rad)
#> rdr1 rdr2 rdr3 rdr4 rdr5 rdr6 rdr7 rdr8 rdr9
#> 1 0.6945313 0.65 0.80625 0.725 0.6598214 0.7684524 0.7375 0.675 0.675
print(stats1)
#> fomCAD avgRadFom avgDiffFom
#> 1 0.5916667 0.7101728 0.1185061
print(stats2)
#> varR varTR cov1 cov2 cov3 Var
#> 1 -4.87891e-19 0.0002648898 0.0007613684 0.002294221 0.0007613684 0.003433637
#> FStat df pval
#> 1 4.15768 937.2437 0.04172626
In addition to the quantities defined previously, the output contains the covariance matrix for the Obuchowski-Rockette model, summarized in Eqn. (10.3) – Eqn. (10.6).
varTR
is \(\sigma_{\tau R}^2\).cov1
is \(\text{Cov}_1\).cov2
is \(\text{Cov}_2\).cov3
is \(\text{Cov}_3\).Var
is \(\text{Var}\).FStat
is the F-statistic for testing the NH.ndf
is the numerator degrees of freedom, equal to unity.df
is denominator degrees of freedom of the F-statistic for testing the NH.Tstat
is the t-statistic for testing the NH that the average difference FOMavgDiffFom
is zero.pval
is the p-value for rejecting the NH. In the example shown below the value is signficant.
Notice that including the variability of cases results in a higher p-value for 2T-RRRC as compared to 1T-RRFC.
Shown next are the confidence interval statistics x$ciAvgRdrEachTrt
for the two treatments (“trt1” = CAD, “trt2” = RAD):
print(x$ciAvgRdrEachTrt)
#> Estimate StdErr DF CILower CIUpper Cov2
#> trt1 0.5916667 0.05802835 Inf 0.4779332 0.7054001 0.003367289
#> trt2 0.7101728 0.03915636 193.1083 0.6329437 0.7874018 0.001221153
Estimate
contains the difference FOM estimate.StdErr
contains the standard estimate of the difference FOM estimate.DF
contains the degrees of freedom of the t-statistic.t
contains the value of the t-statistic.PrGtt
contains the probability of exceeding the magnitude of the t-statistic.CILower
is the lower confidence interval for the difference FOM.CIUpper
is the upper confidence interval for the difference FOM.
Shown next are the confidence interval statistics x$ciDiffFom
between the two treatments (“trt1-trt2” = CAD - RAD):
print(x$ciDiffFom)
#> Estimate StdErr DF t PrGTt CILower
#> trt2-trt1 0.1185061 0.05811861 937.2437 2.039039 0.04172626 0.004448434
#> CIUpper
#> trt2-trt1 0.2325638
The difference figure of merit statistics are contained in a dataframe x$ciDiffFom
with elements:
Estimate
contains the difference FOM estimate.StdErr
contains the standard estimate of the difference FOM estimate.DF
contains the degrees of freedom of the t-statistic.t
contains the value of the t-statistic.PrGtt
contains the probability of exceeding the magnitude of the t-statistic.CILower
is the lower confidence interval for the difference FOM.CIUpper
is the upper confidence interval for the difference FOM.
The figures of merit statistic for the two treatments, 1 is CAD and 2 is RAD.
trt1
: statistics for CAD.trt2
: statistics for RAD.Cov2
: \(\text{Cov}_2\) calculated over individual treatments.
10.8.3 Example 3
The last example shows the structure of RRRC_1T_PCL_0_2
.
RRRC_1T_PCL_0_2#> $fomCAD
#> [1] 0.5916667
#>
#> $fomRAD
#> [1] 0.6945313 0.6500000 0.8062500 0.7250000 0.6598214 0.7684524 0.7375000
#> [8] 0.6750000 0.6750000
#>
#> $avgRadFom
#> [1] 0.7101728
#>
#> $CIAvgRad
#> [1] 0.5961151 0.8242305
#>
#> $avgDiffFom
#> [1] 0.1185061
#>
#> $CIAvgDiffFom
#> [1] 0.004448434 0.232563801
#>
#> $varR
#> [1] 0.002808612
#>
#> $varError
#> [1] 0.005344538
#>
#> $cov2
#> [1] 0.003065705
#>
#> $Tstat
#> rdr2
#> 2.039039
#>
#> $df
#> rdr2
#> 937.2437
#>
#> $pval
#> rdr2
#> 0.04172626
The differences from RRFC_1T_PCL_0_2
are listed next:
varR
is \(\sigma_R^2\) of the single treatment model for comparing CAD to RAD, Eqn. (10.15).cov2
is \(\text{Cov}_2\) of the single treatment model for comparing CAD to RAD.varError
is \(\text{Var}\) of the single treatment model for comparing CAD to RAD.
Notice that the RRRC_1T_PCL_0_2
p value, i.e., 0.0417263, is identical to that of RRRC_2T_PCL_0_2
, i.e., 0.0417263.