Chapter 11 Sample size estimation: OR method

11.1 TBA How much finished

30%

11.2 Introduction

11.3 Statistical Power

\[\begin{equation} Power = 1 - \beta \tag{11.1} \end{equation}\]

11.3.1 Sample size estimation for random-reader random-cases

For convenience the OR model is repeated below with the case-set index suppressed:

\[\begin{equation} Y_{n(ijk)}=\mu+\tau_i+R_j+C_k+(\tau R)_{ij}+(\tau C)_{ik}+(RC)_{jk}+(\tau RC)_{ijk}+\epsilon_ {n(ijk)} \tag{11.2} \end{equation}\]

As usual, the treatment effects \(\tau_i\) are subject to the constraint that they sum to zero. The observed effect size (a random variable) is defined by:

\[\begin{equation} d=\theta_{1\bullet}-\theta_{2\bullet} \tag{11.3} \end{equation}\]

It is a realization of a random variable, so one has some leeway in the choice of anticipated effect size. In the significance-testing procedure described in TBA Chapter 09 interest was in the distribution of the F-statistic when the NH is true. For sample size estimation, one needs to know the distribution of the statistic when the NH is false. It was shown that then the observed F-statistic TBA Eqn. (9.35) is distributed as a non-central F-distribution \(F_{ndf,ddf,\Delta}\) with non-centrality parameter \(\Delta\):

\[\begin{equation} F_{DBM|AH} \sim F_{ndf,ddf,\Delta} \tag{11.4} \end{equation}\]

The non-centrality parameter was defined, Eqn. TBA (9.34), by:

\[\begin{equation} \Delta=\frac{JK\sigma_{Y;\tau}^2}{\left ( \sigma_{Y;\epsilon}^2 + \sigma_{Y;\tau RC}^2 \right )+K\sigma_{Y;\tau R}^2+J\sigma_{Y;\tau C}^2} \tag{11.5} \end{equation}\]

To minimize confusion, this equation has been rewritten here using the subscript \(Y\) to explicitly denote pseudo-value derived quantities (in TBA Chapter 09 this subscript was suppressed.

The estimate of \(\sigma_{Y;\tau C}^2\) can turn out to bee negative. To avoid a negative denominator, Hillis suggests the following modification:

\[\begin{equation} \Delta=\frac{JK\sigma_{Y;\tau}^2}{\left ( \sigma_{Y;\epsilon}^2 + \sigma_{Y;\tau RC}^2 \right )+K\sigma_{Y;\tau R}^2+\max \left (J\sigma_{Y;\tau C}^2 ,0 \right )} \tag{11.6} \end{equation}\]

This expression depends on three variance components, \((\sigma_{Y;\epsilon}^2 + \sigma_{Y;\tau RC}^2)\) - the two terms are inseparable - \(\sigma_{Y;\tau R}^2\) and \(\sigma_{Y;\tau C}^2\). The \(ddf\) term appearing in TBA Eqn. (11.4) was defined by TBA Eqn. (9.24) - this quantity does not change between NH and AH:

\[\begin{equation} ddf_H=\frac{\left [MSTR+\max(MSTR-MSTRC,0) \right ]^2}{\frac{[MSTR]^2}{(I-1)(J-1)}} \tag{11.7} \end{equation}\]

The mean squares in this expression can be expressed in terms of the three variance-components appearing in TBA Eqn. (11.6). Hillis and Berbaum (Stephen L. Hillis and Berbaum 2004) have derived these expression and they will not be repeated here (Eqn. 4 in the cited reference). RJafroc implements a function to calculate the mean squares, UtilMeanSquares(), which allows ddf to be calculated using Eqn. TBA (11.7). The sample size functions in this package need only the three variance-components (the formula for \(ddf_H\) is implemented internally).

For two treatments, since the individual treatment effects must be the negatives of each other (because they sum to zero), it is easily shown that:

\[\begin{equation} \sigma_{Y;\tau}^2=\frac{d^2}{2} \tag{11.8} \end{equation}\]

11.3.2 Dependence of statistical power on estimates of model parameters

Examination of the expression for , Eqn. (11.5), shows that statistical power increases if:

The numerator is large. This occurs if: (a) the anticipated effect-size \(d\) is large. Since effect-size enters as the square, TBA Eqn. (11.8), it is has a particularly strong effect; (b) If \(J \times K\) is large. Both of these results should be obvious, as a large effect size and a large sample size should result in increased probability of rejecting the NH.
The denominator is small. The first term in the denominator is \(\left ( \sigma_{Y;\epsilon}^2 + \sigma_{Y;\tau RC}^2 \right )\). These two terms cannot be separated. This is the residual variability of the jackknife pseudovalues. It should make sense that the smaller the variability, the larger is the non-centrality parameter and the statistical power.
The next term in the denominator is \(K\sigma_{Y;\tau R}^2\), the treatment-reader variance component multiplied by the total number of cases. The reader variance \(\sigma_{Y;R}^2\) has no effect on statistical power, because it has an equal effect on both treatments and cancels out in the difference. Instead, it is the treatment-reader variance \(\sigma_{Y;R}^2\) that contributes “noise” tending to confound the estimate of the effect-size.
The variance components estimated by the ANOVA procedure are realizations of random variables and as such subject to noise (there actually exists a beast such as variance of a variance). The presence of the \(K\) term, usually large, can amplify the effect of noise in the estimate of \(\sigma_{Y;R}^2\), making the sample size estimation procedure less accurate.
The final term in the denominator is \(J\sigma_{Y;\tau C}^2\). The variance \(\sigma_{Y;C}^2\) has no impact on statistical power, as it cancels out in the difference. The treatment-case variance component introduces “noise” into the estimate of the effect size, thereby decreasing power. Since it is multiplied by J, the number of readers, and typically \(J<<K\), the error amplification effect on accuracy of the sample size estimate is not as bad as with the treatment-reader variance component.
Accuracy of sample size estimation, essentially estimating confidence intervals for statistical power, is addressed in (Dev Prasad Chakraborty 2010).

11.3.3 Formulae for random-reader random-case (RRRC) sample size estimation

11.3.4 Significance testing

11.3.5 p-value and confidence interval

11.3.6 Comparing DBM to Obuchowski and Rockette for single-reader multiple-treatments

Having performed a pilot study and planning to perform a pivotal study, sample size estimation follows the following procedure, which assumes that both reader and case are treated as random factors. Different formulae, described later, apply when either reader or case is treated as a fixed factor.

Perform OR analysis on the pilot data. This yields the observed effect size as well as estimates of all relevant variance components and mean squares appearing in TBA Eqn. (11.5) and Eqn. (11.7).
This is the difficult but critical part: make an educated guess regarding the effect-size, \(d\), that one is interested in “detecting” (i.e., hoping to reject the NH with probability \(1-\beta\)). The author prefers the term “anticipated” effect-size to “true” effect-size (the latter implies knowledge of the true difference between the modalities which, as noted earlier, would obviate the need for a pivotal study).
Two scenarios are considered below. In the first scenario, the effect-size is assumed equal to that observed in the pilot study, i.e., \(d = d_{obs}\).
In the second, so-called “best-case” scenario, one assumes that the anticipate value of \(d\) is the observed value plus two-sigma of the confidence interval, in the correct direction, of course, i.e., \(d=\left | d_{obs} \right |+2\sigma\). Here \(\sigma\) is one-fourth the width of the 95% confidence interval for \(d_{obs}\). Anticipating more than \(2\sigma\) greater than the observed effect-size would be overly optimistic. The width of the CI implies that chances are less than 2.5% that the anticipated value is at or beyond the overly optimistic value. These points will become clearer when example datasets are analyzed below.
Calculate statistical power using the distribution implied by Eqn. (11.4), to calculate the probability that a random value of the relevant F-statistic will exceed the critical value, as in §11.3.2.
If power is below the desired or “target” power, one tries successively larger value of \(J\) and / or \(K\) until the target power is reached.

11.4 Formulae for fixed-reader random-case (FRRC) sample size estimation

It was shown in TBA §9.8.2 that for fixed-reader analysis the non-centrality parameter is defined by:

\[\begin{equation} \Delta=\frac{JK\sigma_{Y;\tau}^2}{\sigma_{Y;\epsilon}^2+\sigma_{Y;\tau RC}^2+J\sigma_{Y;\tau C}^2} \tag{11.9} \end{equation}\]

The sampling distribution of the F-statistic under the AH is:

\[\begin{equation} F_{AH|R}\equiv \frac{MST}{MSTC}\sim F_{I-1,(I-1)(K-1),\Delta} \tag{11.10} \end{equation}\]

11.4.1 Formulae for random-reader fixed-case (RRFC) sample size estimation

It is shown in TBA §9.9 that for fixed-case analysis the non-centrality parameter is defined by:

\[\begin{equation} \Delta=\frac{JK\sigma_{Y;\tau}^2}{\sigma_{Y;\epsilon}^2+\sigma_{Y;\tau RC}^2+K\sigma_{Y;\tau R}^2} \tag{11.11} \end{equation}\]

Under the AH, the test statistic is distributed as a non-central F-distribution as follows:

\[\begin{equation} F_{AH|C}\equiv \frac{MST}{MSTR}\sim F_{I-1,(I-1)(J-1),\Delta} \tag{11.12} \end{equation}\]

11.4.2 Example 1

In the first example the Van Dyke dataset is regarded as a pilot study. Two implementations are shown, a direct application of the relevant formulae, including usage of the mean squares, which in principle can be calculated from the three variance-components. This is then compared to the RJafroc implementation.

Shown first is the “open” implementation.

alpha <- 0.05;cat("alpha = ", alpha, "\n")
#> alpha =  0.05
rocData <- dataset02 # select Van Dyke dataset
retDbm <- StSignificanceTesting(dataset = rocData, FOM = "Wilcoxon", method = "DBM") 
varYTR <- retDbm$ANOVA$VarCom["VarTR","Estimates"]
varYTC <- retDbm$ANOVA$VarCom["VarTC","Estimates"]
varYEps <- retDbm$ANOVA$VarCom["VarErr","Estimates"]
effectSize <- retDbm$FOMs$trtMeanDiffs["trt0-trt1","Estimate"]
cat("effect size = ", effectSize, "\n")
#> effect size =  -0.043800322

#RRRC
J <- 10; K <- 163
ncp <- (0.5*J*K*(effectSize)^2)/(K*varYTR+max(J*varYTC,0)+varYEps)
MS <- UtilMeanSquares(rocData, FOM = "Wilcoxon", method = "DBM")
ddf <- (MS$msTR+max(MS$msTC-MS$msTRC,0))^2/(MS$msTR^2)*(J-1)
FCrit <- qf(1 - alpha, 1, ddf)
Power <- 1-pf(FCrit, 1, ddf, ncp = ncp)
data.frame("J"= J,  "K" = K, "FCrit" = FCrit, "ddf" = ddf, "ncp" = ncp, "RRRCPower" = Power)
#>    J   K     FCrit       ddf       ncp  RRRCPower
#> 1 10 163 4.1270572 34.334268 8.1269825 0.79111255

#FRRC
J <- 10; K <- 133
ncp <- (0.5*J*K*(effectSize)^2)/(max(J*varYTC,0)+varYEps)
ddf <- (K-1)
FCrit <- qf(1 - alpha, 1, ddf)
Power <- 1-pf(FCrit, 1, ddf, ncp = ncp)
data.frame("J"= J,  "K" = K, "FCrit" = FCrit, "ddf" = ddf, "ncp" = ncp, "RRRCPower" = Power)
#>    J   K    FCrit ddf       ncp  RRRCPower
#> 1 10 133 3.912875 132 7.9873835 0.80111671

#RRFC
J <- 10; K <- 53
ncp <- (0.5*J*K*(effectSize)^2)/(K*varYTR+varYEps)
ddf <- (J-1)
FCrit <- qf(1 - alpha, 1, ddf)
Power <- 1-pf(FCrit, 1, ddf, ncp = ncp)
data.frame("J"= J,  "K" = K, "FCrit" = FCrit, "ddf" = ddf, "ncp" = ncp, "RRRCPower" = Power)
#>    J  K    FCrit ddf       ncp  RRRCPower
#> 1 10 53 5.117355   9 10.048716 0.80496663

For 10 readers, the numbers of cases needed for 80% power is largest (163) for RRRC and least for RRFC (53). For all three analyses, the expectation of 80% power is met - the numbers of cases and readers were chosen to achieve close to 80% statistical power. Intermediate quantities such as the critical value of the F-statistic, ddf and ncp are shown. The reader should confirm that the code does in fact implement the relevant formulae. Shown next is the RJafroc implementation. The relevant file is mainSsDbm.R, a listing of which follows:

11.4.3 Fixed-reader random-case (FRRC) analysis

11.4.4 Random-reader fixed-case (RRFC) analysis

11.4.5 Single-treatment multiple-reader analysis

11.5 Discussion/Summary/3

11.6 Chapter References

Bamber, Donald. 1975. “The Area Above the Ordinal Dominance Graph and the Area Below the Receiver Operating Characteristic Graph.” Journal of Mathematical Psychology 12 (4): 387–415.

Bunch, Philip C, John F Hamilton, Gary K Sanderson, and Arthur H Simmons. 1977. “A Free Response Approach to the Measurement and Characterization of Radiographic Observer Performance.” In Application of Optical Instrumentation in Medicine VI, 127:124–35. International Society for Optics; Photonics.

Chakraborty, D. P., E. S. Breatnach, M. V. Yester, B. Soto, G. T. Barnes, and R. G. Fraser. 1986. “Digital and Conventional Chest Imaging: A Modified ROC Study of Observer Performance Using Simulated Nodules.” Journal Article. Radiology 158: 35–39. https://doi.org/10.1148/radiology.158.1.3940394.

Chakraborty, Dev P. 2017. Observer Performance Methods for Diagnostic Imaging: Foundations, Modeling, and Applications with r-Based Examples. Boca Raton, FL: CRC Press.

Chakraborty, Dev P. 2002. “Statistical Power in Observer Performance Studies: A Comparison of the ROC and Free-Response Methods in Tasks Involving Localization.” Journal Article. Acad. Radiol. 9 (2): 147–56. https://doi.org/10.1016/s1076-6332(03)80164-2.

Chakraborty, Dev Prasad. 2010. “Prediction Accuracy of a Sample-Size Estimation Method for ROC Studies.” Journal Article. Academic Radiology 17: 628–38. https://doi.org/10.1016/j.acra.2010.01.007.

Chakraborty, Dev, and Xuetong Zhai. 2022. RJafroc: Artificial Intelligence Systems and Observer Performance. https://dpc10ster.github.io/RJafroc/.

Clarkson, Eric, Matthew A. Kupinski, and Harrison H. Barrett. 2006. “A Probabilistic Model for the MRMC Method, Part 1: Theoretical Development.” Journal Article. Academic Radiology 13 (11): 1410–21. https://doi.org/10.1016/j.acra.2006.07.016.

Cohen, Jacob. 1988. Statistical Power Analysis for the Behavioral Sciences. 2nd ed. Lawrence Erlbaum Associates.

DeLong, Elizabeth R, David M DeLong, and Daniel L Clarke-Pearson. 1988. “Comparing the Areas Under Two or More Correlated Receiver Operating Characteristic Curves: A Nonparametric Approach.” Biometrics, 837–45.

Dorfman, D. D., K. S. Berbaum, and C. E. Metz. 1992. “ROC Characteristic Rating Analysis: Generalization to the Population of Readers and Patients with the Jackknife Method.” Journal Article. Invest. Radiol. 27 (9): 723–31. https://pubmed.ncbi.nlm.nih.gov/1399456.

Dorfman, Donald D., Kevin S. Berbaum, and Russell V. Lenth. 1995. “Multireader, Multicase Receiver Operating Characteristic Methodology: A Bootstrap Analysis.” Journal Article. Academic Radiology 2 (7): 626–33. https://doi.org/10.1016/S1076-6332(05)80129-1.

Efron, Bradley, and Robert J. Tibshirani. 1993. An Introduction to the Bootstrap. Book. Vol. 57. Monographs on Statistics and Applied Probability. Boca Raton: Chapman; Hall/CRC.

Gallas, Brandon D. 2006. “One-Shot Estimate of MRMC Variance: AUC.” Journal Article. Academic Radiology 13 (3): 353–62. https://doi.org/10.1016/j.acra.2005.11.030.

Gallas, Brandon D., Gene a Pennello, and Kyle J. Myers. 2007. “Multireader Multicase Variance Analysis for Binary Data.” Journal Article. Journal of the Optical Society of America. A, Optics, Image Science, and Vision 24 (12): 70–80. https://doi.org/10.1364/josaa.24.000b70.

Hajian-Tilaki, K. O., James A. Hanley, L. Joseph, and J. P. Collet. 1997. “Extension of Receiver Operating Characteristic Analysis to Data Concerning Multiple Signal Detection Tasks.” Journal Article. Acad Radiol 4: 222–29. https://doi.org/10.1016/S1076-6332(05)80295-8.

Hillis, S. L., N. A. Obuchowski, K. M. Schartz, and K. S. Berbaum. 2005. “A Comparison of the Dorfman-Berbaum-Metz and Obuchowski-Rockette Methods for Receiver Operating Characteristic (ROC) Data.” Journal Article. Statistics in Medicine 24 (10): 1579–1607. https://doi.org/10.1002/sim.2024.

Hillis, Stephen L. 2014. “A Marginal‐mean ANOVA Approach for Analyzing Multireader Multicase Radiological Imaging Data.” Journal Article. Statistics in Medicine 33 (2): 330–60. https://doi.org/10.1002/sim.5926.

Hillis, Stephen L. 2007. “A Comparison of Denominator Degrees of Freedom Methods for Multiple Observer (ROC) Studies.” Journal Article. Statistics in Medicine 26: 596–619. https://doi.org/10.1002/sim.2532.

Hillis, Stephen L., and K. S. Berbaum. 2004. “Power Estimation for the Dorfman-Berbaum-Metz Method.” Journal Article. Acad. Radiol. 11 (11): 1260–73. https://doi.org/10.1016/j.acra.2004.08.009.

Hillis, Stephen L., K. S. Berbaum, and C. E. Metz. 2008. “Recent Developments in the Dorfman-Berbaum-Metz Procedure for Multireader (ROC) Study Analysis.” Journal Article. Acad Radiol 15 (5): 647–61. https://doi.org/10.1016/j.acra.2007.12.015.

Hillis, Stephen L., Nancy A. Obuchowski, and Kevin S. Berbaum. 2011. “Power Estimation for Multireader ROC Methods: An Updated and Unified Approach.” Journal Article. Academic Radiology 18 (2): 129–42. https://doi.org/10.1016/j.acra.2010.09.007.

ICRU. 1996. “Medical Imaging: The Assessment of Image Quality.” Journal Article. JOURNAL OF THE ICRU 54 (1): 37–40.

Ishwaran, Hemant, and Constantine A. Gatsonis. 2000. “A General Class of Hierarchical Ordinal Regression Models with Applications to Correlated ROC Analysis.” Journal Article. The Canadian Journal of Statistics 28 (4): 731–50. https://doi.org/10.2307/3315913.

Kupinski, Matthew A., Eric Clarkson, and Harrison H. Barrett. 2006. “A Probabilistic Model for the MRMC Method, Part 2: Validation and Applications.” Journal Article. Academic Radiology 13 (11): 1422–30. https://doi.org/10.1016/j.acra.2006.07.015.

Larsen, Richard J, and Morris L Marx. 2005. An Introduction to Mathematical Statistics. Prentice Hall Hoboken, NJ.

Niklason, L. T., N. M. Hickey, Dev P. Chakraborty, E. A. Sabbagh, M. V. Yester, R. G. Fraser, and G. T. Barnes. 1986. “Simulated Pulmonary Nodules: Detection with Dual-Energy Digital Versus Conventional Radiography.” Journal Article. Radiology 160: 589–93. https://doi.org/10.1148/radiology.160.3.3526398.

Noether, Gottfried E. 1967. “Elements of Nonparametric Statistics.” Report. Wiley; Sons.

Obuchowski, N. A., and H. E. Rockette. 1995. “Hypothesis Testing of the Diagnostic Accuracy for Multiple Diagnostic Tests: An ANOVA Approach with Dependent Observations.” Journal Article. Communications in Statistics: Simulation and Computation 24: 285–308. https://doi.org/10.1080/03610919508813243.

Obuchowski, Nancy A. 1998. “Sample Size Calculations in Studies of Test Accuracy.” Journal Article. Statistical Methods in Medical Research 7 (4): 371–92. https://doi.org/10.1177/096228029800700405.

———. 2000. “Sample Size Tables for Receiver Operating Characteristic Studies.” Journal Article. Am. J. Roentgenol. 175 (3): 603–8. http://www.ajronline.org/cgi/content/abstract/175/3/603.

Roe, C. A., and C. E. Metz. 1997a. “Dorfman-Berbaum-Metz Method for Statistical Analysis of Multireader, Multimodality Receiver Operating Characteristic Data: Validation with Computer Simulation.” Journal Article. Acad Radiol 4: 298–303. https://doi.org/10.1016/S1076-6332(97)80032-3.

———. 1997b. “Variance-Component Modeling in the Analysis of Receiver Operating Characteristic Index Estimates.” Journal Article. Acad. Radiol. 4 (8): 587–600. https://doi.org/10.1016/S1076-6332(97)80210-3.

Satterthwaite, F. E. 1941. “Synthesis of Variance.” Journal Article. Psychometrika 6 (5): 309–16.

———. 1946. “An Approximate Distribution of Estimates of Variance Components.” Journal Article. Biometrics Bulletin 2 (6): 110–14.

Swets, John A., and Ronald M. Pickett. 1982. Evaluation of Diagnostic Systems: Methods from Signal Detection Theory. Book. First. Series in Cognition and Perception. New York: Academic Press.

Toledano, A. Y. 2003. “Three Methods for Analyzing Correlated ROC Curves: A Comparison in Real Data Sets.” Journal Article. Statistics in Medicine 22 (18): 2919–33. https://doi.org/10.1002/sim.1518.

Toledano, A. Y., and C. Gatsonis. 1996. “Ordinal Regression Methodology for ROC Curves Derived from Correlated Data.” Journal Article. Stat Med 15 (16): 1807–26. https://doi.org/10.1002/(SICI)1097-0258(19960830)15:16<1807::AID-SIM333>3.0.CO;2-U.

Van Dyke, C. W., R. D. White, N. A. Obuchowski, M. A. Geisinger, R. J. Lorig, and M. A. Meziane. 1993. “Cine MRI in the Diagnosis of Thoracic Aortic Dissection.” Journal Article. 79th RSNA Meetings.

Zanca, Federica, Jurgen Jacobs, Chantal Van Ongeval, Filip Claus, Valerie Celis, Catherine Geniets, Veerle Provost, Herman Pauwels, Guy Marchal, and Hilde Bosmans. 2009. “Evaluation of Clinical Image Processing Algorithms Used in Digital Mammography.” Journal Article. Medical Physics 36 (3): 765–75. https://doi.org/10.1118/1.3077121.

Zhou, Xiao-Hua, Donna K McClish, and Nancy A Obuchowski. 2009. Statistical Methods in Diagnostic Medicine. Vol. 569. John Wiley & Sons. https://doi.org/110.1002/9780470906514.