Chapter 7 Obuchowski Rockette (OR) Analysis

7.1 TBA How much finished

80%

7.2 Introduction

In previous chapters the DBM significance testing procedure (D. D. Dorfman, Berbaum, and Metz 1992) for analyzing MRMC ROC data, along with improvements (Stephen L. Hillis 2014), has been described. Because the method assumes that jackknife pseudovalues can be regarded as independent and identically distributed case-level figures of merit, it has been rightly criticized by Hillis and others (Zhou, McClish, and Obuchowski 2009). Hillis states that the method “works” but lacks firm statistical foundations (S. L. Hillis et al. 2005; Stephen L. Hillis 2007; Stephen L. Hillis, Berbaum, and Metz 2008). I would add that it “works” as long as one restricts to the empirical AUC figure of merit. In my book I gave a justification for why the method “works”. Specifically, the empirical AUC pseudovalues qualify as case-level FOMs - this property has also been noted by (Hajian-Tilaki et al. 1997). However, this property applies only to the empirical AUC, so an alternate approach that applies to any figure of merit is highly desirable.

Hillis’ has proposed that a method based on an earlier publication (N. A. Obuchowski and Rockette 1995), which does not depend on pseudovalues, is preferable from both conceptual and practical points of view. This chapter is named “OR Analysis”, where OR stands for Obuchowski and Rockette. The OR method has advantages in being able to handle more complex study designs (Stephen L. Hillis 2014) that are addressed in subsequent chapters, and applications to other FOMs (e.g., the FROC paradigm uses a rather different FOM from empirical ROC-AUC) are best performed with the OR method.

This chapter delves into the significance testing procedure employed in OR analysis.

Multiple readers interpreting a case-set in multiple treatments is analyzed and the results, DBM vs. OR, are compared for the same dataset. The special cases of fixed-reader and fixed-case analyses are described. Single treatment analysis, where interest is in comparing average performance of readers to a fixed value, is described. Three methods of estimating the covariance matrix are described.

Before proceeding, it is understood that datasets analyzed in this chapter follow a factorial design, sometimes call fully-factorial or fully-crossed design. Basically, the data structure is symmetric, e.g., all readers interpret all cases in all modalities. The next chapter will describe the analysis of split-plot datasets, where, for example, some readers interpret all cases in one modality, while the remaining readers interpret all cases in the other modality.

7.3 Random-reader random-case

In conventional ANOVA models, such as used in DBM, the covariance matrix of the error term is diagonal with all diagonal elements equal to a common variance, represented in the DBM model by the scalar \(\epsilon\) term. Because of the correlated structure of the error term, in OR analysis, a customized ANOVA is needed. The null hypothesis (NH) is that the true figures-of-merit of all treatments are identical, i.e.,

\[\begin{equation} NH:\tau_i=0\;\; (i=1,2,...,I) \tag{7.1} \end{equation}\]

The analysis described next considers both readers and cases as random effects. The F-statistic is denoted \(F_{ORH}\), defined by:

\[\begin{equation} F_{ORH}=\frac{MS(T)}{MS(TR)+J\max(\text{Cov2}-\text{Cov3},0)} \tag{7.2} \end{equation}\]

Eqn. (7.2) incorporates Hillis’ modification of the original OR F-statistic. The modification ensures that the constraint Eqn. (6.23) is always obeyed and also avoids a possibly negative (and hence illegal) F-statistic. The relevant mean squares are defined by (note that these are calculated using FOM values, not pseudovalues):

\[\begin{align} \left.\begin{array}{rcl} MS(T)&=&\frac{J}{I-1}\sum_{i=1}^{I}(\theta_{i\bullet}-\theta_{\bullet\bullet})^2\\ \\ MS(R)&=&\frac{I}{J-1}\sum_{j=1}^{J}(\theta_{\bullet j}-\theta_{\bullet\bullet})^2\\ \\ MS(TR)&=&\frac{1}{(I-1)(J-1)}\sum_{i=1}^{I}\sum_{j=1}^{J}(\theta_{ij}-\theta_{i\bullet}-\theta_{\bullet j}+\theta_{\bullet\bullet}) \end{array}\right\} \tag{7.3} \end{align}\]

The original paper (N. A. Obuchowski and Rockette 1995) actually proposed a different test statistic \(F_{OR}\):

\[\begin{equation} F_{OR}=\frac{MS(T)}{MS(TR)+J(\text{Cov2}-\text{Cov3})} \tag{7.4} \end{equation}\]

Note that Eqn. (7.4) lacks the constraint, subsequently proposed by Hillis, which ensures that the denominator cannot be negative. The following distribution was proposed for the test statistic.

\[\begin{equation} F_{OR}\sim F_{\text{ndf},\text{ddf}} \tag{7.5} \end{equation}\]

The original degrees of freedom were defined by:

\[\begin{align} \begin{split} \text{ndf}&=I-1\\ \text{ddf}&=(I-1)\times(J-1) \end{split} \tag{7.6} \end{align}\]

It turns out that the Obuchowski-Rockette test statistic is very conservative, meaning it is highly biased against rejecting the null hypothesis (the data simulator used in the validation described in their publication did not detect this behavior). Because of the conservative behavior, the predicted sample sizes tended to be quite large (if the test statistic does not reject the NH as often as it should, one way to overcome this tendency is to use a larger sample size). In this connection I have two informative anecdotes.

7.3.1 Two anecdotes

The late Dr. Robert F. Wagner once stated to me (ca. 2001) that the sample-size tables published by Obuchowski (Nancy A. Obuchowski 1998, 2000), using the version of Eqn. (7.2) with the ddf as originally suggested by Obuchowski and Rockette, predicted such high number of readers and cases that he was doubtful about the chances of anyone conducting a practical ROC study!
The second story is that I once conducted NH simulations and analyses using a Roe-Metz simulator (Roe and Metz 1997a) and the significance testing described in the Obuchowski-Rockette paper: the method did not reject the null hypothesis even once in 2000 trials! Recall that with \(\alpha = 0.05\) a valid test should reject the null hypothesis about \(100\pm20\) times in 2000 trials. I recalls (ca. 2004) telling Dr. Steve Hillis about this issue, and he suggested a different denominator degrees of freedom ddf, see next, substitution of which magically solved the problem, i.e., the simulations rejected the null hypothesis 5% of the time.

7.3.2 Hillis ddf

Hillis’ proposed new ddf is shown below (ndf is unchanged), with the subscript \(H\) denoting the Hillis modification:

\[\begin{equation} \text{ddf}_H = \frac{\left [ MS(TR) + J \max(\text{Cov2}-\text{Cov3},0)\right ]^2}{\frac{\left [ MS(TR) \right ]^2}{(I-1)(J-1)}} \tag{7.7} \end{equation}\]

From the previous chapter, the ordering of the covariances is as follows:

\[\begin{equation*} \text{Cov3} \leq \text{Cov2} \leq \text{Cov1} \leq \text{Var} \end{equation*}\]

If \(\text{Cov2} < \text{Cov3}\) (which is the exact opposite of the expected ordering), \(\text{ddf}_H\) reduces to \((I-1)\times(J-1)\), the value originally proposed by Obuchowski and Rockette. With Hillis’ proposed changes, under the null hypothesis the observed statistic \(F_{ORH}\), defined in Eqn. (7.2), is distributed as an F-statistic with \(\text{ndf} = I-1\) and ddf = \(\text{ddf}_H\) degrees of freedom (S. L. Hillis et al. 2005; Stephen L. Hillis 2007; Stephen L. Hillis, Berbaum, and Metz 2008):

\[\begin{equation} F_{ORH}\sim F_{\text{ndf},\text{ddf}_H} \tag{7.8} \end{equation}\]

If the expected ordering is true, i.e., \(\text{Cov2} > \text{Cov3}\) , which is the more likely situation, then \(\text{ddf}_H\) is larger than \((I-1)\times(J-1)\), i.e., the Obuchowski-Rockette ddf, and the p-value decreases and there is a larger probability of rejecting the NH. The modified OR method is more likely to have the correct NH behavior, i.e, it will reject the NH 5% of the time when alpha is set to 0.05 (statisticians refer to this as “passing the 5% test”). The correct NH behavior has been confirmed in simulation testing using the Roe-Metz simulator (Stephen L. Hillis, Berbaum, and Metz (2008)).

7.3.3 Decision rule, p-value and confidence interval

The critical value of the F-statistic for rejection of the null hypothesis is \(F_{1-\alpha,\text{ndf},\text{ddf}_H}\), i.e., that value such that fraction \((1-\alpha)\) of the area under the distribution lies to the left of the critical value. From Eqn. (7.2):

Rejection of the NH is more likely if \(MS(T)\) increases, meaning the treatment effect is larger;
\(MS(TR)\) is smaller, meaning there is less contamination of the treatment effect by treatment-reader variability;
The greater of \(\text{Cov2}\) or \(\text{Cov3}\), which is usually \(\text{Cov2}\), decreases, meaning there is less “noise” in the measurement due to between-reader variability. Recall that \(\text{Cov2}\) involves different-reader same-treatment pairings.
\(\alpha\) increases, meaning one is allowing a greater probability of Type I errors;
\(\text{ndf}\) increases, as this lowers the critical value of the F-statistic. With more treatment pairings, the chance that at least one paired-difference will reject the NH is larger.
\(\text{ddf}_H\) increases, as this lowers the critical value of the F-statistic.

The p-value of the test is the probability, under the NH, that an equal or larger value of the F-statistic than \(F_{ORH}\) could be observed by chance. In other words, it is the area under the F-distribution \(F_{\text{ndf},\text{ddf}_H}\) that lies above the observed value \(F_{ORH}\):

\[\begin{equation} p=\Pr(F>F_{ORH} \mid F\sim F_{\text{ndf},\text{ddf}_H}) \tag{7.9} \end{equation}\]

The \((1-\alpha)\) confidence interval for \(\theta_{i \bullet} - \theta_{i' \bullet}\) is given by:

\[\begin{equation} \begin{split} CI_{1-\alpha,RRRC,\theta_{i \bullet} - \theta_{i' \bullet}} =& \theta_{i \bullet} - \theta_{i' \bullet} \\ &\pm t_{\alpha/2, \text{ddf}_H}\sqrt{\textstyle\frac{2}{J}(MS(TR)+J\max(\text{Cov2}-\text{Cov3},0))} \end{split} \tag{7.10} \end{equation}\]

Define \(\text{df}_i\), the degrees of freedom for modality \(i\):

\[\begin{equation} \text{df}_i = (\text{MS(R)}_i + J\max(\text{Cov2}_{i}, 0))^2/\text{MS(R)}_i^2 * (J - 1) \tag{7.11} \end{equation}\]

Here \(\text{MS(R)}_i\) is the reader mean-square for modality \(i\), and \(\text{Cov2}_i\) is \(\text{Cov2}\) for modality \(i\). Note that all quantities with an \(i\) index are calculated using data from modality \(i\) only.

The \((1-\alpha)\) confidence interval for \(\theta_{i \bullet}\), i.e., \(CI_{1-\alpha,RRRC,\theta_{i \bullet}}\), is given by:

\[\begin{equation} CI_{1-\alpha,RRRC,\theta_{i \bullet}} = \theta_{i \bullet} \pm t_{\alpha/2, \text{df}_i}\sqrt{\textstyle\frac{1}{J}(\text{MS(R)}_i + J\max(\text{Cov2}_{i}, 0))} \tag{7.12} \end{equation}\]

7.4 Fixed-reader random-case

Using the vertical bar notation \(\mid R\) to denote that reader is regarded as a fixed effect (Roe and Metz 1997b), the F -statistic for testing the null hypothesis \(NH: \tau_i = 0 \; (i=1,1,2,...I)\) is (Stephen L. Hillis 2007):

\[\begin{equation} F_{ORH \mid R}=\frac{MS(T)}{\text{Var}-\text{Cov1}+(J-1)\max(\text{Cov2}-\text{Cov3},0)} \tag{7.13} \end{equation}\]

[For \(J\) = 1, Eqn. (7.13) reduces to Eqn. (6.8), i.e., the single-reader analysis described in the previous chapter.]

\(F_{ORH \mid R}\) is distributed as an F-statistic with \(\text{ndf} = I-1\) and \(\text{ddf} = \infty\):

\[\begin{equation} F_{ORH \mid R} \sim F_{I-1,\infty} \tag{7.14} \end{equation}\]

One can get rid of the infinite denominator degrees of freedom by recognizing, as in the previous chapter, that \((I-1) F_{I-1,\infty}\) is distributed as a \(\chi^2\) distribution with \(I-1\) degrees of freedom, i.e., as \(\chi^2_{I-1}\). Therefore, one has, analogous to Eqn. (6.7),

\[\begin{equation} \chi^2_{ORH \mid R} \equiv (I-1)F_{ORH \mid R} \sim \chi^2_{I-1} \tag{7.15} \end{equation}\]

The critical value of the \(\chi^2\) statistic is \(\chi^2_{1-\alpha,I-1}\), which is that value such that fraction \((1-\alpha)\) of the area under the \(\chi^2_{I-1}\) distribution lies to the left of the critical value. The null hypothesis is rejected if the observed value of the \(\chi^2\) statistic exceeds the critical value, i.e.,

\[\chi^2_{ORH \mid R} > \chi^2_{1-\alpha,I-1}\]

The p-value of the test is the probability that a random sample from the chi-square distribution \(\chi^2_{I-1}\) exceeds the observed value of the test statistic \(\chi^2_{ORH \mid R}\) statistic defined in Eqn. (7.15):

\[\begin{equation} p=\Pr(\chi^2 > \chi^2_{ORH \mid R} \mid \chi^2 \sim \chi^2_{I-1}) \tag{7.16} \end{equation}\]

The \((1-\alpha)\) (symmetric) confidence interval for the difference figure of merit is given by:

\[\begin{equation} \begin{split} CI_{1-\alpha,FRRC,\theta_{i \bullet} - \theta_{i' \bullet}} =&(\theta_{i \bullet} - \theta_{i' \bullet}) \\ &\pm t_{\alpha/2, \infty}\sqrt{\textstyle\frac{2}{J}(\text{Var}-\text{Cov1}+(J-1)\max(\text{Cov2}-\text{Cov3},0))} \end{split} \tag{7.17} \end{equation}\]

The NH is rejected if any of the following equivalent conditions is met (these statements are also true for RRRC analysis, and RRFC analysis to be described next):

The observed value of the \(\chi^2\) statistic exceeds the critical value \(\chi^2_{1-\alpha,I-1}\).
The p-value is less than \(\alpha\).
The \((1-\alpha)\) confidence interval for at least one treatment-pairing does not include zero.

Additional confidence intervals are stated below:

The confidence interval for the reader-averaged FOM for each treatment, denoted \(CI_{1-\alpha,FRRC,\theta_{i \bullet}}\).
The confidence interval for treatment FOM differences for each reader, denoted \(CI_{1-\alpha,FRRC,\theta_{i j} - \theta_{i' j}}\).

\[\begin{equation} CI_{1-\alpha,FRRC,\theta_{i \bullet}} = \theta_{i \bullet} \pm z_{\alpha/2}\sqrt{\textstyle\frac{1}{J}(\text{Var}_i+(J-1)\max(\text{Cov2}_i,0)} \tag{7.18} \end{equation}\]

\[\begin{equation} CI_{1-\alpha,FRRC,\theta_{i j} - \theta_{i' j}} = (\theta_{i j} - \theta_{i' j}) \pm z_{\alpha/2}\sqrt{2(\text{Var}_j - \text{Cov1}_j)} \tag{7.19} \end{equation}\]

In these equations \(\text{Var}_i\) and \(\text{Cov2}_i\) are computed using the data for treatment \(i\) only, and \(\text{Var}_j\) and \(\text{Cov1}_j\) are computed using the data for reader \(j\) only.

7.5 Random-reader fixed-case

When case is treated as a fixed factor, the appropriate F-statistic for testing the null hypothesis \(NH: \tau_i = 0 \; (i=1,1,2,...I)\) is:

\[\begin{equation} F_{ORH \mid C}=\frac{MS(T)}{MS(TR)} \tag{7.20} \end{equation}\]

\(F_{ORH \mid C}\) is distributed as an F-statistic with \(ndf = I-1\) and \(ddf = (I-1)(J-1)\):

\[\begin{equation} \left.\begin{array}{rcl} \text{ndf}&=&I-1\\ \text{ddf}&=&(I-1)(J-1)\\ F_{ORH \mid C} &\sim& F_{\text{ndf},\text{ddf}} \end{array}\right\} \tag{7.21} \end{equation}\]

Here is a situation where the degrees of freedom agree with those originally proposed by Obuchowski-Rockette. The critical value of the statistic is \(F_{1-\alpha,I-1,(I-1)(J-1)}\), which is that value such that fraction \((1-\alpha)\) of the distribution lies to the left of the critical value. The null hypothesis is rejected if the observed value of the F statistic exceeds the critical value:

\[F_{ORH \mid C}>F_{1-\alpha,I-1,(I-1)(J-1)}\]

The p-value of the test is the probability that a random sample from the distribution exceeds the observed value:

\[p=\Pr(F>F_{ORH \mid C} \mid F \sim F_{1-\alpha,I-1,(I-1)(J-1)})\]

The \((1-\alpha)\) confidence interval for the reader-averaged difference FOM, \(CI_{1-\alpha,RRFC,\theta_{i \bullet} - \theta_{i' \bullet}}\), is given by:

\[\begin{equation} CI_{1-\alpha,RRFC,\theta_{i \bullet} - \theta_{i' \bullet}} = (\theta_{i \bullet} - \theta_{i' \bullet}) \pm t_{\alpha/2, (I-1)(J-1)}\sqrt{\textstyle\frac{2}{J}MS(TR)} \tag{7.22} \end{equation}\]

The \((1-\alpha)\) confidence interval for the reader-averaged FOM for each treatment, \(CI_{1-\alpha,RRFC,\theta_{i \bullet}}\), is given by:

\[\begin{equation} CI_{1-\alpha,RRFC,\theta_{i \bullet}} = \theta_{i \bullet} \pm t_{\alpha/2, J-1}\sqrt{\textstyle\frac{1}{J}\text{MS(R)}_i} \tag{7.23} \end{equation}\]

Here \(\text{MS(R)}_i\) is the reader mean-square for modality \(i\).

7.6 Single treatment analysis

TBA ## Summary{#or-analysis-st-summary} ## Discussion{#or-analysis-st-discussion} ## Chapter References {#or-analysis-st-references}

References

Dorfman, D. D., K. S. Berbaum, and C. E. Metz. 1992. “ROC Characteristic Rating Analysis: Generalization to the Population of Readers and Patients with the Jackknife Method.” Journal Article. Invest. Radiol. 27 (9): 723–31. https://pubmed.ncbi.nlm.nih.gov/1399456.

Hajian-Tilaki, K. O., James A. Hanley, L. Joseph, and J. P. Collet. 1997. “Extension of Receiver Operating Characteristic Analysis to Data Concerning Multiple Signal Detection Tasks.” Journal Article. Acad Radiol 4: 222–29. https://doi.org/10.1016/S1076-6332(05)80295-8.

Hillis, S. L., N. A. Obuchowski, K. M. Schartz, and K. S. Berbaum. 2005. “A Comparison of the Dorfman-Berbaum-Metz and Obuchowski-Rockette Methods for Receiver Operating Characteristic (ROC) Data.” Journal Article. Statistics in Medicine 24 (10): 1579–1607. https://doi.org/10.1002/sim.2024.

Hillis, Stephen L. 2014. “A Marginal‐mean ANOVA Approach for Analyzing Multireader Multicase Radiological Imaging Data.” Journal Article. Statistics in Medicine 33 (2): 330–60. https://doi.org/10.1002/sim.5926.

Hillis, Stephen L. 2007. “A Comparison of Denominator Degrees of Freedom Methods for Multiple Observer (ROC) Studies.” Journal Article. Statistics in Medicine 26: 596–619. https://doi.org/10.1002/sim.2532.

Hillis, Stephen L., K. S. Berbaum, and C. E. Metz. 2008. “Recent Developments in the Dorfman-Berbaum-Metz Procedure for Multireader (ROC) Study Analysis.” Journal Article. Acad Radiol 15 (5): 647–61. https://doi.org/10.1016/j.acra.2007.12.015.

Obuchowski, N. A., and H. E. Rockette. 1995. “Hypothesis Testing of the Diagnostic Accuracy for Multiple Diagnostic Tests: An ANOVA Approach with Dependent Observations.” Journal Article. Communications in Statistics: Simulation and Computation 24: 285–308. https://doi.org/10.1080/03610919508813243.

Obuchowski, Nancy A. 1998. “Sample Size Calculations in Studies of Test Accuracy.” Journal Article. Statistical Methods in Medical Research 7 (4): 371–92. https://doi.org/10.1177/096228029800700405.

———. 2000. “Sample Size Tables for Receiver Operating Characteristic Studies.” Journal Article. Am. J. Roentgenol. 175 (3): 603–8. http://www.ajronline.org/cgi/content/abstract/175/3/603.

Roe, C. A., and C. E. Metz. 1997a. “Dorfman-Berbaum-Metz Method for Statistical Analysis of Multireader, Multimodality Receiver Operating Characteristic Data: Validation with Computer Simulation.” Journal Article. Acad Radiol 4: 298–303. https://doi.org/10.1016/S1076-6332(97)80032-3.

———. 1997b. “Variance-Component Modeling in the Analysis of Receiver Operating Characteristic Index Estimates.” Journal Article. Acad. Radiol. 4 (8): 587–600. https://doi.org/10.1016/S1076-6332(97)80210-3.

Zhou, Xiao-Hua, Donna K McClish, and Nancy A Obuchowski. 2009. Statistical Methods in Diagnostic Medicine. Vol. 569. John Wiley & Sons. https://doi.org/110.1002/9780470906514.