Artificial Intelligence Systems and Observer Performance

RJafroc analyzes the performance of artificial intelligence (AI) systems/algorithms characterized by a search-and-report strategy. Historically observer performance has dealt with measuring radiologists' performances in search tasks, e.g., searching for lesions in medical images and reporting them, but the implicit location information has been ignored. The methods here apply to any task involving searching for and reporting arbitrary targets in images. The implemented methods apply to analyzing the absolute and relative performances of AI systems, comparing AI performance to a group of human readers or optimizing the reporting threshold of an AI system. In addition to performing historical receiver operating characteristic (ROC) analysis (localization information ignored), the software also performs free-response receiver operating characteristic (FROC) analysis, where the implicit lesion localization information is used. A book describing the underlying methodology and which uses the software has been published: Chakraborty DP: Observer Performance Methods for Diagnostic Imaging - Foundations, Modeling, and Applications with R-Based Examples, Taylor-Francis LLC; 2017: https://www.routledge.com/Observer-Performance-Methods-for-Diagnostic-Imaging-Foundations-Modeling/Chakraborty/p/book/9781482214840. Online updates to this book, which use the software, are at https://dpc10ster.github.io/RJafrocQuickStart/, https://dpc10ster.github.io/RJafrocRocBook/ and at https://dpc10ster.github.io/RJafrocFrocBook/. Supported data collection paradigms are the ROC, FROC and the location ROC (LROC). ROC data consists of single ratings per images, where a rating is the perceived confidence level that the image is that of a diseased patient. An ROC curve is a plot of true positive fraction vs. false positive fraction. FROC data consists of a variable number (zero or more) of mark-rating pairs per image, where a mark is the location of a reported suspicious region and the rating is the confidence level that it is a real lesion. LROC data consists of a rating and a location of the most suspicious region, for every image. Four models of observer performance, and curve-fitting software, are implemented: the binormal model (BM), the contaminated binormal model (CBM), the correlated contaminated binormal model (CORCBM), and the radiological search model (RSM). Unlike the binormal model, CBM, CORCBM and RSM predict "proper" ROC curves that do not inappropriately cross the chance diagonal. Additionally, RSM parameters are related to search performance (not measured in conventional ROC analysis) and classification performance. Search performance refers to finding lesions, i.e., true positives, while simultaneously not finding false positive locations. Classification performance measures the ability to distinguish between true and false positive locations. Knowing these separate performances allows principled optimization of reader or AI system performance. This package supersedes Windows JAFROC (jackknife alternative FROC) software V4.2.1, https://github.com/dpc10ster/WindowsJafroc. Package functions are organized as follows. Data file related function names are preceded by Df, curve fitting functions by Fit, included data sets by dataset, plotting functions by Plot, significance testing functions by St, sample size related functions by Ss, data simulation functions by Simulate and utility functions by Util. Implemented are figures of merit (FOMs) for quantifying performance, functions for visualizing empirical operating characteristics: e.g., ROC, FROC, alternative FROC (AFROC) and weighted AFROC (wAFROC) curves. For fully crossed study designs significance testing of reader-averaged FOM differences between modalities is implemented via both Dorfman-Berbaum-Metz and the Obuchowski-Rockette methods. Also implemented are single modality analyses, allowing comparison of performance of a group of radiologists to a specified value, or comparison of AI to a group of radiologists/algorithms interpreting the same cases. Crossed-modality analysis is implemented wherein there are two crossed modality factors and the aim is to determined performance in each modality factor averaged over all levels of the second factor. Sample size estimation tools are provided for ROC and FROC studies; these use estimates of the relevant variances from a pilot study to predict required numbers of readers and cases in a pivotal study to achieve the desired power. Utility and data file manipulation functions allow data to be read in any of the currently used input formats, including Excel, and the results of the analysis can be viewed in text or Excel output files. The methods are illustrated with several included datasets from the author's collaborations. This update includes improvements to the code, some as a result of user-reported bugs and new feature requests, and others discovered during ongoing testing and code simplification. All changes are noted in NEWS.md.

Details

Package:	RJafroc
Type:	Package
Version:	2.1.3
Date:	2025-04-15
License:	GPL-3
URL:	https://dpc10ster.github.io/RJafroc/

Definitions and abbreviations

a: The separation or "a" parameter of the binormal model
AFROC curve: plot of LLF (ordinate) vs. FPF, where FPF is inferred using highest rating of NL marks on non-diseased cases
AFROC: alternative FROC, see Chakraborty 1989
AFROC1 curve: plot of LLF (ordinate) vs. FPF1, where FPF1 is inferred using highest rating of NL marks on ALL cases
$alpha$: The significance level $\alpha$ of the test of the null hypothesis of no modality effect
AUC: area under curve; e.g., ROC-AUC = area under ROC curve, an example of a FOM
b: The width or "b" parameter of the conventional binormal model
Binormal model: two unequal variance normal distributions, one at zero and one at $mu$, for modeling ROC ratings, $sigma$ is the std. dev. ratio of diseased to non-diseased distributions
CAD: computer aided detection algorithm
CBM: contaminated binormal model (CBM): two equal variance normal distributions for modeling ROC ratings, the diseased distribution is bimodal, with a peak at zero and one at $\mu$, the integrated fraction at $\mu$ is $\alpha$ (not to be confused with $\alpha$ of NH testing)
CI: The (1-$\alpha$) confidence interval for the stated statistic
Crossed-modality: a dataset containing two modality (i.e., treatment) factors, with the levels of the two factors crossed, see paper by Thompson et al
DBM: Dorfman-Berbaum-Metz, a significance testing method for detecting a modality effect in MRMC studies, with Hillis suggested modification to ddf.
ddf: Denominator degrees of freedom of appropriate $F$-test; the corresponding ndf is I - 1
Empirical AUC: trapezoidal area under curve, same as the Wilcoxon statistic for ROC paradigm
FN: false negative, a diseased case classified as non-diseased
FOM: figure of merit, a quantitative measure of performance, performance metric
FP: false positive, a non-diseased case classified as diseased
FPF: number of FPs divided by number of non-diseased cases
FROC curve: plot of LLF (ordinate) vs. NLF
FROC: free-response ROC (a data collection paradigm where each image yields a random number, 0, 1, 2,..., of mark-rating pairs)
FRRC: Analysis that treats readers as fixed and cases as random factors
I: total number of modalities, indexed by $i$
image/case: used interchangeably; a case can consist of several images of the same patient in the same modality
iMRMC: A text file format used for ROC data by FDA/CDRH researchers
individual: A single-modality single-reader dataset.
Intrinsic: Used in connection with RSM; a parameter that is independent of the RSM $\mu$ parameter, but whose meaning may not be as transparent as the corresponding physical parameter
J: number of readers, indexed by j
JAFROC file format: A .xlsx format file, applicable to ROC, ROI, FROC and LROC paradigms
JAFROC: jackknife AFROC: Windows software for analyzing observer performance data: no longer updated, replaced by current package; the name is a misnomer as the jackknife is used only for significance testing; alternatively, the bootstrap could be used; what distinguishes FROC from ROC analysis is the use of the AFROC-AUC as the FOM. With this change, the DBM or the OR method can be used for significance testing
K: total number of cases, K = K1 + K2, indexed by $k$
K1: total number of non-diseased cases, indexed by $k1$
K2: total number of diseased cases, indexed by $k2$
LL: lesion localization i.e., a mark that correctly locates an existing localized lesion; TP is a special case, when the proximity criterion is lax (i.e., "acceptance radius" is large)
LLF: number of LLs divided by the total number of lesions
LROC: location receiver operating characteristic, a data collection paradigm where each image yields a single rating and one location
lrc/MRMC: A text file format used for ROC data by University of Iowa researchers
mark: the location of a suspected diseased region
maxLL: maximum number of lesions per case in dataset
maxNL: maximum number of NL marks per case in dataset
MRMC: multiple reader multiple case (each reader interprets each case in each modality, i.e. fully crossed study design)
ndf: Numerator degrees of freedom of appropriate $F$-test, usually number of treatments minus one
NH: The null hypothesis that all modality effects are zero; rejected if the $p$-value is smaller than $\alpha$
NL: non-lesion localization, of which FP is a special case, i.e., a mark that does not correctly locate any existing localized lesion(s)
NLF: number of NLs divided by the total number of cases
Operating characteristic: A plot of normalized correct decisions on diseased cases along ordinate vs. normalized incorrect decisions on non-diseased cases
Operating point: A point on an operating characteristic, e.g., (FPF, TPF) represents an operating point on an ROC
OR: Obuchowski-Rockette, a significance testing method for detecting a modality effect in MRMC studies, with Hillis suggested modifications
Physical parameter: Used in connection with RSM; a parameter whose meaning is more transparent than the corresponding intrinsic parameter, but which depends on the RSM $\mu$ parameter
Proximity criterion / acceptance radius: Used in connection with FROC (or LROC data); the "nearness" criterion is used to determine if a mark is close enough to a lesion to be counted as a LL (or correct localization); otherwise it is counted as a NL (or incorrect localization)
p-value: the probability, under the null hypothesis, that the observed modality effects, or larger, could occur by chance
Proper: a proper fit does not inappropriately fall below the chance diagonal, does not display a "hook" near the upper right corner
PROPROC: Metz's binormal model based fitting of proper ROC curves
RSM, Radiological Search Model: two unit variance normal distributions for modeling NL and LL ratings; four parameters, $\mu$, $\nu$', $\lambda$' and $\zeta$1
Rating: Confidence level assigned to a case; higher values indicate greater confidence in presence of disease; -Inf is allowed but NA is not allowed
Reader/observer/radiologist/CAD: used interchangeably
RJafroc: the current software
ROC: receiver operating characteristic, a data collection paradigm where each image yields a single rating and location information is ignored
ROC curve: plot of TPF (ordinate) vs. FPF, as threshold is varied; an example of an operating characteristic
ROCFIT: Metz software for binormal model based fitting of ROC data
ROI: region-of-interest (each case is divided into a number of ROIs and the reader assigns an ROC rating to each ROI)
FRRC: Analysis that treats readers as fixed and cases as random factors
RRFC: Analysis that treats readers as random and cases as fixed factors
RRRC: Analysis that treats both readers and cases as random factors
RSCORE-II: original software for binormal model based fitting of ROC data
RSM: Radiological search model, also method for fitting a proper ROC curve to ROC data
RSM-$\zeta$1: Lowest reporting threshold, determines if suspicious region is actually marked
RSM-$\lambda$: Intrinsic parameter of RSM corresponding to $\lambda$', independent of $\mu$
RSM-$\lambda$': Physical Poisson parameter of RSM, average number of latent NLs per case; depends on $\mu$
RSM-$\mu$: separation of the unit variance distributions of RSM
RSM-$\nu$: Intrinsic parameter of RSM, corresponding to $\nu$', independent of $\mu$
RSM-$\nu$': binomial parameter of RSM, probability that lesion is found
SE: sensitivity, same as $TPF$
Significance testing: determining the p-value of a statistical test
SP: specificity, same as $1-FPF$
Threshold: Reporting criteria: if confidence exceeds a threshold value, report case as diseased, otherwise report non-diseased
TN: true negative, a non-diseased case classified as non-diseased
TP: true positive, a diseased case classified as diseased
TPF: number of TPs divided by number of diseased cases
Treatment/modality: used interchangeably, for example, computed tomography (CT) images vs. magnetic resonance imaging (MRI) images
wAFROC curve: plot of weighted LLF (ordinate) vs. FPF, where FPF is inferred using highest rating of NL marks on non-diseased cases ONLY
wAFROC1 curve: plot of weighted LLF (ordinate) vs. FPF1, where FPF1 is inferred using highest rating of NL marks on ALL cases
wAFROC1 FOM: weighted trapezoidal area under AFROC1 curve: only use if there are zero non-diseased cases is always number of treatments minus one

Dataset

A standard dataset object has 3 list elements: $ratings, $lesions and $descriptions, where:

dataset$ratings: contains 3 elements as sub-lists: $NL, $LL and $LL_IL; these describe the structure of the ratings;
dataset$lesions: contains 3 elements as sub-lists: $perCase, $IDs and $weights; these describe the structure of the lesions;
dataset$descriptions: contains 7 elements as sub-lists: $fileName, $type, $name, $truthTableStr, $design, $modalityID and $readerID; these describe other characteristics of the dataset as detailed next.

Note: -Inf is used to indicate the ratings of unmarked lesions and/or missing values. As an example of the latter, if the maximum number of NLs in a dataset is 4, but some images have fewer than 4 NL marks, the corresponding "empty" positions would be filled with -Infs. Do not use NA to denote a missing rating.

Note: A standard dataset always represents R object(s) with the following structure(s):

Data structure, e.g., `dataset02`, an ROC dataset, and `dataset05`, an FROC dataset.

ratings$NL: a float array with dimensions c(I, J, K, maxNL), containing the ratings of NL marks. The first K1 locations of the third index corresponds to NL marks on non-diseased cases and the remaining locations correspond to NL marks on diseased cases. The 4th dimension allows for multiple NL marks on a case: the first index holds the first NL rating on the image, the second holds the second NL rating on the image, etc. The value of maxNL is determined by the case with the maximum number of lesions per case in the dataset. For FROC datasets missing NL ratings are assigned the -Inf rating. For ROC datasets, FP ratings are assigned to the first K1 elements of NL[,,1:K1,1] and the remaining K2 elements of NL[,,(K1+1):K,1] are set to -Inf.
ratings$LL: for non-LROC datasets a float array with dimensions c(I, J, K2, maxLL) containing the ratings of LL marks. The value of maxLL is determined by the maximum number of lesions per case in the dataset. Unmarked lesions are assigned the -Inf rating. For ROC datasets TP ratings are assigned to LL[,,1:K2,1]. For LROC datasets it is a float array with dimensions c(I, J, K2, 1) containing the ratings of correct localizations, otherwise the rating is recorded in the incorrect localization array described next.
ratings$LL_IL: for LROC datasets the ratings of incorrect localization marks on abnormal cases. It is a float array with dimensions c(I, J, K2, 1). For non-LROC datasets this array is filled with NAs.
lesions$perCase: an integer array with length K2, the number of lesions on each diseased case. The maximum value of this array equals maxLL. For example, dataset05$lesions$perCase[4 is 2, meaning the 4th diseased case has two lesions.
lesions$IDs: an integer array with dimensions [K2, maxLL], labeling (or naming) the lesions on the diseased cases. For example, dataset05$lesions$IDs[4,] is c(1,2,-Inf), meaning the 4th diseased case has two lesions, labeled 1 and 2.
lesions$weights: a floating point array with dimensions c(K2, maxLL), representing the relative importance of detecting each lesion. The weights for an abnormal case must sum to unity. For example, dataset05$lesions$weights[4,] is c(0.5,0.5, -Inf), corresponding to equal weights (0.5) assigned to of the two lesions in the case.
descriptions$fileName: a character variable containing the file name of the source data for this dataset. This is generated automatically by the DfReadDataFile function used to read the file. For a simulalated dataset it is set to "NA" (i.e., a character vector, not the variable NA).
descriptions$type: a character variable describing the data type: "ROC", "LROC", "ROI" or "FROC".
descriptions$name: a character variable containing the name of the dataset: e.g., "dataset02" or "dataset05". This is generated automatically by the DfReadDataFile function used to read the file.
descriptions$truthTableStr: a c(I, J, L, maxLL+1) object. For normal cases elements c(I, J, L, 1) are filled with 1s if the corresponding interpretations occurred or NAs otherwise. For abnormal cases elements c(I, J, L, 2:(maxLL+1)) are filled with 1s if the corresponding interpretations occurred or NAs otherwise. This object is necessary for analyzing more complex designs.
descriptions$design: a character variable: "FCTRL", corresponding to factorial design.
descriptions$modalityID: a character vector of length $I$, which labels/names the modalities in the dataset. For non-JAFROC data file formats, they must be unique integers.
descriptions$readerID: a character vector of length $J$, which labels/names the readers in the dataset. For non-JAFROC data file formats, they must be unique integers.

ROI data structure, example `datasetROI`

Only changes from the previously described structure are described below:

ratings$NL: a float array with dimensions c(I, J, K, Q) containing the ratings of each of Q quadrants for each non-diseased case.
ratings$LL: a float array with dimensions c(I, J, K2, Q) containing the ratings of quadrants for each diseased case.
lesions$perCase: this contains the locations, on abnormal cases, containing at least one lesion.

Crossed-modality dataset structure, example `datasetXModality`

Only changes from the previously described structure are described below:

dataset$ratings$NL: a float array with dimension c(I1, I2, J, K, maxNL) containing the ratings of NL marks. Note the existence of two modality indices.
LL: a float array with dimension c(I1, I2, J, K2, maxLL) containing the ratings of all LL marks. Note the existence of two modality indices.
dataset$descriptions$modalityID1: corresponding to first modality factor.
dataset$descriptions$modalityID2: corresponding to second modality factor.

Fitting Functions

FitBinormalRoc: Fit the binormal model to ROC data (R equivalent of ROCFIT or RSCORE).
FitCbmRoc: Fit the contaminated binormal model (CBM) to ROC data.
FitRsmRoc: Fit the radiological search model (RSM) to ROC data.
FitCorCbm: Fit the correlated contaminated binormal model (CORCBM) to paired ROC data.
FitRsmRoc: Fit the radiological search model (RSM) to ROC data.

Plotting Functions

PlotBinormalFit: Plot binormal-predicted ROC curve with provided BM parameters.
PlotEmpOpChrs: Plot empirical operating characteristics for specified dataset.
PlotRsmOpChrs: Plot RSM-fitted ROC curves.

Simulation Functions

SimulateFrocDataset: Simulates an uncorrelated FROC dataset using the RSM.
SimulateRocDataset: Simulates an uncorrelated binormal model ROC dataset.
SimulateCorCbmDataset: Simulates an uncorrelated binormal model ROC dataset.
SimulateLrocDataset: Simulates an uncorrelated LROC dataset.

Sample size Functions

SsPowerGivenJK: Calculate statistical power given numbers of readers J and cases K.
SsPowerTable: Generate a power table.
SsSampleSizeKGivenJ: Calculate number of cases K, for specified number of readers J, to achieve desired power for an ROC study.

Significance Testing Functions

St: Performs significance testing, DBM or OR, with factorial or crossed modalities.
StCadVsRad: Perform significance testing, CAD vs. radiologists.

Miscellaneous and Utility Functions

UtilAucBIN: Binormal model AUC function.
UtilAucCBM: CBM AUC function.
UtilAucPROPROC: PROPROC AUC function.
UtilAnalyticalAucsRSM: RSM ROC/AFROC AUC calculator.
UtilFigureOfMerit: Calculate empirical figures of merit (FOMs) for specified dataset.
Util2Physical: Convert from intrinsic to physical RSM parameters.
UtilLesDistr: Calculates the lesion distribution dataframe.
UtilLesWghts: Calculates the lesion weights matrix.
UtilMeanSquares: Calculates the mean squares used in the DBM and OR methods.
Util2Intrinsic: Convert RSM physical parameters to intrinsic parameters.
UtilPseudoValues: Return jackknife pseudovalues.
UtilDBMVarComp: Utility for Dorfman-Berbaum-Metz variance components.
UtilORVarComp: Utility for Obuchowski-Rockette variance components.

Author

Author: Dev Chakraborty dpc10ster@gmail.com.
Author: Xuetong Zhai xuetong.zhai@gmail.com.
Contributor: Peter Phillips peter.phillips@cumbria.ac.uk.

References

Basics of ROC

Metz, CE (1978). Basic principles of ROC analysis. In Seminars in nuclear medicine (Vol. 8, pp. 283–298). Elsevier.

Metz, CE (1986). ROC Methodology in Radiologic Imaging. Investigative Radiology, 21(9), 720.

Metz, CE (1989). Some practical issues of experimental design and data analysis in radiological ROC studies. Investigative Radiology, 24(3), 234.

Metz, CE (2008). ROC analysis in medical imaging: a tutorial review of the literature. Radiological Physics and Technology, 1(1), 2–12.

Wagner, R. F., Beiden, S. V, Campbell, G., Metz, CE, & Sacks, W. M. (2002). Assessment of medical imaging and computer-assist systems: lessons from recent experience. Academic Radiology, 9(11), 1264–77.

Wagner, R. F., Metz, CE, & Campbell, G. (2007). Assessment of medical imaging systems and computer aids: a tutorial review. Academic Radiology, 14(6), 723–48.

DBM/OR methods and extensions

DORFMAN, D. D., BERBAUM, KS, & Metz, CE (1992). Receiver operating characteristic rating analysis: generalization to the population of readers and patients with the jackknife method. Investigative Radiology, 27(9), 723.

Obuchowski, NA, & Rockette, HE (1994). HYPOTHESIS TESTING OF DIAGNOSTIC ACCURACY FOR MULTIPLE READERS AND MULTIPLE TESTS: AN ANOVA APPROACH WITH DEPENDENT OBSERVATIONS. Communications in Statistics-Simulation and Computation, 24(2), 285–308.

Hillis, SL, Berbaum, KS, & Metz, CE (2008). Recent developments in the Dorfman-Berbaum-Metz procedure for multireader ROC study analysis. Academic Radiology, 15(5), 647–61.

Hillis, SL, Obuchowski, NA, & Berbaum, KS (2011). Power Estimation for Multireader ROC Methods: An Updated and Unified Approach. Acad Radiol, 18, 129–142.

Hillis, SL SL (2007). A comparison of denominator degrees of freedom methods for multiple observer ROC analysis. Statistics in Medicine, 26(3), 596–619.

FROC paradigm

Chakraborty DP. Maximum Likelihood analysis of free-response receiver operating characteristic (FROC) data. Med Phys. 1989;16(4):561–568.

Chakraborty, DP, & Berbaum, KS (2004). Observer studies involving detection and localization: modeling, analysis, and validation. Medical Physics, 31(8), 1–18.

Chakraborty, DP (2006). A search model and figure of merit for observer data acquired according to the free-response paradigm. Physics in Medicine and Biology, 51(14), 3449–62.

Chakraborty, DP (2006). ROC curves predicted by a model of visual search. Physics in Medicine and Biology, 51(14), 3463–82.

Chakraborty, DP (2011). New Developments in Observer Performance Methodology in Medical Imaging. Seminars in Nuclear Medicine, 41(6), 401–418.

Chakraborty, DP (2013). A Brief History of Free-Response Receiver Operating Characteristic Paradigm Data Analysis. Academic Radiology, 20(7), 915–919.

Chakraborty, DP, & Yoon, H.-J. (2008). Operating characteristics predicted by models for diagnostic tasks involving lesion localization. Medical Physics, 35(2), 435.

Thompson JD, Chakraborty DP, Szczepura K, et al. (2016) Effect of reconstruction methods and x-ray tube current-time product on nodule detection in an anthropomorphic thorax phantom: a crossed-modality JAFROC observer study. Medical Physics. 43(3):1265-1274.

Zhai X, Chakraborty DP. (2017) A bivariate contaminated binormal model for robust fitting of proper ROC curves to a pair of correlated, possibly degenerate, ROC datasets. Medical Physics. doi: 10.1002/mp.12263:2207–2222.

Hillis SL, Chakraborty DP, Orton CG. ROC or FROC? It depends on the research question. Medical Physics. 2017.

Chakraborty DP, Nishikawa RM, Orton CG. Due to potential concerns of bias and conflicts of interest, regulatory bodies should not do evaluation methodology research related to their regulatory missions. Medical Physics. 2017.

Dobbins III JT, McAdams HP, Sabol JM, Chakraborty DP, et al. (2016) Multi-Institutional Evaluation of Digital Tomosynthesis, Dual-Energy Radiography, and Conventional Chest Radiography for the Detection and Management of Pulmonary Nodules. Radiology. 282(1):236-250.

Warren LM, Mackenzie A, Cooke J, et al. Effect of image quality on calcification detection in digital mammography. Medical Physics. 2012;39(6):3202-3213.

Chakraborty DP, Zhai X. On the meaning of the weighted alternative free-response operating characteristic figure of merit. Medical physics. 2016;43(5):2548-2557.

Chakraborty DP. (2017) Observer Performance Methods for Diagnostic Imaging - Foundations, Modeling, and Applications with R-Based Examples. Taylor-Francis, LLC.