ROC Curve in R with ggplot2

January 15, 2024

Cover

In this tutorial, we will explore the application of the ggplot2 and plotROC packages for visualizing Receiver Operating Characteristic (ROC) curves in R. ROC curves are commonly examined when assessing machine learning models for binary classification. They are closely associated with the evaluation metric Area Under the Curve (AUC), which quantifies the area beneath a ROC curve in two-dimensional space. While typically used for machine learning models, ROC curves are also relevant for assessing medical tests, particularly those involving continuous biomarkers. In essence, operating characteristics are characterized by a pair of values: False Positive Fraction (FPF) and True Positive Fraction (TPF), denoted as (FPF, TPF). The ROC curve illustrates the complete range of operating characteristics. In general, a pair of operating characteristics can be obtained in a 2-by-2 table as below

Confusion matrix (classification model):

	Actual label = 0	Actual label = 1
Predicted label = 0	True negative (TN)	False negative (FN)
Predicted label = 1	False positive (FP)	True positive (TP)

Contingency table (medical test performance):

	Disease status = 0	Disease status = 1
Medical test = 0	True negative (TN)	False negative (FN)
Medical test = 1	False positive (FP)	True positive (TP)

Hence, we can derive three key summary statistics from the tables above: true positive fraction (sensitivity), true negative fraction (specificity), and false positive fraction (1 - specificity). Note that FPF is also known as type I error, and FNF is referred to as type II error.

Assuming that all abbreviations in the above tables represent the number of observations in their respective cells, we can estimate them as follows:

\[\hat{TPF} = \frac{TP}{TP + FN}, \quad \hat{FPF} = \frac{FP}{FP + TN}, \quad \hat{TNF} = \frac{TN}{TN + FP}.\]

If observations/predictions are independent, confidence intervals for the aforementioned summary statistics can be obtained using exact binomial or asymptotic inference. Essentially, a ROC curve is depicted in a two-dimensional space defined by the pair \((FPF, TPF)\), with FPF on the x-axis and TPF on the y-axis. To bridge the concept of the 2-by-2 tables with the ROC curve, let’s assume predicted probabilities or values of a continuous medical test as \(Y\), the true label as \(D\), and a continuous threshold value \(c\). We can mathematically express TPF, FPF, and the ROC curve as follows:

\[TPF(c) = P(Y \geq c | D = 1), \quad FPF(c) = P(Y \geq c | D = 0), \quad ROC(*) = \{(FPF(c), TPF(c)), c \in (-\infty, \infty) \}.\]

It’s important to note that a ROC curve is a monotone increasing function that maps the interval \((0, 1)\) onto \((0, 1)\). In the case of an uninformative classification, where the predicted \(Y\) is unrelated to the true label \(D\), it is represented by \(TPF(c) = FPF(c)\) for any arbitrary threshold value \(c\). This scenario corresponds to a line with a unit slope in the ROC curve figure. In contrast, a perfect ROC curve demonstrates a scenario where, for some threshold \(c\), we have \(TPF(c) = 1\) and \(FPF(c) = 0\). This corresponds to the left and upper borders of the positive unit quadrant. In a more rigorous formulation, we can view a ROC curve as mapping \(t\) to \(TPF(c)\), where \(c\) is the threshold corresponding to \(t = FPF(c)\). Denoting the survivor function for \(Y\) in the actual label or disease status = 1 and in the actual label or disease status = 0 as

\[S_1(y) = P(Y \geq y | D = 1), \quad S_0(y) = P(Y \geq y | D = 0),\]

respectively, the ROC model is represented as follows:

\[ROC(t) = S_1(S_0^{-1}(t)), \quad t \in (0, 1).\]

The initial appearance of this equation might not be very intuitive, but it essentially aligns with the definition of the ROC curve. If we let \(c = S_0^{-1}(t)\), noting that \(c\) is the threshold corresponding to false positive fraction \(t\) then

\[t = FPF(c) = P(Y \geq c | D = 0).\]

The corresponding TPF is

\[TPF(c) = P(Y \geq c | D = 1) = S_1(c).\]

Therefore, we can have

\[ROC(t) = S_1(c) = S_1(S_0^{-1}(t)).\]

You might wonder why we can take the inverse of \(S_0(y)\) in this context. (Hint: consider the nature of a survival function and its interpretation–monotonically decreasing). The content appears rather technical and theoretical, and more comprehensive explanations of the mathematical properties related to ROC curves can be found elsewhere. For now, let’s skip this material lol.

1. Dataset

Now let’s start to talk about the visualization of ROC curve with ggplot2 and plotROC (an extension of ggplot2) packages. We will use a dataset from a study investigated by Wieand et al., demonstrating nonparametric statistics for comparing diagnostic markers. The dataset is straightforward, including three columns:

CA_19.9: it is a protein found in human blood known as cancer antigen 19-9, measured as a continuous value in units of U/mL (numerical);
CA_125: it is a protein found in human blood known as cancer antigen 125, measured as a continuous value in units of U/mL (numerical);
Pancreatic_Cancer: it serves as a binary indicator for a patient, distinguishing between non-disease (Pancreatic_Cancer = 0) and disease groups (Pancreatic_Cancer = 1) (numerical).

Hence, we intend to illustrate the utilization of these two continuous biomarkers for visualizing ROC curves in the subsequent sections.

2. ROC curve

In essence, we will employ the geom_roc() function, with the option to adjust certain display parameters:

n.cuts: a numerical value should be provided to control the number of cut points displayed along each curve (by default = 10).
cutoffs.at: a vector of user supplied cutoffs to plot as points should be provided to display specific cut points.

Additionally, the aesthetic mappings created by aes() differ from the initial ggplot, and it necessitates the provision of two arguments:

d: it is utilized to input an actual binary label or the true disease status.
m: it is employed to provide continuous biomarkers or predicted probabilities from a well-trained model.

Additional parameters like col or fill can still be passed into the aes() function. As mentioned earlier, confidence intervals for both TPF and FPF can also be obtained and displayed along with a ROC curve at a given cutoff point. This requires using the function geom_rocci(), which can be configured with the following arguments:

sig.level: it is used to specify the significance level for the confidence regions (by default = 0.05).
ci.at: it is used to specify a vector of user supplied values in the range of the biomarker (by default = NULL)

The final optional function for adjusting plot style is style_roc(), designed to incorporate guides and annotations to a ROC curve. For example, users can customize x/y-axis labels using arguments such as xlab or ylab within this function, or alter the theme function via the theme argument.

(1). Continuous medical test performance

Now, let’s proceed to visualize the initial basic version of a ROC curve for the CA 19-9 biomarker.

require(plotROC)
require(ggplot2)
# Version 1
p_ROC1 <- 
  ggplot(Dt, aes(d = Pancreatic_Cancer,            # true disease label
                 m = CA_19.9)) +                   # continuous biomarker
    geom_roc(labelsize = 3) +                      # draw ROC curve
    geom_abline(slope = 1, intercept = 0) +        # add a unit slope line
    style_roc(xlab = 'False Positive Fraction',    # modify x/y-axis labels
              ylab = 'True Positive Fraction') + 
    theme(axis.text.x = element_text(color = 'black', size = 11),    
          axis.text.y = element_text(color = 'black', size = 11),
          axis.title.x = element_text(color = 'black', size = 11, face = 'bold'),                              
          axis.title.y = element_text(color = 'black', size = 11, face = 'bold'))
p_ROC1

In this basic version of the ROC curve, the geom_roc() function, by default, exhibits 10 cutoff points along with their associated values, displayed as points on the curve. Note that the value for each cutoff point is indeed the value of the continuous biomarker CA 19-9 in U/mL. Upon initial observation of the figure, a reasonable cutoff level for CA 19-9 appears to be 21.8 U/mL, where FPF is only 25%, while the associated TPF is slightly greater than 75%. Based on the dataset, this suggests that if a patient without pancreatic cancer has a CA 19-9 level below this cutoff, there’s a 25% chance of a false cancer diagnosis. On the other hand, for patients with pancreatic cancer, the test using this cutoff level can detect 75% of true cancer cases. In the next version, we will illustrate the usage of a 95% confidence region for the selected cutoff point (CA 19-9 = 21.8 U/mL) and display only 5 cutoff points in the ROC curve.

# Version 2
p_ROC2 <- 
  ggplot(Dt, aes(d = Pancreatic_Cancer,            # true disease label
                 m = CA_19.9)) +                   # continuous biomarker
    geom_roc(n.cuts = 5) +                         # draw ROC curve
    geom_rocci(ci.at = 21.8) +                     # draw 95% CI for a selected point
    geom_abline(slope = 1, intercept = 0) +        # add a unit slope line
    style_roc(xlab = 'False Positive Fraction',    # modify x/y-axis labels
              ylab = 'True Positive Fraction') + 
    theme(axis.text.x = element_text(color = 'black', size = 11),    
          axis.text.y = element_text(color = 'black', size = 11),
          axis.title.x = element_text(color = 'black', size = 11, face = 'bold'),                              
          axis.title.y = element_text(color = 'black', size = 11, face = 'bold'))
p_ROC2

Note that the confidence region is asymmetric, despite its rectangular shape. This asymmetry arises because if \((FPF_L, TPF_L)\) and \((FPF_U, TPF_U)\) represent \(1-\alpha^*\) univariate confidence intervals for FPF and TPF, where \(\alpha^* = 1 - (1-\alpha)^{1/2}\), then the rectangle \(R = (FPF_L, TPF_L) \times (FPF_U, TPF_U)\) forms a \(1 - \alpha\) confidence interval for a pair of (FPF, TPF). (Hint: statistically, FPF is independent of TPF if data is independent.) In the next version, we will illustrate how to plot two ROC curves together to compare the performance of two different continuous biomarkers for the diagnosis or prognosis of diseases. We will use the melt_roc() function from the plotROC package to transform a wide dataset into a long version. By implementing this function, we can create a long dataset and subsequently plot two ROC curves for CA 19-9 and CA 125, each with a distinct color.

require(ggthemes)
# Obtain the long version of dataset and modify columns/labels 
Dt_long <- melt_roc(Dt, d = 'Pancreatic_Cancer', c('CA_19.9', 'CA_125'))
colnames(Dt_long) <- c('Pancreatic Cancer', 'Level', 'Biomarker')
Dt_long$Biomarker <- ifelse(Dt_long$Biomarker == 'CA_19.9', 'CA 19-9', 'CA 125')

# Version 3
p_ROC3 <- 
  ggplot(Dt_long, aes(d = `Pancreatic Cancer`,     # true disease label
                      m = Level,                   # continuous biomarker level
                      col = Biomarker)) +          # color for different biomarkers
    geom_roc(pointsize = 0, labels = F) +          # draw ROC curve and remove labels/points
    geom_rocci(ci.at = 21.8) + 
    geom_abline(slope = 1, intercept = 0) +        # add a unit slope line
    style_roc(xlab = 'False Positive Fraction',    # modify x/y-axis labels
              ylab = 'True Positive Fraction') + 
    scale_color_tableau() + 
    theme(axis.text.x = element_text(color = 'black', size = 11),    
          axis.text.y = element_text(color = 'black', size = 11),
          axis.title.x = element_text(color = 'black', size = 11, face = 'bold'),                              
          axis.title.y = element_text(color = 'black', size = 11, face = 'bold'))
p_ROC3

In this figure, CA 125 and CA 19-9 are represented by blue and orange, respectively. When controlling FPF to be around 0.25, it is evident that TPF for CA 19-9 exceeds 0.75, while the TPF for CA 125 is approximately 0.50. This observation indicates that, based on this dataset, the performance of CA 19-9 is superior to that of CA 125.

(2). Classification model performance

In addition to visualizing continuous biomarkers, these functions can be employed to visualize the ROC curve for evaluating classification models. This necessitates the model to output predicted probabilities from a classification model. Meanwhile, users may desire to evaluate the ROC curve with an AUC score. To achieve this, we will use the roc() function from the pROC package and implement a logistic regression for CA 19-9.

# Using pROC package to calculate AUC
require(pROC)
# Classification model: logistic regression
model <- glm(Pancreatic_Cancer ~ CA_19.9, family = binomial(), data = Dt)
# Predicted probability
predProb <- predict(model, newdata = Dt, type = 'response')
# Combined prob with real labels
Result <- data.frame('Predicted_Probability' = predProb, 
                     'Pancreatic_Cancer' = Dt$Pancreatic_Cancer)
# Calculate AUC
AUC <- round(auc(response = Result$Pancreatic_Cancer, 
                 predictor = Result$Predicted_Probability), 2)

# Version 4
p_ROC4 <- 
  ggplot(Result, aes(d = Pancreatic_Cancer,            # true disease label
                     m = Predicted_Probability)) +     # continuous biomarker
    geom_roc(n.cuts = 0) +                             # draw ROC curve
    geom_abline(slope = 1, intercept = 0) +            # add a unit slope line
    annotate("text", x = .75, y = .25,                 # add AUC score
             label = paste("AUC =", AUC)) + 
    style_roc(xlab = 'False Positive Fraction',        # modify x/y-axis labels
              ylab = 'True Positive Fraction') + 
    theme(axis.text.x = element_text(color = 'black', size = 11),    
          axis.text.y = element_text(color = 'black', size = 11),
          axis.title.x = element_text(color = 'black', size = 11, face = 'bold'),                              
          axis.title.y = element_text(color = 'black', size = 11, face = 'bold'))
p_ROC4

The AUC score for the logistic regression model assessing the binary pancreatic cancer outcome based on the continuous biomarker CA 19-9 is 0.86. This value suggests that the performance of the logistic regression model is decent.