Model Evaluation and Comparison

library(clinpubr)
library(dplyr)
library(survival)
library(ggplot2)

Introduction

Evaluating clinical prediction models requires assessing discrimination, calibration, and clinical utility. This vignette demonstrates a comprehensive evaluation workflow:

  1. Model Comparison — Compare multiple models using discrimination and calibration metrics
  2. Time-Dependent ROC — Evaluate survival prediction models at specific time points
  3. C-Index — Global discrimination measure for survival models
  4. Variable Importance — Identify key predictive features

Preparing Data

We’ll use the NCCTG Lung Cancer dataset and fit three logistic regression models of increasing complexity:

data(cancer, package = "survival")
cancer$dead <- cancer$status == 2
cancer$event <- ifelse(cancer$status == 2, 1, 0)

cancer_clean <- cancer %>%
  mutate(sex = factor(sex, labels = c("Male", "Female"))) %>%
  na.omit()

# Fit three models with increasing complexity
cancer_clean$pred_model1 <- predict(
  glm(dead ~ age + sex + ph.karno, data = cancer_clean, family = binomial),
  type = "response"
)
cancer_clean$pred_model2 <- predict(
  glm(dead ~ age + sex + ph.karno + pat.karno, data = cancer_clean, family = binomial),
  type = "response"
)
cancer_clean$pred_model3 <- predict(
  glm(dead ~ age + sex + ph.karno + wt.loss, data = cancer_clean, family = binomial),
  type = "response"
)

knitr::kable(head(cancer_clean[, c("dead", "pred_model1", "pred_model2", "pred_model3")]),
  caption = "Model Predictions vs Actual Outcomes (First 6 Rows)"
)
Model Predictions vs Actual Outcomes (First 6 Rows)
dead pred_model1 pred_model2 pred_model3
2 TRUE 0.7868731 0.7661243 0.7858287
4 TRUE 0.7436147 0.8215218 0.7434448
6 FALSE 0.9260244 0.8954047 0.9273628
7 TRUE 0.7052554 0.7566565 0.7049340
8 TRUE 0.7704063 0.7279677 0.7724185
9 TRUE 0.8207969 0.8021612 0.8203195

Model Comparison

classif_model_compare() provides a comprehensive comparison using multiple metrics (AUC, Accuracy, Sensitivity, Specificity, PPV, NPV, PRAUC, Brier Score) and generates ROC, PR, calibration, and DCA plots:

model_comparison <- classif_model_compare(
  data = cancer_clean,
  target_var = "dead",
  model_names = c("pred_model1", "pred_model2", "pred_model3"),
  save_output = FALSE
)

knitr::kable(model_comparison$metric_table, caption = "Model Performance Comparison")
Model Performance Comparison
Model AUC PRAUC Accuracy Sensitivity Specificity Pos Pred Value Neg Pred Value F1 Kappa Brier cutoff Youden HosLem
3 pred_model3 0.684 (0.594, 0.774) 0.820 0.581 0.50 0.787 0.857 0.381 0.632 0.217 0.185 0.777 0.287 0.985
1 pred_model1 0.684 (0.594, 0.774) 0.817 0.557 0.45 0.830 0.871 0.371 0.593 0.203 0.185 0.786 0.280 0.967
2 pred_model2 0.684 (0.590, 0.777) 0.821 0.605 0.55 0.745 0.846 0.393 0.667 0.232 0.182 0.765 0.295 0.353

Visualization

model_comparison$roc_plot

model_comparison$pr_plot

model_comparison$calibration_plot

model_comparison$dca_plot

Time-Dependent ROC Analysis

For survival outcomes, time-dependent ROC accounts for the timing of events. First, we fit a Cox proportional hazards model and obtain risk scores:

# Fit Cox model for survival prediction
cox_model <- coxph(Surv(time, event) ~ age + sex + ph.karno, data = cancer_clean)
cancer_clean$risk_score <- predict(cox_model, type = "risk")

Now we can evaluate the model using time-dependent ROC at specific time points:

time_roc_result <- time_roc_plot(
  data = cancer_clean,
  event_var = "event",
  time_var = "time",
  marker_var = "risk_score",
  times = c(200, 365),
  time_unit = "days",
  save_plot = FALSE
)

time_roc_result
#> $time_roc
#> Time-dependent-Roc curve estimated using IPCW  (n=167, without competing risks). 
#>       Cases Survivors Censored AUC (%)
#> t=200    48       111        8   67.32
#> t=365    87        49       31   60.85
#> 
#> Method used for estimating IPCW:marginal 
#> 
#> Total computation time : 0  secs.
#> 
#> $plot

C-Index

The C-index (concordance index) measures discrimination for survival models. Interpretation: 0.5 = no discrimination, 0.7-0.8 = acceptable, >0.8 = excellent.

c_index <- calc_cindex(cancer_clean, "time", "event", "risk_score")
c_index
#>   C Index 
#> 0.6243374

Variable Importance

The variable importance list could be obtained from the different models like random forest, gradient boosting machine, etc. Here, we assume we have the variable importance scores from some random model.

importance_scores <- c(
  Age = 0.25, Sex = 0.15, Karnofsky = 0.30,
  WeightLoss = 0.12, Calories = 0.08, PatKarnofsky = 0.10
)

importance_plot(x = importance_scores, x_lab = "Variable Importance")

Show only top variables with custom styling:

importance_plot(
  x = importance_scores,
  x_lab = "Relative Importance",
  top_n = 4,
  color = "steelblue"
)

Summary

Key Functions

Evaluation Checklist

  1. Discrimination: AUC (binary) or C-index (survival), sensitivity/specificity at clinical thresholds
  2. Calibration: Calibration plots, Brier score
  3. Clinical utility: Decision curve analysis, net benefit at relevant thresholds
  4. External validation: Performance in independent populations