Extrinsic variable selection

Introduction

In the main vignette, I discussed how to perform flexible variable selection in a simple, simulated example. In this document, I will discuss extrinsic selection in more detail.

Extrinsic variable selection is variable selection performed using extrinsic variable importance [@williamson2021]. Extrinsic variable importance is a summary of how a particular fitted algorithm makes use of the input features. For example, extrinsic importance for a penalized regression model could be the estimated regression coefficient, while for random forests, it could be the random forest variable importance measure.

In flevr, we use the following definitions of variable importance

Algorithm	`R` package implementation
Generalized linear models	`stats::glm`
Penalized linear models	`glmnet::glmnet` [@friedman2010;@tay2023;@glmnet]
Mean	`stats::mean`
Multivariate adaptive regression splines	`polspline::polymars` [@polspline]
Random forests	``ranger::ranger` [@ranger;@wright2017]
Support vector machines	`kernlab::ksvm` [@karatzoglou2004;@kernlab]
Gradient boosted trees	`xgboost::xgboost` [@xgboost]

Further, we define the extrinsic importance of the Super Learner [@vanderlaan2007]. The first choice is to use the weighted average of the variable importance ranks of the individual learners. The weights are determined by the Super Learner itself, since the algorithm determines a weighted combination of the individual-learner predictions. The second choice is to use the variable importance from the best-performing individual learner.

Throughout this vignette, I will use a dataset inspired by data collected by the Early Detection Research Network (EDRN). Biomarkers developed at six “labs” are validated at at least one of four “validation sites” on 306 cysts. The data also include two binary outcome variables: whether or not the cyst was classified as mucinous, and whether or not the cyst was determined to have high malignant potential. The data contain some missing information, which complicates variable selection; only 212 cysts have complete information. In the first section, we will drop the missing data; in the second section, we will appropriately handle the missing data by using multiple imputation.

# load the dataset
data("biomarkers")
library("dplyr")
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
# set up vector "y" of outcomes and matrix "x" of features
cc <- complete.cases(biomarkers)
y_cc <- biomarkers$mucinous[cc]
x_cc <- biomarkers %>%
  na.omit() %>%
  select(starts_with("lab"), starts_with("cea"))
x_df <- as.data.frame(x_cc)
x_names <- names(x_df)

Extrinsic variable selection

The first step in performing extrinsic variable selection is to fit a Super Learner. This requires specifying a library of candidate learners for the Super Learner [@vanderlaan2007;@superlearner]; throughout, I use a very simple library of learners for the Super Learner (this is for illustration only; in practice, I suggest using a large library of learners).

set.seed(1234)
# fit a Super Learner ensemble; note its simplicity, for speed
library("SuperLearner")
#> Loading required package: nnls
#> Loading required package: gam
#> Loading required package: splines
#> Loading required package: foreach
#> Loaded gam 1.22-2
#> Super Learner
#> Version: 2.0-28.1
#> Package created on 2021-05-04
learners <- c("SL.glm", "SL.ranger.imp", "SL.glmnet")
V <- 2
fit <- SuperLearner::SuperLearner(Y = y_cc, X = x_df,
                                  SL.library = learners,
                                  cvControl = list(V = V),
                                  family = "binomial")
#> Loading required namespace: glmnet
#> Loading required namespace: ranger
#> Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
#> prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
# check the SL weights
fit$coef
#>        SL.glm_All SL.ranger.imp_All     SL.glmnet_All 
#>         0.0000000         0.7667208         0.2332792
# extract importance based on the whole Super Learner
sl_importance_all <- extract_importance_SL(
  fit = fit, feature_names = x_names, import_type = "all"
)
sl_importance_all
#> # A tibble: 21 × 2
#>    feature                  rank
#>    <chr>                   <dbl>
#>  1 lab3_muc3ac_score        3.33
#>  2 lab6_ab_score            4.10
#>  3 lab1_actb                4.87
#>  4 lab1_telomerase_score    5.63
#>  5 cea                      6.40
#>  6 lab4_glucose_score       7.17
#>  7 lab4_areg_score          7.93
#>  8 lab2_fluorescence_score  8.70
#>  9 lab3_muc5ac_score        9.47
#> 10 lab1_molecules_score    10.2 
#> # ℹ 11 more rows

Using the ensemble weights here means that weight approximately 0.53 is given to the boosted trees, weight approximately 0.47 is given to the random forest, while weight 0 is given to the logistic regression model.

To instead look at the importance of the best individual learner, use the following code:

sl_importance_best <- extract_importance_SL(
  fit = fit, feature_names = x_names, import_type = "best"
)
sl_importance_best
#> # A tibble: 21 × 2
#>    feature                  rank
#>    <chr>                   <dbl>
#>  1 lab3_muc3ac_score           1
#>  2 lab6_ab_score               2
#>  3 lab1_actb                   3
#>  4 lab1_telomerase_score       4
#>  5 cea                         5
#>  6 lab4_glucose_score          6
#>  7 lab4_areg_score             7
#>  8 lab2_fluorescence_score     8
#>  9 lab3_muc5ac_score           9
#> 10 lab1_molecules_score       10
#> # ℹ 11 more rows

In this case, this shows the importance in the boosted trees.

Finally, to do variable selection, we need to select a threshold (ideally before looking at the data). In this case, since there are only 24 variables, we choose a threshold of 5.

extrinsic_selected <- extrinsic_selection(
  fit = fit, feature_names = x_names, threshold = 5, import_type = "all"
)
extrinsic_selected
#> # A tibble: 21 × 3
#>    feature                  rank selected
#>    <chr>                   <dbl> <lgl>   
#>  1 lab3_muc3ac_score        3.33 TRUE    
#>  2 lab6_ab_score            4.10 TRUE    
#>  3 lab1_actb                4.87 TRUE    
#>  4 lab1_telomerase_score    5.63 FALSE   
#>  5 cea                      6.40 FALSE   
#>  6 lab4_glucose_score       7.17 FALSE   
#>  7 lab4_areg_score          7.93 FALSE   
#>  8 lab2_fluorescence_score  8.70 FALSE   
#>  9 lab3_muc5ac_score        9.47 FALSE   
#> 10 lab1_molecules_score    10.2  FALSE   
#> # ℹ 11 more rows

Here, we select five variables, since these had weighted rank less than 5.

Extrinsic selection with missing data

n_imp <- 2

To properly handle the missing data, we first perform multiple imputation. We use the R package mice, here with only 2 imputations (in practice, more imputations may be better).

library("mice")
set.seed(20231121)
mi_biomarkers <- mice::mice(data = biomarkers, m = n_imp, printFlag = FALSE)
imputed_biomarkers <- mice::complete(mi_biomarkers, action = "long") %>%
  rename(imp = .imp, id = .id)

We can perform extrinsic variable selection using the imputed data. First, we fit a Super Learner and perform extrinsic variable selection for each imputed dataset. Then, we select a final set of variables based on those that are selected in a pre-specified number of imputed datasets (e.g., 3 of 5) [@heymans2007]. Again, we use a rank of 5 for each imputed dataset to select variables.

set.seed(20231121)
# set up a list to collect selected sets
all_selected_vars <- vector("list", length = 5)
# for each imputed dataset, do extrinsic selection
for (i in 1:n_imp) {
  # fit a Super Learner
  these_data <- imputed_biomarkers %>%
    filter(imp == i)
  this_y <- these_data$mucinous
  this_x <- these_data %>%
    select(starts_with("lab"), starts_with("cea"))
  this_x_df <- as.data.frame(this_x)
  fit <- SuperLearner::SuperLearner(Y = this_y, X = this_x_df,
                                  SL.library = learners,
                                  cvControl = list(V = V),
                                  family = "binomial")
  # do extrinsic selection
  all_selected_vars[[i]] <- extrinsic_selection(
    fit = fit, feature_names = x_names, threshold = 5, import_type = "all"
  )$selected
}
# perform extrinsic variable selection
selected_vars <- pool_selected_sets(sets = all_selected_vars, threshold = 1 / n_imp)
x_names[selected_vars]

In this case, there is only one variable in common between the multiply-imputed approach and the approach dropping missing data. This serves to highlight that it is important in missing-data contexts to handle the missing data appropriately.