README

The goal of this package is to provide R users access to modern methods for non-probability samples when auxiliary information from the population or probability sample is available:

Installation

Basic idea

Consider the following setting where two samples are available: non-probability (denoted as \(S_A\) ) and probability (denoted as \(S_B\)) where set of auxiliary variables (denoted as \(\boldsymbol{X}\)) is available for both sources while \(Y\) and \(\boldsymbol{d}\) (or \(\boldsymbol{w}\)) is present only in probability sample.

Basic functionalities

Sample		Auxiliary variables \(\boldsymbol{X}\)	Target variable \(Y\)	Design (\(\boldsymbol{d}\)) or calibrated (\(\boldsymbol{w}\)) weights
\(S_A\) (non-probability)	1	\(\checkmark\)	\(\checkmark\)	?
	…	\(\checkmark\)	\(\checkmark\)	?
	\(n_A\)	\(\checkmark\)	\(\checkmark\)	?
\(S_B\) (probability)	\(n_A+1\)	\(\checkmark\)	?	\(\checkmark\)
	…	\(\checkmark\)	?	\(\checkmark\)
	\(n_A+n_B\)	\(\checkmark\)	?	\(\checkmark\)

Suppose \(Y\) is the target variable, \(\boldsymbol{X}\) is a matrix of auxiliary variables, \(R\) is the inclusion indicator. Then, if we are interested in estimating the mean \(\bar{\tau}_Y\) or the sum \(\tau_Y\) of the of the target variable given the observed data set \((y_k, \boldsymbol{x}_k, R_k)\), we can approach this problem with the possible scenarios:

When unit-level data is available for non-probability survey only

When unit-level data are available for both surveys

Examples

Simulate example data from the following paper: Kim, Jae Kwang, and Zhonglei Wang. “Sampling techniques for big data analysis.” International Statistical Review 87 (2019): S177-S191 [section 5.2]

Estimator	Example code
Mass imputation based on regression imputation	nonprob( outcome = y ~ x1 + x2 + ... + xk, data = nonprob, pop_totals = c(`(Intercept)`= N, x1 = tau_x1, x2 = tau_x2, ..., xk = tau_xk), method_outcome = "glm", family_outcome = "gaussian" )
Inverse probability weighting	nonprob( selection = ~ x1 + x2 + ... + xk, target = ~ y, data = nonprob, pop_totals = c(`(Intercept)` = N, x1 = tau_x1, x2 = tau_x2, ..., xk = tau_xk), method_selection = "logit" )
Inverse probability weighting with calibration constraint	nonprob( selection = ~ x1 + x2 + ... + xk, target = ~ y, data = nonprob, pop_totals = c(`(Intercept)`= N, x1 = tau_x1, x2 = tau_x2, ..., xk = tau_xk), method_selection = "logit", control_selection = controlSel(est_method_sel = "gee", h = 1) )
Doubly robust estimator	nonprob( selection = ~ x1 + x2 + ... + xk, outcome = y ~ x1 + x2 + …, + xk, pop_totals = c(`(Intercept)` = N, x1 = tau_x1, x2 = tau_x2, ..., xk = tau_xk), svydesign = prob, method_outcome = "glm", family_outcome = "gaussian" )

Estimator	Example code
Mass imputation based on regression imputation	`nonprob( outcome = y ~ x1 + x2 + ... + xk, data = nonprob, svydesign = prob, method_outcome = "glm", family_outcome = "gaussian" )`
Mass imputation based on nearest neighbour imputation	`nonprob( outcome = y ~ x1 + x2 + ... + xk, data = nonprob, svydesign = prob, method_outcome = "nn", family_outcome = "gaussian", control_outcome = controlOutcome(k = 2) )`
Mass imputation based on predictive mean matching	`nonprob( outcome = y ~ x1 + x2 + ... + xk, data = nonprob, svydesign = prob, method_outcome = "pmm", family_outcome = "gaussian" )`
Mass imputation based on regression imputation with variable selection (LASSO)	`nonprob( outcome = y ~ x1 + x2 + ... + xk, data = nonprob, svydesign = prob, method_outcome = "pmm", family_outcome = "gaussian", control_outcome = controlOut(penalty = "lasso"), control_inference = controlInf(vars_selection = TRUE) )`
Inverse probability weighting	`nonprob( selection = ~ x1 + x2 + ... + xk, target = ~ y, data = nonprob, svydesign = prob, method_selection = "logit" )`
Inverse probability weighting with calibration constraint	`nonprob( selection = ~ x1 + x2 + ... + xk, target = ~ y, data = nonprob, svydesign = prob, method_selection = "logit", control_selection = controlSel(est_method_sel = "gee", h = 1) )`
Inverse probability weighting with calibration constraint with variable selection (SCAD)	`nonprob( selection = ~ x1 + x2 + ... + xk, target = ~ y, data = nonprob, svydesign = prob, method_outcome = "pmm", family_outcome = "gaussian", control_inference = controlInf(vars_selection = TRUE) )`
Doubly robust estimator	`nonprob( selection = ~ x1 + x2 + ... + xk, outcome = y ~ x1 + x2 + ... + xk, data = nonprob, svydesign = prob, method_outcome = "glm", family_outcome = "gaussian" )`
Doubly robust estimator with variable selection (SCAD) and bias minimization	`nonprob( selection = ~ x1 + x2 + ... + xk, outcome = y ~ x1 + x2 + ... + xk, data = nonprob, svydesign = prob, method_outcome = "glm", family_outcome = "gaussian", control_inference = controlInf( vars_selection = TRUE, bias_correction = TRUE ) )`

Estimate population mean of y1 based on doubly robust estimator using IPW with calibration constraints.

Funding

Work on this package is supported by the National Science Centre, OPUS 22 grant no. 2020/39/B/HS4/00941.

References (selected)

Chen, Yilin, Pengfei Li, and Changbao Wu. 2020. “Doubly Robust Inference With Nonprobability Survey Samples.” Journal of the American Statistical Association 115 (532): 2011–21. https://doi.org/10.1080/01621459.2019.1677241.

Kim, Jae Kwang, Seho Park, Yilin Chen, and Changbao Wu. 2021. “Combining Non-Probability and Probability Survey Samples Through Mass Imputation.” Journal of the Royal Statistical Society Series A: Statistics in Society 184 (3): 941–63. https://doi.org/10.1111/rssa.12696.

Lumley, Thomas. 2004. “Analysis of Complex Survey Samples.” Journal of Statistical Software 9 (1): 1–19.

———. 2023. “Survey: Analysis of Complex Survey Samples.”

Wu, Changbao. 2023. “Statistical Inference with Non-Probability Survey Samples.” Survey Methodology 48 (2): 283–311. https://www150.statcan.gc.ca/n1/pub/12-001-x/2022002/article/00002-eng.htm.

Yang, Shu, Jae Kwang Kim, and Youngdeok Hwang. 2021. “Integration of Data from Probability Surveys and Big Found Data for Finite Population Inference Using Mass Imputation.” Survey Methodology 47 (1): 29–58. https://www150.statcan.gc.ca/n1/pub/12-001-x/2021001/article/00004-eng.htm.

Yang, Shu, Jae Kwang Kim, and Rui Song. 2020. “Doubly Robust Inference When Combining Probability and Non-Probability Samples with High Dimensional Data.” Journal of the Royal Statistical Society Series B: Statistical Methodology 82 (2): 445–65. https://doi.org/10.1111/rssb.12354.

`nonprobsvy`: an R package for modern statistical inference methods based on non-probability samples

Basic information

Installation

Basic idea

Basic functionalities

When unit-level data is available for non-probability survey only

When unit-level data are available for both surveys

Examples

Funding

References (selected)

nonprobsvy: an R package for modern statistical inference methods based on non-probability samples

Basic information

Installation

Basic idea

Basic functionalities

When unit-level data is available for non-probability survey only

When unit-level data are available for both surveys

Examples

Funding

References (selected)

`nonprobsvy`: an R package for modern statistical inference methods based on non-probability samples