| Title: | Core Survey Analysis Infrastructure |
|---|---|
| Description: | A modern, 'S7'-based foundation for survey analysis spanning both probability and non-probability samples. Probability sample designs include Taylor series linearization, replicate weights (BRR, Fay, jackknife, bootstrap), and two-phase estimation, following 'Lumley' (2004) <doi:10.18637/jss.v009.i08>. Non-probability sample designs support bootstrap and jackknife variance estimation for opt-in panels and convenience samples. Provides a unified estimator interface for means, frequencies, totals, quantiles, ratios, correlations, regression, and t-tests, with weighted 'polychoric' and 'polyserial' correlation following 'Mannan' (2025) <doi:10.2139/ssrn.6580480>. A metadata system preserves 'haven'-style variable labels, value labels, and question-preface attributes through all operations. Uses a 'tidyselect' interface throughout. |
| Authors: | Jacob Dennen [aut, cre, cph] (ORCID: <https://orcid.org/0000-0003-3006-7364>), Thomas Lumley [ctb, cph] (Author of variance estimation code vendored from the 'survey' package) |
| Maintainer: | Jacob Dennen <[email protected]> |
| License: | GPL (>= 3) |
| Version: | 1.0.0 |
| Built: | 2026-06-08 20:25:20 UTC |
| Source: | https://github.com/jdenn0514/surveycore |
Returns a flat character vector of all design-variable column names
(ids, weights, strata, fpc) for any survey design class. NULL entries
are dropped; names are unique. Exported for use by extension packages
(e.g., surveytidy); not intended for end users.
.get_design_vars_flat(design).get_design_vars_flat(design)
design |
A survey design object ( |
A character vector of column names.
All person records from the 2022 American Community Survey (ACS) 1-Year Public Use Microdata Sample (PUMS) for Wyoming (state FIPS 56). Wyoming is the least-populous U.S. state, making this the smallest state-level PUMS file — ideal for fast tests and examples.
acs_pums_wyacs_pums_wy
A data frame with 5,962 rows and 96 variables.
Columns pwgtp1 through pwgtp80 are the 80 successive difference
replicate weights for variance estimation; the remaining 16 variables are:
puma: Public Use Microdata Area code. Use as the cluster ID (PSU)
for variance estimation.
st: State FIPS code (all 56 = Wyoming).
pwgtp: Person weight. Represents the number of people in the
Wyoming population that this record represents.
agep: Age (0–99 years).
sex: Sex (1 = male, 2 = female).
rac1p: Recoded detailed race (1 = White alone, 2 = Black or African
American alone, 3 = American Indian alone, 6 = Asian alone,
9 = Two or more races).
hisp: Recoded Hispanic origin (01 = Not Spanish/Hispanic/Latino;
02–24 = specific Hispanic origin).
schl: Educational attainment (24 categories: 01 = no schooling,
16 = regular high school diploma, 21 = bachelor's degree,
24 = doctorate degree).
esr: Employment status recode (1 = civilian employed at work,
2 = civilian employed with job but not at work, 3 = unemployed,
4 = Armed Forces at work, 5 = Armed Forces not at work,
6 = Not in labor force).
pincp: Total person income in the past 12 months (dollars, signed;
negative values indicate a net loss). Multiply by adjinc / 1e6 to
adjust to constant dollars.
wagp: Wages or salary income in the past 12 months (dollars).
NA if not applicable.
hicov: Health insurance coverage (1 = with health insurance,
2 = without health insurance).
dis: Disability recode (1 = with a disability, 2 = without a
disability).
povpip: Income-to-poverty ratio (0–501; 501 means 501% or more).
wkhp: Usual hours worked per week in the past 12 months. NA if
not in the labor force.
adjinc: Adjustment factor for income and earnings. Divide by
1,000,000 and multiply income variables to convert to 2022 constant
dollars.
Survey design: Successive difference replication (SDR). Use
as_survey_replicate() with all 80 replicate weights:
svy <- as_survey_replicate( acs_pums_wy, weights = pwgtp, repweights = pwgtp1:pwgtp80, type = "successive-difference" )
Income adjustment: Income variables (pincp, wagp) are in survey-year
dollars. Multiply by adjinc / 1e6 to convert to 2022 inflation-adjusted
dollars before comparing across ACS years.
Metadata:
The ACS PUMS source is a plain CSV with no embedded labels. Columns in
acs_pums_wy carry no "label", "labels", or "question_preface"
attributes. Variable descriptions are documented here in ?acs_pums_wy and
in data-raw/README.md. Use set_var_label() and
set_val_labels() to attach labels manually before analysis if needed.
U.S. Census Bureau. 2022 ACS 1-Year PUMS. https://www.census.gov/programs-surveys/acs/microdata/access.html
# Wyoming population represented sum(acs_pums_wy$pwgtp) # Age distribution hist(acs_pums_wy$agep, main = "Age distribution, Wyoming 2022", xlab = "Age") # Confirm 80 replicate weights are present sum(grepl("^pwgtp[0-9]", names(acs_pums_wy)))# Wyoming population represented sum(acs_pums_wy$pwgtp) # Age distribution hist(acs_pums_wy$agep, main = "Age distribution, Wyoming 2022", xlab = "Age") # Confirm 80 replicate weights are present sum(grepl("^pwgtp[0-9]", names(acs_pums_wy)))
survey_collection
Appends one or more surveys to an existing collection and returns a new
survey_collection. The original collection is unchanged. Surveys may be
passed with explicit names or as bare symbols (auto-named, like
as_survey_collection()). Duplicate names are repaired by appending
_1, _2, … Existing names are never modified during repair.
add_survey(.collection, ...)add_survey(.collection, ...)
.collection |
A |
... |
One or more surveys to append. Accepts named arguments
( |
Calling add_survey(x) with no additional surveys returns x unchanged;
no error is raised.
A new survey_collection with the appended surveys.
as_survey_collection(), remove_survey()
Other collections:
as_survey_collection(),
remove_survey(),
set_collection_id(),
set_collection_if_missing_var(),
survey_collection()
d1 <- as_survey( gss_2024, ids = vpsu, weights = wtssps, strata = vstrat, nest = TRUE ) d2 <- as_survey( gss_2024, ids = vpsu, weights = wtssps, strata = vstrat, nest = TRUE ) coll <- as_survey_collection(a = d1) coll2 <- add_survey(coll, b = d2) names(coll2)d1 <- as_survey( gss_2024, ids = vpsu, weights = wtssps, strata = vstrat, nest = TRUE ) d2 <- as_survey( gss_2024, ids = vpsu, weights = wtssps, strata = vstrat, nest = TRUE ) coll <- as_survey_collection(a = d1) coll2 <- add_survey(coll, b = d2) names(coll2)
A 19-variable extract from the 2024 American National Election Studies
(ANES) Time Series Study, a landmark biennial pre- and post-election survey
of the American electorate. Fielded via face-to-face interview and web
(n = 5,521). This extract uses the FTF + Web combined design variables
(v240103a–v240103d), the recommended set for most analyses.
anes_2024anes_2024
A data frame with 5,521 rows and 19 variables:
Pre-election weight (FTF+Web combined). Use for variables asked before November 5, 2024.
Post-election weight (FTF+Web combined). Use for variables asked after November 5, 2024.
PSU (FTF+Web combined). Use as the cluster ID for variance estimation.
Stratum (FTF+Web combined). Use as the stratification variable.
2024 Time Series Case ID. Unique respondent identifier.
Sample type: 1 = Panel, 2 = Fresh Web, 3 = Fresh
FTF, 4 = GSS.
Pre/Post interview completion: 1 = Pre-election only,
2 = Pre- and post-election.
State FIPS code.
Census region: 1 = Northeast, 2 = Midwest,
3 = South, 4 = West.
Age on Election Day (summary). Top-coded at 80.
-2 = missing.
Sex: 1 = male, 2 = female.
Race/ethnicity (5-category summary): White non-Hispanic, Black non-Hispanic, Hispanic, Asian/NHPI non-Hispanic, Other/Multiracial non-Hispanic.
Education (5-category summary): 1 = less than HS,
2 = HS diploma, 3 = some college, 4 = bachelor's degree,
5 = graduate degree.
Household income (28 categories from < $5,000 to $250,000+).
Liberal-conservative self-placement (7-point scale):
1 = extremely liberal, 7 = extremely conservative.
99 = haven't thought about this.
Party identification strength: 1 = strong,
2 = not very strong.
Party identification lean (Independents): 1 = closer to
Republican, 2 = neither, 3 = closer to Democrat.
Did respondent vote for President (POST): 1 = yes,
2 = no.
Presidential vote choice (POST): 1 = Harris,
2 = Trump, 3 = RFK Jr., 4 = West, 5 = Stein, 6 = Other.
Survey design: Stratified cluster — use Taylor series linearization. Two weights are available depending on whether the analysis uses pre- or post-election variables:
# Pre-election analysis (party ID, ideology, candidate preference) svy_pre <- as_survey(anes_2024, ids = v240103c, strata = v240103d, weights = v240103a, nest = TRUE ) # Post-election analysis (validated vote choice) svy_post <- as_survey(anes_2024, ids = v240103c, strata = v240103d, weights = v240103b, nest = TRUE )
Missing value codes: The ANES uses negative integer codes for missing
data throughout: -9 = Refused, -8 = Don't know, -4 = Technical error,
-1 = Inapplicable, and others. These must be recoded to NA before
analysis. Check attr(anes_2024$v241177, "labels") for the full set of
codes for a given variable.
Metadata:
All columns carry variable labels and value labels as R attributes from the
original Stata file, automatically extracted into surveycore's metadata
system when you call as_survey().
Variable labels ("label" attribute): A human-readable description of
each column. Example: attr(anes_2024$v241550, "label") returns
"PRE: What is your sex?" (or similar ANES phrasing).
Value labels ("labels" attribute): A named numeric vector mapping
each code to its meaning, including all missing-value codes. Example:
attr(anes_2024$v241550, "labels") returns a vector with entries for
Male, Female, and the applicable negative missing codes.
American National Election Studies. 2024 Time Series Study.
Available at electionstudies.org (free account required to download raw
data; the processed .rda is included in the package).
Prepared by data-raw/prepare-anes-2024.R.
# Variables in the dataset names(anes_2024) # Create pre-election design svy <- as_survey( anes_2024, ids = v240103c, strata = v240103d, weights = v240103a, nest = TRUE ) # Inspect variable label (ANES uses opaque V-codes; labels give context) attr(anes_2024$v241177, "label") # Inspect value labels, including missing-value codes attr(anes_2024$v241177, "labels")# Variables in the dataset names(anes_2024) # Create pre-election design svy <- as_survey( anes_2024, ids = v240103c, strata = v240103d, weights = v240103a, nest = TRUE ) # Inspect variable label (ANES uses opaque V-codes; labels give context) attr(anes_2024$v241177, "label") # Inspect value labels, including missing-value codes attr(anes_2024$v241177, "labels")
S3 method that dispatches to get_anova(). Pass one or two
survey_glm_fit objects; the single-model or pairwise path is chosen
automatically.
## S3 method for class 'survey_glm_fit' anova(object, ..., method = "LRT", test = "F", null = NULL)## S3 method for class 'survey_glm_fit' anova(object, ..., method = "LRT", test = "F", null = NULL)
object |
A survey_glm_fit object. |
... |
An optional second survey_glm_fit for pairwise comparison; anything else errors. |
method |
Character(1). |
test |
Character(1). |
null |
Numeric or |
A survey_anova tibble; see get_anova() for column details.
Constructs a calibration data object from base sampling weights, g-weights
(calibration factors), and a model matrix of calibration covariates. The
returned list is suitable for assignment to the @calibration slot of a
survey_taylor or survey_replicate object.
as_caldata(base_weights, g_weights, model_matrix)as_caldata(base_weights, g_weights, model_matrix)
base_weights |
A numeric vector of positive, finite base sampling
weights (length n). These are the original sampling weights before
calibration is applied. Must not contain |
g_weights |
A numeric vector of positive, finite g-factors (length n).
The calibrated weights are |
model_matrix |
A numeric matrix with |
The resulting calibration object is used by the variance estimation routines to apply a Deville-Sarndal (1992) calibration correction to Taylor-series and replicate-weight variance estimates.
GLM limitation: Using a calibrated survey_taylor object with
survey_glm() produces correct but conservative standard errors until
GREG-GLM variance is implemented in a future release. The calibration
correction is not applied in the GLM variance path.
A named list with four elements:
qrA QR decomposition (class "qr") of
sqrt(base_weights) * model_matrix. Used for calibration projection
in variance estimation.
wA numeric vector of length n equal to
g_weights * sqrt(base_weights). This intermediate quantity (the
square root of the calibrated weights scaled by g) is used directly
in the GREG variance projection formula.
stageInteger scalar 0L. Currently only between-PSU
calibration (stage 0) is supported.
indexNULL. Reserved for future within-PSU calibration
support.
GREG calibration (single auxiliary variable or multiple uncorrelated
variables): pass the model matrix from model.matrix(formula, data)
directly – one column per calibration variable. The intercept column
((Intercept)) is included by default from model.matrix(); it
contributes one degree of freedom to the calibration adjustment.
Raking (multiple calibration margins, Architecture A): combine all
margin indicator matrices into a single matrix before calling
as_caldata(). Column-bind the matrices and drop one reference column
per margin to avoid rank deficiency (e.g., drop the last column of each
per-margin block). Pass this single combined matrix. Do not call
as_caldata() once per margin; that uses Architecture B (sequential),
which requires a separate as_caldata() element per calibration pass
and stores them all in the @calibration list.
q_k = 1 assumption: as_caldata() always assumes
(uniform calibration weights). If your calibration uses a
non-unity (variance-function weights from survey::calibrate(
calfun = "linear", variance = ...)) you must absorb those weights into
model_matrix before calling as_caldata().
survey_replicate designs@calibration on a survey_replicate object is
provenance-only: it documents that the replicate weights were
derived from a calibrated design, but the variance estimator does not apply
any GREG projection. Calibration is already encoded in the replicate weights
themselves. Do not expect get_means() SE to differ between a
survey_replicate with and without @calibration set.
Deville, J.-C., and Sarndal, C.-E. (1992). Calibration estimators in survey sampling. Journal of the American Statistical Association, 87, 376–382.
as_survey() for Taylor series designs,
as_survey_replicate() for replicate-weight designs,
as_survey_twophase() for two-phase designs
Other constructors:
as_survey(),
as_survey_nonprob(),
as_survey_replicate(),
as_survey_twophase(),
survey_glm(),
survey_glm_fit(),
survey_nonprob(),
survey_replicate(),
survey_taylor(),
survey_twophase()
# Minimal example: 3-unit design, intercept-only calibration base_weights <- c(2.5, 3.0, 4.0) g_weights <- c(1.02, 0.98, 1.01) model_matrix <- matrix(1, nrow = 3, ncol = 1) cd <- as_caldata(base_weights, g_weights, model_matrix) names(cd) # "qr", "w", "stage", "index" # Assign to a survey_taylor design df <- data.frame(y = c(1.2, 2.3, 3.4), wt = base_weights) design <- as_survey(df, weights = wt) design@calibration <- list(cd) is.null(design@calibration) # FALSE# Minimal example: 3-unit design, intercept-only calibration base_weights <- c(2.5, 3.0, 4.0) g_weights <- c(1.02, 0.98, 1.01) model_matrix <- matrix(1, nrow = 3, ncol = 1) cd <- as_caldata(base_weights, g_weights, model_matrix) names(cd) # "qr", "w", "stage", "index" # Assign to a survey_taylor design df <- data.frame(y = c(1.2, 2.3, 3.4), wt = base_weights) design <- as_survey(df, weights = wt) design@calibration <- list(cd) is.null(design@calibration) # FALSE
Creates a survey design object using Taylor series (linearization) for variance estimation. Supports simple random samples, stratified designs, single- and multi-stage cluster designs, and designs with finite population correction. Uses a tidy-select interface for all design variable arguments.
as_survey( data, ids = NULL, probs = NULL, weights = NULL, strata = NULL, fpc = NULL, nest = FALSE, calibration = NULL )as_survey( data, ids = NULL, probs = NULL, weights = NULL, strata = NULL, fpc = NULL, nest = FALSE, calibration = NULL )
data |
A |
ids |
< |
probs |
< |
weights |
< |
strata |
< |
fpc |
< |
nest |
Logical. If |
calibration |
A list of calibration data elements, each produced by
Known limitations (not validated at construction time):
|
A survey_taylor object.
All design variable arguments (ids, probs, weights, strata,
fpc) support tidy-select syntax: bare column names, c() to combine
multiple columns (multi-stage ids = c(psu, ssu), multi-stage fpc),
and tidyselect helpers like starts_with(). See the Examples section
below for runnable demonstrations.
When no ids or strata are specified, the result is a survey_taylor
object with NULL ids and strata — i.e., a simple random sample (SRS).
The Taylor variance machinery produces the same estimates as the classical
SRS formula (1 - f) * s^2 / n. If weights and probs are also both
omitted, uniform weights are assigned and a warning is issued.
as_survey() does not support probability-proportional-to-size (PPS)
variance estimation. Taylor series linearization treats all designs as
with-replacement, which overestimates (is conservative for) variance in
PPS-without-replacement designs. The Yates-Grundy and Brewer/Overton
estimators available in survey::svydesign() via its pps and variance
arguments are not supported.
If your design requires PPS-specific variance estimation, create the design
with survey::svydesign() and convert it with from_svydesign():
d_survey <- survey::svydesign( ids = ~psu, weights = ~wt, strata = ~stratum, pps = "brewer", data = mydata ) d <- from_svydesign(d_survey)
Deville, J.-C. and Sarndal, C.-E. (1992) Calibration estimators in survey sampling. Journal of the American Statistical Association 87(418), 376–382.
Deville, J.-C., Sarndal, C.-E. and Sautory, O. (1993) Generalized raking procedures in survey sampling. Journal of the American Statistical Association 88(423), 1013–1020.
Lumley, T. (2004) Analysis of complex survey samples. Journal of Statistical Software 9(1), 1–19.
Lumley, T. (2010) Complex Surveys: A Guide to Analysis Using R. John Wiley and Sons.
Rao, J.N.K., Yung, W. and Hidiroglou, M.A. (2002) Estimating equations for the analysis of survey data using poststratification information. Sankhya 64-A, 22–36.
Sarndal, C-E., Swensson, B. and Wretman, J. (1992) Model Assisted Survey Sampling. Springer.
as_survey_replicate() for replicate-weight designs,
as_survey_twophase() for two-phase designs,
set_var_label() to add variable labels
Other constructors:
as_caldata(),
as_survey_nonprob(),
as_survey_replicate(),
as_survey_twophase(),
survey_glm(),
survey_glm_fit(),
survey_nonprob(),
survey_replicate(),
survey_taylor(),
survey_twophase()
# Full NHANES design: stratified cluster with PSU IDs nested within strata d <- as_survey( nhanes_2017, ids = sdmvpsu, weights = wtint2yr, strata = sdmvstra, nest = TRUE ) # Stratified design without PSU cluster IDs d_strat <- as_survey(nhanes_2017, weights = wtint2yr, strata = sdmvstra) # Blood pressure analysis: filter to exam participants, use MEC weight exam <- nhanes_2017[nhanes_2017$ridstatr == 2, ] d_bp <- as_survey( exam, ids = sdmvpsu, weights = wtmec2yr, strata = sdmvstra, nest = TRUE ) # c() to combine multiple columns — sketched on a synthetic two-stage frame df <- data.frame( psu = rep(1:5, each = 4), ssu = 1:20, wt = runif(20, 0.5, 2) ) d_ms <- as_survey(df, ids = c(psu, ssu), weights = wt) # Tidy-select helpers like starts_with() also work d_h <- as_survey( gss_2024, ids = vpsu, strata = vstrat, weights = starts_with("wtssn"), nest = TRUE )# Full NHANES design: stratified cluster with PSU IDs nested within strata d <- as_survey( nhanes_2017, ids = sdmvpsu, weights = wtint2yr, strata = sdmvstra, nest = TRUE ) # Stratified design without PSU cluster IDs d_strat <- as_survey(nhanes_2017, weights = wtint2yr, strata = sdmvstra) # Blood pressure analysis: filter to exam participants, use MEC weight exam <- nhanes_2017[nhanes_2017$ridstatr == 2, ] d_bp <- as_survey( exam, ids = sdmvpsu, weights = wtmec2yr, strata = sdmvstra, nest = TRUE ) # c() to combine multiple columns — sketched on a synthetic two-stage frame df <- data.frame( psu = rep(1:5, each = 4), ssu = 1:20, wt = runif(20, 0.5, 2) ) d_ms <- as_survey(df, ids = c(psu, ssu), weights = wt) # Tidy-select helpers like starts_with() also work d_h <- as_survey( gss_2024, ids = vpsu, strata = vstrat, weights = starts_with("wtssn"), nest = TRUE )
Builds a survey_collection from one or more survey design objects for comparative analysis across waves, cross-sections, or sub-populations. Each element is stored independently — designs are never combined, and variance estimation is never re-specified.
as_survey_collection(..., group, .id = ".survey", .if_missing_var = "error")as_survey_collection(..., group, .id = ".survey", .if_missing_var = "error")
... |
One or more |
group |
< |
.id |
Character(1). Identifier column name used when dispatching
analysis functions across the collection. Default |
.if_missing_var |
Character(1), one of |
Arguments may be passed with explicit names ("wave1" = d1) or as bare
symbols (d1, auto-named to "d1"). An unnamed argument that is not a
bare symbol (e.g., an inline as_survey(...) call) raises
surveycore_error_collection_unnamed_expr — name such arguments
explicitly.
Duplicate names are repaired by appending _1, _2, … to subsequent
occurrences (first occurrence preserved). When any rename occurs,
a surveycore_warning_collection_duplicate_name_repaired warning is
emitted showing the original -> repaired mapping.
A survey_collection object containing the supplied surveys.
survey_collection, add_survey(), remove_survey()
Other collections:
add_survey(),
remove_survey(),
set_collection_id(),
set_collection_if_missing_var(),
survey_collection()
d1 <- as_survey( gss_2024, ids = vpsu, weights = wtssps, strata = vstrat, nest = TRUE ) d2 <- as_survey( gss_2024, ids = vpsu, weights = wtssps, strata = vstrat, nest = TRUE ) # Explicit names coll <- as_survey_collection("2020" = d1, "2024" = d2) names(coll) # Bare-symbol auto-naming coll2 <- as_survey_collection(d1, d2) names(coll2) # Uniform grouping across members coll3 <- as_survey_collection(d1, d2, group = vstrat) names(survey_data(coll3[[1L]]))d1 <- as_survey( gss_2024, ids = vpsu, weights = wtssps, strata = vstrat, nest = TRUE ) d2 <- as_survey( gss_2024, ids = vpsu, weights = wtssps, strata = vstrat, nest = TRUE ) # Explicit names coll <- as_survey_collection("2020" = d1, "2024" = d2) names(coll) # Bare-symbol auto-naming coll2 <- as_survey_collection(d1, d2) names(coll2) # Uniform grouping across members coll3 <- as_survey_collection(d1, d2, group = vstrat) names(survey_data(coll3[[1L]]))
Creates a survey design object for non-probability samples (e.g., online panels, quota samples, volunteer panels). Accepts pre-computed calibration weights (including raking and post-stratification) or inverse probability weighting (IPW) pseudo-weights.
as_survey_nonprob( data, weights, repweights = NULL, type = "bootstrap", scale = NULL, rscales = NULL, mse = TRUE, reference_sample = NULL, calibration = NULL )as_survey_nonprob( data, weights, repweights = NULL, type = "bootstrap", scale = NULL, rscales = NULL, mse = TRUE, reference_sample = NULL, calibration = NULL )
data |
A |
weights |
< |
repweights |
< |
type |
Character scalar. Replicate variance type. When
|
scale |
Numeric scalar. Scaling factor for the replicate variance
formula. Default |
rscales |
Numeric vector of length |
mse |
Logical. If |
reference_sample |
Optional. A survey_taylor object representing the
probability-based reference sample used to estimate propensity scores or
calibration targets. Stored in |
calibration |
Optional. A calibration provenance object returned by
a surveywts weighting function. Stored in |
Unlike probability samples, non-probability samples have no design weights derived from known selection probabilities, which means estimates carry additional uncertainty not captured by standard design-based variance formulas. Per Elliott and Valliant (2017), Valliant, Dever, and Kreuter (2018), and Brick (2015), bootstrap or jackknife replicate weights are the recommended approach for variance estimation — they propagate calibration uncertainty into standard errors. Note, however, that replicate variance addresses calibration uncertainty only; it does not resolve uncertainty about the selection mechanism itself, which requires untestable modeling assumptions about the relationship between sample membership and the survey variables of interest. Without replicate weights, standard errors use a model-assisted SRS approximation that systematically underestimates variance for non-probability samples.
When repweights is supplied, the variance estimator uses the replicate
formula: V = scale * sum(rscales * (theta_r - theta)^2). For bootstrap
replicates (type = "bootstrap"), the default scale = 1/R follows Wu
(2022) and Chen et al. (2021). For jackknife replicates (type = "JK1",
"JK2", or "JKn"), scale and rscales follow the standard jackknife
variance conventions; see type for defaults.
When repweights = NULL, standard errors use an SRS approximation (treating
each observation as its own PSU). This understates calibration uncertainty;
see vignette("creating-survey-objects") for details.
A survey_nonprob object.
Use as_survey_nonprob() instead of as_survey() when:
Your data comes from a non-probability sample (online panel, quota sample, MTurk/Prolific, etc.)
You have calibration or raking weights but no probability sampling design structure (no PSU IDs, strata, etc.)
You want to explicitly record the provenance of your calibration weights for reproducibility
If your data comes from a probability sample with known design structure,
use as_survey(), as_survey_replicate(), or as_survey_twophase()
instead.
Two modes are available, depending on whether repweights is supplied:
repweights = NULL, the default)Standard errors treat the calibrated weights as fixed and assume simple random sampling. This is a model-assisted approximation that understates calibration uncertainty. Use this mode only when replicate weights are unavailable; interpret standard errors with caution (Valliant 2020; Elliott and Valliant 2017).
repweights supplied)Each replicate weight column must contain calibrated weights re-estimated on one bootstrap draw (i.e., raking or post-stratification was re-applied within each replicate). This propagates calibration uncertainty into the variance estimate and is the recommended approach (Chrostowski et al. 2025; Kolenikov 2014).
See vignette("creating-survey-objects") for guidance on choosing between
these modes and on the limitations of SRS-based variance for calibrated
non-probability samples.
Valliant, R. (2020). Comparing alternatives for estimation from nonprobability samples. Journal of Survey Statistics and Methodology 8(2), 231–263. doi:10.1093/jssam/smz003
Elliott, M.R. and Valliant, R. (2017). Inference for nonprobability samples. Statistical Science 32(2), 249–264.
Chrostowski, M.J., Guzman, C.A. and Malm, L. (2025). Variance estimation for non-probability surveys. Journal of Survey Statistics and Methodology (forthcoming).
Brick, J.M. (2015). Compositional model inference. In Proceedings of the Section on Survey Research Methods, pp. 299–307. American Statistical Association, Alexandria, VA.
Valliant, R., Dever, J.A. and Kreuter, F. (2018). Practical Tools for Designing and Weighting Survey Samples, 2nd ed. Springer, New York.
Kolenikov, S. (2014). Calibrating variance estimation with proxy variables. Survey Methodology 40(1), 21–38.
Wu, C. (2022). Statistical inference with non-probability survey samples. Survey Methodology 48(2), 283–311.
Chen, Y., Li, P. and Wu, C. (2021). Doubly robust inference with non-probability survey samples. Journal of the American Statistical Association 115(532), 2011–2021.
as_survey() for probability designs with Taylor variance,
as_survey_replicate() for replicate-weight designs
Other constructors:
as_caldata(),
as_survey(),
as_survey_replicate(),
as_survey_twophase(),
survey_glm(),
survey_glm_fit(),
survey_nonprob(),
survey_replicate(),
survey_taylor(),
survey_twophase()
# Minimal: pre-computed calibration weights, SRS-based variance df <- data.frame( y = rnorm(200), age = sample(c("18-34", "35-54", "55+"), 200, replace = TRUE), cal_wt = runif(200, 0.5, 2.5) ) d <- as_survey_nonprob(df, weights = cal_wt) # Bootstrap variance: replicate weights with calibration re-applied in each set.seed(1) R <- 50 rep_cols <- setNames( as.data.frame( matrix(runif(200 * R, 0.5, 2.5), nrow = 200) ), paste0("rep_", seq_len(R)) ) df_rep <- cbind(df, rep_cols) d_boot <- as_survey_nonprob( df_rep, weights = cal_wt, repweights = starts_with("rep_"), type = "bootstrap" ) # Jackknife variance (JK1): delete-one replicate weights d_jk <- as_survey_nonprob( df_rep, weights = cal_wt, repweights = starts_with("rep_"), type = "JK1" )# Minimal: pre-computed calibration weights, SRS-based variance df <- data.frame( y = rnorm(200), age = sample(c("18-34", "35-54", "55+"), 200, replace = TRUE), cal_wt = runif(200, 0.5, 2.5) ) d <- as_survey_nonprob(df, weights = cal_wt) # Bootstrap variance: replicate weights with calibration re-applied in each set.seed(1) R <- 50 rep_cols <- setNames( as.data.frame( matrix(runif(200 * R, 0.5, 2.5), nrow = 200) ), paste0("rep_", seq_len(R)) ) df_rep <- cbind(df, rep_cols) d_boot <- as_survey_nonprob( df_rep, weights = cal_wt, repweights = starts_with("rep_"), type = "bootstrap" ) # Jackknife variance (JK1): delete-one replicate weights d_jk <- as_survey_nonprob( df_rep, weights = cal_wt, repweights = starts_with("rep_"), type = "JK1" )
Creates a survey design object using replicate weights for variance estimation. Supports all common replicate methods: jackknife (JK1, JK2, JKn), balanced repeated replication (BRR, Fay), bootstrap, ACS, successive-difference, and user-defined types. Uses a tidy-select interface for weight and replicate-weight columns.
as_survey_replicate( data, weights, repweights, type = c("JK1", "JK2", "JKn", "BRR", "Fay", "bootstrap", "ACS", "successive-difference", "other"), scale = NULL, rscales = NULL, fpc = NULL, fpctype = c("fraction", "correction"), mse = TRUE, calibration = NULL )as_survey_replicate( data, weights, repweights, type = c("JK1", "JK2", "JKn", "BRR", "Fay", "bootstrap", "ACS", "successive-difference", "other"), scale = NULL, rscales = NULL, fpc = NULL, fpctype = c("fraction", "correction"), mse = TRUE, calibration = NULL )
data |
A |
weights |
< |
repweights |
< |
type |
Character. Replicate weight method. One of |
scale |
Numeric. Scaling factor applied to the replicate variance
formula. If |
rscales |
Numeric vector of replicate-specific scaling factors, or
|
fpc |
< |
fpctype |
Character. How |
mse |
Logical. If |
calibration |
A list of calibration data elements, each produced by
Known limitations (not validated at construction time):
|
A survey_replicate object.
Both weights and repweights support tidy-select syntax:
# Bare name for weights
as_survey_replicate(
df, weights = wt, repweights = starts_with("repwt"), type = "BRR"
)
# c() for explicit replicate columns
as_survey_replicate(
df, weights = wt, repweights = c(rep1, rep2, rep3), type = "JK1"
)
The replicate weight matrix is not stored in the object. Only the
column names are stored in @variables$repweights. Variance estimation
computes the matrix on demand:
as.matrix(design@data[, design@variables$repweights]).
Each call to an estimation function (e.g., get_means(), get_totals())
materialises the full replicate weight matrix from the data frame. For large
designs (e.g., ACS PUMS with 500k+ rows × 80 replicates), this is roughly
nrow * n_replicates * 8 bytes per call (~363 MB for ACS Wyoming × 80).
If you are estimating many variables, this is repeated for each call.
This behaviour matches the survey package reference implementation.
Canty, A.J. and Davison, A.C. (1999) Resampling-based variance estimation for labour force surveys. The Statistician 48(3), 379–391.
Deville, J.-C. and Sarndal, C.-E. (1992) Calibration estimators in survey sampling. Journal of the American Statistical Association 87(418), 376–382.
Deville, J.-C., Sarndal, C.-E. and Sautory, O. (1993) Generalized raking procedures in survey sampling. Journal of the American Statistical Association 88(423), 1013–1020.
Judkins, D.R. (1990) Fay's method for variance estimation. Journal of the American Statistical Association 85(410), 895–904.
Rao, J.N.K., Wu, C.F.J. and Yue, K. (1992) Some recent work on resampling methods for complex surveys. Survey Methodology 18(2), 209–217.
Shao, J. and Tu, D. (1995) The Jackknife and Bootstrap. Springer.
as_survey() for Taylor series designs,
as_survey_twophase() for two-phase designs,
set_var_label() to add variable labels
Other constructors:
as_caldata(),
as_survey(),
as_survey_nonprob(),
as_survey_twophase(),
survey_glm(),
survey_glm_fit(),
survey_nonprob(),
survey_replicate(),
survey_taylor(),
survey_twophase()
# ACS PUMS Wyoming: 80 successive-difference replicate weights d_acs <- as_survey_replicate( acs_pums_wy, weights = pwgtp, repweights = pwgtp1:pwgtp80, type = "successive-difference" ) # Explicit replicate columns using c() d_sub <- as_survey_replicate( acs_pums_wy, weights = pwgtp, repweights = c(pwgtp1, pwgtp2, pwgtp3, pwgtp4), type = "JK1" )# ACS PUMS Wyoming: 80 successive-difference replicate weights d_acs <- as_survey_replicate( acs_pums_wy, weights = pwgtp, repweights = pwgtp1:pwgtp80, type = "successive-difference" ) # Explicit replicate columns using c() d_sub <- as_survey_replicate( acs_pums_wy, weights = pwgtp, repweights = c(pwgtp1, pwgtp2, pwgtp3, pwgtp4), type = "JK1" )
Creates a two-phase (double) sampling design from an existing
survey_taylor Phase 1 object. Phase 1 covers all rows; Phase 2 is a
strict subset indicated by a logical column. Uses a tidy-select interface
for all Phase 2 design variable arguments.
as_survey_twophase( phase1, ids2 = NULL, strata2 = NULL, probs2 = NULL, fpc2 = NULL, subset, method = c("full", "approx", "simple") )as_survey_twophase( phase1, ids2 = NULL, strata2 = NULL, probs2 = NULL, fpc2 = NULL, subset, method = c("full", "approx", "simple") )
phase1 |
A survey design object (inheriting from |
ids2 |
< |
strata2 |
< |
probs2 |
< |
fpc2 |
< |
subset |
< |
method |
Character. Variance estimation method for combining Phase 1
and Phase 2 variability. One of |
"full" — Full two-phase variance formula. Accounts for variability in
both phases. Requires Phase 2 design information (probs2, ids2,
strata2) when Phase 2 is not a simple random subsample. If none of
these are provided, an error is raised.
"approx" — Approximation that ignores Phase 1 sampling variability.
Faster but less accurate than "full" when the Phase 1 sampling fraction
is non-negligible.
"simple" — Treats Phase 2 as a single-phase design, ignoring Phase 1.
Only valid when Phase 1 is a census (no sampling). Issues a warning when
Phase 1 has PSU cluster variables, because this understates variance for
clustered designs.
A survey_twophase object.
Sarndal, C-E., Swensson, B. and Wretman, J. (1992) Model Assisted Survey Sampling. Springer.
Breslow, N.E. and Chatterjee, N. (1999) Design and analysis of two-phase studies with binary outcome applied to Wilms tumour prognosis. Applied Statistics 48, 457–468.
Breslow, N., Lumley, T., Ballantyne, C.M., Chambless, L.E. and Kulick, M. (2009) Improved Horvitz-Thompson estimation of model parameters from two-phase stratified samples: applications in epidemiology. Statistics in Biosciences. doi:10.1007/s12561-009-9001-6
as_survey() for Taylor series designs,
as_survey_replicate() for replicate-weight designs
Other constructors:
as_caldata(),
as_survey(),
as_survey_nonprob(),
as_survey_replicate(),
survey_glm(),
survey_glm_fit(),
survey_nonprob(),
survey_replicate(),
survey_taylor(),
survey_twophase()
# Minimal two-phase design: Phase 1 = full cohort, Phase 2 = random subset df <- data.frame( id = 1:20, wt = rep(2, 20), in_phase2 = c(rep(TRUE, 10), rep(FALSE, 10)), y = rnorm(20) ) phase1 <- as_survey(df, ids = id, weights = wt) d2 <- as_survey_twophase(phase1, subset = in_phase2) # With Phase 2 stratification and inclusion probabilities df2 <- data.frame( id = 1:30, wt = rep(3, 30), in_phase2 = c(rep(TRUE, 15), rep(FALSE, 15)), arm = rep(c("A", "B", "C"), 10), subsamprate = rep(c(0.5, 0.7, 0.3), 10), y = rnorm(30) ) phase1b <- as_survey(df2, ids = id, weights = wt) d2b <- as_survey_twophase( phase1b, strata2 = arm, probs2 = subsamprate, subset = in_phase2, method = "full" )# Minimal two-phase design: Phase 1 = full cohort, Phase 2 = random subset df <- data.frame( id = 1:20, wt = rep(2, 20), in_phase2 = c(rep(TRUE, 10), rep(FALSE, 10)), y = rnorm(20) ) phase1 <- as_survey(df, ids = id, weights = wt) d2 <- as_survey_twophase(phase1, subset = in_phase2) # With Phase 2 stratification and inclusion probabilities df2 <- data.frame( id = 1:30, wt = rep(3, 30), in_phase2 = c(rep(TRUE, 15), rep(FALSE, 15)), arm = rep(c("A", "B", "C"), 10), subsamprate = rep(c(0.5, 0.7, 0.3), 10), y = rnorm(30) ) phase1b <- as_survey(df2, ids = id, weights = wt) d2b <- as_survey_twophase( phase1b, strata2 = arm, probs2 = subsamprate, subset = in_phase2, method = "full" )
Converts a survey_taylor, survey_replicate, or survey_twophase object
to the corresponding survey package object: svydesign, svrepdesign,
or twophase. Useful for accessing survey package estimation functions
or for round-trip testing.
as_svydesign(x)as_svydesign(x)
x |
A |
Metadata (variable labels, value labels) is NOT carried over — the survey
package has no metadata system.
A survey::svydesign, survey::svrepdesign, or survey::twophase
object.
from_svydesign() to convert back from a survey design
Other conversion:
as_tbl_svy(),
from_svydesign(),
from_tbl_svy()
d <- as_survey( nhanes_2017, ids = sdmvpsu, weights = wtint2yr, strata = sdmvstra, nest = TRUE ) if (requireNamespace("survey", quietly = TRUE)) { sv <- as_svydesign(d) survey::svymean(~ridageyr, sv, na.rm = TRUE) }d <- as_survey( nhanes_2017, ids = sdmvpsu, weights = wtint2yr, strata = sdmvstra, nest = TRUE ) if (requireNamespace("survey", quietly = TRUE)) { sv <- as_svydesign(d) survey::svymean(~ridageyr, sv, na.rm = TRUE) }
Converts a surveycore design object to an srvyr tbl_svy by first
converting to a survey design via as_svydesign() and then wrapping
with srvyr::as_survey(). Requires both survey and srvyr.
as_tbl_svy(x)as_tbl_svy(x)
x |
A |
Metadata (variable labels, value labels) is NOT carried over.
A srvyr::tbl_svy object.
from_tbl_svy() to convert back from a tbl_svy object
Other conversion:
as_svydesign(),
from_svydesign(),
from_tbl_svy()
d <- as_survey( nhanes_2017, ids = sdmvpsu, weights = wtint2yr, strata = sdmvstra, nest = TRUE ) if ( requireNamespace("survey", quietly = TRUE) && requireNamespace("srvyr", quietly = TRUE) ) { ts <- as_tbl_svy(d) }d <- as_survey( nhanes_2017, ids = sdmvpsu, weights = wtint2yr, strata = sdmvstra, nest = TRUE ) if ( requireNamespace("survey", quietly = TRUE) && requireNamespace("srvyr", quietly = TRUE) ) { ts <- as_tbl_svy(d) }
A simple random sample from the 2000 California Academic Performance
Index (API) study. 200 schools were randomly sampled. This is the same
underlying data as apisrs in the survey package, reformatted to
surveycore conventions.
ca_api_2000ca_api_2000
A data frame with 200 rows and 38 variables:
Sampling weight (inverse probability of selection).
FPC (number of schools in the California API system).
County/district/school code (character, 14-digit).
School number (integer).
District number (integer).
Short school name (character).
Full school name (character).
District name (character).
County name (character).
County number (integer).
API score 2000 (integer).
API score 1999 (integer).
API growth target (integer).
API score change, api00 - api99 (integer).
Percent of students tested (integer).
Met school-wide growth target (integer, 0 = No, 1 = Yes).
Met comparable improvement target (integer, 0 = No, 1 = Yes).
Met both targets (integer, 0 = No, 1 = Yes).
Eligible for awards program (integer, 0 = No, 1 = Yes).
School type (integer): 1 = Elementary, 2 = High, 3 = Middle.
Year-round school (integer, 0 = No, 1 = Yes).
Percent of students receiving free meals (integer).
Number of English language learners (integer).
Percent of students in first year at school (integer).
Total number of students (integer).
Number of students included in API 2000 (integer).
Average class size, grades K–3 (integer; NA for high and middle schools).
Average class size, grades 4–6 (integer; NA for high schools and some others).
Average class size, core academic courses (integer; NA for most elementary schools).
Percent of parents who did not complete high school (integer).
Percent of parents who are high school graduates (integer).
Percent of parents with some college (integer).
Percent of parents who are college graduates (integer).
Percent of parents with graduate school education (integer).
Average parent education level (numeric).
Percent of parents who responded to the survey (integer).
Percent of teachers fully credentialed (integer).
Percent of teachers on emergency credentials (integer).
Survey design: Simple random sample. Use as_survey() with
weights = pw and fpc = fpc:
svy <- as_survey( ca_api_2000, weights = pw, fpc = fpc )
Missing values: Several columns have NA for schools where the value is
inapplicable: acs_k3 (grades K–3) is NA for high schools and middle
schools, where those grade spans do not exist; acs_46 (grades 4–6) is
NA for all high schools and some elementary and middle schools; acs_core is NA for
most elementary schools.
Metadata: All 38 columns carry "label" attributes (human-readable
variable descriptions). The six categorical columns (stype, sch_wide,
comp_imp, both, awards, yr_rnd) additionally carry "labels"
attributes mapping integer codes to category names, compatible with
surveycore's metadata system.
Relationship to apisrs: This dataset contains the same observations
as survey::apisrs, with three differences: (1) the all-NA flag
column is dropped; (2) factor columns are stored as plain integers with
labels attributes; (3) column names are in snake_case.
Lumley T (2004). Analysis of complex survey samples. Journal of Statistical
Software, 9(1):1–19. Data distributed with the survey R package.
California Department of Education, Academic Performance Index 2000.
head(ca_api_2000[, c("pw", "fpc", "api00", "enroll")]) # Create an SRS design svy <- as_survey(ca_api_2000, weights = pw, fpc = fpc) svy # Inspect variable label attr(ca_api_2000$api00, "label") # Inspect value labels for school type attr(ca_api_2000$stype, "labels")head(ca_api_2000[, c("pw", "fpc", "api00", "enroll")]) # Create an SRS design svy <- as_survey(ca_api_2000, weights = pw, fpc = fpc) svy # Inspect variable label attr(ca_api_2000$api00, "label") # Inspect value labels for school type attr(ca_api_2000$stype, "labels")
Groups variables by their shared question_preface metadata and classifies
each group as one of "single", "sata", or "battery". This is the single
source of truth used by downstream export functions to decide how to render
each question.
classify_question_type(x, ..., variable = NULL)classify_question_type(x, ..., variable = NULL)
x |
A survey design object or |
... |
< |
variable |
|
The classification rules, applied per requested variable:
If the variable has no question_preface, or is the only requested
variable sharing its preface, type = "single".
If a question_preface is shared by 2+ requested variables and at least
one is flagged via set_sata(), all variables in that group get
type = "sata".
Otherwise (shared preface, no SATA flag), all variables in the group
get type = "battery".
Group numbers are assigned sequentially by first appearance in the input.
A tibble with columns:
variable (character) — variable name
question_preface (character) — the preface, or NA if none
type (character) — one of "single", "sata", or "battery"
group (integer) — group id; variables with the same non-NA preface
share a group
set_sata(), extract_sata(), set_question_preface()
Other metadata:
extract_higher_is(),
extract_metadata(),
extract_missing_codes(),
extract_question_preface(),
extract_reverse_coded(),
extract_sata(),
extract_universe(),
extract_val_labels(),
extract_var_label(),
extract_var_note(),
infer_question_prefaces(),
set_higher_is(),
set_missing_codes(),
set_question_preface(),
set_reverse_coded(),
set_sata(),
set_universe(),
set_val_labels(),
set_var_label(),
set_var_note(),
survey_metadata(),
survey_weighting_history()
d <- as_survey( nhanes_2017, ids = sdmvpsu, weights = wtint2yr, strata = sdmvstra, nest = TRUE ) d <- set_question_preface( d, riagendr = "Demographics", ridageyr = "Demographics" ) d <- set_sata(d, riagendr, ridageyr) classify_question_type(d, riagendr, ridageyr, bpxsy1)d <- as_survey( nhanes_2017, ids = sdmvpsu, weights = wtint2yr, strata = sdmvstra, nest = TRUE ) d <- set_question_preface( d, riagendr = "Demographics", ridageyr = "Demographics" ) d <- set_sata(d, riagendr, ridageyr) classify_question_type(d, riagendr, ridageyr, bpxsy1)
Converts a survey_glm_fit object into a survey_glm_tidy result tibble
with one row per model coefficient (plus optional reference rows for factor
predictors), design-based standard errors, confidence intervals, and
structured metadata.
clean( model, conf_level = 0.95, include_reference = TRUE, n = FALSE, statistic = TRUE, exponentiate = FALSE, interaction_sep = " * ", ... )clean( model, conf_level = 0.95, include_reference = TRUE, n = FALSE, statistic = TRUE, exponentiate = FALSE, interaction_sep = " * ", ... )
model |
A |
conf_level |
Numeric scalar in |
include_reference |
Logical. If |
n |
Logical. If |
statistic |
Logical. If |
exponentiate |
Logical. If |
interaction_sep |
Character scalar. Separator for interaction term
labels. Default |
... |
Currently unused. |
A survey_glm_tidy object: a tibble with S3 class
c("survey_glm_tidy", "survey_result", "tbl_df", "tbl", "data.frame").
Metadata is accessed via meta().
survey_glm() to fit the model, meta() to access metadata.
Other analysis:
get_anova(),
get_corr(),
get_covariance(),
get_diffs(),
get_effective_n(),
get_freqs(),
get_means(),
get_pairwise(),
get_quantiles(),
get_ratios(),
get_t_test(),
get_totals(),
get_variance(),
meta()
d <- as_survey( gss_2024, ids = vpsu, weights = wtssps, strata = vstrat, nest = TRUE ) fit <- survey_glm(d, age ~ sex) clean(fit) clean(fit, conf_level = 0.99, exponentiate = FALSE)d <- as_survey( gss_2024, ids = vpsu, weights = wtssps, strata = vstrat, nest = TRUE ) fit <- survey_glm(d, age ~ sex) clean(fit) clean(fit, conf_level = 0.99, exponentiate = FALSE)
Returns the direction-of-improvement ("better" or "worse") for one or
more variables in a survey design object or data frame. Variables with no
direction set return NA_character_.
extract_higher_is(x, ..., variable = NULL)extract_higher_is(x, ..., variable = NULL)
x |
A survey design object or |
... |
< |
variable |
|
A named character vector. Unset variables return NA_character_.
Returns character(0) (named, zero-length) when all specified variables
are missing from x.
set_higher_is() to set direction attributes
Other metadata:
classify_question_type(),
extract_metadata(),
extract_missing_codes(),
extract_question_preface(),
extract_reverse_coded(),
extract_sata(),
extract_universe(),
extract_val_labels(),
extract_var_label(),
extract_var_note(),
infer_question_prefaces(),
set_higher_is(),
set_missing_codes(),
set_question_preface(),
set_reverse_coded(),
set_sata(),
set_universe(),
set_val_labels(),
set_var_label(),
set_var_note(),
survey_metadata(),
survey_weighting_history()
d <- as_survey( nhanes_2017, ids = sdmvpsu, weights = wtint2yr, strata = sdmvstra, nest = TRUE ) d <- set_higher_is(d, bpxsy1 = "worse") extract_higher_is(d, bpxsy1) extract_higher_is(d)d <- as_survey( nhanes_2017, ids = sdmvpsu, weights = wtint2yr, strata = sdmvstra, nest = TRUE ) d <- set_higher_is(d, bpxsy1 = "worse") extract_higher_is(d, bpxsy1) extract_higher_is(d)
Returns a summary of all metadata fields for one or more variables in a survey design object or data frame. Useful for auditing metadata state or building codebooks.
extract_metadata(x, ..., fill = NULL)extract_metadata(x, ..., fill = NULL)
x |
A survey design object or |
... |
< |
fill |
|
A named list. Each entry is a named list with keys:
variable_label, value_labels, question_preface, note,
universe, missing_codes, transformations.
Other metadata:
classify_question_type(),
extract_higher_is(),
extract_missing_codes(),
extract_question_preface(),
extract_reverse_coded(),
extract_sata(),
extract_universe(),
extract_val_labels(),
extract_var_label(),
extract_var_note(),
infer_question_prefaces(),
set_higher_is(),
set_missing_codes(),
set_question_preface(),
set_reverse_coded(),
set_sata(),
set_universe(),
set_val_labels(),
set_var_label(),
set_var_note(),
survey_metadata(),
survey_weighting_history()
d <- as_survey( nhanes_2017, ids = sdmvpsu, weights = wtint2yr, strata = sdmvstra, nest = TRUE ) d <- set_universe(d, ridageyr = "All participants 0+") extract_metadata(d, ridageyr) extract_metadata(d, fill = "include")d <- as_survey( nhanes_2017, ids = sdmvpsu, weights = wtint2yr, strata = sdmvstra, nest = TRUE ) d <- set_universe(d, ridageyr = "All participants 0+") extract_metadata(d, ridageyr) extract_metadata(d, fill = "include")
Returns missing value sentinel codes for one or more variables in a survey design object or data frame.
extract_missing_codes(x, ..., format = "list", fill = NULL)extract_missing_codes(x, ..., format = "list", fill = NULL)
x |
A survey design object or |
... |
< |
format |
|
fill |
Scalar or |
"list" (default): named list of atomic vectors. Empty: list().
"data_frame": long-format tibble with columns variable, description
(NA if codes vector is unnamed), code (coerced to character). Empty:
zero-row tibble.
set_missing_codes() to set missing value codes
Other metadata:
classify_question_type(),
extract_higher_is(),
extract_metadata(),
extract_question_preface(),
extract_reverse_coded(),
extract_sata(),
extract_universe(),
extract_val_labels(),
extract_var_label(),
extract_var_note(),
infer_question_prefaces(),
set_higher_is(),
set_missing_codes(),
set_question_preface(),
set_reverse_coded(),
set_sata(),
set_universe(),
set_val_labels(),
set_var_label(),
set_var_note(),
survey_metadata(),
survey_weighting_history()
d <- as_survey( nhanes_2017, ids = sdmvpsu, weights = wtint2yr, strata = sdmvstra, nest = TRUE ) d <- set_missing_codes(d, ridageyr = c("Not applicable" = 999L)) extract_missing_codes(d, ridageyr) extract_missing_codes(d, ridageyr, format = "data_frame")d <- as_survey( nhanes_2017, ids = sdmvpsu, weights = wtint2yr, strata = sdmvstra, nest = TRUE ) d <- set_missing_codes(d, ridageyr = c("Not applicable" = 999L)) extract_missing_codes(d, ridageyr) extract_missing_codes(d, ridageyr, format = "data_frame")
Returns question preface text for one or more variables in a survey design object or data frame.
extract_question_preface(x, ..., format = "named_vector", fill = NULL)extract_question_preface(x, ..., format = "named_vector", fill = NULL)
x |
A survey design object or |
... |
< |
format |
|
fill |
Scalar or |
"named_vector" (default): named character vector. Empty: character(0).
"list": named list of character scalars. Empty: list().
"data_frame": tibble with columns variable and preface. Empty:
zero-row tibble.
set_question_preface() to set a question preface
Other metadata:
classify_question_type(),
extract_higher_is(),
extract_metadata(),
extract_missing_codes(),
extract_reverse_coded(),
extract_sata(),
extract_universe(),
extract_val_labels(),
extract_var_label(),
extract_var_note(),
infer_question_prefaces(),
set_higher_is(),
set_missing_codes(),
set_question_preface(),
set_reverse_coded(),
set_sata(),
set_universe(),
set_val_labels(),
set_var_label(),
set_var_note(),
survey_metadata(),
survey_weighting_history()
d <- as_survey( gss_2024, ids = vpsu, weights = wtssps, strata = vstrat, nest = TRUE ) d <- set_question_preface(d, happy = "Taken all together...") extract_question_preface(d, happy)d <- as_survey( gss_2024, ids = vpsu, weights = wtssps, strata = vstrat, nest = TRUE ) d <- set_question_preface(d, happy = "Taken all together...") extract_question_preface(d, happy)
Returns the reverse-coded status for one or more variables in a survey design object or data frame.
extract_reverse_coded(x, ..., variable = NULL)extract_reverse_coded(x, ..., variable = NULL)
x |
A survey design object or |
... |
< |
variable |
|
A named logical vector. Variables not marked as reverse-coded return
FALSE. When all specified variables are missing, returns logical(0).
set_reverse_coded() to set reverse-coded flags
Other metadata:
classify_question_type(),
extract_higher_is(),
extract_metadata(),
extract_missing_codes(),
extract_question_preface(),
extract_sata(),
extract_universe(),
extract_val_labels(),
extract_var_label(),
extract_var_note(),
infer_question_prefaces(),
set_higher_is(),
set_missing_codes(),
set_question_preface(),
set_reverse_coded(),
set_sata(),
set_universe(),
set_val_labels(),
set_var_label(),
set_var_note(),
survey_metadata(),
survey_weighting_history()
d <- as_survey( nhanes_2017, ids = sdmvpsu, weights = wtint2yr, strata = sdmvstra, nest = TRUE ) d <- set_reverse_coded(d, bpxsy1) extract_reverse_coded(d, bpxsy1) extract_reverse_coded(d)d <- as_survey( nhanes_2017, ids = sdmvpsu, weights = wtint2yr, strata = sdmvstra, nest = TRUE ) d <- set_reverse_coded(d, bpxsy1) extract_reverse_coded(d, bpxsy1) extract_reverse_coded(d)
Returns the SATA status for one or more variables in a survey design object or a data frame.
extract_sata(x, ..., format = "named_vector", fill = FALSE)extract_sata(x, ..., format = "named_vector", fill = FALSE)
x |
A survey design object or |
... |
< |
format |
|
fill |
|
"named_vector" (default): named logical vector. Empty: logical(0).
"list": named list of logical scalars. Empty: list().
"data_frame": tibble with columns variable (character) and sata
(logical). Empty: zero-row tibble.
set_sata() to set SATA flags
Other metadata:
classify_question_type(),
extract_higher_is(),
extract_metadata(),
extract_missing_codes(),
extract_question_preface(),
extract_reverse_coded(),
extract_universe(),
extract_val_labels(),
extract_var_label(),
extract_var_note(),
infer_question_prefaces(),
set_higher_is(),
set_missing_codes(),
set_question_preface(),
set_reverse_coded(),
set_sata(),
set_universe(),
set_val_labels(),
set_var_label(),
set_var_note(),
survey_metadata(),
survey_weighting_history()
d <- as_survey( nhanes_2017, ids = sdmvpsu, weights = wtint2yr, strata = sdmvstra, nest = TRUE ) d <- set_sata(d, riagendr) extract_sata(d, riagendr) extract_sata(d, fill = NULL)d <- as_survey( nhanes_2017, ids = sdmvpsu, weights = wtint2yr, strata = sdmvstra, nest = TRUE ) d <- set_sata(d, riagendr) extract_sata(d, riagendr) extract_sata(d, fill = NULL)
Returns universe (eligibility) descriptions for one or more variables in a survey design object or data frame.
extract_universe(x, ..., format = "named_vector", fill = NULL)extract_universe(x, ..., format = "named_vector", fill = NULL)
x |
A survey design object or |
... |
< |
format |
|
fill |
Scalar or |
"named_vector" (default): named character vector. Empty: character(0).
"list": named list of character scalars. Empty: list().
"data_frame": tibble with columns variable and universe. Empty:
zero-row tibble.
set_universe() to set a universe description
Other metadata:
classify_question_type(),
extract_higher_is(),
extract_metadata(),
extract_missing_codes(),
extract_question_preface(),
extract_reverse_coded(),
extract_sata(),
extract_val_labels(),
extract_var_label(),
extract_var_note(),
infer_question_prefaces(),
set_higher_is(),
set_missing_codes(),
set_question_preface(),
set_reverse_coded(),
set_sata(),
set_universe(),
set_val_labels(),
set_var_label(),
set_var_note(),
survey_metadata(),
survey_weighting_history()
d <- as_survey( nhanes_2017, ids = sdmvpsu, weights = wtint2yr, strata = sdmvstra, nest = TRUE ) d <- set_universe(d, ridageyr = "All participants 0+") extract_universe(d) extract_universe(d, ridageyr, format = "data_frame")d <- as_survey( nhanes_2017, ids = sdmvpsu, weights = wtint2yr, strata = sdmvstra, nest = TRUE ) d <- set_universe(d, ridageyr = "All participants 0+") extract_universe(d) extract_universe(d, ridageyr, format = "data_frame")
Returns value labels for one or more variables in a survey design object or data frame.
extract_val_labels(x, ..., format = "list", fill = NULL)extract_val_labels(x, ..., format = "list", fill = NULL)
x |
A survey design object or |
... |
< |
format |
|
fill |
Scalar or |
"list" (default): named list of named vectors. Empty: list().
"data_frame": long-format tibble with columns variable, label,
value (codes coerced to character). Empty: zero-row tibble.
set_val_labels() to set value labels
Other metadata:
classify_question_type(),
extract_higher_is(),
extract_metadata(),
extract_missing_codes(),
extract_question_preface(),
extract_reverse_coded(),
extract_sata(),
extract_universe(),
extract_var_label(),
extract_var_note(),
infer_question_prefaces(),
set_higher_is(),
set_missing_codes(),
set_question_preface(),
set_reverse_coded(),
set_sata(),
set_universe(),
set_val_labels(),
set_var_label(),
set_var_note(),
survey_metadata(),
survey_weighting_history()
d <- as_survey( nhanes_2017, ids = sdmvpsu, weights = wtint2yr, strata = sdmvstra, nest = TRUE ) extract_val_labels(d, riagendr) extract_val_labels(d, riagendr, format = "data_frame")d <- as_survey( nhanes_2017, ids = sdmvpsu, weights = wtint2yr, strata = sdmvstra, nest = TRUE ) extract_val_labels(d, riagendr) extract_val_labels(d, riagendr, format = "data_frame")
Returns variable labels for one or more variables in a survey design object or data frame.
extract_var_label(x, ..., format = "named_vector", fill = NULL)extract_var_label(x, ..., format = "named_vector", fill = NULL)
x |
A survey design object or |
... |
< |
format |
|
fill |
Scalar or |
"named_vector" (default): named character vector. Empty: character(0).
"list": named list of character scalars. Empty: list().
"data_frame": tibble with columns variable and label. Empty:
zero-row tibble.
set_var_label() to set a variable label
Other metadata:
classify_question_type(),
extract_higher_is(),
extract_metadata(),
extract_missing_codes(),
extract_question_preface(),
extract_reverse_coded(),
extract_sata(),
extract_universe(),
extract_val_labels(),
extract_var_note(),
infer_question_prefaces(),
set_higher_is(),
set_missing_codes(),
set_question_preface(),
set_reverse_coded(),
set_sata(),
set_universe(),
set_val_labels(),
set_var_label(),
set_var_note(),
survey_metadata(),
survey_weighting_history()
d <- as_survey( nhanes_2017, ids = sdmvpsu, weights = wtint2yr, strata = sdmvstra, nest = TRUE ) extract_var_label(d) extract_var_label(d, riagendr, ridageyr) extract_var_label(d, format = "data_frame") extract_var_label(d, fill = NA_character_)d <- as_survey( nhanes_2017, ids = sdmvpsu, weights = wtint2yr, strata = sdmvstra, nest = TRUE ) extract_var_label(d) extract_var_label(d, riagendr, ridageyr) extract_var_label(d, format = "data_frame") extract_var_label(d, fill = NA_character_)
Returns analyst notes for one or more variables in a survey design object or data frame.
extract_var_note(x, ..., format = "named_vector", fill = NULL)extract_var_note(x, ..., format = "named_vector", fill = NULL)
x |
A survey design object or |
... |
< |
format |
|
fill |
Scalar or |
"named_vector" (default): named character vector. Empty: character(0).
"list": named list of character scalars. Empty: list().
"data_frame": tibble with columns variable and note. Empty:
zero-row tibble.
set_var_note() to set a note
Other metadata:
classify_question_type(),
extract_higher_is(),
extract_metadata(),
extract_missing_codes(),
extract_question_preface(),
extract_reverse_coded(),
extract_sata(),
extract_universe(),
extract_val_labels(),
extract_var_label(),
infer_question_prefaces(),
set_higher_is(),
set_missing_codes(),
set_question_preface(),
set_reverse_coded(),
set_sata(),
set_universe(),
set_val_labels(),
set_var_label(),
set_var_note(),
survey_metadata(),
survey_weighting_history()
d <- as_survey( gss_2024, ids = vpsu, weights = wtssps, strata = vstrat, nest = TRUE ) d <- set_var_note(d, age = "Top-coded at 89") extract_var_note(d, age)d <- as_survey( gss_2024, ids = vpsu, weights = wtssps, strata = vstrat, nest = TRUE ) d <- set_var_note(d, age = "Top-coded at 89") extract_var_note(d, age)
Converts a survey package design object (svydesign, svrepdesign, or
twophase) to the corresponding surveycore S7 object. The data, design
variables, and replicate weights are preserved; metadata (variable labels,
value labels) is not — the survey package has no metadata system.
from_svydesign(x)from_svydesign(x)
x |
A |
Weight column names are recovered from the design call when available. When
the call does not contain a formula (e.g., weights were passed as a vector),
the weight column is identified by matching the stored weight values against
columns in the data. If no match is found, a ..surveycore_wt.. column is
added.
A survey_taylor, survey_replicate, or survey_twophase object.
as_svydesign() to convert in the other direction
Other conversion:
as_svydesign(),
as_tbl_svy(),
from_tbl_svy()
if (requireNamespace("survey", quietly = TRUE)) { sv <- survey::svydesign( ids = ~sdmvpsu, weights = ~wtint2yr, strata = ~sdmvstra, data = nhanes_2017, nest = TRUE ) d <- from_svydesign(sv) survey_data(d) }if (requireNamespace("survey", quietly = TRUE)) { sv <- survey::svydesign( ids = ~sdmvpsu, weights = ~wtint2yr, strata = ~sdmvstra, data = nhanes_2017, nest = TRUE ) d <- from_svydesign(sv) survey_data(d) }
Converts an srvyr tbl_svy to a surveycore design object by delegating
to from_svydesign(). A tbl_svy IS a survey.design, so the conversion
is structurally identical. Requires both survey and srvyr.
from_tbl_svy(x)from_tbl_svy(x)
x |
A |
A survey_taylor, survey_replicate, or survey_twophase object.
as_tbl_svy() to convert in the other direction
Other conversion:
as_svydesign(),
as_tbl_svy(),
from_svydesign()
if ( requireNamespace("survey", quietly = TRUE) && requireNamespace("srvyr", quietly = TRUE) ) { ts <- srvyr::as_survey( survey::svydesign( ids = ~sdmvpsu, weights = ~wtint2yr, strata = ~sdmvstra, data = nhanes_2017, nest = TRUE ) ) d <- from_tbl_svy(ts) }if ( requireNamespace("survey", quietly = TRUE) && requireNamespace("srvyr", quietly = TRUE) ) { ts <- srvyr::as_survey( survey::svydesign( ids = ~sdmvpsu, weights = ~wtint2yr, strata = ~sdmvstra, data = nhanes_2017, nest = TRUE ) ) d <- from_tbl_svy(ts) }
Rao-Scott design-based ANOVA and design-based Wald tests for
survey_glm() fits. Accepts three input shapes on object:
get_anova( object, formula = NULL, response = NULL, predictors = NULL, ..., method = c("LRT", "Wald"), test = c("F", "Chisq"), null = NULL, tolerance = sqrt(.Machine$double.eps), decimals = NULL, label_vars = TRUE, name_style = "surveycore" )get_anova( object, formula = NULL, response = NULL, predictors = NULL, ..., method = c("LRT", "Wald"), test = c("F", "Chisq"), null = NULL, tolerance = sqrt(.Machine$double.eps), decimals = NULL, label_vars = TRUE, name_style = "surveycore" )
object |
A survey_glm_fit, a list of survey_glm_fit objects, or a survey design (survey_base subclass). |
formula |
A model formula (e.g. |
response |
Character string naming the outcome variable. Only used
when |
predictors |
Character vector of predictor variable names. Only used
when |
... |
Additional arguments forwarded to |
method |
Character(1). |
test |
Character(1). |
null |
Numeric or |
tolerance |
Numeric(1). Reciprocal-condition-number threshold for the
naive-covariance near-singular gate in the Rao-Scott LRT. Default
|
decimals |
Integer(1) or |
label_vars |
Logical(1). When |
name_style |
Character(1). |
A single survey_glm_fit — sequential mode, one row per term.
A list of survey_glm_fit objects — chained pairwise comparison,
producing length(object) - 1 rows.
A survey design (any survey_base subclass) — fits the model internally
via survey_glm() using formula (or response + predictors), then
runs sequential anova on the fit.
Supports the four method x test combinations shared with
survey::anova.svyglm(): Rao-Scott working-LRT with F or Chisq reference,
and design-based Wald with F or Chisq reference.
A survey_anova tibble with columns term, statistic, df,
ddf, deff, p_value, stars and a .meta attribute. When
name_style = "broom", p_value is renamed to p.value and ddf
is renamed to df.residual.
Other analysis:
clean(),
get_corr(),
get_covariance(),
get_diffs(),
get_effective_n(),
get_freqs(),
get_means(),
get_pairwise(),
get_quantiles(),
get_ratios(),
get_t_test(),
get_totals(),
get_variance(),
meta()
gss_cc <- gss_2024[stats::complete.cases(gss_2024[, c("age", "sex", "educ")]), ] gss_design <- as_survey( gss_cc, ids = vpsu, weights = wtssps, strata = vstrat, nest = TRUE ) # Single fit fit <- survey_glm(gss_design, age ~ sex + educ) get_anova(fit) # Design + formula (fits internally) get_anova(gss_design, age ~ sex + educ) # List of fits (chained pairwise comparison) fit_s <- survey_glm(gss_design, age ~ sex) fit_b <- survey_glm(gss_design, age ~ sex + educ) get_anova(list(fit_s, fit_b))gss_cc <- gss_2024[stats::complete.cases(gss_2024[, c("age", "sex", "educ")]), ] gss_design <- as_survey( gss_cc, ids = vpsu, weights = wtssps, strata = vstrat, nest = TRUE ) # Single fit fit <- survey_glm(gss_design, age ~ sex + educ) get_anova(fit) # Design + formula (fits internally) get_anova(gss_design, age ~ sex + educ) # List of fits (chained pairwise comparison) fit_s <- survey_glm(gss_design, age ~ sex) fit_b <- survey_glm(gss_design, age ~ sex + educ) get_anova(list(fit_s, fit_b))
Compute pairwise correlations between two or more variables in a survey
design, with design-based standard errors and confidence intervals. Returns
results in long or wide format. The estimator is selected by method:
"pearson" (default) for two numeric variables, "polychoric" for two
ordinal variables under a bivariate-normal latent model (Olsson 1979),
or "polyserial" for one ordinal + one continuous variable (Cox 1974).
The survey-weighted polychoric and polyserial estimators (point estimates
and design-based variance) are implemented from scratch following
Mannan (2025); they are not derived from the survey package, which does
not provide these estimators.
get_corr( design, x, group = NULL, format = c("long", "wide"), redundant = FALSE, diagonal = FALSE, variance = "ci", conf_level = 0.95, n_weighted = FALSE, decimals = NULL, min_cell_n = 30L, na.rm = TRUE, label_values = TRUE, label_vars = TRUE, name_style = "surveycore", method = "pearson", ..., .id = NULL, .if_missing_var = NULL )get_corr( design, x, group = NULL, format = c("long", "wide"), redundant = FALSE, diagonal = FALSE, variance = "ci", conf_level = 0.95, n_weighted = FALSE, decimals = NULL, min_cell_n = 30L, na.rm = TRUE, label_values = TRUE, label_vars = TRUE, name_style = "surveycore", method = "pearson", ..., .id = NULL, .if_missing_var = NULL )
design |
A survey design object: |
x |
<
|
group |
< |
format |
|
redundant |
Logical. If |
diagonal |
Logical. If |
variance |
|
conf_level |
Numeric scalar in (0, 1). Default |
n_weighted |
Logical. If |
decimals |
Integer or |
min_cell_n |
Integer. Minimum pairwise unweighted count before
|
na.rm |
Logical. Controls |
label_values |
Logical. If |
label_vars |
Logical. If |
name_style |
|
method |
Character(1). Estimator applied to every pair. One of
|
... |
Unused. Reserved so that |
.id |
Character(1) or |
.if_missing_var |
|
Polychoric / polyserial semantics. For method != "pearson", each pair
is fit by a two-step MLE: weighted marginal thresholds (and, for
polyserial, a weighted standardization of the continuous side) are
estimated first, then rho is maximised over the weighted
log-likelihood via stats::optimize() on (-1 + 1e-6, 1 - 1e-6).
Confidence intervals are constructed on the Fisher-z scale
(atanh(rho)) and back-transformed via tanh with truncation to
[-1, 1]. The Wald statistic zeta.hat / SE(zeta.hat) is referred to
a standard normal distribution, so df = NA_integer_ — distinct from
the Pearson case where df = n - 2 and the t-distribution is used.
Column label attributes are method-neutral (e.g. "statistic", not
"t-statistic" / "z-statistic"); check meta(result)$method to
interpret the values.
Bivariate-normal assumption. The polychoric / polyserial MLEs assume the underlying latent variables are jointly bivariate-normal. This is an unverified assumption; no runtime diagnostic is performed.
Taylor-path cost. On a survey_taylor design, the variance path
for method != "pearson" is O(n) re-optimisations per variable pair
(a perturbation-based influence function). For large n and many
pairs, passing a survey_replicate design (one re-fit per replicate,
not per respondent) is substantially faster.
Replicate-type caveat. Mannan (2025) verifies the replicate-weight
variance formula for jackknife and bootstrap replicates. BRR and Fay
replicates are admitted mechanically via the design's stored scale
/ rscales coefficients, but the paper does not validate their
behaviour for this non-linear pseudo-likelihood estimator.
A survey_corr tibble (also inheriting survey_result).
When group is active, group variable columns are prepended before all
other columns in both long and wide formats.
Long format columns:
[group_cols...] — group variable columns (when active), first.
var1, var2 — variable names (or labels when
label_vars = TRUE).
r — correlation coefficient. For method = "pearson", the
weighted product-moment correlation; for "polychoric" /
"polyserial", the MLE of rho under a bivariate-normal latent
model.
Variance columns (se, var, cv, ci_low, ci_high, moe,
deff) — only those requested via variance.
p_value — two-tailed p-value.
statistic — test statistic. A t-statistic for
method = "pearson"; a Wald z-statistic for latent methods.
df — degrees of freedom. n - 2 for method = "pearson";
NA_integer_ for "polychoric" and "polyserial" (asymptotic
normal distribution is used).
n — pairwise unweighted count.
n_weighted — pairwise sum of weights (only when requested).
Wide format columns:
[group_cols...] — group variable columns (when active), first.
variable — row variable names (or labels).
One column per focal variable, containing r values.
Use meta(result) to access design type, variable labels, and
method ("pearson", "polychoric", or "polyserial"). For
method != "pearson", meta(result)$bivariate_normal_cdf is
"pbivnorm" (the bivariate-normal CDF used internally). When the
replicate variance path observed one or more non-converged replicates,
meta(result)$n_failed_replicates_total carries the scalar total.
Cox, N. R. (1974). Estimation of the correlation between a continuous and a discrete variable. Biometrics, 30(1), 171-178.
Mannan, H. (2025). SAS programs for estimation of weighted polychoric and weighted polyserial correlations in a complex survey. SSRN. doi:10.2139/ssrn.6580480
Olsson, U. (1979). Maximum likelihood estimation of the polychoric correlation coefficient. Psychometrika, 44(4), 443-460.
Other analysis:
clean(),
get_anova(),
get_covariance(),
get_diffs(),
get_effective_n(),
get_freqs(),
get_means(),
get_pairwise(),
get_quantiles(),
get_ratios(),
get_t_test(),
get_totals(),
get_variance(),
meta()
d <- as_survey( nhanes_2017, ids = sdmvpsu, weights = wtint2yr, strata = sdmvstra, nest = TRUE ) get_corr(d, x = c(ridageyr, bpxsy1)) # Wide correlation matrix get_corr(d, x = c(ridageyr, bpxsy1), format = "wide") # AAPOR-compliant get_corr( d, x = c(ridageyr, bpxsy1), variance = c("ci", "moe"), n_weighted = TRUE ) # Polychoric correlation between two ordinal variables df <- data.frame( id = 1:200, wt = runif(200, 0.5, 2), o1 = factor(sample(1:4, 200, replace = TRUE), ordered = TRUE), o2 = factor(sample(1:4, 200, replace = TRUE), ordered = TRUE) ) d_ord <- as_survey(df, weights = wt) get_corr(d_ord, x = c(o1, o2), method = "polychoric")d <- as_survey( nhanes_2017, ids = sdmvpsu, weights = wtint2yr, strata = sdmvstra, nest = TRUE ) get_corr(d, x = c(ridageyr, bpxsy1)) # Wide correlation matrix get_corr(d, x = c(ridageyr, bpxsy1), format = "wide") # AAPOR-compliant get_corr( d, x = c(ridageyr, bpxsy1), variance = c("ci", "moe"), n_weighted = TRUE ) # Polychoric correlation between two ordinal variables df <- data.frame( id = 1:200, wt = runif(200, 0.5, 2), o1 = factor(sample(1:4, 200, replace = TRUE), ordered = TRUE), o2 = factor(sample(1:4, 200, replace = TRUE), ordered = TRUE) ) d_ord <- as_survey(df, weights = wt) get_corr(d_ord, x = c(o1, o2), method = "polychoric")
Compute the design-based estimate of the finite-population Pearson
covariance for every (unordered, by default) pair of numeric variables
selected from x, with optional grouping, uncertainty quantification,
and metadata-driven labelling. Matches the off-diagonal entries of
survey::svyvar() (Kish n/(n-1) correction) on Taylor, replicate,
twophase, and nonprob designs at numerical parity.
get_covariance( design, x, group = NULL, redundant = FALSE, diagonal = FALSE, variance = "ci", conf_level = 0.95, n_weighted = FALSE, decimals = NULL, min_cell_n = 30L, na.rm = TRUE, label_values = TRUE, label_vars = TRUE, name_style = "surveycore", ..., .id = NULL, .if_missing_var = NULL )get_covariance( design, x, group = NULL, redundant = FALSE, diagonal = FALSE, variance = "ci", conf_level = 0.95, n_weighted = FALSE, decimals = NULL, min_cell_n = 30L, na.rm = TRUE, label_values = TRUE, label_vars = TRUE, name_style = "surveycore", ..., .id = NULL, .if_missing_var = NULL )
design |
A survey design object: |
x |
< |
group |
< |
redundant |
Logical. If |
diagonal |
Logical. If |
variance |
|
conf_level |
Numeric scalar in |
n_weighted |
Logical. If |
decimals |
Integer or |
min_cell_n |
Integer. Minimum pairwise unweighted count before
|
na.rm |
Logical. If |
label_values |
Logical. If |
label_vars |
Logical. If |
name_style |
|
... |
Unused. Reserved so that |
.id |
Character(1) or |
.if_missing_var |
|
Confidence intervals use the normal-Wald approximation on the SE of the
covariance estimate: ci_low = covariance - z * se,
ci_high = covariance + z * se, where z = qnorm((1 + conf_level) / 2).
The bounds are not clamped. Covariance is unbounded — ci_low and
ci_high may have opposite signs and may cross zero. Users who want
clamped intervals can post-process. This behaviour matches
survey::svyvar().
NA handling is pairwise-complete per pair: each ordered pair drops
rows where either variable is NA. There is no na_handling argument;
pairwise is the only policy. This matches survey::svyvar() off-diagonal
pair-at-a-time semantics, not svyvar()'s default listwise deletion
across a multi-variable formula. Numerical parity therefore only holds
when oracle calls are made pair-at-a-time
(survey::svyvar(~x + y, design) per pair).
Under diagonal = TRUE, the self-pair (x, x) returns the design-based
Kish-corrected variance of x on the active domain — not 1 as in
get_corr(). The covariance matrix diagonal is the variance vector, not
the identity. The diagonal-parity gate guarantees that
get_covariance(d, c(x, x), diagonal = TRUE)$covariance and $se equal
get_variance(d, x)$variance and $se numerically (point at 1e-10,
SE at 1e-8) when the active domains match.
Design effect (deff) uses the Goodnight / Mood-Graybill SRS reference
SE_SRS(cov) = sqrt((Var(x) * Var(y) + cov^2) / (n - 1)). When both
the design SE and SRS SE are zero (constant-variable pairs), deff is
set to exactly 0 (0 / 0 guard).
A survey_covariance tibble (also inheriting survey_result).
Columns, in order:
[group_cols...] — group variable columns (when active), first.
var1, var2 — factor columns identifying the pair (levels in
x-supply order).
covariance — design-based Pearson covariance estimate
(Kish-corrected). NaN for degenerate cells; 0 for pairs where
at least one variable is constant on the active domain.
Uncertainty columns (se, var, cv, ci_low, ci_high,
moe, deff) — only those requested via variance.
n — pairwise unweighted count.
n_weighted — pair's sum of weights (only when requested).
Mood, A. M., Graybill, F. A., & Boes, D. C. (1974). Introduction to the Theory of Statistics (3rd ed.). McGraw-Hill.
Lumley, T. (2010). Complex Surveys: A Guide to Analysis Using R. Wiley.
Cochran, W. G. (1977). Sampling Techniques (3rd ed.). Wiley.
Demnati, A., & Rao, J. N. K. (2004). Linearization variance estimators for survey data. Survey Methodology, 30, 17–26.
Other analysis:
clean(),
get_anova(),
get_corr(),
get_diffs(),
get_effective_n(),
get_freqs(),
get_means(),
get_pairwise(),
get_quantiles(),
get_ratios(),
get_t_test(),
get_totals(),
get_variance(),
meta()
d <- as_survey( nhanes_2017, ids = sdmvpsu, weights = wtint2yr, strata = sdmvstra, nest = TRUE ) get_covariance(d, x = c(ridageyr, bpxsy1)) # Include the diagonal (self-pairs return Var(x), not 1) get_covariance(d, x = c(ridageyr, bpxsy1), diagonal = TRUE) # With grouping get_covariance(d, x = c(ridageyr, bpxsy1), group = riagendr)d <- as_survey( nhanes_2017, ids = sdmvpsu, weights = wtint2yr, strata = sdmvstra, nest = TRUE ) get_covariance(d, x = c(ridageyr, bpxsy1)) # Include the diagonal (self-pairs return Var(x), not 1) get_covariance(d, x = c(ridageyr, bpxsy1), diagonal = TRUE) # With grouping get_covariance(d, x = c(ridageyr, bpxsy1), group = riagendr)
Estimates treatment effects (differences from a reference group) via survey-weighted regression. Supports bivariate and multivariate models, Gaussian and non-Gaussian families, and optional subgroup analysis.
get_diffs( design, x, treats, group = NULL, covariates = NULL, ref_level = NULL, pval_adj = NULL, show_means = TRUE, show_pct_change = FALSE, scale = c("ame", "link"), variance = "ci", conf_level = 0.95, alpha = 0.05, show_favorability = FALSE, min_cell_n = 30L, n_weighted = FALSE, decimals = NULL, na.rm = TRUE, label_values = TRUE, name_style = "surveycore", ..., .id = NULL, .if_missing_var = NULL )get_diffs( design, x, treats, group = NULL, covariates = NULL, ref_level = NULL, pval_adj = NULL, show_means = TRUE, show_pct_change = FALSE, scale = c("ame", "link"), variance = "ci", conf_level = 0.95, alpha = 0.05, show_favorability = FALSE, min_cell_n = 30L, n_weighted = FALSE, decimals = NULL, na.rm = TRUE, label_values = TRUE, name_style = "surveycore", ..., .id = NULL, .if_missing_var = NULL )
design |
A survey design object: |
x |
< |
treats |
< |
group |
< |
covariates |
Character vector of additional model terms as strings.
Supports interactions ( |
ref_level |
Character(1). Reference level of |
pval_adj |
Character(1) or |
show_means |
Logical. If |
show_pct_change |
Logical. If |
scale |
Character(1). |
variance |
|
conf_level |
Numeric(1) in (0, 1). Confidence level. Default
|
alpha |
Numeric(1) strictly between 0 and 1. Significance threshold
used when |
show_favorability |
Logical. If |
min_cell_n |
Integer(1). Minimum unweighted cell size before
|
n_weighted |
Logical. If |
decimals |
Integer(1) or |
na.rm |
Logical. If |
label_values |
Logical. If |
name_style |
|
... |
Passed to |
.id |
Character(1) or |
.if_missing_var |
|
get_diffs() uses two estimation paths:
Clean path (bivariate Gaussian with no covariates and no group, OR
any family with scale = "link"): extracts coefficients directly from
clean(). The intercept is the reference group mean; treatment
coefficients are differences from reference. When scale = "link" and
the family is non-Gaussian, mean and pct_change are suppressed.
Marginaleffects path (covariates, non-Gaussian with
scale = "ame", or group): uses avg_slopes() for estimates and
avg_predictions() for means.
When scale = "link" and the family is non-Gaussian, the mean and
pct_change columns are suppressed (omitted entirely). Link-scale
means are not substantively meaningful.
When group is active, p-value adjustment is applied independently
within each group. For global adjustment across all comparisons,
apply stats::p.adjust() to the result manually. Confidence intervals
reflect the specified conf_level and are not affected by p-value
adjustment.
All p-values and confidence intervals use the t-distribution with design-based residual degrees of freedom, regardless of estimation path.
By default, non-Gaussian models report average marginal effects on
the response scale. Set scale = "link" for coefficients on the link
scale (e.g., log-odds for logistic regression).
A survey_diffs tibble (also inheriting survey_result).
Columns (in order): group columns (when active), treatment variable,
estimate, pct_change (optional), mean (optional), n,
n_weighted (optional), se (optional), ci_low (optional),
ci_high (optional), p_value, stars, favorable (optional),
backlash (optional). Use meta() to access design type, family,
reference level, and other metadata.
Other analysis:
clean(),
get_anova(),
get_corr(),
get_covariance(),
get_effective_n(),
get_freqs(),
get_means(),
get_pairwise(),
get_quantiles(),
get_ratios(),
get_t_test(),
get_totals(),
get_variance(),
meta()
# Create survey design with treatment groups set.seed(42) df <- data.frame( id = 1:200, wt = runif(200, 0.5, 2), dv = rnorm(200, 50, 10), arm = factor(sample(c("Control", "A", "B"), 200, TRUE)) ) d <- as_survey(df, weights = wt) # Basic treatment effect get_diffs(d, dv, arm) # With percentage change and p-value adjustment get_diffs(d, dv, arm, show_pct_change = TRUE, pval_adj = "BH")# Create survey design with treatment groups set.seed(42) df <- data.frame( id = 1:200, wt = runif(200, 0.5, 2), dv = rnorm(200, 50, 10), arm = factor(sample(c("Control", "A", "B"), 200, TRUE)) ) d <- as_survey(df, weights = wt) # Basic treatment effect get_diffs(d, dv, arm) # With percentage change and p-value adjustment get_diffs(d, dv, arm, show_pct_change = TRUE, pval_adj = "BH")
Computes the effective sample size of a survey design using either the
Kish (1965) weight-only approximation (method = "kish") or the full
design-effect-based formula for a specified variable (method = "deff").
get_effective_n( design, x = NULL, group = NULL, method = c("kish", "deff"), na.rm = TRUE, decimals = NULL, min_cell_n = 30L, ..., .id = NULL, .if_missing_var = NULL )get_effective_n( design, x = NULL, group = NULL, method = c("kish", "deff"), na.rm = TRUE, decimals = NULL, min_cell_n = 30L, ..., .id = NULL, .if_missing_var = NULL )
design |
A survey design object: |
x |
< |
group |
< |
method |
Character(1). |
na.rm |
Logical. If |
decimals |
Integer or |
min_cell_n |
Integer. Minimum unweighted cell count before
|
... |
Unused. Reserved so that |
.id |
Character(1) or |
.if_missing_var |
|
The Kish method (method = "kish") computes effective N from survey
weights alone: n_eff = sum(w)^2 / sum(w^2). It captures only weight variation.
For clustered designs with equal weights, deff_kish = 1.0 even when the
true design effect is substantially greater due to clustering. Use
method = "deff" to capture the full design effect for a specific
analysis variable.
The DEFF method (method = "deff") computes effective N as
n_eff = n / DEFF, where DEFF = Var_design / Var_SRS for variable x.
It captures clustering, stratification, and weight variation jointly.
A survey_effective_n tibble (also inheriting survey_result).
Columns, in order:
[.id] — survey identifier column (when design is a collection).
[group_cols...] — group variable columns (when grouping is active).
n — integer. Unweighted count of observations.
n_eff — numeric. Effective sample size.
deff_kish — numeric. Weight-based design effect (n / n_eff).
Present when method = "kish" only.
deff — numeric. Full design effect (Var_design / Var_SRS).
Present when method = "deff" only.
Use meta(result)$method to retrieve the formula used. For DEFF,
meta(result)$x is a named list with variable metadata.
Other analysis:
clean(),
get_anova(),
get_corr(),
get_covariance(),
get_diffs(),
get_freqs(),
get_means(),
get_pairwise(),
get_quantiles(),
get_ratios(),
get_t_test(),
get_totals(),
get_variance(),
meta()
d <- as_survey( nhanes_2017, ids = sdmvpsu, weights = wtint2yr, strata = sdmvstra, nest = TRUE ) # Kish effective N (weight-only approximation) get_effective_n(d) # Full DEFF effective N for a specific variable get_effective_n(d, ridageyr, method = "deff")d <- as_survey( nhanes_2017, ids = sdmvpsu, weights = wtint2yr, strata = sdmvstra, nest = TRUE ) # Kish effective N (weight-only approximation) get_effective_n(d) # Full DEFF effective N for a specific variable get_effective_n(d, ridageyr, method = "deff")
Compute weighted proportions (percentages) for one or more categorical variables in a survey design, with optional grouping, uncertainty quantification, and metadata-driven labelling.
get_freqs( design, x, ..., group = NULL, names_to = "name", values_to = "value", variance = NULL, conf_level = 0.95, n_weighted = FALSE, decimals = NULL, min_cell_n = 30L, na.rm = TRUE, label_values = TRUE, label_vars = TRUE, name_style = "surveycore", .id = NULL, .if_missing_var = NULL )get_freqs( design, x, ..., group = NULL, names_to = "name", values_to = "value", variance = NULL, conf_level = 0.95, n_weighted = FALSE, decimals = NULL, min_cell_n = 30L, na.rm = TRUE, label_values = TRUE, label_vars = TRUE, name_style = "surveycore", .id = NULL, .if_missing_var = NULL )
design |
A survey design object: |
x |
< |
... |
Additional arguments forwarded to |
group |
< |
names_to |
Character(1). Column name for the variable identifier in
multi-variable mode. Default |
values_to |
Character(1). Column name for the response value in
multi-variable mode. Default |
variance |
|
conf_level |
Numeric scalar in (0, 1). Confidence level for intervals.
Default |
n_weighted |
Logical. If |
decimals |
Integer or |
min_cell_n |
Integer. Minimum unweighted cell count before
|
na.rm |
Logical. If |
label_values |
Logical. If |
label_vars |
Logical. If |
name_style |
|
.id |
Character(1) or |
.if_missing_var |
|
Single-variable mode (when x resolves to exactly one variable):
The focal variable name becomes the first column. Rows follow the factor
level order (if the variable is a factor) or ascending sort order otherwise.
Multi-variable mode (when x resolves to two or more variables):
Results are stacked in long format. The names_to column contains the
variable label (when label_vars = TRUE) or the raw variable name as
fallback. The values_to column contains the response values.
Domain estimation: Proportions use the ratio linearization approach,
equivalent to survey::svymean() on a binary indicator within the active
domain. The full design structure is used for variance estimation — rows are
not physically removed for domain/group subsets.
na.rm = FALSE: NA is appended as the last level. All proportions
(including non-NA levels) have their denominator inflated to include
NA rows, so the pct column sums to 1.
A survey_freqs tibble (also inheriting survey_result). Columns:
[group_cols...] — group variable columns (when active), first.
[variable_name] (single) or [names_to] + [values_to] (multi).
pct — weighted proportion (0–1).
Variance columns (se, var, cv, ci_low, ci_high, moe,
deff) — only those requested via variance.
n — unweighted cell count (sample basis of each estimate).
n_weighted — estimated population count (only when requested).
Use meta(result) to access design type, variable labels, value labels,
and other metadata.
Other analysis:
clean(),
get_anova(),
get_corr(),
get_covariance(),
get_diffs(),
get_effective_n(),
get_means(),
get_pairwise(),
get_quantiles(),
get_ratios(),
get_t_test(),
get_totals(),
get_variance(),
meta()
# NHANES exam weights are 0 for non-examined participants; filter first nhanes_sub <- nhanes_2017[nhanes_2017$wtmec2yr > 0, ] d <- as_survey( nhanes_sub, ids = sdmvpsu, weights = wtmec2yr, strata = sdmvstra, nest = TRUE ) # Single variable get_freqs(d, riagendr) # With confidence intervals get_freqs(d, riagendr, variance = "ci") # Grouped get_freqs(d, riagendr, group = sdmvstra) # Multi-variable (stacked) get_freqs(d, c(riagendr, ridreth3), names_to = "item", values_to = "value")# NHANES exam weights are 0 for non-examined participants; filter first nhanes_sub <- nhanes_2017[nhanes_2017$wtmec2yr > 0, ] d <- as_survey( nhanes_sub, ids = sdmvpsu, weights = wtmec2yr, strata = sdmvstra, nest = TRUE ) # Single variable get_freqs(d, riagendr) # With confidence intervals get_freqs(d, riagendr, variance = "ci") # Grouped get_freqs(d, riagendr, group = sdmvstra) # Multi-variable (stacked) get_freqs(d, c(riagendr, ridreth3), names_to = "item", values_to = "value")
Compute the weighted mean of a single numeric variable in a survey design, with optional grouping, uncertainty quantification, and metadata-driven labelling.
get_means( design, x, group = NULL, variance = "ci", conf_level = 0.95, n_weighted = FALSE, decimals = NULL, min_cell_n = 30L, na.rm = TRUE, label_values = TRUE, label_vars = TRUE, name_style = "surveycore", ..., .id = NULL, .if_missing_var = NULL )get_means( design, x, group = NULL, variance = "ci", conf_level = 0.95, n_weighted = FALSE, decimals = NULL, min_cell_n = 30L, na.rm = TRUE, label_values = TRUE, label_vars = TRUE, name_style = "surveycore", ..., .id = NULL, .if_missing_var = NULL )
design |
A survey design object: |
x |
< |
group |
< |
variance |
|
conf_level |
Numeric scalar in (0, 1). Confidence level for intervals.
Default |
n_weighted |
Logical. If |
decimals |
Integer or |
min_cell_n |
Integer. Minimum unweighted cell count before
|
na.rm |
Logical. If |
label_values |
Logical. Accepted for API consistency across |
label_vars |
Logical. Accepted for API uniformity; has no visible
effect since |
name_style |
|
... |
Unused. Reserved so that |
.id |
Character(1) or |
.if_missing_var |
|
A survey_means tibble (also inheriting survey_result). Columns:
[group_cols...] — group variable columns (when active), first.
mean — weighted mean estimate.
Variance columns (se, var, cv, ci_low, ci_high, moe,
deff) — only those requested via variance.
df — degrees of freedom used for CI calculation. Present only for
survey_taylor designs with an active @calibration object
(GREG-corrected SE). For all other designs the normal approximation
(Inf) is used and df is not included.
n — unweighted count of non-NA observations used in the estimate.
n_weighted — sum of weights (only when requested).
The variable name is stored in meta(result)$x, not as a column.
Use meta(result) to access design type, variable labels, and other
metadata.
Other analysis:
clean(),
get_anova(),
get_corr(),
get_covariance(),
get_diffs(),
get_effective_n(),
get_freqs(),
get_pairwise(),
get_quantiles(),
get_ratios(),
get_t_test(),
get_totals(),
get_variance(),
meta()
d <- as_survey( nhanes_2017, ids = sdmvpsu, weights = wtint2yr, strata = sdmvstra, nest = TRUE ) get_means(d, ridageyr) # With grouped estimate get_means(d, ridageyr, group = riagendr) # AAPOR-compliant get_means(d, ridageyr, variance = c("ci", "moe"), n_weighted = TRUE)d <- as_survey( nhanes_2017, ids = sdmvpsu, weights = wtint2yr, strata = sdmvstra, nest = TRUE ) get_means(d, ridageyr) # With grouped estimate get_means(d, ridageyr, group = riagendr) # AAPOR-compliant get_means(d, ridageyr, variance = c("ci", "moe"), n_weighted = TRUE)
Runs all k(k-1)/2 pairwise two-sample t-tests for a grouping variable
with k levels and applies multiple-comparison p-value adjustment.
Delegates pair-level computations to get_t_test().
get_pairwise( design, x, by, group = NULL, pval_adj = "holm", conf_level = 0.95, variance = "ci", na.rm = TRUE, min_cell_n = 30L, decimals = NULL, label_values = TRUE, label_vars = TRUE, name_style = "surveycore", ..., .id = NULL, .if_missing_var = NULL )get_pairwise( design, x, by, group = NULL, pval_adj = "holm", conf_level = 0.95, variance = "ci", na.rm = TRUE, min_cell_n = 30L, decimals = NULL, label_values = TRUE, label_vars = TRUE, name_style = "surveycore", ..., .id = NULL, .if_missing_var = NULL )
design |
A survey design object: |
x |
< |
by |
< |
group |
< |
pval_adj |
Character(1). P-value adjustment method passed to
|
conf_level |
Numeric(1). Confidence level strictly in (0, 1).
Default |
variance |
Character. Which uncertainty columns to include.
Valid values: |
na.rm |
Logical(1). Accepted for API uniformity. Default |
min_cell_n |
Integer(1). Warn for small cells. Default |
decimals |
Integer(1) or |
label_values |
Logical(1). Convert |
label_vars |
Logical(1). Accepted for API uniformity; no visible
effect. Default |
name_style |
Character(1). |
... |
Additional arguments forwarded to |
.id |
Character(1) or |
.if_missing_var |
|
A survey_pairwise tibble (also inheriting survey_result).
Columns: group columns (when active), level_a, level_b,
estimate, mean_a, mean_b, n_a, n_b, se (optional),
ci_low (optional), ci_high (optional), t_stat, df,
p_value (adjusted), stars. Use meta() to access the
adjustment method and other metadata.
Other analysis:
clean(),
get_anova(),
get_corr(),
get_covariance(),
get_diffs(),
get_effective_n(),
get_freqs(),
get_means(),
get_quantiles(),
get_ratios(),
get_t_test(),
get_totals(),
get_variance(),
meta()
gss_sub <- gss_2024[gss_2024$sex %in% c(1L, 2L) & !is.na(gss_2024$age), ] gss_sub$sex <- factor( gss_sub$sex, levels = c(1, 2), labels = c("Male", "Female") ) gss_design <- as_survey( gss_sub, ids = vpsu, weights = wtssps, strata = vstrat, nest = TRUE ) get_pairwise(gss_design, age, by = sex)gss_sub <- gss_2024[gss_2024$sex %in% c(1L, 2L) & !is.na(gss_2024$age), ] gss_sub$sex <- factor( gss_sub$sex, levels = c(1, 2), labels = c("Male", "Female") ) gss_design <- as_survey( gss_sub, ids = vpsu, weights = wtssps, strata = vstrat, nest = TRUE ) get_pairwise(gss_design, age, by = sex)
Compute survey-weighted quantiles (including the median) for a single numeric variable using the Woodruff (1952) confidence interval method. Supports optional grouping, domain estimation, and all five survey design classes.
get_quantiles( design, x, probs = c(0.25, 0.5, 0.75), group = NULL, variance = "ci", conf_level = 0.95, n_weighted = FALSE, decimals = NULL, min_cell_n = 30L, na.rm = TRUE, label_values = TRUE, label_vars = TRUE, name_style = "surveycore", ..., .id = NULL, .if_missing_var = NULL )get_quantiles( design, x, probs = c(0.25, 0.5, 0.75), group = NULL, variance = "ci", conf_level = 0.95, n_weighted = FALSE, decimals = NULL, min_cell_n = 30L, na.rm = TRUE, label_values = TRUE, label_vars = TRUE, name_style = "surveycore", ..., .id = NULL, .if_missing_var = NULL )
design |
A survey design object: |
x |
< |
probs |
Numeric vector of probabilities in (0, 1). Default
|
group |
< |
variance |
|
conf_level |
Numeric scalar in (0, 1). Confidence level for Woodruff
intervals. Default |
n_weighted |
Logical. If |
decimals |
Integer or |
min_cell_n |
Integer. Minimum unweighted cell count before
|
na.rm |
Logical. If |
label_values |
Logical. Accepted for API consistency across |
label_vars |
Logical. Accepted for API uniformity; has no visible
effect on |
name_style |
|
... |
Unused. Reserved so that |
.id |
Character(1) or |
.if_missing_var |
|
A survey_quantiles tibble (also inheriting survey_result).
[group_cols...] — group variable columns (when active), first.
quantile — probability label: "p25", "p50", etc.
estimate — weighted quantile estimate.
Variance columns (se, var, cv, ci_low, ci_high, moe,
deff) — only those requested via variance. CIs are Woodruff
intervals and are generally asymmetric around estimate. deff is
always NA for quantile estimates: computing it requires a kernel
density estimate at the quantile point (the Woodruff SRS approximation
used by survey::svyquantile(deff = TRUE)), which is not implemented.
n — unweighted count of observations in the active domain used
in the estimate. When na.rm = TRUE, counts only non-NA observations;
when na.rm = FALSE, counts all active-domain rows (including NAs,
though the estimate will be NA_real_).
n_weighted — sum of weights (only when requested).
One row per (group combination × quantile probability). The variable name
and probs vector are stored in meta(result).
Woodruff, R. S. (1952). Confidence intervals for medians and other position measures. Journal of the American Statistical Association, 47(260), 635–646.
Other analysis:
clean(),
get_anova(),
get_corr(),
get_covariance(),
get_diffs(),
get_effective_n(),
get_freqs(),
get_means(),
get_pairwise(),
get_ratios(),
get_t_test(),
get_totals(),
get_variance(),
meta()
d <- as_survey( nhanes_2017, ids = sdmvpsu, weights = wtint2yr, strata = sdmvstra, nest = TRUE ) # IQR + median (default) get_quantiles(d, ridageyr) # Median only with SE get_quantiles(d, ridageyr, probs = 0.5, variance = c("ci", "se")) # Grouped quartiles get_quantiles(d, ridageyr, group = riagendr)d <- as_survey( nhanes_2017, ids = sdmvpsu, weights = wtint2yr, strata = sdmvstra, nest = TRUE ) # IQR + median (default) get_quantiles(d, ridageyr) # Median only with SE get_quantiles(d, ridageyr, probs = 0.5, variance = c("ci", "se")) # Grouped quartiles get_quantiles(d, ridageyr, group = riagendr)
Estimate the ratio of two survey-weighted totals (numerator / denominator)
for a survey design object. Uses the delta method (linearization) for
variance estimation for Taylor, SRS, calibrated, and two-phase designs, and
direct per-replicate computation for replicate-weight designs. Both
approaches are equivalent to survey::svyratio() for their respective
design types.
Supports optional grouping, domain estimation, and all five survey design
classes.
get_ratios( design, numerator, denominator, group = NULL, variance = "ci", conf_level = 0.95, n_weighted = FALSE, decimals = NULL, min_cell_n = 30L, na.rm = TRUE, label_values = TRUE, label_vars = TRUE, name_style = "surveycore", ..., .id = NULL, .if_missing_var = NULL )get_ratios( design, numerator, denominator, group = NULL, variance = "ci", conf_level = 0.95, n_weighted = FALSE, decimals = NULL, min_cell_n = 30L, na.rm = TRUE, label_values = TRUE, label_vars = TRUE, name_style = "surveycore", ..., .id = NULL, .if_missing_var = NULL )
design |
A survey design object: |
numerator |
< |
denominator |
< |
group |
< |
variance |
|
conf_level |
Numeric scalar in (0, 1). Confidence level for
confidence intervals. Default |
n_weighted |
Logical. If |
decimals |
Integer or |
min_cell_n |
Integer. Minimum unweighted cell count before
|
na.rm |
Logical. If |
label_values |
Logical. Accepted for API consistency across |
label_vars |
Logical. Accepted for API uniformity; has no visible
effect on |
name_style |
|
... |
Unused. Reserved so that |
.id |
Character(1) or |
.if_missing_var |
|
A survey_ratios tibble (also inheriting survey_result).
[group_cols...] — group variable columns (when active), first.
ratio — estimated ratio (weighted total of numerator / weighted
total of denominator).
Variance columns (se, var, cv, ci_low, ci_high, moe,
deff) — only those requested via variance.
n — unweighted count of rows where both numerator and denominator
are non-NA.
n_weighted — sum of weights (only when requested).
Numerator and denominator variable names are stored in meta(result), not
as output columns. Use meta(result)$numerator and
meta(result)$denominator to access them.
Other analysis:
clean(),
get_anova(),
get_corr(),
get_covariance(),
get_diffs(),
get_effective_n(),
get_freqs(),
get_means(),
get_pairwise(),
get_quantiles(),
get_t_test(),
get_totals(),
get_variance(),
meta()
d <- as_survey(pew_npors_2025, weights = weight, strata = stratum) # Ratio of prayer frequency to in-person attendance frequency get_ratios(d, numerator = pray, denominator = attendper) # With grouped estimates get_ratios(d, pray, attendper, group = gender) # AAPOR-compliant output get_ratios(d, pray, attendper, variance = c("ci", "moe"), n_weighted = TRUE)d <- as_survey(pew_npors_2025, weights = weight, strata = stratum) # Ratio of prayer frequency to in-person attendance frequency get_ratios(d, numerator = pray, denominator = attendper) # With grouped estimates get_ratios(d, pray, attendper, group = gender) # AAPOR-compliant output get_ratios(d, pray, attendper, variance = c("ci", "moe"), n_weighted = TRUE)
Compares the weighted means of two groups using a design-based t-test.
Follows the mathematical model of survey::svyttest() but uses
surveycore's own variance machinery (survey_glm()). Supports all four
survey design classes and optional subgroup analysis via group.
get_t_test( design, x, by, group = NULL, conf_level = 0.95, variance = "ci", na.rm = TRUE, min_cell_n = 30L, decimals = NULL, label_values = TRUE, label_vars = TRUE, name_style = "surveycore", ..., .id = NULL, .if_missing_var = NULL )get_t_test( design, x, by, group = NULL, conf_level = 0.95, variance = "ci", na.rm = TRUE, min_cell_n = 30L, decimals = NULL, label_values = TRUE, label_vars = TRUE, name_style = "surveycore", ..., .id = NULL, .if_missing_var = NULL )
design |
A survey design object: |
x |
< |
by |
< |
group |
< |
conf_level |
Numeric(1). Confidence level strictly in (0, 1).
Default |
variance |
Character. Which uncertainty columns to include.
Valid values: |
na.rm |
Logical(1). Accepted for API uniformity with other
|
min_cell_n |
Integer(1). Warn when either group has fewer than
this many unweighted observations. Default |
decimals |
Integer(1) or |
label_values |
Logical(1). When |
label_vars |
Logical(1). Accepted for API uniformity; has no
visible effect because column names are fixed. Default |
name_style |
Character(1). Output column naming style.
|
... |
Additional arguments forwarded to |
.id |
Character(1) or |
.if_missing_var |
|
A survey_t_test tibble (also inheriting survey_result).
Columns: group columns (when active), level_a, level_b,
estimate, mean_a, mean_b, n_a, n_b, se (optional),
ci_low (optional), ci_high (optional), t_stat, df,
p_value, stars. Use meta() to access design type, conf_level,
and variable metadata.
Other analysis:
clean(),
get_anova(),
get_corr(),
get_covariance(),
get_diffs(),
get_effective_n(),
get_freqs(),
get_means(),
get_pairwise(),
get_quantiles(),
get_ratios(),
get_totals(),
get_variance(),
meta()
gss_sub <- gss_2024[gss_2024$sex %in% c(1L, 2L) & !is.na(gss_2024$age), ] gss_sub$sex <- factor( gss_sub$sex, levels = c(1, 2), labels = c("Male", "Female") ) gss_design <- as_survey( gss_sub, ids = vpsu, weights = wtssps, strata = vstrat, nest = TRUE ) get_t_test(gss_design, age, by = sex)gss_sub <- gss_2024[gss_2024$sex %in% c(1L, 2L) & !is.na(gss_2024$age), ] gss_sub$sex <- factor( gss_sub$sex, levels = c(1, 2), labels = c("Male", "Female") ) gss_design <- as_survey( gss_sub, ids = vpsu, weights = wtssps, strata = vstrat, nest = TRUE ) get_t_test(gss_design, age, by = sex)
Compute the estimated population total of a numeric variable in a survey design, or the estimated population size when no variable is supplied. Supports optional grouping, uncertainty quantification, and metadata-driven labelling.
get_totals( design, x = NULL, group = NULL, variance = "ci", conf_level = 0.95, n_weighted = FALSE, decimals = NULL, min_cell_n = 30L, na.rm = TRUE, label_values = TRUE, label_vars = TRUE, name_style = "surveycore", ..., .id = NULL, .if_missing_var = NULL )get_totals( design, x = NULL, group = NULL, variance = "ci", conf_level = 0.95, n_weighted = FALSE, decimals = NULL, min_cell_n = 30L, na.rm = TRUE, label_values = TRUE, label_vars = TRUE, name_style = "surveycore", ..., .id = NULL, .if_missing_var = NULL )
design |
A survey design object: |
x |
< |
group |
< |
variance |
|
conf_level |
Numeric scalar in (0, 1). Default |
n_weighted |
Logical. For |
decimals |
Integer or |
min_cell_n |
Integer. Default |
na.rm |
Logical. If |
label_values |
Logical. Accepted for API consistency across |
label_vars |
Logical. Accepted for API uniformity. Default |
name_style |
|
... |
Additional arguments forwarded to |
.id |
Character(1) or |
.if_missing_var |
|
A survey_totals tibble (also inheriting survey_result). Columns:
[group_cols...] — group variable columns (when active), first.
total — the weighted sum estimate.
Variance columns — only those requested via variance.
n — unweighted count (omitted in no-variable mode).
n_weighted — sum of weights (only when requested).
The variable name (or NULL for no-variable mode) is in
meta(result)$x. Use meta(result) for additional metadata.
Other analysis:
clean(),
get_anova(),
get_corr(),
get_covariance(),
get_diffs(),
get_effective_n(),
get_freqs(),
get_means(),
get_pairwise(),
get_quantiles(),
get_ratios(),
get_t_test(),
get_variance(),
meta()
d <- as_survey_replicate( acs_pums_wy, weights = pwgtp, repweights = pwgtp1:pwgtp80, type = "successive-difference" ) # Population size get_totals(d) # Total for a variable get_totals(d, agep) # Grouped get_totals(d, agep, group = sex)d <- as_survey_replicate( acs_pums_wy, weights = pwgtp, repweights = pwgtp1:pwgtp80, type = "successive-difference" ) # Population size get_totals(d) # Total for a variable get_totals(d, agep) # Grouped get_totals(d, agep, group = sex)
Compute the design-based estimate of the finite-population variance for one
or more numeric variables in a survey design, with optional grouping,
uncertainty quantification, and metadata-driven labelling. Matches
survey::svyvar() numerically (Kish n/(n-1) correction) on Taylor,
replicate, twophase, and nonprob designs.
get_variance( design, x, group = NULL, variance = "ci", conf_level = 0.95, n_weighted = FALSE, decimals = NULL, min_cell_n = 30L, na.rm = TRUE, na_handling = c("pairwise", "listwise"), label_values = TRUE, label_vars = TRUE, name_style = "surveycore", ..., .id = NULL, .if_missing_var = NULL )get_variance( design, x, group = NULL, variance = "ci", conf_level = 0.95, n_weighted = FALSE, decimals = NULL, min_cell_n = 30L, na.rm = TRUE, na_handling = c("pairwise", "listwise"), label_values = TRUE, label_vars = TRUE, name_style = "surveycore", ..., .id = NULL, .if_missing_var = NULL )
design |
A survey design object: |
x |
< |
group |
< |
variance |
|
conf_level |
Numeric scalar in (0, 1). Confidence level for intervals.
Default |
n_weighted |
Logical. If |
decimals |
Integer or |
min_cell_n |
Integer. Minimum unweighted cell count before
|
na.rm |
Logical. If |
na_handling |
|
label_values |
Logical. Accepted for API consistency across |
label_vars |
Logical. If |
name_style |
|
... |
Unused. Reserved so that |
.id |
Character(1) or |
.if_missing_var |
|
Confidence intervals use the normal-Wald approximation on the SE of the
variance estimate: ci_low = variance - z * se,
ci_high = variance + z * se, where z = qnorm((1 + conf_level) / 2).
The bounds are not clamped. When the true variance is near zero with
wide SE, ci_low may be negative. Users who want non-negative lower
bounds can clamp at 0 post-hoc. This behaviour matches
survey::svyvar().
Under na_handling = "pairwise" (the default), each focal variable
contributes its own per-variable complete-case count to n. Under
na_handling = "listwise", every output row shares the intersection
complete-case count — rows with NA in any selected variable are
excluded from every variable's calculation.
A survey_variance tibble (also inheriting survey_result).
Columns, in order:
[.id] — survey identifier column, only when design is a
survey_collection.
[group_cols...] — group variable columns (when active), first.
name — focal variable name (or its label when label_vars = TRUE).
variance — design-based point estimate of the finite-population
variance. Note: the column is always named variance regardless of the
variance parameter (which controls uncertainty columns, not this
column). NaN for degenerate cells; exact 0 for constant-in-domain
variables.
Uncertainty columns (se, var, cv, ci_low, ci_high,
moe, deff) — only those requested via the variance parameter.
The var uncertainty column is the variance of the estimated variance,
distinct from the variance point estimate column.
n — unweighted count of non-NA observations used.
n_weighted — sum of weights (only when n_weighted = TRUE).
Other analysis:
clean(),
get_anova(),
get_corr(),
get_covariance(),
get_diffs(),
get_effective_n(),
get_freqs(),
get_means(),
get_pairwise(),
get_quantiles(),
get_ratios(),
get_t_test(),
get_totals(),
meta()
d <- as_survey( nhanes_2017, ids = sdmvpsu, weights = wtint2yr, strata = sdmvstra, nest = TRUE ) get_variance(d, ridageyr) # Multiple variables get_variance(d, c(ridageyr, bpxsy1)) # With grouping get_variance(d, ridageyr, group = riagendr)d <- as_survey( nhanes_2017, ids = sdmvpsu, weights = wtint2yr, strata = sdmvstra, nest = TRUE ) get_variance(d, ridageyr) # Multiple variables get_variance(d, c(ridageyr, bpxsy1)) # With grouping get_variance(d, ridageyr, group = riagendr)
A 27-variable extract from the 2024 General Social Survey (GSS), one of the longest-running sociological surveys in the United States (fielded annually or biennially since 1972). All 3,309 respondents from the 2024 cross-section are included.
gss_2024gss_2024
A data frame with 3,309 rows and 27 variables:
Variance primary sampling unit. Use as the cluster ID for variance estimation.
Variance stratum. Use as the stratification variable.
Person post-stratification weight. Standard analysis weight.
Person post-stratification weight adjusted for differential non-response. Preferred when non-response bias is a concern.
Respondent ID. Unique case identifier.
Survey year (all 2024 in this extract).
Ballot form (A, B, C, or D). The GSS uses a
split-ballot design; not all questions appear on every ballot.
Inapplicable items are coded -100.
Age in years (89 = 89 or older).
Sex: 1 = male, 2 = female.
Race: 1 = white, 2 = black, 3 = other.
Hispanic origin: 1 = not Hispanic; 2–50 = specific
Hispanic origin.
Highest year of school completed (0–20 years).
Highest degree: 0 = less than HS, 1 = high school,
2 = associate, 3 = bachelor's, 4 = graduate.
Total family income (26 categories from < $1,000 to $170,000+).
Marital status: 1 = married, 2 = widowed,
3 = divorced, 4 = separated, 5 = never married.
Labor force status: 1 = full time, 2 = part time,
3 = temporarily not working, 4 = unemployed, 5 = retired,
6 = in school, 7 = keeping house, 8 = other.
Hours worked last week (for employed respondents only).
Number of adults in household (8 = 8 or more).
Party identification: 0 = strong Democrat,
3 = Independent, 6 = strong Republican, 7 = other party.
Political views: 1 = extremely liberal,
7 = extremely conservative.
General happiness: 1 = very happy, 2 = pretty happy,
3 = not too happy.
Self-rated health: 1 = excellent, 2 = good,
3 = fair, 4 = poor.
Social trust: 1 = most people can be trusted,
2 = can't be too careful, 3 = depends.
Government spending on welfare: 1 = too little,
2 = about right, 3 = too much.
Abortion for any reason: 1 = yes, 2 = no.
Religious service attendance: 0 = never,
8 = several times a week.
Religious preference: 1 = Protestant, 2 = Catholic,
3 = Jewish, 4 = none, and others.
Survey design: Stratified multi-stage cluster — use Taylor series linearization:
svy <- as_survey(gss_2024, ids = vpsu, strata = vstrat, weights = wtssps, # or wtssnrps for non-response-adjusted weight nest = TRUE )
Missing value codes: The GSS uses a consistent system of negative integer codes for missing data across all variables:
| Code | Meaning |
-100 |
Inapplicable (question not asked of this respondent) |
-99 |
No answer |
-98 |
Don't know |
-97 |
Skipped on web |
-90 |
Refused |
These codes are stored as value labels on every column (check
attr(gss_2024$happy, "labels")). Recode them to NA before analysis.
Split-ballot design: The ballot variable indicates which question
module a respondent received. Variables asked only on some ballots will
have -100 (Inapplicable) for respondents on other ballots.
Metadata:
All columns carry variable labels and value labels as R attributes from the
original SPSS file, automatically extracted into surveycore's metadata
system when you call as_survey().
Variable labels ("label" attribute): A human-readable description of
each column. Example: attr(gss_2024$happy, "label") returns
"GENERAL HAPPINESS".
Value labels ("labels" attribute): A named numeric vector mapping
each code to its meaning, including all missing-value codes. Example:
attr(gss_2024$happy, "labels") returns entries for Very happy,
Pretty happy, Not too happy, and the negative missing codes.
NORC at the University of Chicago. General Social Survey 2024.
https://gss.norc.org (free account required to download raw data;
the processed .rda is included in the package).
Prepared by data-raw/prepare-gss-2024.R.
# Variables in the dataset names(gss_2024) # Create survey design svy <- as_survey( gss_2024, ids = vpsu, strata = vstrat, weights = wtssps, nest = TRUE ) # Inspect variable label attr(gss_2024$happy, "label") # Inspect value labels (includes GSS missing-value codes) attr(gss_2024$happy, "labels") # Split-ballot: how many respondents per ballot form? table(gss_2024$ballot)# Variables in the dataset names(gss_2024) # Create survey design svy <- as_survey( gss_2024, ids = vpsu, strata = vstrat, weights = wtssps, nest = TRUE ) # Inspect variable label attr(gss_2024$happy, "label") # Inspect value labels (includes GSS missing-value codes) attr(gss_2024$happy, "labels") # Split-ballot: how many respondents per ballot form? table(gss_2024$ballot)
Scans variable labels in a survey design object or labelled data frame for
groups of variables sharing a common preface (via separator or longest
common prefix). For survey design objects, detected prefaces are written to
@metadata@question_prefaces. For data frames, prefaces are written to
attr(col, "question_preface") on each column (no metadata object exists
until as_survey() is called). The shared text is trimmed from each
variable label, leaving only the unique suffix.
infer_question_prefaces( x, sep = c(" - ", "- ", " – ", ": ", " | "), min_vars = 2L, lcp_min = 20L, overwrite = FALSE, verbose = TRUE )infer_question_prefaces( x, sep = c(" - ", "- ", " – ", ": ", " | "), min_vars = 2L, lcp_min = 20L, overwrite = FALSE, verbose = TRUE )
x |
A survey design object ( |
sep |
Character vector of literal separator strings to try, in
priority order. Default: |
min_vars |
Minimum number of variables that must share a candidate
preface to trigger extraction. Default |
lcp_min |
Minimum character length (after trimming to a word boundary)
for an LCP-derived preface to be accepted. Default |
overwrite |
If |
verbose |
If |
Detection algorithm (two passes):
Separator pass — for each separator in sep (tried in order):
Variables whose label contains the separator are grouped by their candidate preface (text before the first occurrence of the separator, trimmed).
Any group with min_vars members is recorded; those
variables are excluded from all subsequent passes.
LCP pass — for remaining labelled variables ( 2):
The character-level longest common prefix (LCP) of all remaining labels is computed and trimmed to the last word boundary.
If the trimmed LCP is lcp_min characters, the group
is recorded.
Apply step:
Variables with an existing question_preface are skipped when
overwrite = FALSE (default); a warning is emitted listing the count
of skipped variables.
Variables whose unique suffix would be empty after trimming are always skipped with a per-variable warning.
Data frame integration:
When called on a data frame, the detected preface is written to
attr(col, "question_preface"). Passing the result to as_survey()
automatically picks up both the trimmed label and the preface via the
internal haven metadata extraction step.
The modified x, invisibly.
Other metadata:
classify_question_type(),
extract_higher_is(),
extract_metadata(),
extract_missing_codes(),
extract_question_preface(),
extract_reverse_coded(),
extract_sata(),
extract_universe(),
extract_val_labels(),
extract_var_label(),
extract_var_note(),
set_higher_is(),
set_missing_codes(),
set_question_preface(),
set_reverse_coded(),
set_sata(),
set_universe(),
set_val_labels(),
set_var_label(),
set_var_note(),
survey_metadata(),
survey_weighting_history()
# Data frame with haven-style labels (Qualtrics / SPSS export pattern) df <- data.frame( discrim_a = 1:5, discrim_b = 2:6, discrim_c = 3:7 ) attr(df$discrim_a, "label") <- "Please rate discrimination - Evangelical Christians" attr(df$discrim_b, "label") <- "Please rate discrimination - Muslims" attr(df$discrim_c, "label") <- "Please rate discrimination - Jews" df <- infer_question_prefaces(df, verbose = FALSE) attr(df$discrim_a, "label") # "Evangelical Christians" attr(df$discrim_a, "question_preface") # "Please rate discrimination"# Data frame with haven-style labels (Qualtrics / SPSS export pattern) df <- data.frame( discrim_a = 1:5, discrim_b = 2:6, discrim_c = 3:7 ) attr(df$discrim_a, "label") <- "Please rate discrimination - Evangelical Christians" attr(df$discrim_b, "label") <- "Please rate discrimination - Muslims" attr(df$discrim_c, "label") <- "Please rate discrimination - Jews" df <- infer_question_prefaces(df, verbose = FALSE) attr(df$discrim_a, "label") # "Evangelical Christians" attr(df$discrim_a, "question_preface") # "Please rate discrimination"
Retrieves the structured metadata list attached to a survey result object
returned by any get_*() analysis function.
meta(x, ...) ## S3 method for class 'survey_result' meta(x, ...)meta(x, ...) ## S3 method for class 'survey_result' meta(x, ...)
x |
A |
... |
Currently unused. Reserved for future extensions. |
This is the only supported way to access result metadata — do not use
attr(result, ".meta") directly.
A named list. Common fields present on every result:
design_typeCharacter(1). Design class: "taylor",
"replicate", "twophase", or "nonprob". SRS designs are
represented as survey_taylor (no IDs/strata) and report
"taylor".
conf_levelNumeric(1). Confidence level used (e.g. 0.95).
callLanguage. Matched call to the get_*() function.
n_respondentsInteger(1). Total rows in the design, regardless of groups, domain status, or weights.
groupNamed list. One entry per grouping variable; empty list
(list()) when no groups are active. Each entry is a named list with:
variable_label (character or NULL), question_preface (character
or NULL), value_labels (named vector or NULL).
xNamed list. One entry per focal variable. Length 1 for
single-x functions (get_means, get_totals, get_quantiles);
length N for multi-x functions (get_freqs, get_corr). Each entry
has the same sub-structure as group entries. NULL for
get_totals() when called without an x argument.
Function-specific additional fields:
probs(get_quantiles only) Numeric vector of quantile
probabilities.
method(get_corr only) Character(1) correlation method.
numerator, denominator
(get_ratios only) Flat named lists
with keys name, variable_label, question_preface, value_labels.
Other analysis:
clean(),
get_anova(),
get_corr(),
get_covariance(),
get_diffs(),
get_effective_n(),
get_freqs(),
get_means(),
get_pairwise(),
get_quantiles(),
get_ratios(),
get_t_test(),
get_totals(),
get_variance()
# Construct a minimal survey_result to illustrate meta(): result <- structure( tibble::tibble(mean = 42.0, se = 1.5, n = 100L), .meta = list( design_type = "taylor", conf_level = 0.95, call = quote(get_means(d, x)), n_respondents = 100L, group = list(), x = list( x = list( variable_label = NULL, question_preface = NULL, value_labels = NULL ) ) ), class = c("survey_means", "survey_result", "tbl_df", "tbl", "data.frame") ) meta(result)$design_type # "taylor" meta(result)$n_respondents # 100L meta(result)$conf_level # 0.95# Construct a minimal survey_result to illustrate meta(): result <- structure( tibble::tibble(mean = 42.0, se = 1.5, n = 100L), .meta = list( design_type = "taylor", conf_level = 0.95, call = quote(get_means(d, x)), n_respondents = 100L, group = list(), x = list( x = list( variable_label = NULL, question_preface = NULL, value_labels = NULL ) ) ), class = c("survey_means", "survey_result", "tbl_df", "tbl", "data.frame") ) meta(result)$design_type # "taylor" meta(result)$n_respondents # 100L meta(result)$conf_level # 0.95
A merged dataset from the National Health and Nutrition Examination Survey
(NHANES) 2017-2018 cycle, combining demographic characteristics with blood
pressure measurements. Covers all 9,254 sampled participants; blood pressure
variables are NA for the 550 interview-only participants (ridstatr == 1).
nhanes_2017nhanes_2017
A data frame with 9,254 rows and 14 variables:
Respondent sequence number (unique identifier, join key).
Masked variance pseudo-PSU. Use as the cluster ID for variance estimation. See Details.
Masked variance pseudo-stratum. Use as the stratification variable for variance estimation. See Details.
Full-sample 2-year MEC examination weight. Use for any analysis involving examination measurements (e.g., blood pressure).
Full-sample 2-year interview weight. Use for analyses based on interview data only.
Interview/examination status: 1 = interview only,
2 = both interview and MEC examination.
Gender: 1 = male, 2 = female.
Age in years at screening, top-coded at 80.
Race/Hispanic origin (6 categories): 1 = Mexican
American, 2 = Other Hispanic, 3 = Non-Hispanic White,
4 = Non-Hispanic Black, 6 = Non-Hispanic Asian,
7 = Other/Multiracial.
Ratio of family income to the federal poverty level (continuous, 0–5; values >5 are top-coded at 5).
Education level for adults 20+: 1 = Less than 9th grade,
2 = 9th–11th grade, 3 = High school graduate/GED, 4 = Some
college/AA, 5 = College graduate or above.
Systolic blood pressure, 1st reading (mm Hg). NA if not
examined.
Diastolic blood pressure, 1st reading (mm Hg). NA if not
examined.
60-second pulse rate (beats per minute). NA if not
examined.
Survey design: Taylor series linearization. When creating a survey
design object, use sdmvpsu as the cluster ID, sdmvstra as the stratum,
and wtmec2yr as the weight for examination-based analyses:
svy <- as_survey(nhanes_2017, ids = sdmvpsu, strata = sdmvstra, weights = wtmec2yr )
Use wtint2yr instead of wtmec2yr for interview-only variables
(e.g., income, education).
Metadata:
All columns carry variable labels and value labels as R attributes,
automatically extracted into surveycore's metadata system when you call
as_survey().
Variable labels ("label" attribute): A human-readable description of
each column. Example: attr(nhanes_2017$riagendr, "label") returns
"Gender".
Value labels ("labels" attribute): A named numeric vector mapping
each code to its meaning. Example: attr(nhanes_2017$riagendr, "labels")
returns c(Male = 1, Female = 2).
Source files: DEMO_J.xpt (demographics) merged with BPX_J.xpt (blood
pressure) on seqn. Prepared by data-raw/download-nhanes.R.
National Center for Health Statistics, CDC. NHANES 2017-2018 Continuous Survey. https://www.cdc.gov/nchs/nhanes/
# All 9,254 participants (interview + exam) head(nhanes_2017) # Restrict to exam participants for blood pressure analysis exam_only <- nhanes_2017[nhanes_2017$ridstatr == 2, ] # Inspect variable label attr(nhanes_2017$riagendr, "label") # Inspect value labels attr(nhanes_2017$riagendr, "labels") # Inspect value labels for race/ethnicity attr(nhanes_2017$ridreth3, "labels")# All 9,254 participants (interview + exam) head(nhanes_2017) # Restrict to exam participants for blood pressure analysis exam_only <- nhanes_2017[nhanes_2017$ridstatr == 2, ] # Inspect variable label attr(nhanes_2017$riagendr, "label") # Inspect value labels attr(nhanes_2017$riagendr, "labels") # Inspect value labels for race/ethnicity attr(nhanes_2017$ridreth3, "labels")
The first weekly wave of the Democracy Fund + UCLA Nationscape survey, fielded July 18–24, 2019. Approximately 6,250 completed online interviews drawn from the Lucid respondent exchange platform using a non-probability quota design, with raking weights calibrated to ACS demographic targets and 2016 presidential vote choice.
ns_wave1ns_wave1
A data frame with approximately 6,250 rows and 171 variables
(170 survey variables plus wave_id added by the prepare script).
Unique respondent ID (integer).
Interview date (character, "YYYY-MM-DD" format).
Wave identifier: "ns20190718" for all rows in this
dataset.
Raking weight calibrated to ACS demographic targets and 2016 presidential vote choice. Use for all population-level estimates.
Country direction: 1 = Right direction,
2 = Wrong track, 3 = Not sure.
Economy outlook: 1 = Better, 2 = Worse,
3 = Same, 4 = Not sure.
Political interest (4-pt): 1 = Very interested,
4 = Not at all interested.
Voter registration: 1 = Registered,
2 = Not registered, 3 = Not eligible.
Trump presidential approval: 1 = Strongly approve,
2 = Somewhat approve, 3 = Somewhat disapprove,
4 = Strongly disapprove.
2020 vote intention: 1 = Trump,
2 = Democratic candidate, 3 = Other, 4 = Don't plan to vote,
5 = Not sure.
2016 presidential vote. See labels.
Write-in for vote_2016 "other" choice.
Would consider voting for Trump: 1 = Yes,
2 = No.
Reason for not considering Trump (open text).
Primary vote party: 1 = Democratic,
2 = Republican, 3 = Other.
Democratic primary vote intention. See labels.
Write-in for dem_vote_intent "other".
Top-ranked Democratic presidential candidate. See labels.
Second-ranked Democratic candidate. See labels.
Third-ranked Democratic candidate. See labels.
Wants non-Trump Republican nominee: 1 = Yes,
2 = No, 3 = Not sure.
U.S. House vote intention: 1 = Democrat,
2 = Republican, 3 = Other, 4 = Won't vote, 5 = Not sure.
U.S. Senate vote intention. Same codes as
house_intent.
Governor vote intention. Same codes as
house_intent.
Used social media for political news in past
week: 1 = Selected, 2 = Not selected. See "question_preface"
attribute for shared question stem. Same coding for all news_sources_*
variables.
Used CNN for political news.
Used MSNBC for political news.
Used Fox News for political news.
Used network news (ABC/CBS/NBC/PBS).
Used local TV news.
Used Telemundo or Univision.
Used NPR.
Used AM talk radio.
Used a national newspaper.
Used a local newspaper.
Used another news source: 1 = Selected,
2 = Not selected.
Write-in for news_sources_other.
Favorability toward Whites: 1 = Very
favorable, 2 = Somewhat favorable, 3 = Somewhat unfavorable,
4 = Very unfavorable, 5 = Not sure. Same coding for all
group_favorability_* variables.
Favorability toward Blacks.
Favorability toward Latinos.
Favorability toward Asians.
Favorability toward Christians.
Favorability toward Socialists.
Favorability toward Muslims.
Favorability toward labor unions.
Favorability toward the police.
Favorability toward undocumented immigrants.
Favorability toward gays and lesbians.
Favorability toward Republicans.
Favorability toward Democrats.
Favorability toward Donald Trump. Same
5-point scale as group_favorability_* variables.
Favorability toward Barack Obama.
Favorability toward Alexandria Ocasio-Cortez.
Favorability toward Joe Biden.
Favorability toward Kamala Harris.
Favorability toward Pete Buttigieg.
Favorability toward Elizabeth Warren.
Favorability toward Bernie Sanders.
Favorability toward Mike Pence.
Trump vs. Biden head-to-head: 1 = Trump, 2 = Biden,
3 = Not sure. Same coding for all trump_* matchup variables.
Trump vs. Sanders.
Trump vs. Harris.
Trump vs. Warren.
Trump vs. Buttigieg.
Trump vs. Cory Booker.
Trump vs. Julian Castro.
Trump vs. Tulsi Gabbard.
Trump vs. Kirsten Gillibrand.
Trump vs. Beto O'Rourke.
Pence vs. Biden head-to-head: 1 = Pence, 2 = Biden,
3 = Not sure. Same coding for all pence_* matchup variables.
Pence vs. Buttigieg.
Pence vs. Harris.
Pence vs. Sanders.
Pence vs. Warren.
Whether Donald Trump cares about telling
the truth: 1 = Yes, 2 = No, 3 = Not sure. Same coding for all
cand_truth_* variables.
Whether Elizabeth Warren cares about the truth.
Whether Joe Biden cares about the truth.
Whether Bernie Sanders cares about the truth.
Whether Pete Buttigieg cares about the truth.
Whether Kamala Harris cares about the truth.
Whether Donald Trump relies on facts vs.
hunches: 1 = Facts and evidence, 2 = Hunches, 3 = Not sure. Same
coding for all cand_facts_* variables.
Whether Elizabeth Warren relies on facts.
Whether Joe Biden relies on facts.
Whether Bernie Sanders relies on facts.
Whether Pete Buttigieg relies on facts.
Whether Kamala Harris relies on facts.
Agree/disagree: minorities should work
their way up without special favors. 1 = Strongly agree, 2 = Agree,
3 = Neither, 4 = Disagree, 5 = Strongly disagree. Same scale for
all racial_attitudes_* and gender_attitudes_* variables.
Agree/disagree: generations of slavery make it difficult for Blacks to work out of the lower class.
Agree/disagree: I prefer close relatives marry someone from the same race.
Agree/disagree: it's alright for Blacks and Whites to date.
Agree/disagree: more comfortable with a male boss than female boss.
Agree/disagree: women are just as capable of thinking logically as men.
Agree/disagree: increased opportunities for women have improved quality of life.
Agree/disagree: women who complain about harassment cause more problems than they solve.
Perceived discrimination against Blacks:
1 = A great deal, 2 = A lot, 3 = A little, 4 = None at all,
5 = Not sure. Same scale for all discrimination_* variables.
Perceived discrimination against Whites.
Perceived discrimination against Muslims.
Perceived discrimination against Christians.
Perceived discrimination against Women.
Perceived discrimination against Men.
U.S. Senate knowledge question. See labels.
U.S. Supreme Court knowledge question. See labels.
3-category party ID: 1 = Democrat, 2 = Republican,
3 = Independent, 4 = Something else.
7-point party ID (legacy coding). See labels.
Strength of Democratic ID (conditional on
pid3 == 1). See labels.
Strength of Republican ID (conditional on
pid3 == 2). See labels.
Partisan lean of Independents (conditional on
pid3 == 3). See labels.
5-point ideological self-placement: 1 = Very liberal,
5 = Very conservative.
Employment status (selected choice). See labels.
Write-in for employment "other".
Born outside the U.S.: 1 = Yes, 2 = No.
Primary language at home. See labels.
Religious affiliation (selected choice). See labels.
Write-in for religion "other".
Born-again or evangelical Christian: 1 = Yes,
2 = No.
Sexual orientation. See labels.
Labor union membership: 1 = Yes, 2 = No,
3 = Non-union household, 4 = Not sure.
Household gun ownership: 1 = Yes, 2 = No,
3 = Not sure.
Support building a wall on the southern U.S. border:
1 = Strongly support, 2 = Somewhat support, 3 = Somewhat oppose,
4 = Strongly oppose, 5 = Not sure. Same scale for all policy items
through limit_magazines. See "question_preface" attribute on each
variable for the exact shared question stem.
Support capping carbon emissions.
Support large-scale government investment in environmental technology.
Support requiring background checks for all gun purchases.
Support cutting taxes for families making < $100K/year.
Support eliminating the estate tax.
Support raising taxes on families making > $600K.
Support ensuring all students can graduate from state colleges debt-free.
Support requiring a waiting period and ultrasound before an abortion.
Support never permitting abortion.
Support permitting abortion in cases other than rape/incest/life at risk.
Support permitting late-term abortion.
Support allowing employers to decline abortion coverage.
Support guaranteeing jobs for all Americans.
Support enacting a Green New Deal.
Support creating a public registry of gun ownership.
Support separating children from parents prosecuted for illegal border crossing.
Support shifting to a merit-based immigration system.
Support requiring proof of citizenship to wire money internationally.
Support impeaching President Trump.
Support withdrawing military support for Israel.
Support legalizing marijuana.
Support requiring 12 weeks of paid maternity leave.
Support Medicare-for-All.
Support reducing the size of the U.S. military.
Support raising the minimum wage to $15/hour.
Support banning people from predominantly Muslim countries.
Support removing barriers to domestic oil and gas drilling.
Support granting reparations to descendants of slaves.
Support allowing people to work in unionized workplaces without paying union dues.
Support displaying the Ten Commandments in public schools and courthouses.
Support limiting trade with other countries.
Support allowing transgender people to serve in the military.
Support raising taxes on families making > $250K.
Support providing tax-funded vouchers for private or religious schools.
Support providing government-run health insurance to all Americans.
Support providing the option to purchase government-run insurance.
Support subsidizing health insurance for lower income people not on Medicaid.
Support creating a path to citizenship for all undocumented immigrants.
Support a path to citizenship for DREAMers.
Support deporting all undocumented immigrants.
Support banning all guns.
Support banning assault rifles.
Support limiting gun magazines to 10 bullets.
Respondent age in years.
Gender: 1 = Male, 2 = Female, 3 = Other.
Census region: 1 = Northeast, 2 = Midwest,
3 = South, 4 = West.
Hispanic or Latino origin: 1 = Yes, 2 = No.
Race/ethnicity (6 categories). See labels.
Household income (7 brackets). See labels.
Educational attainment (6 categories). See labels.
U.S. state of residence (2-letter abbreviation).
Congressional district.
This dataset is the first of 77 weekly waves collected from July 2019 through January 2021. The full survey ran in three phases:
| Phase | Weeks | Dates | Approx. N |
| Phase 1 | 1–24 | Jul 18, 2019 – Dec 26, 2019 | 150,000 |
| Phase 2 | 25–50 | Jan 2, 2020 – Jun 25, 2020 | 162,500 |
| Phase 3 | 51–77 | Jul 2, 2020 – Jan 12, 2021 | 168,750 |
Only Wave 1 is bundled in the package because 77 waves × ~6,250 rows
would be prohibitively large. To obtain the full dataset by phase, use the
prepare scripts in data-raw/ (see the Source section).
Survey design:
The Nationscape is a calibrated non-probability sample (quota design with
raking weights). Use as_survey_nonprob() — it is designed specifically
for this use case and will gain bootstrap re-calibration variance in Phase
2.5:
svy <- as_survey_nonprob(ns_wave1, weights = weight)
Metadata:
All substantive columns carry variable labels ("label" attribute) set
during data preparation. Battery items additionally carry a
"question_preface" attribute with the shared question stem. Value
labels ("labels" attribute) are present for all coded response items.
Battery structure:
Most multi-item question groups follow a {battery}_{item} naming
convention. All items within a battery share an identical
"question_preface" attribute:
| Battery prefix | Preface summary | N items |
news_sources_* |
News sources used in past week | 13 |
group_favorability_* |
Favorability toward named groups | 13 |
cand_favorability_* |
Favorability toward named candidates | 9 |
trump_* |
Trump head-to-head matchups | 10 |
pence_* |
Pence head-to-head matchups | 5 |
cand_truth_* |
Whether each candidate tells the truth | 6 |
cand_facts_* |
Whether each candidate relies on facts | 6 |
racial_attitudes_* |
Agree/disagree racial attitude items | 4 |
gender_attitudes_* |
Agree/disagree gender attitude items | 4 |
discrimination_* |
Perceived discrimination by group | 6 |
Three policy batteries share the same Agree/Disagree/Neither scale:
wall, cap_carbon, environment, guns_bg, mctaxes, estate_tax,
raise_upper_tax, college, abortion_waiting, abortion_never,
abortion_conditions, late_term_abortion, abortion_insurance,
guaranteed_jobs, green_new_deal, gun_registry,
immigration_separation, immigration_system, immigration_wire,
impeach_trump, israel, marijuana, maternityleave,
medicare_for_all, military_size, minwage, muslimban,
oil_and_gas, reparations, right_to_work, ten_commandments,
trade, trans_military, uctaxes2, vouchers, gov_insurance,
public_option, health_subsidies, path_to_citizenship, dreamers,
deportation, ban_guns, ban_assault_rifles, limit_magazines.
Democracy Fund Voter Study Group / UCLA. Nationscape Data Set, version
December 2021. https://www.voterstudygroup.org/data/nationscape
(free download; academic research use). Prepared by
data-raw/prepare-nationscape-phase1.R.
For full methodology, see the Nationscape User Guide and the
Representative Assessment report in
data-raw/nationscape/Nationscape-User-Guide-2021Dec.pdf.
Tausanovitch, Chris and Lynn Vavreck. 2021. Democracy Fund + UCLA Nationscape, July 18–24, 2019 — Wave 1 (version 20210301). Retrieved from voterstudygroup.org/data/nationscape.
Rivers, Douglas and Delia Bailey. 2009. "Inference from matched samples in the 2008 U.S. national elections." Proceedings of the Joint Statistical Meetings, Social Statistics Section.
# Design variables head(ns_wave1[, c("response_id", "weight", "age", "gender")]) # Inspect a battery item's metadata attr(ns_wave1$group_favorability_blacks, "label") attr(ns_wave1$group_favorability_blacks, "question_preface") attr(ns_wave1$news_sources_cnn, "labels") # Create a calibrated survey design (correct approach for raked # non-prob samples) svy <- as_survey_nonprob(ns_wave1, weights = weight) get_freqs(svy, pres_approval) # Party identification distribution table(ns_wave1$pid3)# Design variables head(ns_wave1[, c("response_id", "weight", "age", "gender")]) # Inspect a battery item's metadata attr(ns_wave1$group_favorability_blacks, "label") attr(ns_wave1$group_favorability_blacks, "question_preface") attr(ns_wave1$news_sources_cnn, "labels") # Create a calibrated survey design (correct approach for raked # non-prob samples) svy <- as_survey_nonprob(ns_wave1, weights = weight) get_freqs(svy, pres_approval) # Party identification distribution table(ns_wave1$pid3)
The extended survey dataset from Pew Research Center's 2019-2020 Survey of U.S. Jews, fielded November 19, 2019 – June 3, 2020 (n = 5,881). Respondents were drawn from a national, stratified random sample of residential mailing addresses with oversampling of households likely to contain Jewish respondents. The dataset carries 100 jackknife replicate weights alongside the main weight.
pew_jewish_2020pew_jewish_2020
A data frame with 5,881 rows and 130 variables. Variables
extweight1–extweight100 are jackknife replicate weights; the remaining
30 variables are:
Full-sample base weight. Use for all estimates.
Jackknife replicate weight 1 of 100.
Jackknife replicate weight 2 of 100.
Jackknife replicate weight 3 of 100.
Jackknife replicate weight 4 of 100.
Jackknife replicate weight 5 of 100.
Jackknife replicate weight 6 of 100.
Jackknife replicate weight 7 of 100.
Jackknife replicate weight 8 of 100.
Jackknife replicate weight 9 of 100.
Jackknife replicate weight 10 of 100.
Jackknife replicate weight 11 of 100.
Jackknife replicate weight 12 of 100.
Jackknife replicate weight 13 of 100.
Jackknife replicate weight 14 of 100.
Jackknife replicate weight 15 of 100.
Jackknife replicate weight 16 of 100.
Jackknife replicate weight 17 of 100.
Jackknife replicate weight 18 of 100.
Jackknife replicate weight 19 of 100.
Jackknife replicate weight 20 of 100.
Jackknife replicate weight 21 of 100.
Jackknife replicate weight 22 of 100.
Jackknife replicate weight 23 of 100.
Jackknife replicate weight 24 of 100.
Jackknife replicate weight 25 of 100.
Jackknife replicate weight 26 of 100.
Jackknife replicate weight 27 of 100.
Jackknife replicate weight 28 of 100.
Jackknife replicate weight 29 of 100.
Jackknife replicate weight 30 of 100.
Jackknife replicate weight 31 of 100.
Jackknife replicate weight 32 of 100.
Jackknife replicate weight 33 of 100.
Jackknife replicate weight 34 of 100.
Jackknife replicate weight 35 of 100.
Jackknife replicate weight 36 of 100.
Jackknife replicate weight 37 of 100.
Jackknife replicate weight 38 of 100.
Jackknife replicate weight 39 of 100.
Jackknife replicate weight 40 of 100.
Jackknife replicate weight 41 of 100.
Jackknife replicate weight 42 of 100.
Jackknife replicate weight 43 of 100.
Jackknife replicate weight 44 of 100.
Jackknife replicate weight 45 of 100.
Jackknife replicate weight 46 of 100.
Jackknife replicate weight 47 of 100.
Jackknife replicate weight 48 of 100.
Jackknife replicate weight 49 of 100.
Jackknife replicate weight 50 of 100.
Jackknife replicate weight 51 of 100.
Jackknife replicate weight 52 of 100.
Jackknife replicate weight 53 of 100.
Jackknife replicate weight 54 of 100.
Jackknife replicate weight 55 of 100.
Jackknife replicate weight 56 of 100.
Jackknife replicate weight 57 of 100.
Jackknife replicate weight 58 of 100.
Jackknife replicate weight 59 of 100.
Jackknife replicate weight 60 of 100.
Jackknife replicate weight 61 of 100.
Jackknife replicate weight 62 of 100.
Jackknife replicate weight 63 of 100.
Jackknife replicate weight 64 of 100.
Jackknife replicate weight 65 of 100.
Jackknife replicate weight 66 of 100.
Jackknife replicate weight 67 of 100.
Jackknife replicate weight 68 of 100.
Jackknife replicate weight 69 of 100.
Jackknife replicate weight 70 of 100.
Jackknife replicate weight 71 of 100.
Jackknife replicate weight 72 of 100.
Jackknife replicate weight 73 of 100.
Jackknife replicate weight 74 of 100.
Jackknife replicate weight 75 of 100.
Jackknife replicate weight 76 of 100.
Jackknife replicate weight 77 of 100.
Jackknife replicate weight 78 of 100.
Jackknife replicate weight 79 of 100.
Jackknife replicate weight 80 of 100.
Jackknife replicate weight 81 of 100.
Jackknife replicate weight 82 of 100.
Jackknife replicate weight 83 of 100.
Jackknife replicate weight 84 of 100.
Jackknife replicate weight 85 of 100.
Jackknife replicate weight 86 of 100.
Jackknife replicate weight 87 of 100.
Jackknife replicate weight 88 of 100.
Jackknife replicate weight 89 of 100.
Jackknife replicate weight 90 of 100.
Jackknife replicate weight 91 of 100.
Jackknife replicate weight 92 of 100.
Jackknife replicate weight 93 of 100.
Jackknife replicate weight 94 of 100.
Jackknife replicate weight 95 of 100.
Jackknife replicate weight 96 of 100.
Jackknife replicate weight 97 of 100.
Jackknife replicate weight 98 of 100.
Jackknife replicate weight 99 of 100.
Jackknife replicate weight 100 of 100.
Unique respondent identifier.
Jewish identity category: 1 = Jews By Religion,
2 = Jews Of No Religion, 3 = Jewish Background,
4 = Jewish Affinity, 5 = Respondent Not Jewish In Any Way.
Collection mode: 1 = Screener And Extended Survey
Via Cawi, 2 = Screener And Extended Survey Via Teleform,
3 = Screener Via Cawi, Extended Survey Via Teleform.
Census region: 1 = Northeast, 2 = Midwest,
3 = South, 4 = West.
Sex: 1 = Male, 2 = Female, 99 = Not Answered.
Age: 1 = 18-29, 2 = 30-49, 3 = 50-64, 4 = 65+;
999 = No Answer.
Education: 1 = High School Or Less,
2 = Some College, 3 = College Graduate, 4 = Postgrad Degree;
99 = No Answer.
Current religion (24 categories including Jewish subgroups and combinations).
Hispanic origin: 1 = Yes, 2 = No, 99 = Not Answered.
Race (5 categories).
Race-ethnicity (4 categories).
Presidential approval (Trump): 1 = Strongly Approve,
2 = Somewhat Approve, 3 = Somewhat Disapprove,
4 = Strongly Disapprove, 99 = Not Answered.
Right track/wrong track:
1 = Generally Headed In The Right Direction,
2 = Off On The Wrong Track, 99 = Not Answered.
Personal life satisfaction: 1 = Excellent,
2 = Good, 3 = Only Fair, 4 = Poor, 99 = Not Answered.
Community as a place to live: 1 = Excellent,
2 = Good, 3 = Only Fair, 4 = Poor, 99 = Not Answered.
Jewish. Battery 1: religious identity (select-all-that-apply). See Details for question text.
Catholic. Battery 1: religious identity.
Mormon. Battery 1: religious identity.
Muslim. Battery 1: religious identity.
Jewish. Battery 2: religious background (select-all-that-apply). See Details for question text.
Catholic. Battery 2: religious background.
Mormon. Battery 2: religious background.
Muslim. Battery 2: religious background.
Evangelical Christians. Battery 3: discrimination perceptions (rating scale). See Details for question text.
Muslims. Battery 3: discrimination perceptions.
Jews. Battery 3: discrimination perceptions.
Blacks. Battery 3: discrimination perceptions.
Hispanics. Battery 3: discrimination perceptions.
Gays and lesbians. Battery 3: discrimination perceptions.
Survey design: Jackknife replication — use as_survey_replicate()
with all
100 replicate weights:
svy <- as_survey_replicate( pew_jewish_2020, weights = extweight, repweights = extweight1:extweight100, type = "JK1" )
Jewish identity classification: The jewishcat variable classifies
respondents into five mutually exclusive categories used in the published
Pew report. Use jewishcat rather than constructing your own
classification from the raw religion variables.
Battery question stems:
Battery 1 (relconsider_a–relconsider_d): "ASIDE from religion, do you consider yourself to be any of the following in any way (for example ethnically, culturally or because of your family's background)?"
Values: 1 = Yes, Consider Myself This, 2 = No, Do Not Consider
Myself This, 99 = Refused.
Battery 2 (relraised_a–relraised_d): "Please indicate whether you were raised in any of the following traditions or had a parent from any of the following backgrounds." Values: 1 = Yes, Was Raised In
This Tradition Or Had A Parent From This Background, 2 = No, Was Not
Raised In This Tradition And Did Not Have A Parent From This Background,
99 = Refused.
Battery 3 (discrim_a–discrim_f): "Please tell us how much discrimination there is against each of these groups in our society today." Values: 1 = A Lot, 2 = Some, 3 = Not Much,
4 = None At All, 99 = Not Answered.
Metadata:
All columns carry variable labels and value labels as R attributes from the
original Stata file. The three battery variable groups additionally carry a
"question_preface" attribute with the shared question stem. All three
attribute types are automatically extracted into surveycore's metadata
system when you call as_survey_replicate().
Variable labels ("label" attribute): A human-readable description of
each column — for battery items this is the unique item text (e.g.,
"Jewish"). Example: attr(pew_jewish_2020$relconsider_a, "label")
returns "Jewish".
Value labels ("labels" attribute): A named numeric vector mapping
each code to its meaning. Example:
attr(pew_jewish_2020$relconsider_a, "labels") returns
c("Yes, Consider Myself This" = 1, "No, Do Not Consider Myself This" = 2, Refused = 99).
Question preface ("question_preface" attribute): The shared question
stem for each battery group. Example:
attr(pew_jewish_2020$discrim_a, "question_preface") returns
"Please tell us how much discrimination there is against each of these groups in our society today.".
Pew Research Center. Jewish Americans in 2020 (Extended Dataset).
https://www.pewresearch.org/datasets/ (free account required to
download raw data; the processed .rda is included in the package).
Prepared by data-raw/prepare-pew-jewish-2020.R.
# Design variables head(pew_jewish_2020[, c("qkey", "extweight", "jewishcat")]) # Confirm 100 replicate weights are present sum(grepl("^extweight[0-9]", names(pew_jewish_2020))) # Inspect variable label (unique item text for battery variable) attr(pew_jewish_2020$discrim_a, "label") # Inspect value labels attr(pew_jewish_2020$discrim_a, "labels") # Inspect question preface (shared stem across the battery) attr(pew_jewish_2020$discrim_a, "question_preface") # Jewish identity distribution (use jewishcat, not raw religion vars) table(pew_jewish_2020$jewishcat)# Design variables head(pew_jewish_2020[, c("qkey", "extweight", "jewishcat")]) # Confirm 100 replicate weights are present sum(grepl("^extweight[0-9]", names(pew_jewish_2020))) # Inspect variable label (unique item text for battery variable) attr(pew_jewish_2020$discrim_a, "label") # Inspect value labels attr(pew_jewish_2020$discrim_a, "labels") # Inspect question preface (shared stem across the battery) attr(pew_jewish_2020$discrim_a, "question_preface") # Jewish identity distribution (use jewishcat, not raw religion vars) table(pew_jewish_2020$jewishcat)
The 2025 National Public Opinion Reference Survey (NPORS), conducted February 5 – June 18, 2025, by Pew Research Center (n = 5,022). An address-based sample (ABS) drawn from the USPS Computerized Delivery Sequence File, with respondents completing the survey online, by paper, or by telephone in English or Spanish. All 65 columns from the public release file are retained.
pew_npors_2025pew_npors_2025
A data frame with 5,022 rows and 65 variables. The 11 smuse_*
variables form a battery asking about social media platform use and share a
"question_preface" attribute. All other variables are documented
individually below:
Case ID. Unique respondent identifier.
Sampling stratum (10 levels, defined by census block group demographics).
Base weight — inverse probability of selection, with adaptive mode adjustment.
Final weight — basewt after raking to Census population
targets. Use for all population-level estimates.
Data collection mode: 1 = Online, 2 = Paper,
3 = Phone.
Language interview completed in: 1 = English,
2 = Spanish.
Language interview started in.
Interview start timestamp.
Interview end timestamp.
Economic conditions in your community today (Excellent / Good / Fair / Poor).
Economic conditions one year from now (Better / Worse / Same).
Community type: Urban / Suburban / Rural.
Americans united vs. divided on values.
Area safety in terms of crime (Extremely safe – Not at all safe).
Government's role in protecting people from themselves.
Impact of more gun ownership on crime.
Household financial situation (Comfortable – Can't meet basics).
Military service in household.
Volunteered for any organization in past 12 months.
Uses internet or email at least occasionally.
Accesses internet on a mobile device.
Internet use frequency (6 categories).
Internet use frequency (4 categories, derived).
Subscribes to home internet service.
Home internet type (dial-up, broadband, etc.).
Facebook. Part of social media use battery (see Details).
YouTube. Part of social media use battery (see Details).
X (formerly Twitter). Part of social media use battery.
Instagram. Part of social media use battery.
Snapchat. Part of social media use battery.
WhatsApp. Part of social media use battery.
TikTok. Part of social media use battery.
Reddit. Part of social media use battery.
Bluesky. Part of social media use battery.
Threads. Part of social media use battery.
Truth Social. Part of social media use battery.
Listens to radio.
Has a cell phone.
Cell phone is a smartphone.
Has a working landline telephone at home.
Current religion (12 categories).
Religion (4 categories: Protestant, Catholic, Unaffiliated, Other).
Born-again or evangelical Christian.
In-person religious service attendance (6 categories).
Online/TV religious service participation (6 categories).
Importance of religion in life (Very – Not at all).
Prayer frequency outside of services (7 categories).
Education level (categorical).
Hispanic origin.
Race (5 categories).
Race-ethnicity (5 categories including Asian non-Hispanic).
Age in 13 five-year groups.
Age (4 categories: 18-29, 30-49, 50-64, 65+).
U.S. born vs. foreign born.
Gender (man / woman / other).
Number of adults in household.
Total family income (8 categories from < $30,000 to $150,000+).
Census region (NE / MW / S / W).
Metropolitan area indicator.
Registered to vote at current address.
Party affiliation (Rep / Dem / Ind / Other).
Party lean for Independents (Rep / Dem).
Party summary (Rep+Lean Rep / Dem+Lean Dem / No lean).
Voted in the 2024 presidential election.
2024 presidential vote choice (Trump / Harris / Other).
Survey design: Stratified address-based sample with raking post-stratification — use Taylor series linearization. NPORS has no PSU (each address is its own unit, effectively a stratified SRS):
svy <- as_survey(pew_npors_2025, strata = stratum, weights = weight )
Use basewt instead of weight for sensitivity analyses comparing
pre- and post-raking estimates.
Social media battery: All 11 smuse_* variables share the question
stem "Please indicate whether or not you ever use the following websites or apps." Values: 1 = Selected, 2 = Not selected, 99 = Refused.
Each variable additionally carries a "question_preface" attribute with
this shared stem.
Metadata:
All columns carry variable labels and value labels as R attributes from the
original SPSS file. The 11 smuse_* battery variables additionally carry
a "question_preface" attribute with the shared question stem. All three
attribute types are automatically extracted into surveycore's metadata
system when you call as_survey().
Variable labels ("label" attribute): A human-readable description of
each column — for smuse_* variables this is just the platform name
(e.g., "Facebook"). Example: attr(pew_npors_2025$smuse_fb, "label")
returns "Facebook".
Value labels ("labels" attribute): A named numeric vector mapping
each code to its meaning. Example:
attr(pew_npors_2025$smuse_fb, "labels") returns
c(Selected = 1, "Not selected" = 2, Refused = 99).
Question preface ("question_preface" attribute): The shared question
stem for battery items, set on all smuse_* columns. Example:
attr(pew_npors_2025$smuse_fb, "question_preface") returns
"Please indicate whether or not you ever use the following websites or apps.".
Pew Research Center. 2025 National Public Opinion Reference Survey.
https://www.pewresearch.org/datasets/ (free account required to
download raw data; the processed .rda is included in the package).
Prepared by data-raw/prepare-pew-npors-2025.R.
# Variables in the dataset names(pew_npors_2025) # Create survey design (no PSU for ABS design) svy <- as_survey( pew_npors_2025, strata = stratum, weights = weight ) # Inspect variable label attr(pew_npors_2025$smuse_fb, "label") # Inspect value labels attr(pew_npors_2025$smuse_fb, "labels") # Inspect question preface (shared stem for all smuse_* battery items) attr(pew_npors_2025$smuse_fb, "question_preface")# Variables in the dataset names(pew_npors_2025) # Create survey design (no PSU for ABS design) svy <- as_survey( pew_npors_2025, strata = stratum, weights = weight ) # Inspect variable label attr(pew_npors_2025$smuse_fb, "label") # Inspect value labels attr(pew_npors_2025$smuse_fb, "labels") # Inspect question preface (shared stem for all smuse_* battery items) attr(pew_npors_2025$smuse_fb, "question_preface")
Print Method for survey_anova Objects
## S3 method for class 'survey_anova' print(x, ...)## S3 method for class 'survey_anova' print(x, ...)
x |
A |
... |
Additional arguments (currently unused). |
x, invisibly.
Prints a structured header showing design type, family, dependent variable, treatment variable with reference level, and estimation method, then delegates to the tibble print method for the body.
## S3 method for class 'survey_diffs' print(x, ...)## S3 method for class 'survey_diffs' print(x, ...)
x |
A |
... |
Passed to the tibble print method. |
x, invisibly.
Print method for survey_pairwise objects.
## S3 method for class 'survey_pairwise' print(x, ...)## S3 method for class 'survey_pairwise' print(x, ...)
x |
A |
... |
Additional arguments (unused). |
x, invisibly.
Prints a labelled header showing the specific result class and dimensions, then delegates to the tibble print method for the tabular content.
## S3 method for class 'survey_result' print(x, ...)## S3 method for class 'survey_result' print(x, ...)
x |
A |
... |
Passed to the tibble print method. |
x, invisibly.
result <- structure( tibble::tibble(mean = 42.0, se = 1.5, n = 100L), .meta = list( design_type = "taylor", conf_level = 0.95, call = quote(get_means(d, x)), n_respondents = 100L, group = list(), x = list( x = list( variable_label = NULL, question_preface = NULL, value_labels = NULL ) ) ), class = c("survey_means", "survey_result", "tbl_df", "tbl", "data.frame") ) print(result)result <- structure( tibble::tibble(mean = 42.0, se = 1.5, n = 100L), .meta = list( design_type = "taylor", conf_level = 0.95, call = quote(get_means(d, x)), n_respondents = 100L, group = list(), x = list( x = list( variable_label = NULL, question_preface = NULL, value_labels = NULL ) ) ), class = c("survey_means", "survey_result", "tbl_df", "tbl", "data.frame") ) print(result)
Print method for survey_t_test objects.
## S3 method for class 'survey_t_test' print(x, ...)## S3 method for class 'survey_t_test' print(x, ...)
x |
A |
... |
Additional arguments (unused). |
x, invisibly.
survey_collection
Drops one or more named surveys from a collection and returns a new
survey_collection. Errors if any requested name is not present.
remove_survey(x, name)remove_survey(x, name)
x |
A |
name |
Character vector of survey names to drop. All names must be
present in |
A new survey_collection without the dropped surveys. Errors
surveycore_error_collection_empty if removing would leave the
collection empty. This error is raised by the S7 class validator, not
by remove_survey() itself.
as_survey_collection(), add_survey()
Other collections:
add_survey(),
as_survey_collection(),
set_collection_id(),
set_collection_if_missing_var(),
survey_collection()
d1 <- as_survey( gss_2024, ids = vpsu, weights = wtssps, strata = vstrat, nest = TRUE ) d2 <- as_survey( gss_2024, ids = vpsu, weights = wtssps, strata = vstrat, nest = TRUE ) coll <- as_survey_collection(a = d1, b = d2) coll2 <- remove_survey(coll, "a") names(coll2)d1 <- as_survey( gss_2024, ids = vpsu, weights = wtssps, strata = vstrat, nest = TRUE ) d2 <- as_survey( gss_2024, ids = vpsu, weights = wtssps, strata = vstrat, nest = TRUE ) coll <- as_survey_collection(a = d1, b = d2) coll2 <- remove_survey(coll, "a") names(coll2)
survey_collection
Updates the @id property of a survey_collection. The new value is
the column name .dispatch_over_collection() injects when an analysis
function (get_means(), get_freqs(), etc.) is dispatched across the
collection without an explicit per-call .id.
set_collection_id(x, id)set_collection_id(x, id)
x |
|
id |
Character(1). The new identifier column name. Must be
non- |
Setting the same value as the existing @id returns the collection
unchanged (no error, no warning). All other invariants on the
collection (@surveys, @groups, @if_missing_var) are preserved.
Pipes naturally with the rest of the collection API:
coll |> set_collection_id("wave") |> get_means(y1)
The modified survey_collection, invisibly.
Other collections:
add_survey(),
as_survey_collection(),
remove_survey(),
set_collection_if_missing_var(),
survey_collection()
d1 <- as_survey( gss_2024, ids = vpsu, weights = wtssps, strata = vstrat, nest = TRUE ) coll <- as_survey_collection(a = d1) coll <- set_collection_id(coll, "wave") coll@idd1 <- as_survey( gss_2024, ids = vpsu, weights = wtssps, strata = vstrat, nest = TRUE ) coll <- as_survey_collection(a = d1) coll <- set_collection_id(coll, "wave") coll@id
survey_collection
Updates the @if_missing_var property of a survey_collection. The
new value is the per-call default .dispatch_over_collection() uses
when an analysis function (get_means(), get_freqs(), etc.) is
dispatched across the collection without an explicit per-call
.if_missing_var.
set_collection_if_missing_var(x, if_missing_var)set_collection_if_missing_var(x, if_missing_var)
x |
|
if_missing_var |
Character(1), one of |
Setting the same value as the existing @if_missing_var returns the
collection unchanged (no error, no warning). All other invariants on
the collection (@surveys, @groups, @id) are preserved.
Pipes naturally with the rest of the collection API:
coll |> set_collection_if_missing_var("skip") |> get_means(y1)
The modified survey_collection, invisibly.
Other collections:
add_survey(),
as_survey_collection(),
remove_survey(),
set_collection_id(),
survey_collection()
d1 <- as_survey( gss_2024, ids = vpsu, weights = wtssps, strata = vstrat, nest = TRUE ) coll <- as_survey_collection(a = d1) coll <- set_collection_if_missing_var(coll, "skip") coll@if_missing_vard1 <- as_survey( gss_2024, ids = vpsu, weights = wtssps, strata = vstrat, nest = TRUE ) coll <- as_survey_collection(a = d1) coll <- set_collection_if_missing_var(coll, "skip") coll@if_missing_var
Records whether a higher value is better ("better") or worse ("worse")
for one or more variables in a survey design object or data frame. This
metadata is used by get_diffs() when show_favorability = TRUE.
set_higher_is(x, ..., variable = NULL, direction = NULL)set_higher_is(x, ..., variable = NULL, direction = NULL)
x |
A survey design object or |
... |
Named arguments where the name is the variable and the value is
the direction ( |
variable |
|
direction |
|
The modified object, invisibly.
extract_higher_is() to retrieve direction attributes
Other metadata:
classify_question_type(),
extract_higher_is(),
extract_metadata(),
extract_missing_codes(),
extract_question_preface(),
extract_reverse_coded(),
extract_sata(),
extract_universe(),
extract_val_labels(),
extract_var_label(),
extract_var_note(),
infer_question_prefaces(),
set_missing_codes(),
set_question_preface(),
set_reverse_coded(),
set_sata(),
set_universe(),
set_val_labels(),
set_var_label(),
set_var_note(),
survey_metadata(),
survey_weighting_history()
d <- as_survey( nhanes_2017, ids = sdmvpsu, weights = wtint2yr, strata = sdmvstra, nest = TRUE ) d <- set_higher_is(d, bpxsy1 = "worse") extract_higher_is(d, bpxsy1)d <- as_survey( nhanes_2017, ids = sdmvpsu, weights = wtint2yr, strata = sdmvstra, nest = TRUE ) d <- set_higher_is(d, bpxsy1 = "worse") extract_higher_is(d, bpxsy1)
Sets missing-value codes for one or more variables. Missing codes are atomic
vectors documenting which data values represent missing data
(e.g., c(Refused = -2L, DontKnow = -1L)).
set_missing_codes(x, ..., variable = NULL, codes = NULL)set_missing_codes(x, ..., variable = NULL, codes = NULL)
x |
A survey design object or a data frame. |
... |
Named arguments where the name is the variable and the value is
a named atomic vector of missing codes. Supports |
variable |
A character vector of variable names. Use with |
codes |
A list of named atomic vectors, one per element of |
Supports Conventions 1, 2, and 3 — see set_var_label() for details on
the calling conventions. For Convention 3 with a single variable, a bare
named atomic vector is accepted in addition to a list.
The modified object, invisibly.
extract_missing_codes() to retrieve missing value codes
Other metadata:
classify_question_type(),
extract_higher_is(),
extract_metadata(),
extract_missing_codes(),
extract_question_preface(),
extract_reverse_coded(),
extract_sata(),
extract_universe(),
extract_val_labels(),
extract_var_label(),
extract_var_note(),
infer_question_prefaces(),
set_higher_is(),
set_question_preface(),
set_reverse_coded(),
set_sata(),
set_universe(),
set_val_labels(),
set_var_label(),
set_var_note(),
survey_metadata(),
survey_weighting_history()
d <- as_survey( gss_2024, ids = vpsu, weights = wtssps, strata = vstrat, nest = TRUE ) d <- set_missing_codes(d, happy = c(Refused = -1L, DK = -2L)) extract_missing_codes(d, happy)d <- as_survey( gss_2024, ids = vpsu, weights = wtssps, strata = vstrat, nest = TRUE ) d <- set_missing_codes(d, happy = c(Refused = -1L, DK = -2L)) extract_missing_codes(d, happy)
Sets the question preface string for one or more variables. Question prefaces are the shared introductory text for a battery of related questions.
set_question_preface(x, ..., variable = NULL, preface = NULL)set_question_preface(x, ..., variable = NULL, preface = NULL)
x |
A survey design object or a data frame. |
... |
Named arguments where the name is the variable and the value is
the preface string. Supports |
variable |
A character vector of variable names. Use with |
preface |
A character vector of preface strings, one per element of
|
Supports Conventions 1, 2, and 3 — see set_var_label() for details.
The modified object, invisibly.
extract_question_preface() to retrieve a preface
Other metadata:
classify_question_type(),
extract_higher_is(),
extract_metadata(),
extract_missing_codes(),
extract_question_preface(),
extract_reverse_coded(),
extract_sata(),
extract_universe(),
extract_val_labels(),
extract_var_label(),
extract_var_note(),
infer_question_prefaces(),
set_higher_is(),
set_missing_codes(),
set_reverse_coded(),
set_sata(),
set_universe(),
set_val_labels(),
set_var_label(),
set_var_note(),
survey_metadata(),
survey_weighting_history()
d <- as_survey( gss_2024, ids = vpsu, weights = wtssps, strata = vstrat, nest = TRUE ) d <- set_question_preface(d, happy = "Taken all together...") extract_question_preface(d, happy)d <- as_survey( gss_2024, ids = vpsu, weights = wtssps, strata = vstrat, nest = TRUE ) d <- set_question_preface(d, happy = "Taken all together...") extract_question_preface(d, happy)
Marks one or more variables as reverse-coded in a survey design object or
data frame. Uses the same two-convention pattern as set_sata().
set_reverse_coded(x, ..., variable = NULL, reverse_coded = TRUE)set_reverse_coded(x, ..., variable = NULL, reverse_coded = TRUE)
x |
A survey design object or |
... |
< |
variable |
|
reverse_coded |
|
Convention A (tidy-select ...) — recommended:
design |> set_reverse_coded(anxiety, worry)
Convention B (variable = character vector) — programmatic:
vars <- c("anxiety", "worry")
design |> set_reverse_coded(variable = vars)
Setting reverse_coded = FALSE removes the flag.
The modified object, invisibly.
extract_reverse_coded() to retrieve reverse-coded flags
Other metadata:
classify_question_type(),
extract_higher_is(),
extract_metadata(),
extract_missing_codes(),
extract_question_preface(),
extract_reverse_coded(),
extract_sata(),
extract_universe(),
extract_val_labels(),
extract_var_label(),
extract_var_note(),
infer_question_prefaces(),
set_higher_is(),
set_missing_codes(),
set_question_preface(),
set_sata(),
set_universe(),
set_val_labels(),
set_var_label(),
set_var_note(),
survey_metadata(),
survey_weighting_history()
d <- as_survey( nhanes_2017, ids = sdmvpsu, weights = wtint2yr, strata = sdmvstra, nest = TRUE ) d <- set_reverse_coded(d, bpxsy1, ridageyr) d <- set_reverse_coded(d, bpxsy1, reverse_coded = FALSE)d <- as_survey( nhanes_2017, ids = sdmvpsu, weights = wtint2yr, strata = sdmvstra, nest = TRUE ) d <- set_reverse_coded(d, bpxsy1, ridageyr) d <- set_reverse_coded(d, bpxsy1, reverse_coded = FALSE)
Marks one or more variables as select-all-that-apply (SATA) in a survey
design object or a data frame. Unlike the other unified setters (which map
variable names to heterogeneous content), set_sata() applies a single
logical flag to all listed variables, so it uses a simplified two-convention
pattern.
set_sata(x, ..., variable = NULL, sata = TRUE)set_sata(x, ..., variable = NULL, sata = TRUE)
x |
A survey design object or |
... |
< |
variable |
|
sata |
|
Convention A (tidy-select ...) — recommended:
design |> set_sata(news_tv, news_online, news_radio)
design |> set_sata(starts_with("news_"))
Convention B (variable = character vector) — programmatic:
sata_vars <- c("news_tv", "news_online", "news_radio")
design |> set_sata(variable = sata_vars)
Setting sata = FALSE unmarks the listed variables.
The modified object, invisibly.
extract_sata() to retrieve SATA flags
Other metadata:
classify_question_type(),
extract_higher_is(),
extract_metadata(),
extract_missing_codes(),
extract_question_preface(),
extract_reverse_coded(),
extract_sata(),
extract_universe(),
extract_val_labels(),
extract_var_label(),
extract_var_note(),
infer_question_prefaces(),
set_higher_is(),
set_missing_codes(),
set_question_preface(),
set_reverse_coded(),
set_universe(),
set_val_labels(),
set_var_label(),
set_var_note(),
survey_metadata(),
survey_weighting_history()
d <- as_survey( nhanes_2017, ids = sdmvpsu, weights = wtint2yr, strata = sdmvstra, nest = TRUE ) d <- set_sata(d, riagendr, ridageyr) d <- set_sata(d, riagendr, sata = FALSE)d <- as_survey( nhanes_2017, ids = sdmvpsu, weights = wtint2yr, strata = sdmvstra, nest = TRUE ) d <- set_sata(d, riagendr, ridageyr) d <- set_sata(d, riagendr, sata = FALSE)
Sets the universe description for one or more variables. The universe
describes the population to which a variable applies
(e.g., "Adults 18+").
set_universe(x, ..., variable = NULL, universe = NULL)set_universe(x, ..., variable = NULL, universe = NULL)
x |
A survey design object or a data frame. |
... |
Named arguments where the name is the variable and the value is
the universe description string. Supports |
variable |
A character vector of variable names. Use with |
universe |
A character vector of universe description strings, one per
element of |
Supports Conventions 1, 2, and 3 — see set_var_label() for details.
The modified object, invisibly.
extract_universe() to retrieve universe descriptions
Other metadata:
classify_question_type(),
extract_higher_is(),
extract_metadata(),
extract_missing_codes(),
extract_question_preface(),
extract_reverse_coded(),
extract_sata(),
extract_universe(),
extract_val_labels(),
extract_var_label(),
extract_var_note(),
infer_question_prefaces(),
set_higher_is(),
set_missing_codes(),
set_question_preface(),
set_reverse_coded(),
set_sata(),
set_val_labels(),
set_var_label(),
set_var_note(),
survey_metadata(),
survey_weighting_history()
d <- as_survey( gss_2024, ids = vpsu, weights = wtssps, strata = vstrat, nest = TRUE ) d <- set_universe(d, age = "All respondents 18+") extract_metadata(d, age)d <- as_survey( gss_2024, ids = vpsu, weights = wtssps, strata = vstrat, nest = TRUE ) d <- set_universe(d, age = "All respondents 18+") extract_metadata(d, age)
Sets value labels for one or more variables using one of three conventions.
set_val_labels(x, ..., variable = NULL, labels = NULL)set_val_labels(x, ..., variable = NULL, labels = NULL)
x |
A survey design object or a data frame. |
... |
Named arguments where the name is the variable and the value is
a fully named vector of value labels. Supports |
variable |
A character vector of variable names. |
labels |
A list of named vectors, one per element of |
Convention 1 (named ...) — recommended:
set_val_labels(x, sex = c(Male = 1L, Female = 2L))
Convention 2 (single named list in ...):
set_val_labels(x, list(sex = c(Male = 1L, Female = 2L)))
Convention 3 (variable + labels):
set_val_labels(x, variable = "sex", labels = c(Male = 1L, Female = 2L))
The modified object, invisibly.
extract_val_labels() to retrieve value labels
Other metadata:
classify_question_type(),
extract_higher_is(),
extract_metadata(),
extract_missing_codes(),
extract_question_preface(),
extract_reverse_coded(),
extract_sata(),
extract_universe(),
extract_val_labels(),
extract_var_label(),
extract_var_note(),
infer_question_prefaces(),
set_higher_is(),
set_missing_codes(),
set_question_preface(),
set_reverse_coded(),
set_sata(),
set_universe(),
set_var_label(),
set_var_note(),
survey_metadata(),
survey_weighting_history()
d <- as_survey( nhanes_2017, ids = sdmvpsu, weights = wtint2yr, strata = sdmvstra, nest = TRUE ) d <- set_val_labels(d, riagendr = c(Male = 1L, Female = 2L))d <- as_survey( nhanes_2017, ids = sdmvpsu, weights = wtint2yr, strata = sdmvstra, nest = TRUE ) d <- set_val_labels(d, riagendr = c(Male = 1L, Female = 2L))
Sets variable labels using one of three conventions.
set_var_label(x, ..., variable = NULL, label = NULL)set_var_label(x, ..., variable = NULL, label = NULL)
x |
A survey design object or a data frame. |
... |
Named arguments where the name is the variable and the value is
the label string. Supports |
variable |
A character vector of variable names. Use with |
label |
A character vector of label strings, one per element of
|
Convention 1 (named ...) — recommended for interactive use:
set_var_label(x, age = "Age in years", income = "Annual income") set_var_label(x, !!!labels_list) # list splicing
Convention 2 (named vector in ...) — useful for programmatic use:
set_var_label(x, c(age = "Age in years", income = "Annual income"))
Convention 3 (variable + label arguments) — for vector input:
vars <- c("age", "income")
lbls <- c("Age in years", "Annual income")
set_var_label(x, variable = vars, label = lbls)
The modified object, invisibly.
extract_var_label() to retrieve a label
Other metadata:
classify_question_type(),
extract_higher_is(),
extract_metadata(),
extract_missing_codes(),
extract_question_preface(),
extract_reverse_coded(),
extract_sata(),
extract_universe(),
extract_val_labels(),
extract_var_label(),
extract_var_note(),
infer_question_prefaces(),
set_higher_is(),
set_missing_codes(),
set_question_preface(),
set_reverse_coded(),
set_sata(),
set_universe(),
set_val_labels(),
set_var_note(),
survey_metadata(),
survey_weighting_history()
d <- as_survey( nhanes_2017, ids = sdmvpsu, weights = wtint2yr, strata = sdmvstra, nest = TRUE ) d <- set_var_label(d, indfmpir = "Income-to-poverty ratio") # Multiple variables d <- set_var_label( d, bpxsy1 = "Systolic BP (1st reading)", bpxdi1 = "Diastolic BP (1st reading)" )d <- as_survey( nhanes_2017, ids = sdmvpsu, weights = wtint2yr, strata = sdmvstra, nest = TRUE ) d <- set_var_label(d, indfmpir = "Income-to-poverty ratio") # Multiple variables d <- set_var_label( d, bpxsy1 = "Systolic BP (1st reading)", bpxdi1 = "Diastolic BP (1st reading)" )
Sets an analyst note for one or more variables. Notes are free-text annotations for documenting processing decisions, data quality concerns, or other context.
set_var_note(x, ..., variable = NULL, note = NULL)set_var_note(x, ..., variable = NULL, note = NULL)
x |
A survey design object or a data frame. |
... |
Named arguments where the name is the variable and the value is
the note string. Supports |
variable |
A character vector of variable names. Use with |
note |
A character vector of note strings, one per element of
|
Supports Conventions 1, 2, and 3 — see set_var_label() for details.
The modified object, invisibly.
extract_var_note() to retrieve a note
Other metadata:
classify_question_type(),
extract_higher_is(),
extract_metadata(),
extract_missing_codes(),
extract_question_preface(),
extract_reverse_coded(),
extract_sata(),
extract_universe(),
extract_val_labels(),
extract_var_label(),
extract_var_note(),
infer_question_prefaces(),
set_higher_is(),
set_missing_codes(),
set_question_preface(),
set_reverse_coded(),
set_sata(),
set_universe(),
set_val_labels(),
set_var_label(),
survey_metadata(),
survey_weighting_history()
d <- as_survey( gss_2024, ids = vpsu, weights = wtssps, strata = vstrat, nest = TRUE ) d <- set_var_note(d, age = "Top-coded at 89") extract_var_note(d, age)d <- as_survey( gss_2024, ids = vpsu, weights = wtssps, strata = vstrat, nest = TRUE ) d <- set_var_note(d, age = "Top-coded at 89") extract_var_note(d, age)
An S7 container that holds multiple independent survey_base objects
(e.g., multiple waves of a panel or cross-sectional series) for
comparative analysis. Create with as_survey_collection().
survey_collection( surveys = list(), groups = character(0), id = ".survey", if_missing_var = "error" )survey_collection( surveys = list(), groups = character(0), id = ".survey", if_missing_var = "error" )
surveys |
A named list of |
groups |
Character vector of grouping variable names. Every member's
|
id |
Character(1). Identifier column name used when dispatching
analysis functions across the collection. Default |
if_missing_var |
Character(1), one of |
survey_collection deliberately does not inherit from
survey_base. This prevents collection-of-collections nesting: a
survey_collection passed as an element of another collection fails
the element-type check automatically.
Each element of @surveys is an independent survey_base subclass
object (e.g., survey_taylor, survey_replicate, survey_twophase,
survey_nonprob). Mixed-type collections are allowed — the collection
never combines designs, so heterogeneous classes cannot produce an
invalid state.
A survey_collection object.
surveysA fully named list of survey_base objects.
Length . Names are unique, non-NA, and non-empty.
groupsA character vector of grouping variable names
applied uniformly across every member survey. Default
character(0) (ungrouped). When non-empty, every member's
@groups is asserted identical() to this value.
idCharacter(1). Identifier column name injected by
.dispatch_over_collection() when a get_*() is called on the
collection. Default ".survey". Stored on the collection and
consumed as the per-call default; a non-NULL .id at the
analysis-function call site overrides this stored value.
Mutate via set_collection_id().
if_missing_varCharacter(1), one of c("error", "skip").
Default "error". Controls how dispatched get_*() functions
behave when a member is missing a requested variable. Stored on
the collection and consumed as the per-call default; a non-NULL
.if_missing_var at the analysis-function call site overrides
this stored value. Mutate via set_collection_if_missing_var().
as_survey_collection() to build a collection from survey
objects; add_survey() / remove_survey() to mutate an existing
collection.
Other collections:
add_survey(),
as_survey_collection(),
remove_survey(),
set_collection_id(),
set_collection_if_missing_var()
d1 <- as_survey( gss_2024, ids = vpsu, weights = wtssps, strata = vstrat, nest = TRUE ) coll <- survey_collection(surveys = list(gss = d1)) length(coll) names(coll)d1 <- as_survey( gss_2024, ids = vpsu, weights = wtssps, strata = vstrat, nest = TRUE ) coll <- survey_collection(surveys = list(gss = d1)) length(coll) names(coll)
Returns the underlying data frame stored in a survey design object.
This is a thin accessor for x@data that provides a stable public name
independent of the S7 property structure.
survey_data(x)survey_data(x)
x |
A |
A data.frame with all variables, including design variables.
d <- as_survey( nhanes_2017, ids = sdmvpsu, weights = wtint2yr, strata = sdmvstra, nest = TRUE ) head(survey_data(d))d <- as_survey( nhanes_2017, ids = sdmvpsu, weights = wtint2yr, strata = sdmvstra, nest = TRUE ) head(survey_data(d))
Fits a GLM to survey data, producing design-based coefficient estimates
and variance-covariance matrix via the Binder (1983) sandwich estimator.
All four concrete surveycore design classes are supported
(survey_taylor, survey_replicate, survey_twophase, survey_nonprob).
survey_collection inputs are rejected; call survey_glm() on each
element individually.
survey_glm( design, formula = NULL, response = NULL, predictors = NULL, family = stats::gaussian(), na.action = stats::na.omit, start = NULL, etastart = NULL, mustart = NULL, control = list(), quiet = FALSE )survey_glm( design, formula = NULL, response = NULL, predictors = NULL, family = stats::gaussian(), na.action = stats::na.omit, start = NULL, etastart = NULL, mustart = NULL, control = list(), quiet = FALSE )
design |
A survey design object created by |
formula |
A model formula in standard R notation
(e.g. |
response |
Character string naming the outcome variable.
Programmatic alternative to |
predictors |
Character vector of predictor variable names. Used with
|
family |
A GLM family object specifying the error distribution and
link function. Default |
na.action |
How to handle |
start |
Starting values for the coefficient vector. |
etastart |
Starting values for the linear predictor. |
mustart |
Starting values for the mean. |
control |
A list of GLM control parameters passed to
|
quiet |
Logical. If |
Variance estimation: Uses the Binder (1983) sandwich estimator, which
decomposes into per-observation score vectors passed to the Phase 0
variance machinery. The bread (X'WX)^(-1) accounts for IRLS working
weights and is correct for all GLM families including binomial and
Poisson.
binomial() family: Wraps the stats::glm() call in
suppressWarnings() to suppress the "non-integer #successes" warning
that fires for every survey-weighted binomial model.
Domain estimation: Use surveytidy::filter() before calling
survey_glm(). The GLM is fit on in-domain rows only; variance
estimation uses the full design for correct design-based SEs.
Multinomial response: cbind() on the LHS of formula is not
supported. Multinomial logistic regression is deferred to a later phase.
Formula to model matrix: survey_glm() passes the formula to
stats::model.matrix() via stats::glm(). Factor and character predictors
are dummy-coded using model.matrix() default contrasts (treatment coding:
first level as reference). Numeric predictors enter as-is. Interaction
terms (:, *) and inline transformations (log(), I()) are supported
as in any standard R formula. The resulting model matrix is n x p where
p is the number of coefficients including the intercept.
Predictor variable types: Predictors may be numeric, integer, logical,
factor, or character. Character predictors are coerced to factor by
stats::model.matrix(). Ordered factors use polynomial contrasts by
default. All other R types (list columns, complex, raw) will produce an
error from stats::model.matrix().
Input assumptions: surveycore assumes (1) each row of design@data
represents one sampled unit; (2) survey weights are positive and finite
for all rows (validated at construction time); (3) the model formula
variables are columns of design@data; (4) the design is correctly
specified before calling survey_glm(). No centering, scaling, or
other pre-processing is applied to predictor variables beyond what the
formula specifies.
Data transformations: No automatic transformation is applied to
predictor or response variables. Factor encoding is handled by
stats::model.matrix() using the active contrasts. Link function
transformations (e.g. log link in poisson()) are applied by the
family object, not by surveycore. To apply custom transformations, use
I() or log() etc. inside the formula.
Row and column names: The coefficient vector returned in
fit@coefficients carries the names produced by stats::model.matrix()
(e.g. "(Intercept)", "sexFemale", "age"). fit@vcov carries the
same names on rows and columns. model.frame.survey_glm_fit() returns the
model frame with row names matching the rows used in fitting (i.e. the
row names of design@data after applying na.action). Rows excluded by
na.action = na.omit do not appear in the model frame.
Missing values: na.action controls handling of NA in model frame
variables (predictors and response). na.omit (default) silently drops
rows with any NA; the variance estimator uses the full design for
correct sandwich SEs. na.fail stops with an informative error listing
all variables containing NA and the row count for each. Survey weights
are validated separately at construction time and must not contain NA.
Performance: Runtime scales as O(n · p²) for the score matrix
computation and O(p³) for the bread matrix (solve). For Taylor designs,
variance estimation adds O(n · H · p²) where H is the number of
strata. For replicate designs it adds O(R · n · p) where R is the
number of replicates. The dominant cost for large n is typically the
stats::glm() IRLS fit (O(n · p² · I) per IRLS iteration).
A survey_glm_fit S7 object.
Binder, D.A. (1983) On the variances of asymptotically normal estimators from complex surveys. International Statistical Review 51(3), 279–292.
Binder, D.A. (1991) Use of estimating functions for interval estimation from complex surveys. Proceedings of the American Statistical Association, Section on Survey Research Methods, 34–42.
Lumley, T. and Scott, A. (2014) Tests in surveys with complex sampling. Journal of the Royal Statistical Society: Series B 76(2), 431–452.
Other constructors:
as_caldata(),
as_survey(),
as_survey_nonprob(),
as_survey_replicate(),
as_survey_twophase(),
survey_glm_fit(),
survey_nonprob(),
survey_replicate(),
survey_taylor(),
survey_twophase()
d <- as_survey( gss_2024, ids = vpsu, weights = wtssps, strata = vstrat, nest = TRUE ) # Linear model: respondent age predicted by education and sex fit <- survey_glm(d, age ~ educ + sex) fit@coefficients fit@vcov # Programmatic interface — suitable for lapply() results <- lapply(c("age", "educ"), function(v) { survey_glm(d, response = v, predictors = "sex") })d <- as_survey( gss_2024, ids = vpsu, weights = wtssps, strata = vstrat, nest = TRUE ) # Linear model: respondent age predicted by education and sex fit <- survey_glm(d, age ~ educ + sex) fit@coefficients fit@vcov # Programmatic interface — suitable for lapply() results <- lapply(c("age", "educ"), function(v) { survey_glm(d, response = v, predictors = "sex") })
S7 class produced by survey_glm(). Holds all regression output from a
survey-weighted generalised linear model: design-based coefficient
estimates, variance-covariance matrix, fitted values, residuals, and
model metadata.
survey_glm_fit( coefficients = integer(0), vcov = NULL, fitted_values = integer(0), residuals = integer(0), weights = integer(0), design = survey_base(), degf = integer(0), family = list(), formula = NULL, null_deviance = integer(0), deviance = integer(0), df_null = integer(0), df_residual = integer(0), converged = logical(0), call = NULL, fit_ = NULL, term_assign = integer(0) )survey_glm_fit( coefficients = integer(0), vcov = NULL, fitted_values = integer(0), residuals = integer(0), weights = integer(0), design = survey_base(), degf = integer(0), family = list(), formula = NULL, null_deviance = integer(0), deviance = integer(0), df_null = integer(0), df_residual = integer(0), converged = logical(0), call = NULL, fit_ = NULL, term_assign = integer(0) )
coefficients |
Named numeric vector of length |
vcov |
|
fitted_values |
Numeric vector of length |
residuals |
Working residuals from IRLS, length |
weights |
Survey weights used in fitting, length |
design |
The original survey_base survey design object. |
degf |
Raw design degrees of freedom (positive scalar). For
|
family |
GLM family object (e.g. |
formula |
Model formula. |
null_deviance |
Null model deviance. |
deviance |
Residual deviance. |
df_null |
Classical null df ( |
df_residual |
Classical residual df ( |
converged |
Logical; whether IRLS converged. |
call |
The |
fit_ |
Internal raw |
term_assign |
Integer vector: |
A survey_glm_fit object.
survey_glm() to create a survey_glm_fit.
Other constructors:
as_caldata(),
as_survey(),
as_survey_nonprob(),
as_survey_replicate(),
as_survey_twophase(),
survey_glm(),
survey_nonprob(),
survey_replicate(),
survey_taylor(),
survey_twophase()
# survey_glm_fit objects are created by survey_glm(), not directly d <- as_survey( gss_2024, ids = vpsu, weights = wtssps, strata = vstrat, nest = TRUE ) fit <- survey_glm(d, age ~ sex) fit@coefficients# survey_glm_fit objects are created by survey_glm(), not directly d <- as_survey( gss_2024, ids = vpsu, weights = wtssps, strata = vstrat, nest = TRUE ) fit <- survey_glm(d, age ~ sex) fit@coefficients
Stores variable labels, value labels, question prefaces, notes, and
transformation history for variables in a survey design object.
Automatically populated from haven-style attributes when
as_survey() or related constructors are called.
survey_metadata( variable_labels = list(), value_labels = list(), question_prefaces = list(), notes = list(), universe = list(), missing_codes = list(), sata = list(), higher_is = list(), reverse_coded = list(), transformations = list(), weighting_history = list() )survey_metadata( variable_labels = list(), value_labels = list(), question_prefaces = list(), notes = list(), universe = list(), missing_codes = list(), sata = list(), higher_is = list(), reverse_coded = list(), transformations = list(), weighting_history = list() )
variable_labels |
A named list mapping variable names to character
labels (e.g., |
value_labels |
A named list mapping variable names to named vectors
of value labels (e.g., |
question_prefaces |
A named list mapping variable names to shared question battery preface text. |
notes |
A named list mapping variable names to analyst notes. |
universe |
A named list mapping variable names to universe
descriptions (e.g., |
missing_codes |
A named list mapping variable names to atomic
vectors of missing-value codes
(e.g., |
sata |
A named list mapping variable names to |
higher_is |
A named list mapping variable names to |
reverse_coded |
A named list mapping variable names to |
transformations |
A named list tracking variable transformation history (populated automatically during operations). |
weighting_history |
A list recording weighting operations applied to
the survey object (e.g., raking, trimming). Each entry is written by
a surveywts function and contains the operation name,
parameters, effective sample size before/after, and design effect.
Always |
A survey_metadata object.
Other metadata:
classify_question_type(),
extract_higher_is(),
extract_metadata(),
extract_missing_codes(),
extract_question_preface(),
extract_reverse_coded(),
extract_sata(),
extract_universe(),
extract_val_labels(),
extract_var_label(),
extract_var_note(),
infer_question_prefaces(),
set_higher_is(),
set_missing_codes(),
set_question_preface(),
set_reverse_coded(),
set_sata(),
set_universe(),
set_val_labels(),
set_var_label(),
set_var_note(),
survey_weighting_history()
# Empty metadata (default) m <- survey_metadata() m@variable_labels # Pre-populated metadata m <- survey_metadata( variable_labels = list(age = "Respondent age", income = "Annual income"), value_labels = list(sex = c(Male = 1L, Female = 2L)) ) m@variable_labels$age m@value_labels$sex# Empty metadata (default) m <- survey_metadata() m@variable_labels # Pre-populated metadata m <- survey_metadata( variable_labels = list(age = "Respondent age", income = "Annual income"), value_labels = list(sex = c(Male = 1L, Female = 2L)) ) m@variable_labels$age m@value_labels$sex
A survey design object for non-probability samples (e.g., online panels,
quota samples, volunteer panels) with calibration weights (including raking
and post-stratification) or inverse probability weighting (IPW)
pseudo-weights. Create with
as_survey_nonprob().
survey_nonprob( data = data.frame(), metadata = survey_metadata(), variables = list(), groups = character(0), call = NULL, calibration = NULL, reference_sample = NULL )survey_nonprob( data = data.frame(), metadata = survey_metadata(), variables = list(), groups = character(0), call = NULL, calibration = NULL, reference_sample = NULL )
data |
A |
metadata |
A survey_metadata object. Created automatically by
|
variables |
A named list of design specification ( |
groups |
Set by surveytidy's |
call |
Language object capturing the construction call. |
calibration |
The calibration provenance object returned by a
surveywts calibration function (e.g., |
reference_sample |
Optional survey_taylor object representing the
probability-based reference sample used to estimate propensity scores or
calibration targets. Stored for reproducibility. Default |
A survey_nonprob object.
Two modes are available, selected by whether @variables$repweights is
NULL:
Standard errors treat the calibrated weights as fixed and assume simple random sampling. This understates calibration uncertainty and should only be used when replicate weights are unavailable.
Bootstrap or jackknife replicate weights propagate calibration uncertainty into the variance estimate. Each replicate column must contain calibrated weights re-estimated on one replicate draw. This is the recommended approach.
See as_survey_nonprob() for the full parameter interface, including
type, scale, rscales, and mse.
Unlike as_survey(), as_survey_replicate(), and as_survey_twophase(),
this class does not assume a probability sampling design. When no
replicate weights are supplied, standard errors rest on a model-assisted SRS
assumption, which is consistent with common practice for calibrated
non-probability samples (e.g., raked online panels). When replicate weights
are supplied, bootstrap or jackknife variance is used instead. See
vignette("creating-survey-objects") for guidance on choosing between these
modes and the limitations of each.
@variables)weightsCharacter string naming the (calibrated) weight column.
repweightsCharacter vector of bootstrap replicate weight column
names, or NULL when no replicate weights are present.
typeReplicate type. Only "bootstrap" is supported for
non-probability samples ("JK1", "JK2", and "JKn" are not
accepted); or NULL when no replicate weights are present.
scaleNumeric scale factor for the variance formula, or NULL.
rscalesPer-replicate scale factors, or NULL.
mseLogical. TRUE for MSE form of variance, or NULL.
probs_providedAlways FALSE for calibrated designs.
@calibration)When calibration is performed via surveywts, the returned calibration
object is stored here. It contains the calibration targets, variables used,
trimming cap, effective sample size before and after, and design effect.
NULL when calibration was performed externally (e.g., via anesrake).
as_survey_nonprob() to create a survey_nonprob object.
Other constructors:
as_caldata(),
as_survey(),
as_survey_nonprob(),
as_survey_replicate(),
as_survey_twophase(),
survey_glm(),
survey_glm_fit(),
survey_replicate(),
survey_taylor(),
survey_twophase()
A survey design object using replicate weights for variance estimation.
Create with as_survey_replicate().
survey_replicate( data = data.frame(), metadata = survey_metadata(), variables = list(), groups = character(0), call = NULL, calibration = NULL )survey_replicate( data = data.frame(), metadata = survey_metadata(), variables = list(), groups = character(0), call = NULL, calibration = NULL )
data |
A |
metadata |
A survey_metadata object. Created automatically by
|
variables |
A named list of design specification (weights,
repweights, type, scale, rscales, fpc, fpctype, mse). Set
automatically by |
groups |
Set by surveytidy's |
call |
Language object capturing the construction call. |
calibration |
A list of calibration data elements produced by
|
A survey_replicate object.
@variables)weightsCharacter string naming the weight column.
repweightsCharacter vector of replicate weight column names.
The replicate weight matrix is computed on demand from
design@data[, design@variables$repweights] — it is not stored as
a property.
typeReplicate weight method: one of "JK1", "JK2",
"JKn", "BRR", "Fay", "bootstrap", "ACS",
"successive-difference", or "other".
scaleNumeric scaling factor for variance estimation.
rscalesNumeric vector of replicate-specific scales, or
NULL.
fpcFPC column name or NULL.
fpctype"fraction" or "correction".
mseLogical. Use MSE estimates?
Canty, A.J. and Davison, A.C. (1999) Resampling-based variance estimation for labour force surveys. The Statistician 48(3), 379–391.
Deville, J.-C. and Sarndal, C.-E. (1992) Calibration estimators in survey sampling. Journal of the American Statistical Association 87(418), 376–382.
Deville, J.-C., Sarndal, C.-E. and Sautory, O. (1993) Generalized raking procedures in survey sampling. Journal of the American Statistical Association 88(423), 1013–1020.
Judkins, D.R. (1990) Fay's method for variance estimation. Journal of the American Statistical Association 85(410), 895–904.
Rao, J.N.K., Wu, C.F.J. and Yue, K. (1992) Some recent work on resampling methods for complex surveys. Survey Methodology 18(2), 209–217.
Shao, J. and Tu, D. (1995) The Jackknife and Bootstrap. Springer.
as_survey_replicate() to create a survey_replicate object.
Other constructors:
as_caldata(),
as_survey(),
as_survey_nonprob(),
as_survey_replicate(),
as_survey_twophase(),
survey_glm(),
survey_glm_fit(),
survey_nonprob(),
survey_taylor(),
survey_twophase()
# Prefer as_survey_replicate() over calling survey_replicate() directly set.seed(1) df <- data.frame( y = rnorm(20), wt = runif(20, 1, 3), rep1 = runif(20, 0.5, 2), rep2 = runif(20, 0.5, 2) ) d <- as_survey_replicate( df, weights = wt, repweights = starts_with("rep"), type = "BRR" ) class(d)# Prefer as_survey_replicate() over calling survey_replicate() directly set.seed(1) df <- data.frame( y = rnorm(20), wt = runif(20, 1, 3), rep1 = runif(20, 0.5, 2), rep2 = runif(20, 0.5, 2) ) d <- as_survey_replicate( df, weights = wt, repweights = starts_with("rep"), type = "BRR" ) class(d)
A survey design object using Taylor series (linearization) for variance
estimation. Create with as_survey().
survey_taylor( data = data.frame(), metadata = survey_metadata(), variables = list(), groups = character(0), call = NULL, calibration = NULL )survey_taylor( data = data.frame(), metadata = survey_metadata(), variables = list(), groups = character(0), call = NULL, calibration = NULL )
data |
A |
metadata |
A survey_metadata object. Created automatically by
|
variables |
A named list of design specification (ids, weights,
strata, fpc, nest, probs_provided). Set automatically by |
groups |
Set by surveytidy's |
call |
Language object capturing the construction call. |
calibration |
A list of calibration data elements produced by
|
A survey_taylor object.
@variables)idsCharacter vector of cluster ID column names, or NULL
for simple random sampling.
weightsCharacter string naming the weight column.
strataCharacter string naming the strata column, or NULL.
fpcCharacter string naming the finite population correction
column, or NULL.
nestLogical. TRUE if cluster IDs are nested within strata
(i.e., the same ID value in two strata refers to two distinct PSUs).
probs_providedLogical. TRUE if the user supplied probs
rather than weights to as_survey().
Deville, J.-C. and Sarndal, C.-E. (1992) Calibration estimators in survey sampling. Journal of the American Statistical Association 87(418), 376–382.
Deville, J.-C., Sarndal, C.-E. and Sautory, O. (1993) Generalized raking procedures in survey sampling. Journal of the American Statistical Association 88(423), 1013–1020.
Lumley, T. (2010) Complex Surveys: A Guide to Analysis Using R. John Wiley and Sons.
Rao, J.N.K., Yung, W. and Hidiroglou, M.A. (2002) Estimating equations for the analysis of survey data using poststratification information. Sankhya 64-A, 22–36.
Sarndal, C-E., Swensson, B. and Wretman, J. (1992) Model Assisted Survey Sampling. Springer.
as_survey() to create a survey_taylor object.
Other constructors:
as_caldata(),
as_survey(),
as_survey_nonprob(),
as_survey_replicate(),
as_survey_twophase(),
survey_glm(),
survey_glm_fit(),
survey_nonprob(),
survey_replicate(),
survey_twophase()
# Prefer as_survey() over calling survey_taylor() directly d <- as_survey( gss_2024, ids = vpsu, weights = wtssps, strata = vstrat, nest = TRUE ) class(d)# Prefer as_survey() over calling survey_taylor() directly d <- as_survey( gss_2024, ids = vpsu, weights = wtssps, strata = vstrat, nest = TRUE ) class(d)
A survey design object for two-phase (double) sampling. Create with
as_survey_twophase().
survey_twophase( data = data.frame(), metadata = survey_metadata(), variables = list(), groups = character(0), call = NULL )survey_twophase( data = data.frame(), metadata = survey_metadata(), variables = list(), groups = character(0), call = NULL )
data |
A |
metadata |
A survey_metadata object. Inherited from the Phase 1
design when using |
variables |
A named list of design specification (phase1, phase2,
subset, method). Set automatically by |
groups |
Set by surveytidy's |
call |
Language object capturing the construction call. |
A survey_twophase object.
@variables)phase1Named list containing the Phase 1 design specification
(from a survey_taylor object's @variables).
phase2Named list with optional Phase 2 design columns:
ids, strata, probs, fpc — each NULL or a character vector
of column names.
subsetCharacter string naming the logical column that
indicates Phase 2 membership (TRUE = selected into Phase 2).
method"full", "approx", or "simple".
as_survey_twophase() to create a survey_twophase object.
Other constructors:
as_caldata(),
as_survey(),
as_survey_nonprob(),
as_survey_replicate(),
as_survey_twophase(),
survey_glm(),
survey_glm_fit(),
survey_nonprob(),
survey_replicate(),
survey_taylor()
# Prefer as_survey_twophase() over calling survey_twophase() directly set.seed(1) df <- data.frame( id = 1:100, y = rnorm(100), x = rnorm(100), wt = runif(100, 1, 3), in_phase2 = c(rep(TRUE, 40), rep(FALSE, 60)) ) phase1 <- as_survey(df, weights = wt) d <- as_survey_twophase(phase1, subset = in_phase2) class(d)# Prefer as_survey_twophase() over calling survey_twophase() directly set.seed(1) df <- data.frame( id = 1:100, y = rnorm(100), x = rnorm(100), wt = runif(100, 1, 3), in_phase2 = c(rep(TRUE, 40), rep(FALSE, 60)) ) phase1 <- as_survey(df, weights = wt) d <- as_survey_twophase(phase1, subset = in_phase2) class(d)
Returns the list of weighting operations recorded on a survey design object. Each entry is appended by surveywts after a calibration or nonresponse adjustment step. Returns an empty list when no history has been recorded.
survey_weighting_history(x)survey_weighting_history(x)
x |
A survey design object (any class inheriting from |
A list of history entries, or list() if no history is present.
Other metadata:
classify_question_type(),
extract_higher_is(),
extract_metadata(),
extract_missing_codes(),
extract_question_preface(),
extract_reverse_coded(),
extract_sata(),
extract_universe(),
extract_val_labels(),
extract_var_label(),
extract_var_note(),
infer_question_prefaces(),
set_higher_is(),
set_missing_codes(),
set_question_preface(),
set_reverse_coded(),
set_sata(),
set_universe(),
set_val_labels(),
set_var_label(),
set_var_note(),
survey_metadata()
d <- as_survey( nhanes_2017, ids = sdmvpsu, weights = wtint2yr, strata = sdmvstra, nest = TRUE ) survey_weighting_history(d) # list() — no weighting historyd <- as_survey( nhanes_2017, ids = sdmvpsu, weights = wtint2yr, strata = sdmvstra, nest = TRUE ) survey_weighting_history(d) # list() — no weighting history
The name of the logical column added to @data by filter() (from
surveytidy) to mark domain membership. Exposed here so that sibling
packages (surveytidy, surveywts) can reference it without
using :::.
SURVEYCORE_DOMAIN_COLSURVEYCORE_DOMAIN_COL
An object of class character of length 1.
Updates one or more design variables (weights, cluster IDs, strata, FPC, or replicate weights) on an existing survey design object. Use this after modifying the underlying data — for example, after recalibrating weights or adding a stratification variable. Emits an informational message listing changed variables.
update_design( x, ids = NULL, weights = NULL, strata = NULL, fpc = NULL, repweights = NULL, validate = TRUE )update_design( x, ids = NULL, weights = NULL, strata = NULL, fpc = NULL, repweights = NULL, validate = TRUE )
x |
A |
ids |
< |
weights |
< |
strata |
< |
fpc |
< |
repweights |
< |
validate |
Logical. If |
The modified survey object, invisibly. As a side effect, a
cli_inform() message is printed listing each changed design variable
(old name → new name).
as_survey() to create a survey_taylor object,
as_survey_replicate() to create a survey_replicate object,
as_survey_twophase() to create a survey_twophase object
# NHANES has two weight columns for different analysis types; # start with the MEC examination weight for exam participants exam <- nhanes_2017[nhanes_2017$ridstatr == 2, ] d <- as_survey( exam, ids = sdmvpsu, weights = wtmec2yr, strata = sdmvstra, nest = TRUE ) # Switch to interview weight for interview-based variables d_updated <- update_design(d, weights = wtint2yr)# NHANES has two weight columns for different analysis types; # start with the MEC examination weight for exam participants exam <- nhanes_2017[nhanes_2017$ridstatr == 2, ] d <- as_survey( exam, ids = sdmvpsu, weights = wtmec2yr, strata = sdmvstra, nest = TRUE ) # Switch to interview weight for interview-based variables d_updated <- update_design(d, weights = wtint2yr)