mutate() now syncs @metadata (variable labels, value labels, and
transformation log) for recode columns produced by dplyr::across().
Previously, only explicitly-named left-hand-side mutations were tracked (#41).Collection-aware methods for all standard dplyr/tidyr verbs are now
dispatched per-survey when called on a survey_collection. The result
is a new survey_collection whose @id, @if_missing_var, and
@groups properties are preserved.
filter(), filter_out(), mutate(), arrange()select(), relocate(), rename(), rename_with(),
drop_na(), distinct(), rowwise()group_by(), ungroup(), group_vars(), is_rowwise()slice(), slice_head(), slice_tail(), slice_min(),
slice_max(), slice_sample()pull() (returns a vector via vctrs::vec_c()),
glimpse() (default mode binds members; .by_survey = TRUE per-member)The .if_missing_var argument on each verb ("error" (default) or
"skip") lets you override the collection's stored missing-variable
behaviour for a single call. Skipped surveys are reported via the typed
message class surveytidy_message_collection_skipped_surveys.
The 6 join verbs (left_join(), right_join(), inner_join(),
full_join(), semi_join(), anti_join()) error with
surveytidy_error_collection_verb_unsupported when called on a
survey_collection. Apply joins inside a per-survey pipeline before
constructing the collection.
library(surveytidy) is now sufficient to use the collection
construction and setter API. The following surveycore symbols are
re-exported:
as_survey_collection()set_collection_id()set_collection_if_missing_var()add_survey()remove_survey()replace_when() and replace_values() no longer retain stale value labels
for values absent from the recoded result. Previously, collapsing or replacing
values left label entries (e.g. "High" = 4) attached to a vector that
contained no such values. User-supplied .value_labels are always preserved.surveytidy_error_collection_verb_emptiedsurveytidy_error_collection_verb_failedsurveytidy_error_collection_by_unsupportedsurveytidy_error_collection_select_group_removedsurveytidy_error_collection_rename_group_partialsurveytidy_error_collection_slice_zerosurveytidy_error_collection_pull_incompatible_typessurveytidy_error_collection_glimpse_id_collisionsurveytidy_error_collection_verb_unsupportedsurveytidy_warning_collection_rowwise_mixedsurveytidy_message_collection_skipped_surveysvctrs (>= 0.6.0) to Imports (used by pull.survey_collection
and glimpse.survey_collection).surveycore minimum-version pin to (>= 0.8.2).Eight join and binding functions are now available for combining survey design
objects with plain data frames. All join functions protect design variable
columns (weights, strata, PSU, etc.) from being overwritten by y and append
a typed sentinel to @variables$domain for traceability by Phase 1 estimation
functions.
left_join() — appends lookup columns from y without removing any rows.
Errors if duplicate keys in y would expand rows.
inner_join() — two modes: domain-aware (.domain_aware = TRUE, default)
marks unmatched rows out-of-domain without removing them; physical mode
(.domain_aware = FALSE) removes rows and issues a warning. Physical mode
is not supported for survey_twophase designs.
semi_join() — marks rows matching y as in-domain; unmatched rows are
marked out-of-domain. No rows are physically removed.
anti_join() — inverse of semi_join(); marks rows that do NOT match y
as in-domain.
right_join() and full_join() — always error when called on a survey
object. Both would add rows with NA design variable values, which would
invalidate variance estimation.
bind_cols() — appends columns from ... to a survey object. Validates
that all inputs have the same row count. Passes through to
dplyr::bind_cols() for non-survey inputs.
bind_rows() — always errors when called with a survey object. Combining
two survey designs has undefined variance structure. Passes through to
dplyr::bind_rows() for non-survey inputs.
Two helper functions are now available for computing row-wise statistics
inside mutate(). Both propagate metadata automatically and accept .label
and .description to document the new variable in a single step.
row_means() — computes the row mean across selected columns. Accepts
na.rm to control NA handling. Issues a warning if any selected column
is a design variable.
row_sums() — computes the row sum across selected columns. Accepts
na.rm to control NA handling. Issues a warning if any selected column
is a design variable.
mutate() — transformation metadata is now correctly written back to the
design object. Previously, [[<- assignment on S7 properties silently
failed; the fix uses S7::prop<-() to sync all six metadata properties
correctly (#25).Five vector-level transformation functions are now available for converting,
collapsing, and reversing variables inside mutate(). All five propagate value
labels automatically and accept .label and .description arguments to
attach metadata in a single step.
make_factor() — converts labelled, numeric, character, or factor vectors
to an R factor. Levels are ordered by the numeric value of each value label.
Accepts ordered, drop_levels, force, and na.rm to control level
creation.
make_dicho() — collapses a multi-level factor to two levels by stripping
the first word of each label and merging labels that reduce to the same
stem. Accepts .exclude to keep specific levels as NA, and flip_levels
to reverse the resulting order.
make_binary() — converts a dichotomous variable to a 0/1 integer. Thin
wrapper around make_dicho(); accepts flip_values to control which level
maps to 1.
make_rev() — reverses a numeric scale using min + max - x and remaps
value labels to match. Issues a warning when all values are NA.
make_flip() — reverses the semantic valence of a variable by reversing the
label strings while keeping the underlying values unchanged. Requires a
label argument to document the new meaning.
Six vector-level recoding functions are now available. Each shadows its dplyr
equivalent and adds optional arguments for attaching variable labels, value
labels, and transformation notes directly inside mutate(). Without any of
those arguments, output is identical to dplyr.
case_when() — a survey-aware dplyr::case_when(). Evaluates a sequence
of condition ~ value formulas and uses the first match for each element.
Use this to create an entirely new vector from conditions. Accepts .label
to set a variable label, .value_labels to attach a named vector of value
labels, .factor = TRUE to return an ordered factor (levels follow formula
order), and .description to record a plain-language note about the
transformation.
if_else() — a survey-aware dplyr::if_else(). Applies a single binary
condition element-wise (true/false/missing). Stricter than base
ifelse(): true, false, and missing are cast to a common type.
Accepts .label, .value_labels, and .description.
na_if() — a survey-aware dplyr::na_if(). Converts specific values to
NA. Unlike dplyr's scalar-only y, this version accepts a vector y and
replaces all matching values in a single call. When the input carries value
labels, they are inherited automatically; .update_labels = TRUE (the
default) removes label entries for the NA'd values, while
.update_labels = FALSE retains them (useful for documenting what was set
to missing). Also accepts .description.
recode_values() — a survey-aware dplyr::recode_values(). Replaces values
found in from with the corresponding value from to; values not in from
are kept unchanged or trigger an error (.unmatched = "error"). Intended for
full remapping of every value in a vector. Set .use_labels = TRUE to build
the from/to map automatically from the input's existing value labels
(codes become from; label strings become to). Also accepts .label,
.value_labels, .factor, and .description.
replace_values() — a survey-aware dplyr::replace_values(). Replaces
values found in from with the corresponding value from to; all other
values are left unchanged. Use this for partial in-place replacement of
specific values in an existing vector. Automatically inherits both the
variable label and value labels from the input; supply .label or
.value_labels to override. Also accepts .description.
replace_when() — a survey-aware dplyr::replace_when(). Like case_when()
but for partial in-place updates: evaluates condition ~ value formulas and
replaces only matching elements, leaving all others at their original value.
Automatically inherits labels from the input; supply .label or
.value_labels to override. Also accepts .description.
All six functions support a common set of label arguments that propagate into
@metadata when used inside mutate():
.label — a character string stored in @metadata@variable_labels as the
human-readable variable label for the new column..value_labels — a named vector stored in @metadata@value_labels, where
names are label strings and values are the corresponding data values..description — a plain-language string stored in
@metadata@transformations describing how the variable was derived.case_when() and recode_values() also accept .factor = TRUE, which
returns an ordered factor instead of a character vector (levels follow formula
or to order respectively). .factor and .label cannot be combined.
mutate() enhancementsmutate() now coordinates label propagation automatically: it pre-attaches
label attributes from @metadata before the inner dplyr call so recode
functions can see existing labels, reads the label output back from recoded
columns, and writes it into @metadata — all without extra user steps. The
weight-column warning has also been split into two distinct classes:
surveytidy_warning_mutate_weight_col for the weight column and
surveytidy_warning_mutate_structural_var for strata, PSU, FPC, and
replicate weights.
README now displays the hex logo.LICENSE.md updated to credit third-party hex sticker icon (Freepik / Flaticon, CC BY 3.0).DESCRIPTION author entry updated with current email, ORCID, and copyright-holder (cph) role.merge-main skill to detect
commits on main that have not been merged back to develop.survey-result and dedup-rename-rowwise phases.filter_out() — the complement of filter(). Marks rows matching the
condition as out-of-domain while leaving all other rows in-domain. Like
filter(), no rows are removed. Chains with filter() via AND-accumulation
on the domain column. filter_out(d, group == "control") is often clearer
than filter(d, group != "control") for exclusion use-cases.
distinct() — removes duplicate rows while always retaining all columns
(design variables are never dropped). With no column arguments, deduplicates
on non-design columns only (survey-safe default). Always issues
surveycore_warning_physical_subset.
rename_with() — function-based column renaming. Applies .fn to columns
selected by .cols and propagates renames to @variables, @metadata,
@groups, and visible_vars. Validates .fn output and errors with
surveytidy_error_rename_fn_bad_output for non-character, wrong-length, or
duplicate output.
rowwise() — enables row-by-row computation in mutate() (e.g.,
max(c_across(...))). Rowwise state is stored in @variables$rowwise —
never in @groups, keeping those clean for estimation functions. group_by()
and ungroup() exit rowwise mode, mirroring dplyr behaviour.
is_rowwise() — returns TRUE when the survey object is in rowwise mode.is_grouped() — returns TRUE when @groups is non-empty.group_vars() — returns the current grouping column names from @groups.filter(), arrange(), mutate(), slice(), slice_head(),
slice_tail(), slice_min(), slice_max(), slice_sample(), and
drop_na() are now registered for survey_result objects (the S3 base class
for surveycore analysis outputs: survey_means, survey_freqs,
survey_totals, survey_quantiles, survey_corr, survey_ratios).
Previously, applying dplyr verbs to these objects could silently strip the
class and .meta attribute. Now both are preserved, and mutate() keeps
meta$group coherent when .keep drops grouping columns.
select(), rename(), and rename_with() are now registered for
survey_result objects with active .meta updates. select() prunes stale
meta$group entries when grouping columns are dropped and handles inline
renames (select(r, grp = group)). rename() and rename_with() propagate
column renames to all .meta key references ($group, $x,
$numerator$name, $denominator$name). rename_with() errors with
surveytidy_error_rename_fn_bad_output if .fn returns non-character,
wrong-length, NA, or duplicate names.
drop_na() now performs domain-aware filtering instead of physically removing
rows. Previously, drop_na() removed rows with NA values, changing which
units contributed to variance estimation and producing incorrect standard
errors. It now marks incomplete rows as out-of-domain — equivalent to the
corresponding filter(!is.na(col1), ...) chain — giving correct variance
estimates for downstream analyses.
filter(): the .by unsupported-argument error was mis-classified as a
surveycore_error_*; corrected to surveytidy_error_filter_by_unsupported.
rename() and rename_with() now update @groups when a grouped column is
renamed, and correctly update twophase design variable references
(@variables$phase1, @variables$phase2, @variables$subset). The domain
column (..surveycore_domain..) is silently protected from renaming.
filter() and filter_out() support if_any() and if_all() in
conditions.
Roxygen documentation standardised across all verb files to mirror the
dplyr/tidyr reference style, with @details subsections for
surveytidy-specific behaviour and examples using nhanes_2017.
Rd files consolidated from per-method (e.g., arrange.survey_base.Rd) to
per-verb (e.g., arrange.Rd), fixing the "S3 methods shown with full name"
R CMD check NOTE.
First release. Implements a complete set of dplyr and tidyr verbs for survey
design objects created with the surveycore package.
filter() — domain-aware filtering. Marks rows in-domain rather than
removing them, preserving correct variance estimation for subpopulation
analyses. Chained filter() calls AND their conditions together.
select() — column selection. Physically removes non-selected columns while
always retaining design variables (weights, strata, PSU, FPC, replicate
weights). Sets @variables$visible_vars so print() hides design columns
the user did not explicitly request.
relocate() — column reordering. Reorders visible_vars when a prior
select() has been called; reorders @data directly otherwise.
pull() — extract a column as a plain vector (terminal operation).
glimpse() — concise column summary, respecting visible_vars.
mutate() — add or modify columns. Re-attaches design variables dropped by
.keep = "none" or .keep = "used". Issues
surveytidy_warning_mutate_design_var when a mutation's left-hand side names
a design variable. Respects @groups set by group_by().
rename() — rename columns. Automatically keeps @variables (design
specification) and @metadata (variable labels, value labels, etc.) in sync
with the new column names. Issues surveytidy_warning_rename_design_var when
a design variable is renamed.
arrange() — row sorting. The domain column moves correctly with the rows
after sorting. Supports .by_group = TRUE using @groups.
slice(), slice_head(), slice_tail(), slice_min(), slice_max(),
slice_sample() — physical row selection with a
surveycore_warning_physical_subset warning. slice_sample(weight_by = )
additionally issues surveytidy_warning_slice_sample_weight_by to flag that
the weight_by column is independent of the survey design weights.
group_by() — store grouping columns in @groups. Does not attach a
grouped_df attribute to @data; grouping is kept on the survey object.
Supports .add = TRUE for incremental grouping and computed expressions
(e.g., group_by(d, above_median = y1 > median(y1))).
ungroup() — remove all groups (no arguments) or remove specific columns
from @groups (partial ungroup).
drop_na() — domain-aware NA handling. Marks rows with NA in specified
columns (or any column) as out-of-domain without removing them. Equivalent
to filter(!is.na(col1), !is.na(col2), ...) and gives correct variance
estimates for downstream analyses. Successive drop_na() calls AND their
conditions together.
subset() — physical row removal with surveycore_warning_physical_subset.
Prefer filter() for subpopulation analyses.
The key design decision in surveytidy is that filter() never removes rows.
Removing rows from a survey design changes which units contribute to variance
estimation and produces incorrect standard errors for subpopulation statistics.
filter() instead writes a logical domain column (..surveycore_domain..) to
@data. Phase 1 estimation functions will read this column to restrict
calculations to the domain while retaining all rows for variance estimation.
dplyr_reconstruct.survey_base() ensures complex dplyr pipelines (joins,
across(), internal slice operations) return survey objects rather than
plain tibbles. Errors with surveycore_error_design_var_removed if a
pipeline drops a design variable.
Invariant 6 added to test_invariants(): every column name listed in
@variables$visible_vars must exist in @data.