Dynamically defining dplyr across statements (was dplyr: summarise across using variable names and a condition)

Hello All,

Thanks Rui for your response to my question. I agree that it is possible to use a workaround. Get most of what you want and then tidy it up afterwards. I too have a workaround that I have pasted below. I wanted to avoid that initially. I felt I was only using a workaround because I hadn't yet figured out how to use the dplyr software properly.

Determined how to summarise across conditionally during the weekend. That led me to rename my question as learning this changed the nature of the problem.

Below are my "have" and "need" data sets from before, for which I've reordered columns. After that, are some vectors of variable names that are already defined in my code and which I thought might be helpful in producing a solution. After that, is dplyr code that summarizes across conditionally. If the data being submitted to this code were always going to be the same, this would work perfectly. That's not the case though. So the across statements that are needed will be data dependent. Last, I've pasted my version of a workaround. This should work for any dataset.

Ideally, I'd like to get a solution that builds on the summarise across code below. It seems likely that would involve dynamically creating the various across statements though, and that might wind up being a lot more complicated and verbose than my workaround. Another possibility might be to do this in one pass using non-dplyr code. If neither of those options works out, it may be that the workaround is actually the way to go.

Thanks,

Paul

#### Have and need data ####

library(magrittr)
library(dplyr) 

have <- structure(list(
? ptno = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M",
?????????? "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z"),
? age1 = c(74, 70, 78, 79, 72, 81, 76, 58, 53, 74, 72, 74, 75,
?????????? 73, 80, 62, 67, 65, 83, 67, 72, 90, 73, 84, 90, 51),
? age2 = c(71, 67, 72, 74, 65, 79, 70, 49, 45, 68, 70, 71, 74,
?????????? 71, 69, 58, 65, 59, 80, 60, 68, 87, 71, 82, 80, 49),
? gender_male = c(1L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 0L,
????????????????? 1L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 1L, 0L, 1L, 1L, 0L),
? gender_female = c(0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 1L,
??????????????????? 0L, 1L, 1L, 0L, 0L, 0L, 1L, 1L, 0L, 1L, 0L, 0L, 1L),
? race_white = c(0L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L,
???????????????? 1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L),
? race_black = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
???????????????? 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L),
? race_other = c(1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 1L,
???????????????? 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)),
? row.names = c(NA, -26L), class = c("tbl_df", "tbl", "data.frame"))

have <- have %>%
? select(ptno, age1, gender_male, gender_female, age2, everything())

need <-structure(list(
? age1_mean = 72.8076923076923, age1_std = 9.72838827666425,
? age2_mean = 68.2307692307692, age2_std = 10.2227498934785,
? gender_male_prop = 0.576923076923077, gender_female_prop = 0.423076923076923,
? race_white_prop = 0.769230769230769, race_black_prop = 0.0384615384615385,
? race_other_prop = 0.192307692307692),
? row.names = c(NA, -1L), class = c("tbl_df", "tbl", "data.frame"))

need <- need %>%
? select(age1_mean, age1_std, gender_male_prop, gender_female_prop, age2_mean, age2_std, everything())

#### Vectors of variable names ####

vars_num <-? c("age1", "age2")
vars_dmy <-? c("gender", "race")
vars_all <-? c("age1", "age2","gender", "race")

#### dplyr conditional summarize across ####

have %>%
? summarize(
???? across(2:2, list(mean = mean, std = sd)),
???? across(3:4, list(prop = mean)),
???? across(5:5, list(mean = mean, std = sd)),
???? across(6:8, list(prop = mean))
? ) %>%
? all.equal(need)?

#### Workaround ####

have %>%
? summarise(across(
???? .cols = !contains("chai_patient_id"),
???? .fns = list(mean = mean, std = sd),
???? .names = "{col}_{fn}"
? )) %>%
? select(starts_with(vars_num) | ends_with("mean")) %>%
? rename_at(vars(!starts_with(vars_num)), list(~ str_replace(., "mean$", "prop"))) %>% 
? all.equal(need)

Dynamically defining dplyr across statements (was dplyr: summarise across using variable names and a condition)

Thread (3 messages)