summary stats across columns, where column names indicate groups

﹥>﹥吖頭↗ 提交于 2021-02-04 16:35:48


Data frame have includes a few thousand vectors that follow a naming pattern. Each vector name includes a noun, then either _a, _b, or _c. Below are the first 10 vars and obs:

id  turtle_a   banana_a   castle_a   turtle_b   banana_b   castle_b   turtle_c   banana_c   castle_c
A      -0.58      -0.88      -0.56      -0.53      -0.32      -0.42      -0.52      -0.89      -0.72
B         NA         NA         NA      -0.84      -0.36      -0.26         NA         NA         NA
C       0.00      -0.43      -0.75      -0.35      -0.88      -0.14      -0.26      -0.15      -0.81
D      -0.81      -0.63      -0.77      -0.82      -0.83      -0.50      -0.77      -0.25      -0.07
E      -0.25      -0.33      -0.09      -0.51      -0.27      -0.81      -0.06      -0.23      -0.97
F      -0.80      -0.88      -0.05         NA         NA         NA         NA         NA         NA
G      -0.25      -0.76      -0.21         NA         NA         NA         NA         NA         NA
H      -0.47      -0.10      -0.67      -0.46      -0.71      -0.24      -0.76      -0.04      -0.11
I      -0.15      -0.34      -0.57      -0.40      -0.14      -0.49         NA         NA         NA
J      -0.65      -0.86      -0.37      -0.67      -0.81      -0.63         NA         NA         NA

Data frame want is the mean across all columns for every set of variables in a noun group. For example, averaging turtle_a, turtle_b, and turtle_c for id=A equals -0.54. Here's what want looks like if I just do it for the handful of noun groups in the example.

id   turtle_m    banana_m    castle_m
A       -0.54       -0.70       -0.57
B       -0.84       -0.36       -0.26
C       -0.20       -0.49       -0.57
D       -0.80       -0.57       -0.45
E       -0.27       -0.28       -0.62
F       -0.80       -0.88       -0.05
G       -0.25       -0.76       -0.21
H       -0.56       -0.29       -0.34
I       -0.27       -0.24       -0.53
J       -0.66       -0.83       -0.50

Options so far:

  1. convert to long, summarize with a group_by() function in dplyr, and transpose back to wide.
  2. resort the vectors so the noun groups appear next to each other, and write a loop that computes means across columns, taking three-column steps at each iteration

It seems like summarize_at or summarize_all could be used more effectively than either of my current options, but I'm not sure how to use it in a way that will dynamically group variables by naming convention.

Any thoughts?


We can use split.default to split the columns based on the substring of column names, loop over the list with sapply with rowMeans and then cbind with the first column

out <- cbind(df1[1], sapply(split.default(df1[-1], 
    sub("_.*", "", names(df1)[-1])), rowMeans, na.rm = TRUE))

Or we can use pivot_longer

df1 %>% 
   pivot_longer(cols = -id, names_sep="_", names_to = c(".value", "group")) %>%
   group_by(id) %>%
   summarise(across(turtle:castle,  mean,  na.rm = TRUE))

