extracting one set of multiple variables in a list of data.frames in R

问题

Suppose I have a data.frame like THIS. Any columns of data after the column named autoreg are arbitrary columns defined by the user. So, I won't know the columns names or values. For example, in THIS data.frame columns named: "ESL" "prof" "scope" "type" are defined by the user.

Question:

How can I have a looping structure (in BASE R) that at each round, extracts one set of each of these arbitrary columns? My desired output is a list within which the ESL values prof values scope values and type values from each study are put next to each other.

I have tried two nested lapply (see below) which extracts all values for all sets of these arbitrary columns but how can I extract one set of each of these arbitrary columns at a time?

D <- read.csv("https://raw.githubusercontent.com/izeh/i/master/i.csv", h = T) ## data.frame
L <- split(D, D$study.name) ; L[[1]] <- NULL

arb.names <- c("ESL", "prof", "scope", "type") ## arbitrary column names 

a <- lapply(1:length(arb.names), function(j) lapply(1:length(L), function(i) L[[i]][arb.names[j]]))

回答1:

May be we need to grep the 'arb.names' to extract the set of columns from the 'L'

lapply(arb.names, function(nm) lapply(L, function(l1) l1[grep(nm, names(l1))]))

If we want to group the different names across the list as a single list, use transpose

library(purrr)
lapply(arb.names, function(nm) transpose(lapply(L, function(l1) l1[grep(nm, names(l1))])))

Or using base R

m1 <- simplify2array(lapply(arb.names, function(nm)
      lapply(L, function(l1) l1[grep(nm, names(l1))])))
split(m1, col(m1))

回答2:

Although this question has an accepted answer I would like to propose a completely different approach.

If I understand correctly, the OP is looking for a way to easily compare the values in the arbitrary columns between the different studies. As additional complexity, the names of the arbitrary columns are not known beforehand.

My suggestion is to reshape the data appropriately:

library(data.table)
library(magrittr)
melt(setDT(D), id.vars = c("study.name", "group.name"), 
     measure.vars = tail(names(D), -grep("autoreg", names(D))), na.rm = TRUE) %>%
  dcast(variable + study.name ~ group.name)

    variable study.name Cont.Long Cont.Long2 Cont.Short DCF.Long DCF.Long2 DCF.Short ME.long ME.long2 ME.short
 1:      ESL  Ellis.sh1         1         NA          1        1        NA         1       1       NA        1
 2:      ESL      Goey1         0         NA          0        0        NA         0       0       NA        0
 3:      ESL      kabla         1          1          1        1         1         1       1        1        1
 4:     prof  Ellis.sh1         2         NA          2        2        NA         2       2       NA        2
 5:     prof      Goey1         1         NA          1        1        NA         1       1       NA        1
 6:     prof      kabla         3          3          3        3         3         3       3        3        3
 7:    scope  Ellis.sh1         0         NA          0        0        NA         0       0       NA        0
 8:    scope      Goey1         1         NA          1        1        NA         1       1       NA        1
 9:    scope      kabla         0          0          0        0         0         0       0        0        0
10:     type  Ellis.sh1         1         NA          1        1        NA         1       1       NA        1
11:     type      Goey1         0         NA          0        0        NA         0       0       NA        0
12:     type      kabla         1          1          1        1         1         1       1        1        1

As arbitrary columns (column variable in the reshaped format) all columns are picked from D which appear after column autoreg regardless of their names by

tail(names(D), -grep("autoreg", names(D)))

Addendum

Please note that the column names are taken from group.name and have been ordered alphabetically. If it is an requirement to maintain the original row order in which group.name did appear in D then the factor levels of group.name need to be adjusted accordingly:

library(data.table)
library(magrittr)
lvls <- D[study.name != "", 1:2] %>% 
  split(drop = TRUE, by = "study.name") %>% 
  .[lengths(.) %>% order() %>% rev()] %>% # merge longest first
  Reduce(function(x, y) merge(x, y, by = "group.name", all = TRUE, sort = FALSE), .) %>% 
  .[, group.name %>% forcats::fct_drop() %>% forcats::fct_inorder()] 
melt(setDT(D), id.vars = c("study.name", "group.name"), 
     measure.vars = tail(names(D), -grep("autoreg", names(D))), na.rm = TRUE) %>%
  .[, group.name := factor(group.name, levels = lvls)] %>% 
  dcast(variable + study.name ~ group.name)

    variable study.name ME.short ME.long ME.long2 DCF.Short DCF.Long DCF.Long2 Cont.Short Cont.Long Cont.Long2
 1:      ESL  Ellis.sh1        1       1       NA         1        1        NA          1         1         NA
 2:      ESL      Goey1        0       0       NA         0        0        NA          0         0         NA
 3:      ESL      kabla        1       1        1         1        1         1          1         1          1
 4:     prof  Ellis.sh1        2       2       NA         2        2        NA          2         2         NA
 5:     prof      Goey1        1       1       NA         1        1        NA          1         1         NA
 6:     prof      kabla        3       3        3         3        3         3          3         3          3
 7:    scope  Ellis.sh1        0       0       NA         0        0        NA          0         0         NA
 8:    scope      Goey1        1       1       NA         1        1        NA          1         1         NA
 9:    scope      kabla        0       0        0         0        0         0          0         0          0
10:     type  Ellis.sh1        1       1       NA         1        1        NA          1         1         NA
11:     type      Goey1        0       0       NA         0        0        NA          0         0         NA
12:     type      kabla        1       1        1         1        1         1          1         1          1

Data

As external links may break in the future, here is OP's dataset from the github link:

D <-
structure(list(study.name = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 
1L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L), .Label = c("", "Ellis.sh1", "Goey1", "kabla"), class = "factor"), 
    group.name = structure(c(10L, 8L, 7L, 5L, 4L, 2L, 1L, 10L, 
    8L, 7L, 5L, 4L, 2L, 1L, 10L, 8L, 9L, 7L, 5L, 6L, 4L, 2L, 
    3L), .Label = c("", "Cont.Long", "Cont.Long2", "Cont.Short", 
    "DCF.Long", "DCF.Long2", "DCF.Short", "ME.long", "ME.long2", 
    "ME.short"), class = "factor"), n = c(13L, 13L, 15L, 15L, 
    16L, 16L, NA, 13L, 13L, 15L, 15L, 16L, 16L, NA, 13L, 13L, 
    13L, 15L, 15L, 15L, 16L, 16L, 16L), mpre = c(0.34, 0.34, 
    0.37, 0.37, 0.32, 0.32, NA, 0.34, 0.34, 0.37, 0.37, 0.32, 
    0.32, NA, 0.34, 0.34, 0.34, 0.37, 0.37, 0.37, 0.32, 0.32, 
    0.32), mpos = c(0.72, 0.39, 0.54, 0.49, 0.28, 0.35, NA, 0.72, 
    0.39, 0.54, 0.49, 0.28, 0.35, NA, 0.72, 0.39, 0.39, 0.54, 
    0.49, 0.49, 0.28, 0.35, 0.35), sdpre = c(0.37, 0.37, 0.38, 
    0.38, 0.37, 0.37, NA, 0.37, 0.37, 0.38, 0.38, 0.37, 0.37, 
    NA, 0.37, 0.37, 0.37, 0.38, 0.38, 0.38, 0.37, 0.37, 0.37), 
    sdpos = c(0.34, 0.36, 0.36, 0.36, 0.36, 0.32, NA, 0.34, 0.36, 
    0.36, 0.36, 0.36, 0.32, NA, 0.34, 0.36, 0.36, 0.36, 0.36, 
    0.36, 0.36, 0.32, 0.32), control = c(FALSE, FALSE, FALSE, 
    FALSE, TRUE, TRUE, NA, FALSE, FALSE, FALSE, FALSE, TRUE, 
    TRUE, NA, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, 
    TRUE, TRUE), post = c(1L, 2L, 1L, 2L, 1L, 2L, NA, 1L, 2L, 
    1L, 2L, 1L, 2L, NA, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), 
    r = c(0.5, 0.5, 0.5, 0.5, 0.5, 0.5, NA, 0.5, 0.5, 0.5, 0.5, 
    0.5, 0.5, NA, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5
    ), autoreg = c(FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, 
    NA, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, NA, FALSE, 
    FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE), 
    ESL = c(1L, 1L, 1L, 1L, 1L, 1L, NA, 0L, 0L, 0L, 0L, 0L, 0L, 
    NA, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), prof = c(2L, 2L, 
    2L, 2L, 2L, 2L, NA, 1L, 1L, 1L, 1L, 1L, 1L, NA, 3L, 3L, 3L, 
    3L, 3L, 3L, 3L, 3L, 3L), scope = c(0L, 0L, 0L, 0L, 0L, 0L, 
    NA, 1L, 1L, 1L, 1L, 1L, 1L, NA, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
    0L, 0L), type = c(1L, 1L, 1L, 1L, 1L, 1L, NA, 0L, 0L, 0L, 
    0L, 0L, 0L, NA, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L)), class = "data.frame", row.names = c(NA, 
-23L))

    study.name group.name  n mpre mpos sdpre sdpos control post   r autoreg ESL prof scope type
 1:  Ellis.sh1   ME.short 13 0.34 0.72  0.37  0.34   FALSE    1 0.5   FALSE   1    2     0    1
 2:  Ellis.sh1    ME.long 13 0.34 0.39  0.37  0.36   FALSE    2 0.5   FALSE   1    2     0    1
 3:  Ellis.sh1  DCF.Short 15 0.37 0.54  0.38  0.36   FALSE    1 0.5   FALSE   1    2     0    1
 4:  Ellis.sh1   DCF.Long 15 0.37 0.49  0.38  0.36   FALSE    2 0.5   FALSE   1    2     0    1
 5:  Ellis.sh1 Cont.Short 16 0.32 0.28  0.37  0.36    TRUE    1 0.5   FALSE   1    2     0    1
 6:  Ellis.sh1  Cont.Long 16 0.32 0.35  0.37  0.32    TRUE    2 0.5   FALSE   1    2     0    1
 7:                       NA   NA   NA    NA    NA      NA   NA  NA      NA  NA   NA    NA   NA
 8:      Goey1   ME.short 13 0.34 0.72  0.37  0.34   FALSE    1 0.5   FALSE   0    1     1    0
 9:      Goey1    ME.long 13 0.34 0.39  0.37  0.36   FALSE    2 0.5   FALSE   0    1     1    0
10:      Goey1  DCF.Short 15 0.37 0.54  0.38  0.36   FALSE    1 0.5   FALSE   0    1     1    0
11:      Goey1   DCF.Long 15 0.37 0.49  0.38  0.36   FALSE    2 0.5   FALSE   0    1     1    0
12:      Goey1 Cont.Short 16 0.32 0.28  0.37  0.36    TRUE    1 0.5   FALSE   0    1     1    0
13:      Goey1  Cont.Long 16 0.32 0.35  0.37  0.32    TRUE    2 0.5   FALSE   0    1     1    0
14:                       NA   NA   NA    NA    NA      NA   NA  NA      NA  NA   NA    NA   NA
15:      kabla   ME.short 13 0.34 0.72  0.37  0.34   FALSE    1 0.5   FALSE   1    3     0    1
16:      kabla    ME.long 13 0.34 0.39  0.37  0.36   FALSE    2 0.5   FALSE   1    3     0    1
17:      kabla   ME.long2 13 0.34 0.39  0.37  0.36   FALSE    3 0.5   FALSE   1    3     0    1
18:      kabla  DCF.Short 15 0.37 0.54  0.38  0.36   FALSE    1 0.5   FALSE   1    3     0    1
19:      kabla   DCF.Long 15 0.37 0.49  0.38  0.36   FALSE    2 0.5   FALSE   1    3     0    1
20:      kabla  DCF.Long2 15 0.37 0.49  0.38  0.36   FALSE    3 0.5   FALSE   1    3     0    1
21:      kabla Cont.Short 16 0.32 0.28  0.37  0.36    TRUE    1 0.5   FALSE   1    3     0    1
22:      kabla  Cont.Long 16 0.32 0.35  0.37  0.32    TRUE    2 0.5   FALSE   1    3     0    1
23:      kabla Cont.Long2 16 0.32 0.35  0.37  0.32    TRUE    3 0.5   FALSE   1    3     0    1
    study.name group.name  n mpre mpos sdpre sdpos control post   r autoreg ESL prof scope type

来源：https://stackoverflow.com/questions/56705711/extracting-one-set-of-multiple-variables-in-a-list-of-data-frames-in-r

标签

function

loops

dataframe

lapply