tidyr::unite across column patterns

问题

I have a dataset that looks something like this

site <- c("A", "B", "C", "D", "E")
D01_1 <- c(1, 0, 0, 0, 1)
D01_2 <- c(1, 1, 0, 1, 1)
D02_1 <- c(1, 0, 1, 0, 1)
D02_2 <- c(0, 1, 0, 0, 1)
D03_1 <- c(1, 1, 0, 0, 0)
D03_2 <- c(0, 1, 0, 0, 1)
df <- data.frame(site, D01_1, D01_2, D02_1, D02_2, D03_1, D03_2)

I am trying to unite the D0x_1 and D0x_2 columns so that the values in the columns are separated by a slash. I can do this with the following code and it works just fine:

library(dplyr)
library(tidyr)

df.unite <- df %>%
  unite(D01, D01_1, D01_2, sep = "/", remove = TRUE) %>%
  unite(D02, D02_1, D02_2, sep = "/", remove = TRUE) %>%
  unite(D03, D03_1, D03_2, sep = "/", remove = TRUE)

...but the problem is that it requires me to type out each unite pair multiple times and it is unwieldy across the large number of columns in my dataset. Is there a way in dplyr to unite across similarly patterned column names and then loop across the columns? unite_each doesn't seem to exist.

回答1:

Two options, which are really the same thing rearranged.

Option 1. Nested calls

First, you can use lapply to apply unite_ (the standard evaluation version to which you can pass strings) programmatically across columns. To do so, you'll need to build a list of names for it to use, and then wrap the lapply in do.call(cbind to catch columns, and cbind site back to it. Altogether:

cols <- unique(substr(names(df)[-1], 1, 3))
cbind(site = df$site, do.call(cbind,
        lapply(cols, function(x){unite_(df, x, grep(x, names(df), value = TRUE), 
                                        sep = '/', remove = TRUE) %>% select_(x)})
        ))

#   site D01 D02 D03
# 1    A 1/1 1/0 1/0
# 2    B 0/1 0/1 1/1
# 3    C 0/0 1/0 0/0
# 4    D 0/1 0/0 0/0
# 5    E 1/1 1/1 0/1

Option 2: Chained

Alternately, if you really like pipes, you can actually hack the whole thing into a chain (lapply included!), swapping out a few of the base functions for dplyr ones:

df %>% select(-site) %>% names() %>% substr(1,3) %>% unique() %>%
  lapply(function(x){unite_(df, x, grep(x, names(df), value = TRUE), 
                            sep = '/', remove = TRUE) %>% select_(x)}) %>%
  bind_cols() %>% mutate(site = as.character(df$site)) %>% select(site, starts_with('D'))

# Source: local data frame [5 x 4]
# 
#    site   D01   D02   D03
#   (chr) (chr) (chr) (chr)
# 1     A   1/1   1/0   1/0
# 2     B   0/1   0/1   1/1
# 3     C   0/0   1/0   0/0
# 4     D   0/1   0/0   0/0
# 5     E   1/1   1/1   0/1

Check out the intermediate products to see how it fits together, but it's pretty much the same logic as the base approach.

回答2:

This is a solution with base functions. First, I looked for indexes of ***_1 in columns. I also created names for columns for the final process, using gsub() and unique(). The sapply part pastes two columns with /. If x = 1, then, x +1 = 2. So you always choose two columns next to each other and handle the pasting job. Then, I added site with cbind() and created a data frame. The last job is to assign column names.

library(magrittr)

ind <- grep(pattern = "1$", x = names(df))

names <- unique(gsub(pattern = "_\\d+$",
                replacement = "", x = names(df)))

sapply(ind, function(x){
        foo <- paste(df[,x], df[, x+1], sep = "/")
        foo
       }) %>%
cbind(as.character(df$site), .) %>%
data.frame -> out

names(out) <- names

#  site D01 D02 D03
#1    A 1/1 1/0 1/0
#2    B 0/1 0/1 1/1
#3    C 0/0 1/0 0/0
#4    D 0/1 0/0 0/0
#5    E 1/1 1/1 0/1

回答3:

You can use an easy base R approach, too:

cols <- split(names(df)[-1], sub("_\\d+", "", names(df)[-1]))

cbind(df[1], sapply(names(cols), function(col) {
  do.call(paste, c(df[cols[[col]]], sep = "/"))
}))
#  site D01 D02 D03
#1    A 1/1 1/0 1/0
#2    B 0/1 0/1 1/1
#3    C 0/0 1/0 0/0
#4    D 0/1 0/0 0/0
#5    E 1/1 1/1 0/1

来源：https://stackoverflow.com/questions/36002235/tidyrunite-across-column-patterns

标签

dplyr

tidyr