问题
This question already has an answer here:
- Splitting a dataframe string column into multiple different columns [duplicate] 4 answers
I've got a dataset that consists of email communication. An example:
library(dplyr)
library(tidyr)
dat <- data_frame('date' = Sys.time(),
'from' = c("person1@gmail.com", "person2@yahoo.com",
"person3@hotmail.com", "person4@msn.com"),
'to' = c("person2@yahoo.com,person3@hotmail.com", "person3@hotmail.com",
"person4@msn.com,person1@gmail.com,person2@yahoo.com", "person1@gmail.com"))
In the above example it's simple enough to see how many variables I need, so I could just do the following:
dat %>% separate(to, into = paste0("to_", 1:3), sep = ",", extra = "merge", fill = "right")
#Source: local data frame [4 x 5]
#
# date from to_1 to_2 to_3
# (time) (chr) (chr) (chr) (chr)
#1 2015-10-22 14:52:41 person1@gmail.com person2@yahoo.com person3@hotmail.com NA
#2 2015-10-22 14:52:41 person2@yahoo.com person3@hotmail.com NA NA
#3 2015-10-22 14:52:41 person3@hotmail.com person4@msn.com person1@gmail.com person2@yahoo.com
#4 2015-10-22 14:52:41 person4@msn.com person1@gmail.com NA NA
However, my dataset is 4,000 records long and I'd rather not go through and find the row with the most number of elements in it so that I can determine how many variables I need to create. My approach to handling this is to first split the column myself and get the length of each split and then find the max:
n_vars <- dat$to %>% str_split(",") %>% lapply(function(z) length(z)) %>% unlist() %>% max()
But that seems inefficient. Is there a better way of doing this?
回答1:
We could use cSplit
library(splitstackshape)
cSplit(dat, 'to', ',')
回答2:
This is a good question - my usual repsonse is to use strsplit
, then unnest
and spread
, which is also not super efficient:
library(dplyr)
library(tidyr)
dat %>% mutate(to = strsplit(to, ",")) %>%
unnest(to) %>%
group_by(from) %>%
mutate(row = row_number()) %>%
spread(row, to)
Source: local data frame [4 x 5]
date from 1 2 3
(time) (chr) (chr) (chr) (chr)
1 2015-10-22 15:03:17 person1@gmail.com person2@yahoo.com person3@hotmail.com NA
2 2015-10-22 15:03:17 person2@yahoo.com person3@hotmail.com NA NA
3 2015-10-22 15:03:17 person3@hotmail.com person4@msn.com person1@gmail.com person2@yahoo.com
4 2015-10-22 15:03:17 person4@msn.com person1@gmail.com NA NA
来源:https://stackoverflow.com/questions/33288695/how-to-use-tidyrseparate-when-the-number-of-needed-variables-is-unknown