How to use tidyr::separate when the number of needed variables is unknown [duplicate]

本小妞迷上赌 提交于 2019-11-26 22:18:24

问题


This question already has an answer here:

  • Splitting a dataframe string column into multiple different columns [duplicate] 4 answers

I've got a dataset that consists of email communication. An example:

library(dplyr)
library(tidyr)

dat <- data_frame('date' = Sys.time(), 
                  'from' = c("person1@gmail.com", "person2@yahoo.com", 
                             "person3@hotmail.com", "person4@msn.com"), 
                  'to' = c("person2@yahoo.com,person3@hotmail.com", "person3@hotmail.com", 
                           "person4@msn.com,person1@gmail.com,person2@yahoo.com", "person1@gmail.com"))

In the above example it's simple enough to see how many variables I need, so I could just do the following:

dat %>% separate(to, into = paste0("to_", 1:3), sep = ",", extra = "merge", fill = "right")

#Source: local data frame [4 x 5]
#
#                 date                from                to_1                to_2              to_3
#               (time)               (chr)               (chr)               (chr)             (chr)
#1 2015-10-22 14:52:41   person1@gmail.com   person2@yahoo.com person3@hotmail.com                NA
#2 2015-10-22 14:52:41   person2@yahoo.com person3@hotmail.com                  NA                NA
#3 2015-10-22 14:52:41 person3@hotmail.com     person4@msn.com   person1@gmail.com person2@yahoo.com
#4 2015-10-22 14:52:41     person4@msn.com   person1@gmail.com                  NA                NA

However, my dataset is 4,000 records long and I'd rather not go through and find the row with the most number of elements in it so that I can determine how many variables I need to create. My approach to handling this is to first split the column myself and get the length of each split and then find the max:

n_vars <- dat$to %>% str_split(",") %>% lapply(function(z) length(z)) %>% unlist() %>% max()

But that seems inefficient. Is there a better way of doing this?


回答1:


We could use cSplit

library(splitstackshape) 
cSplit(dat, 'to', ',')



回答2:


This is a good question - my usual repsonse is to use strsplit, then unnest and spread, which is also not super efficient:

library(dplyr)
library(tidyr)

dat %>% mutate(to = strsplit(to, ",")) %>%
        unnest(to) %>%
        group_by(from) %>%
        mutate(row = row_number()) %>%
        spread(row, to)

Source: local data frame [4 x 5]

                 date                from                   1                   2                 3
               (time)               (chr)               (chr)               (chr)             (chr)
1 2015-10-22 15:03:17   person1@gmail.com   person2@yahoo.com person3@hotmail.com                NA
2 2015-10-22 15:03:17   person2@yahoo.com person3@hotmail.com                  NA                NA
3 2015-10-22 15:03:17 person3@hotmail.com     person4@msn.com   person1@gmail.com person2@yahoo.com
4 2015-10-22 15:03:17     person4@msn.com   person1@gmail.com                  NA                NA


来源:https://stackoverflow.com/questions/33288695/how-to-use-tidyrseparate-when-the-number-of-needed-variables-is-unknown

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!