Spreading a two column data frame with tidyr

后端 未结 5 1994
渐次进展
渐次进展 2020-11-30 14:47

I have a data frame that looks like this:

  a b
1 x 8
2 x 6
3 y 3
4 y 4
5 z 5
6 z 6

and I want to turn it into this:

  x y          


        
相关标签:
5条回答
  • 2020-11-30 15:28

    Somehow like this?

    df <- data.frame(ind = rep(1:min(table(df$a)), length(unique(df$a))), df)
    df %>% spread(a, b) %>% select(-ind)
      ind x y z
    1   1 8 3 5
    2   2 6 4 6
    
    0 讨论(0)
  • 2020-11-30 15:30

    While I'm aware you're after tidyr, base has a solution in this case:

    unstack(df, b~a)
    

    It's also a little bit faster:

    Unit: microseconds
    
                    expr     min      lq     mean  median       uq      max neval
     df %>% spread(a, b) 657.699 679.508 717.7725 690.484 724.9795 1648.381   100
      unstack(df, b ~ a) 309.891 335.264 349.4812 341.9635 351.6565 639.738   100
    

    By popular demand, with something bigger

    I haven't included the data.table solution as I'm not sure if pass by reference would be a problem for microbenchmark.

    library(microbenchmark)
    library(tidyr)
    library(magrittr)
    
    nlevels <- 3
    #Ensure that all levels have the same number of elements
    nrow <- 1e6 - 1e6 %% nlevels
    df <- data.frame(a=sample(rep(c("x", "y", "z"), length.out=nrow)),
                     b=sample.int(9, nrow, replace=TRUE))
    
    microbenchmark(df %>% spread(a, b),  unstack(df, b ~ a), data.frame(split(df$b,df$a)), do.call(cbind,split(df$b,df$a)))
    

    Even on 1 million, unstack is faster. Notably, the split solution is also very fast.

    Unit: milliseconds
                                  expr       min        lq      mean    median       uq       max neval
                   df %>% spread(a, b) 366.24426 414.46913 450.78504 453.75258 486.1113 542.03722   100
                    unstack(df, b ~ a)  47.07663  51.17663  61.24411  53.05315  56.1114 102.71562   100
         data.frame(split(df$b, df$a))  19.44173  19.74379  22.28060  20.18726  22.1372  67.53844   100
     do.call(cbind, split(df$b, df$a))  26.99798  27.41594  31.27944  27.93225  31.2565  79.93624   100
    
    0 讨论(0)
  • 2020-11-30 15:30

    Since tidyr 1.0.0 you can use pivot_wider(), and because a doesn't have unique values you'll need a call to unchop on top :

    
    library(tidyr)
    df <- data.frame(
      a = c("x", "x", "y", "y", "z", "z"),
      b = c(8, 6, 3, 4, 5, 6)
    )
    
    pivot_wider(df, names_from = "a", values_from = "b", values_fn = list(b = list)) %>%
      unchop(everything())
    #> # A tibble: 2 x 3
    #>       x     y     z
    #>   <dbl> <dbl> <dbl>
    #> 1     8     3     5
    #> 2     6     4     6
    

    Created on 2019-09-14 by the reprex package (v0.3.0)

    0 讨论(0)
  • 2020-11-30 15:31

    Another base answer (that also looks like fast):

    data.frame(split(df$b,df$a))
    
    0 讨论(0)
  • 2020-11-30 15:41

    You can do this with dcast and rowid from the data.table package as well:

    dat <- dcast(setDT(df), rowid(a) ~ a, value.var = "b")[,a:=NULL]
    

    which gives:

    > dat
       x y z
    1: 8 3 5
    2: 6 4 6
    

    Old solution:

    # create a sequence number by group
    setDT(df)[, r:=1:.N, by = a]
    # reshape to wide format and remove the sequence variable
    dat <- dcast(df, r ~ a, value.var = "b")[,r:=NULL]
    

    which gives:

    > dat
       x y z
    1: 8 3 5
    2: 6 4 6
    
    0 讨论(0)
提交回复
热议问题