Spread vs dcast

前端 未结 2 624
无人共我
无人共我 2020-12-15 00:48

I have a table like this,

> head(dt2)
  Weight Height   Fitted interval limit    value
1   65.6  174.0 71.91200     pred   lwr 53.73165
2   80.7  193.5 91         


        
相关标签:
2条回答
  • 2020-12-15 01:22

    Let's say you were starting with data that looked like this:

    mydf
    #   Weight Height  Fitted interval limit    value
    # 1     42  153.4 51.0792     conf   lwr 49.15463
    # 2     42  153.4 51.0792     pred   lwr 32.82122
    # 3     42  153.4 51.0792     conf   upr 53.00376
    # 4     42  153.4 51.0792     pred   upr 69.33717
    # 5     42  153.4 51.0792     conf   lwr 60.00000
    # 6     42  153.4 51.0792     pred   lwr 90.00000
    

    Notice the duplication in rows 5 and 6 of the grouping columns (1 to 5). This is essentially what "tidyr" is telling you. The first row and fifth are duplicates, as are the second and sixth.

    tidyr::spread(mydf, limit, value)
    # Error: Duplicate identifiers for rows (1, 5), (2, 6)
    

    As suggested by @Jaap, the solution is to first "summarise" the data. Since "tidyr" is only for reshaping data (unlike "reshape2", which aggregated and reshaped), you need to perform the aggregation with "dplyr" before you change the data form. Here, I've done that with summarise for the "value" column.

    If you stopped the execution at the summarise step, you would find that our original 6-row dataset had "shrunk" to 4 rows. Now, spread would work as expected.

    mydf %>% 
      group_by(Weight, Height, Fitted, interval, limit) %>% 
      summarise(value = mean(value)) %>% 
      spread(limit, value)
    # Source: local data frame [2 x 6]
    # 
    #   Weight Height  Fitted interval      lwr      upr
    #    (dbl)  (dbl)   (dbl)    (chr)    (dbl)    (dbl)
    # 1     42  153.4 51.0792     conf 54.57731 53.00376
    # 2     42  153.4 51.0792     pred 61.41061 69.33717
    

    This matches the expected output from dcast with fun.aggregate = mean.

    reshape2::dcast(mydf, Weight + Height + Fitted + interval ~ limit, fun.aggregate = mean)
    #   Weight Height  Fitted interval      lwr      upr
    # 1     42  153.4 51.0792     conf 54.57731 53.00376
    # 2     42  153.4 51.0792     pred 61.41061 69.33717
    

    Sample data:

     mydf <- structure(list(Weight = c(42, 42, 42, 42, 42, 42), Height = c(153.4, 
         153.4, 153.4, 153.4, 153.4, 153.4), Fitted = c(51.0792, 51.0792,         
         51.0792, 51.0792, 51.0792, 51.0792), interval = c("conf", "pred",        
         "conf", "pred", "conf", "pred"), limit = structure(c(1L, 1L,             
         2L, 2L, 1L, 1L), .Label = c("lwr", "upr"), class = "factor"),            
             value = c(49.15463, 32.82122, 53.00376, 69.33717, 60,          
             90)), .Names = c("Weight", "Height", "Fitted", "interval",     
         "limit", "value"), row.names = c(NA, 6L), class = "data.frame")   
    
    0 讨论(0)
  • 2020-12-15 01:22

    Here are data.table alternatives to dplyr. Use mydf from Ananda's answer.

    library(data.table)
    library(magrittr)
    library(tidyr)
    
    DT <- data.table(mydf)
    

    First, you can use by to compute the mean by each limit.

    DT[, .(lwr = mean(value[limit == "lwr"]), 
           upr = mean(value[limit == "upr"])), 
       by = .(Weight, Height, Fitted, interval)]
    

    If this limit == ... looks too much hard coding, you can first aggregate into a long format, then spread. This works because once you aggregate, there is no duplicate.

    DT[, .(value = mean(value)), by = .(Weight, Height, Fitted, interval, limit)] %>%
      spread(key = "limit", value = "value")
    

    Both gets you

    #   Weight Height  Fitted interval      lwr      upr
    #1:     42  153.4 51.0792     conf 54.57731 53.00376
    #2:     42  153.4 51.0792     pred 61.41061 69.33717
    
    0 讨论(0)
提交回复
热议问题