tidyr spread function generates sparse matrix when compact vector expected

后端 未结 1 1667
醉酒成梦
醉酒成梦 2020-12-18 23:49

I\'m learning dplyr, having come from plyr, and I want to generate (per group) columns (per interaction) from the output of xtabs.

Short summary: I\'m getting

<
相关标签:
1条回答
  • 2020-12-19 00:13

    The key here is that spread doesn't aggregate the data.

    Hence, if you hadn't already used xtabs to aggregate first, you would be doing this:

    a <- data.frame(P=c(F,T,F,T,F),A=c(F,F,T,T,T), Freq = 1) %>% 
        unite(S,A,P)
    a
    ##             S Freq
    ## 1 FALSE_FALSE    1
    ## 2  FALSE_TRUE    1
    ## 3  TRUE_FALSE    1
    ## 4   TRUE_TRUE    1
    ## 5  TRUE_FALSE    1
    
    a %>% spread(S, Freq)
    ##   FALSE_FALSE FALSE_TRUE TRUE_FALSE TRUE_TRUE
    ## 1           1         NA         NA        NA
    ## 2          NA          1         NA        NA
    ## 3          NA         NA          1        NA
    ## 4          NA         NA         NA         1
    ## 5          NA         NA          1        NA
    

    Which wouldn't make sense any other way (without aggregation).

    This is predictable based on the help file for the fill parameter:

    If there isn't a value for every combination of the other variables and the key column, this value will be substituted.

    In your case, there aren't any other variables to combine with the key column. Had there been, then...

    b <- data.frame(P=c(F,T,F,T,F),A=c(F,F,T,T,T), Freq = 1
                                    , h = rep(c("foo", "bar"), length.out = 5)) %>% 
        unite(S,A,P)
    b
    ##             S Freq   h
    ## 1 FALSE_FALSE    1 foo
    ## 2  FALSE_TRUE    1 bar
    ## 3  TRUE_FALSE    1 foo
    ## 4   TRUE_TRUE    1 bar
    ## 5  TRUE_FALSE    1 foo
    
    > b %>% spread(S, Freq)
    ## Error: Duplicate identifiers for rows (3, 5)
    

    ...it would fail, because it can't aggregate rows 3 and 5 (because it isn't designed to).

    The tidyr/dplyr way to do it would be group_by and summarize instead of xtabs, because summarize preserves the grouping column, hence spread can tell which observations belong in the same row:

    b %>%   group_by(h, S) %>%
        summarize(Freq = sum(Freq))
    ## Source: local data frame [4 x 3]
    ## Groups: h
    ## 
    ##     h           S Freq
    ## 1 bar  FALSE_TRUE    1
    ## 2 bar   TRUE_TRUE    1
    ## 3 foo FALSE_FALSE    1
    ## 4 foo  TRUE_FALSE    2
    
    b %>%   group_by(h, S) %>%
        summarize(Freq = sum(Freq)) %>%
        spread(S, Freq)
    ## Source: local data frame [2 x 5]
    ## 
    ##     h FALSE_FALSE FALSE_TRUE TRUE_FALSE TRUE_TRUE
    ## 1 bar          NA          1         NA         1
    ## 2 foo           1         NA          2        NA
    
    0 讨论(0)
提交回复
热议问题