R data.table: subgroup weighted percent of group

前端 未结 3 602
一整个雨季
一整个雨季 2020-12-16 19:24

I have a data.table like:

library(data.table)
widgets <- data.table(serial_no=1:100, 
                      color=rep_len(c(\"red\",\"green\"         


        
3条回答
  •  时光取名叫无心
    2020-12-16 19:59

    This is almost a single step:

    # A
    widgets[,{
        totwt = .N
        .SD[,.(frac=.N/totwt),by=style]
    },by=color]
        # color  style frac
     # 1:   red  round 0.36
     # 2:   red pointy 0.32
     # 3:   red   flat 0.32
     # 4: green pointy 0.36
     # 5: green   flat 0.32
     # 6: green  round 0.32
     # 7:  blue   flat 0.36
     # 8:  blue  round 0.32
     # 9:  blue pointy 0.32
    # 10: black  round 0.36
    # 11: black pointy 0.32
    # 12: black   flat 0.32
    
    # B
    widgets[,{
        totwt = sum(weight)
        .SD[,.(frac=sum(weight)/totwt),by=style]
    },by=color]
     #    color  style      frac
     # 1:   red  round 0.3466667
     # 2:   red pointy 0.3466667
     # 3:   red   flat 0.3066667
     # 4: green pointy 0.3333333
     # 5: green   flat 0.3200000
     # 6: green  round 0.3466667
     # 7:  blue   flat 0.3866667
     # 8:  blue  round 0.2933333
     # 9:  blue pointy 0.3200000
    # 10: black  round 0.3733333
    # 11: black pointy 0.3333333
    # 12: black   flat 0.2933333
    

    How it works: Construct your denominator for the top-level group (color) before going to the finer group (color with style) to tabulate.


    Alternatives. If styles repeat within each color and this is only for display purposes, try a table:

    # A
    widgets[,
      prop.table(table(color,style),1)
    ]
    #        style
    # color   flat pointy round
    #   black 0.32   0.32  0.36
    #   blue  0.36   0.32  0.32
    #   green 0.32   0.36  0.32
    #   red   0.32   0.32  0.36
    
    # B
    widgets[,rep(1L,sum(weight)),by=.(color,style)][,
      prop.table(table(color,style),1)
    ]
    
    #        style
    # color        flat    pointy     round
    #   black 0.2933333 0.3333333 0.3733333
    #   blue  0.3866667 0.3200000 0.2933333
    #   green 0.3200000 0.3333333 0.3466667
    #   red   0.3066667 0.3466667 0.3466667
    

    For B, this expands the data so that there is one observation for each unit of weight. With large data, such an expansion would be a bad idea (since it costs so much memory). Also, weight has to be an integer; otherwise, its sum will be silently truncated to one (e.g., try rep(1,2.5) # [1] 1 1).

提交回复
热议问题