Group/bin/bucket data in R and get count per bucket and sum of values per bucket

前端 未结 1 395
北恋
北恋 2020-12-19 02:46

I wish to bucket/group/bin data :

C1             C2       C3
49488.01172    0.0512   54000
268221.1563    0.0128   34399
34775.96094    0.0128   54444
13046.         


        
相关标签:
1条回答
  • 2020-12-19 02:55

    From the comments, "C2" seems to be "character" column with % as suffix. Before, creating a group, remove the % using sub, convert to "numeric" (as.numeric). The variable "group" is created (transform(df,...)) by using the function cut with breaks (group buckets/intervals) and labels (for the desired group labels) arguments. Once the group variable is created, the sum of the "C1" by "group" and the "count" of elements within "group" can be done using aggregate from "base R"

    df1 <-  transform(df, group=cut(as.numeric(sub('[%]', '', C2)), 
        breaks=c(-Inf,0.005, 0.010, 0.014, Inf),
          labels=c('<0.005', 0.005, 0.01, 0.014)))
    
     res <- do.call(data.frame,aggregate(C1~group, df1, 
            FUN=function(x) c(Count=length(x), Sum=sum(x))))
    
     dNew <- data.frame(group=levels(df1$group))
     merge(res, dNew, all=TRUE)
     #   group C1.Count    C1.Sum
     #1 <0.005        2 3491509.6
     #2  0.005       NA        NA
     #3   0.01        2  302997.1
     #4  0.014        8  364609.5
    

    or you can use data.table. setDT converts the data.frame to data.table. Specify the "grouping" variable with by= and summarize/create the two variables "Count" and "Sum" within the list(. .N gives the count of elements within each "group".

     library(data.table)
      setDT(df1)[, list(Count=.N, Sum=sum(C1)), by=group][]
    

    Or using dplyr. The %>% connect the LHS with RHS arguments and chains them together. Use group_by to specify the "group" variable, and then use summarise_each or summarise to get summary count and sum of the concerned column. summarise_each would be useful if there are more than one column.

     library(dplyr)
     df1 %>%
          group_by(group) %>% 
          summarise_each(funs(n(), Sum=sum(.)), C1)
    

    Update

    Using the new dataset df

    df1 <- transform(df, group=cut(C2,  breaks=c(-Inf,0.005, 0.010, 0.014, Inf),
                                 labels=c('<0.005', 0.005, 0.01, 0.014)))
    
    res <- do.call(data.frame,aggregate(cbind(C1,C3)~group, df1, 
           FUN=function(x) c(Count=length(x), Sum=sum(x))))
    res
    #  group C1.Count    C1.Sum C3.Count C3.Sum
    #1 <0.005        2 3491509.6        2  91233
    #2   0.01        2  302997.1        2  88843
    #3  0.014        8  364609.5        8 268809
    

    and you can do the merge as detailed above.

    The dplyr approach would be the same except specifying the additional variable

     df1%>%
          group_by(group) %>%
           summarise_each(funs(n(), Sum=sum(.)), C1, C3)
     #Source: local data frame [3 x 5]
    
     #  group C1_n C3_n    C1_Sum C3_Sum
     #1 <0.005    2    2 3491509.6  91233
     #2   0.01    2    2  302997.1  88843
     #3  0.014    8    8  364609.5 268809
    

    data

    df <-structure(list(C1 = c(49488.01172, 268221.1563, 34775.96094, 
    13046.98047, 2121699.75, 71155.09375, 1369809.875, 750, 44943.82813, 
    85585.04688, 31090.10938, 68550.40625), C2 = c("0.0512%", "0.0128%", 
    "0.0128%", "0.07241%", "0.00453%", "0.0181%", "0.00453%", "0.2048%", 
    "0.0362%", "0.0362%", "0.0362%", "0.0181%")), .Names = c("C1", 
    "C2"), row.names = c(NA, -12L), class = "data.frame")
    
    0 讨论(0)
提交回复
热议问题