Consolidate duplicate rows

后端 未结 6 1903
暖寄归人
暖寄归人 2020-12-01 02:07

I have a data frame where one column is species\' names, and the second column is abundance values. Due to the sampling procedure, some species appear more than once (i.e.,

相关标签:
6条回答
  • 2020-12-01 02:21

    A MWE to verify whether a formula to respect a second variable (i.e., here "Z" and in addition to "X", would actually work:

    example = data.frame(X=c("x"),Z=c("a"),Y=c(1), stringsAsFactors=F)
    newrow = c("y","b",1)
    example <- rbind(example, newrow)
    newrow = c("z","a",0.5)
    example <- rbind(example, newrow)
    newrow = c("x","b",1)
    example <- rbind(example, newrow)
    newrow = c("x","b",2)
    example <- rbind(example, newrow)
    newrow = c("y","b",10)
    example <- rbind(example, newrow)
    example$X = as.factor(example$X)
    example$Z = as.factor(example$Z)
    example$Y = as.numeric(example$Y)
    example_agg <- aggregate(Y~X+Z,data=example,FUN=sum)
    
    0 讨论(0)
  • 2020-12-01 02:28

    A dplyr solution:

    library(dplyr)
    df %>% group_by(x) %>% summarise(y = sum(y))
    
    0 讨论(0)
  • 2020-12-01 02:28
    > tapply(df$y, df$x, sum)
    sp1 sp2 sp3 sp4 
      2   9   7   3 
    

    if it has to be a data.frame Ben's answer works great. or you can coerce the tapply output.

    out <- tapply(df$y, df$x, sum)
    >     data.frame(x=names(out), y=out, row.names=NULL)
        x y
    1 sp1 2
    2 sp2 9
    3 sp3 7
    4 sp4 3
    
    0 讨论(0)
  • 2020-12-01 02:30

    A data.table solution for time and memory efficiency

    library(data.table)
    DT <- as.data.table(df)
    # which columns are numeric 
    numeric_cols <- which(sapply(DT, is.numeric))
    DT[, lapply(.SD, sum), by = x, .SDcols = numeric_cols]
    

    Or, in your case, given that you know that there is only the 1 column y you wish to sum over

    DT[, list(y=sum(y)),by=x]
    
    0 讨论(0)
  • 2020-12-01 02:31

    This works:

    library(plyr)
    ddply(df,"x",numcolwise(sum))
    

    in words: (1) split the data frame df by the "x" column; (2) for each chunk, take the sum of each numeric-valued column; (3) stick the results back into a single data frame. (dd in ddply stands for "take a d ata frame as input, return a d ata frame")

    Another, possibly clearer, approach:

    aggregate(y~x,data=df,FUN=sum)
    

    See quick/elegant way to construct mean/variance summary table for a related (slightly more complex) question.

    0 讨论(0)
  • 2020-12-01 02:43

    Simple as aggregate:

    aggregate(df['y'], by=df['x'], sum)
    
    0 讨论(0)
提交回复
热议问题