Remove duplicated rows

前端 未结 11 2098
清酒与你
清酒与你 2020-11-22 00:00

I have read a CSV file into an R data.frame. Some of the rows have the same element in one of the columns. I would like to remove rows that are duplicates in th

11条回答
  •  天命终不由人
    2020-11-22 01:01

    The data.table package also has unique and duplicated methods of it's own with some additional features.

    Both the unique.data.table and the duplicated.data.table methods have an additional by argument which allows you to pass a character or integer vector of column names or their locations respectively

    library(data.table)
    DT <- data.table(id = c(1,1,1,2,2,2),
                     val = c(10,20,30,10,20,30))
    
    unique(DT, by = "id")
    #    id val
    # 1:  1  10
    # 2:  2  10
    
    duplicated(DT, by = "id")
    # [1] FALSE  TRUE  TRUE FALSE  TRUE  TRUE
    

    Another important feature of these methods is a huge performance gain for larger data sets

    library(microbenchmark)
    library(data.table)
    set.seed(123)
    DF <- as.data.frame(matrix(sample(1e8, 1e5, replace = TRUE), ncol = 10))
    DT <- copy(DF)
    setDT(DT)
    
    microbenchmark(unique(DF), unique(DT))
    # Unit: microseconds
    #       expr       min         lq      mean    median        uq       max neval cld
    # unique(DF) 44708.230 48981.8445 53062.536 51573.276 52844.591 107032.18   100   b
    # unique(DT)   746.855   776.6145  2201.657   864.932   919.489  55986.88   100  a 
    
    
    microbenchmark(duplicated(DF), duplicated(DT))
    # Unit: microseconds
    #           expr       min         lq       mean     median        uq        max neval cld
    # duplicated(DF) 43786.662 44418.8005 46684.0602 44925.0230 46802.398 109550.170   100   b
    # duplicated(DT)   551.982   558.2215   851.0246   639.9795   663.658   5805.243   100  a 
    

提交回复
热议问题