Minus operation of data frames

前端 未结 7 719
忘掉有多难
忘掉有多难 2020-12-05 18:53

I have 2 data frames df1 and df2.

df1 <- data.frame(c1=c(\"a\",\"b\",\"c\",\"d\"),c2=c(1,2,3,4) )
df2 <- data.frame(c1=c(\"         


        
相关标签:
7条回答
  • 2020-12-05 19:02

    You can create identifier columnas then subset:

    e.g.

    df1 <- data.frame(c1=c("a","b","c","d"),c2=c(1,2,3,4), indf1 = rep("Y",4) )
    df2 <- data.frame(c1=c("c","d","e","f"),c2=c(3,4,5,6),indf2 = rep("Y",4) )
    merge(df1,df2)
    #  c1 c2 indf1 indf2
    #1  c  3     Y     Y
    #2  d  4     Y     Y
    
    bigdf <- merge(df1,df2,all=TRUE)
    #  c1 c2 indf1 indf2
    #1  a  1     Y  <NA>
    #2  b  2     Y  <NA>
    #3  c  3     Y     Y
    #4  d  4     Y     Y
    #5  e  5  <NA>     Y
    #6  f  6  <NA>     Y
    

    Then subset how you wish:

     bigdf[is.na(bigdf$indf1) ,]
    #  c1 c2 indf1 indf2
    #5  e  5  <NA>     Y
    #6  f  6  <NA>     Y
    
     bigdf[is.na(bigdf$indf2) ,]  #<- output you requested those not in df2
    #  c1 c2 indf1 indf2
    #1  a  1     Y  <NA>
    #2  b  2     Y  <NA>
    
    0 讨论(0)
  • 2020-12-05 19:03

    I prefer sqldf package:

    require(sqldf)
    sqldf("select * from df1 except select * from df2")
    
    ##   c1 c2
    ## 1  a  1
    ## 2  b  2
    
    0 讨论(0)
  • 2020-12-05 19:05

    One issue with https://stackoverflow.com/a/16144262/2055486 is it assumes neither data frame already has duplicated rows. The following function removes that limitation and also works with arbitrary user defined columns in x or y.

    The implementation uses a similar idea to the implementation of duplicated.data.frame in concatenating the columns together with a separator. duplicated.data.frame uses "\r", which can cause collisions if the entries have embedded "\r" characters. This uses the ASCII record separator "\30" which will have a much lower chance of appearing in input data.

    setdiff.data.frame <- function(x, y,
        by = intersect(names(x), names(y)),
        by.x = by, by.y = by) {
      stopifnot(
        is.data.frame(x),
        is.data.frame(y),
        length(by.x) == length(by.y))
    
      !do.call(paste, c(x[by.x], sep = "\30")) %in% do.call(paste, c(y[by.y], sep = "\30"))
    }
    
    # Example usage
    # remove all 4 or 6 cylinder 4 gear cars or 8 cylinder 3 gear rows
    to_remove <- data.frame(cyl = c(4, 6, 8), gear = c(4, 4, 3))
    mtcars[setdiff.data.frame(mtcars, to_remove), ]
    #>                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
    #> Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
    #> Valiant        18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
    #> Toyota Corona  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
    #> Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
    #> Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
    #> Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
    #> Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
    #> Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
    
    # with differing column names
    to_remove2 <- data.frame(a = c(4, 6, 8), b = c(4, 4, 3))
    mtcars[setdiff.data.frame(mtcars, to_remove2, by.x = c("cyl", "gear"), by.y = c("a", "b")), ]
    #>                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
    #> Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
    #> Valiant        18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
    #> Toyota Corona  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
    #> Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
    #> Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
    #> Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
    #> Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
    #> Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
    
    0 讨论(0)
  • 2020-12-05 19:08

    I think the simplest solution is with dplyr (tidyverse).

    require(tidyverse)
    anti_join(df1, df2)
    
    0 讨论(0)
  • 2020-12-05 19:08

    If you're not planning on using any of the actual data in d2, then you don't need merge at all:

    df1[!(df1$c1 %in% df2$c1), ]
    
    0 讨论(0)
  • 2020-12-05 19:11

    I remember coming across this exact issue quite a few months back. Managed to sift through my Evernote one-liners.

    Note: This is not my solution. Credit goes to whoever wrote it (whom I can't seem to find at the moment).

    If you don't worry about rownames then you can do:

    df1[!duplicated(rbind(df2, df1))[-seq_len(nrow(df2))], ]
    #   c1 c2
    # 1  a  1
    # 2  b  2
    

    Edit: A data.table solution:

    dt1 <- data.table(df1, key="c1")
    dt2 <- data.table(df2)
    dt1[!dt2]
    

    or better one-liner (from v1.9.6+):

    setDT(df1)[!df2, on="c1"]
    

    This returns all rows in df1 where df2$c1 doesn't have a match with df1$c1.

    0 讨论(0)
提交回复
热议问题