Minus operation of data frames

前端未结

关注

 7  740

I have 2 data frames df1 and df2.

df1 <- data.frame(c1=c(\"a\",\"b\",\"c\",\"d\"),c2=c(1,2,3,4) )
df2 <- data.frame(c1=c(\"


                      
              相关标签:


      
      
        
          7条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  無奈伤痛        
                
              
                            
                2020-12-05 19:02
              
            
            
                                                                       
You can create identifier columnas then subset:

e.g.

df1 <- data.frame(c1=c("a","b","c","d"),c2=c(1,2,3,4), indf1 = rep("Y",4) )
df2 <- data.frame(c1=c("c","d","e","f"),c2=c(3,4,5,6),indf2 = rep("Y",4) )
merge(df1,df2)
#  c1 c2 indf1 indf2
#1  c  3     Y     Y
#2  d  4     Y     Y

bigdf <- merge(df1,df2,all=TRUE)
#  c1 c2 indf1 indf2
#1  a  1     Y  <NA>
#2  b  2     Y  <NA>
#3  c  3     Y     Y
#4  d  4     Y     Y
#5  e  5  <NA>     Y
#6  f  6  <NA>     Y


Then subset how you wish:

 bigdf[is.na(bigdf$indf1) ,]
#  c1 c2 indf1 indf2
#5  e  5  <NA>     Y
#6  f  6  <NA>     Y

 bigdf[is.na(bigdf$indf2) ,]  #<- output you requested those not in df2
#  c1 c2 indf1 indf2
#1  a  1     Y  <NA>
#2  b  2     Y  <NA>

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  旧巷少年郎        
                
              
                            
                2020-12-05 19:03
              
            
            
                                                                       
I prefer sqldf package:

require(sqldf)
sqldf("select * from df1 except select * from df2")

##   c1 c2
## 1  a  1
## 2  b  2

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  时光取名叫无心        
                
              
                            
                2020-12-05 19:05
              
            
            
                                                                       
One issue with https://stackoverflow.com/a/16144262/2055486 is it assumes neither data frame already has duplicated rows. The following function removes that limitation and also works with arbitrary user defined columns in x or y.

The implementation uses a similar idea to the implementation of duplicated.data.frame in concatenating the columns together with a separator. duplicated.data.frame uses "\r", which can cause collisions if the entries have embedded "\r" characters. This uses the ASCII record separator "\30" which will have a much lower chance of appearing in input data.
    

setdiff.data.frame <- function(x, y,
    by = intersect(names(x), names(y)),
    by.x = by, by.y = by) {
  stopifnot(
    is.data.frame(x),
    is.data.frame(y),
    length(by.x) == length(by.y))

  !do.call(paste, c(x[by.x], sep = "\30")) %in% do.call(paste, c(y[by.y], sep = "\30"))
}

# Example usage
# remove all 4 or 6 cylinder 4 gear cars or 8 cylinder 3 gear rows
to_remove <- data.frame(cyl = c(4, 6, 8), gear = c(4, 4, 3))
mtcars[setdiff.data.frame(mtcars, to_remove), ]
#>                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
#> Valiant        18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
#> Toyota Corona  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
#> Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
#> Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
#> Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
#> Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
#> Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8

# with differing column names
to_remove2 <- data.frame(a = c(4, 6, 8), b = c(4, 4, 3))
mtcars[setdiff.data.frame(mtcars, to_remove2, by.x = c("cyl", "gear"), by.y = c("a", "b")), ]
#>                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
#> Valiant        18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
#> Toyota Corona  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
#> Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
#> Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
#> Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
#> Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
#> Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  孤街浪徒        
                
              
                            
                2020-12-05 19:08
              
            
            
                                                                       
I think the simplest solution is with dplyr (tidyverse).

require(tidyverse)
anti_join(df1, df2)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  再見小時候        
                
              
                            
                2020-12-05 19:08
              
            
            
                                                                       
If you're not planning on using any of the actual data in d2, then you don't need merge at all:

df1[!(df1$c1 %in% df2$c1), ]

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  天涯浪人        
                
              
                            
                2020-12-05 19:11
              
            
            
                                                                       
I remember coming across this exact issue quite a few months back. Managed to sift through my  Evernote one-liners. 

Note: This is not my solution. Credit goes to whoever wrote it (whom I can't seem to find at the moment). 

If you don't worry about rownames then you can do:

df1[!duplicated(rbind(df2, df1))[-seq_len(nrow(df2))], ]
#   c1 c2
# 1  a  1
# 2  b  2




Edit: A data.table solution:

dt1 <- data.table(df1, key="c1")
dt2 <- data.table(df2)
dt1[!dt2]


or better one-liner (from v1.9.6+):

setDT(df1)[!df2, on="c1"]


This returns all rows in df1 where df2$c1 doesn't have a match with df1$c1.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
   
          
     1
2
下一页
           
           
        
                                  
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复