Replace a subset of a data frame with dplyr join operations

前端未结

关注

 4  489

Suppose that I gave a treatment to some column values of a data frame like this:

  id animal weight   height ...
  1    dog     23.0
  2    cat     NA
  3


                      
              相关标签:


      
      
        
          4条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  北海茫月        
                
              
                            
                2020-12-16 14:30
              
            
            
                                                                       
What you describe is a join operation in which you update some values in the original dataset. This is very easy to do with great performance using data.table because of its fast joins and update-by-reference concept (:=). 

Here's an example for your toy data:

library(data.table)
setDT(df)             # convert to data.table without copy
setDT(sub_df)         # convert to data.table without copy

# join and update "df" by reference, i.e. without copy 
df[sub_df, on = c("id", "animal"), weight := i.weight]


The data is now updated:

#   id animal weight
#1:  1    dog   23.0
#2:  2    cat    2.2
#3:  3   duck    1.2
#4:  4  fairy    0.2
#5:  5  snake    1.3


You can use setDF to switch back to ordinary data.frame.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  悲&欢浪女        
                
              
                            
                2020-12-16 14:37
              
            
            
                                                                       
Remove the na's first, then simply stack the tibbles:

 bind_rows(filter(df,!is.na(weight)),sub_df)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  礼貌的吻别        
                
              
                            
                2020-12-16 14:41
              
            
            
                                                                       
For anyone looking for a solution to use in a tidyverse pipeline:

I run into this problem a lot, and have written a short function that uses mostly tidyverse verbs to get around this. It will account for the case when there are additional columns in the original df.

For example, if the OP's df had an additional 'height' column:

library(dplyr)

df <- tibble(id = seq(1:5),
                 animal = c("dog", "cat", "duck", "fairy", "snake"),
                 weight = c("23", NA, "1.2", "0.2",  "BAD"),
                 height = c("54", "45", "21", "50", "42"))


And the subset of data we wanted to join in was the same:

sub_df <- tibble(id = c(2, 5),
                     animal = c("cat", "snake"),
                     weight = c("2.2", "1.3"))


If we used the OP's method alone (anti_join %>% bind_rows), this won't work because of the additional 'height' column in df. An extra step or two is needed.

In this case we could use the following function:

replace_subset <- function(df, df_subset, id_col_names = c()) {

  # work out which of the columns contain "new" data
  new_data_col_names <- colnames(df_subset)[which(!colnames(df_subset) %in% id_col_names)]

  # complete the df_subset with the extra columns from df
  df_sub_to_join <- df_subset %>%
    left_join(select(df, -new_data_col_names), by = c(id_col_names))

  # join and bind rows
  df_out <- df %>%
    anti_join(df_sub_to_join, by = c(id_col_names)) %>%
    bind_rows(df_sub_to_join)

  return(df_out)

}


Now for the results:

replace_subset(df = df , df_subset = sub_df, id_col_names = c("id"))

## A tibble: 5 x 4
#     id animal weight height
#  <dbl> <chr>  <chr>  <chr> 
#1     1 dog    23     54    
#2     3 duck   1.2    21    
#3     4 fairy  0.2    50    
#4     2 cat    2.2    45    
#5     5 snake  1.3    42  



And here's an example using the function in a pipeline:

df %>%
  replace_subset(df_subset = sub_df, id_col_names = c("id")) %>%
  mutate_at(.vars = vars(c('weight', 'height')), .funs = ~as.numeric(.)) %>%
  mutate(bmi = weight / (height^2))

## A tibble: 5 x 5
#     id animal weight height      bmi
#  <dbl> <chr>   <dbl>  <dbl>    <dbl>
#1     1 dog      23       54 0.00789 
#2     3 duck      1.2     21 0.00272 
#3     4 fairy     0.2     50 0.00008 
#4     2 cat       2.2     45 0.00109 
#5     5 snake     1.3     42 0.000737



hope this is helpful :)
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  南方客        
                
              
                            
                2020-12-16 14:46
              
            
            
                                                                       
Isn't dplyr::rows_update exactly what we need here? The following code should work:
df %>% dplyr::rows_update(sub_df, by = "id")

This should work as long as there is a unique identifier (one or multiple variables) for your datasets.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复