Replace a subset of a data frame with dplyr join operations

前端 未结 4 483
栀梦
栀梦 2020-12-16 14:05

Suppose that I gave a treatment to some column values of a data frame like this:

  id animal weight   height ...
  1    dog     23.0
  2    cat     NA
  3           


        
相关标签:
4条回答
  • 2020-12-16 14:30

    What you describe is a join operation in which you update some values in the original dataset. This is very easy to do with great performance using data.table because of its fast joins and update-by-reference concept (:=).

    Here's an example for your toy data:

    library(data.table)
    setDT(df)             # convert to data.table without copy
    setDT(sub_df)         # convert to data.table without copy
    
    # join and update "df" by reference, i.e. without copy 
    df[sub_df, on = c("id", "animal"), weight := i.weight]
    

    The data is now updated:

    #   id animal weight
    #1:  1    dog   23.0
    #2:  2    cat    2.2
    #3:  3   duck    1.2
    #4:  4  fairy    0.2
    #5:  5  snake    1.3
    

    You can use setDF to switch back to ordinary data.frame.

    0 讨论(0)
  • 2020-12-16 14:37

    Remove the na's first, then simply stack the tibbles:

     bind_rows(filter(df,!is.na(weight)),sub_df)
    
    0 讨论(0)
  • 2020-12-16 14:41

    For anyone looking for a solution to use in a tidyverse pipeline:

    I run into this problem a lot, and have written a short function that uses mostly tidyverse verbs to get around this. It will account for the case when there are additional columns in the original df.

    For example, if the OP's df had an additional 'height' column:

    library(dplyr)
    
    df <- tibble(id = seq(1:5),
                     animal = c("dog", "cat", "duck", "fairy", "snake"),
                     weight = c("23", NA, "1.2", "0.2",  "BAD"),
                     height = c("54", "45", "21", "50", "42"))
    

    And the subset of data we wanted to join in was the same:

    sub_df <- tibble(id = c(2, 5),
                         animal = c("cat", "snake"),
                         weight = c("2.2", "1.3"))
    

    If we used the OP's method alone (anti_join %>% bind_rows), this won't work because of the additional 'height' column in df. An extra step or two is needed.

    In this case we could use the following function:

    replace_subset <- function(df, df_subset, id_col_names = c()) {
    
      # work out which of the columns contain "new" data
      new_data_col_names <- colnames(df_subset)[which(!colnames(df_subset) %in% id_col_names)]
    
      # complete the df_subset with the extra columns from df
      df_sub_to_join <- df_subset %>%
        left_join(select(df, -new_data_col_names), by = c(id_col_names))
    
      # join and bind rows
      df_out <- df %>%
        anti_join(df_sub_to_join, by = c(id_col_names)) %>%
        bind_rows(df_sub_to_join)
    
      return(df_out)
    
    }
    

    Now for the results:

    replace_subset(df = df , df_subset = sub_df, id_col_names = c("id"))
    
    ## A tibble: 5 x 4
    #     id animal weight height
    #  <dbl> <chr>  <chr>  <chr> 
    #1     1 dog    23     54    
    #2     3 duck   1.2    21    
    #3     4 fairy  0.2    50    
    #4     2 cat    2.2    45    
    #5     5 snake  1.3    42  
    
    

    And here's an example using the function in a pipeline:

    df %>%
      replace_subset(df_subset = sub_df, id_col_names = c("id")) %>%
      mutate_at(.vars = vars(c('weight', 'height')), .funs = ~as.numeric(.)) %>%
      mutate(bmi = weight / (height^2))
    
    ## A tibble: 5 x 5
    #     id animal weight height      bmi
    #  <dbl> <chr>   <dbl>  <dbl>    <dbl>
    #1     1 dog      23       54 0.00789 
    #2     3 duck      1.2     21 0.00272 
    #3     4 fairy     0.2     50 0.00008 
    #4     2 cat       2.2     45 0.00109 
    #5     5 snake     1.3     42 0.000737
    
    

    hope this is helpful :)

    0 讨论(0)
  • 2020-12-16 14:46

    Isn't dplyr::rows_update exactly what we need here? The following code should work:

    df %>% dplyr::rows_update(sub_df, by = "id")
    

    This should work as long as there is a unique identifier (one or multiple variables) for your datasets.

    0 讨论(0)
提交回复
热议问题