Replace na's with value from another df

问题

I have two dataframes below, the first df has ~15k records of number of steps taken by time and date, the second df is the average number of steps per interval time. What I'm trying to do is go through df1 and replace the na values with the avg.steps value from df2, however I've can't seem to figure it out R. What would be the most efficient way to do this? And is there a way to do it using dplyr?

df1 looks like this:

steps <- c(51, 516, NA, NA, 161, 7)
interval <- c(915, 920, 925, 930, 935, 940)

steps  interval
   51       915
  516       920
   NA       925
   NA       930
  161       935
    7       940

df2 looks like this:

avg.steps <- c(51, 516, 245, 0, 161, 7)
interval <- c(915, 920, 925, 930, 935, 940)

avg.steps  interval
       51       915
      516       920
      245       925
        0       930
      161       935
        7       940

回答1:

Here's how I'd do it using data.table v1.9.6:

require(data.table) # v1.9.6+, for 'on=' feature
dt1[is.na(steps), steps := dt2[.SD, avg.steps, on="interval"]]

The first argument i = is.na(steps) allows us to look at just those rows where dt1$steps is NA. On those rows, we update dt1$steps. This is done by performing a join as subset. .SD refers to the subset of data, i.e., those rows where dt1$steps equals NA.

For each row where steps is NA, we find the corresponding matching row in dt2 while joining on "interval" column.

As an example, is.na(steps) would return 3rd row in dt1 as one of the rows. Finding matching row for .SD$interval = 925 with dt2$interval would return the index "3" (3rd row in dt2). The corresponding avg.steps value is "245". Thus 3rd row of dt1 gets updated with 245.

Hope this helps.

If dt2 has multiple matches for any dt1$interval value, you'll have to decide which value to update with. But I'm guessing it is not the case here.

回答2:

As long as the entries are corresponding, it's quite straightforward
df1$steps[is.na(df1$steps)] <- df2$avg.steps[is.na(df1$steps)]

EDIT: If they're non-corresponding, then here's a dplyr solution:

library(dplyr)

df1$steps[is.na(steps)] <- (df1 %>% filter(is.na(steps)) %>% 
    group_by(interval) %>% 
    mutate(steps = rep(df2$avg.steps[df2$interval == interval[1]], length(interval)))$steps

来源：https://stackoverflow.com/questions/33072348/replace-nas-with-value-from-another-df

标签

dplyr