问题
I have two dataframes below, the first df has ~15k records of number of steps taken by time and date, the second df is the average number of steps per interval time. What I'm trying to do is go through df1 and replace the na values with the avg.steps value from df2, however I've can't seem to figure it out R. What would be the most efficient way to do this? And is there a way to do it using dplyr?
df1 looks like this:
steps <- c(51, 516, NA, NA, 161, 7)
interval <- c(915, 920, 925, 930, 935, 940)
steps interval
51 915
516 920
NA 925
NA 930
161 935
7 940
df2 looks like this:
avg.steps <- c(51, 516, 245, 0, 161, 7)
interval <- c(915, 920, 925, 930, 935, 940)
avg.steps interval
51 915
516 920
245 925
0 930
161 935
7 940
回答1:
Here's how I'd do it using data.table v1.9.6
:
require(data.table) # v1.9.6+, for 'on=' feature
dt1[is.na(steps), steps := dt2[.SD, avg.steps, on="interval"]]
The first argument i = is.na(steps)
allows us to look at just those rows where dt1$steps
is NA
. On those rows, we update dt1$steps
. This is done by performing a join as subset. .SD
refers to the subset of data, i.e., those rows where dt1$steps
equals NA
.
For each row where steps
is NA
, we find the corresponding matching row in dt2
while joining on "interval" column.
As an example, is.na(steps)
would return 3rd row in dt1
as one of the rows. Finding matching row for .SD$interval = 925
with dt2$interval
would return the index "3" (3rd row in dt2
). The corresponding avg.steps
value is "245". Thus 3rd row of dt1
gets updated with 245
.
Hope this helps.
If dt2
has multiple matches for any dt1$interval
value, you'll have to decide which value to update with. But I'm guessing it is not the case here.
回答2:
As long as the entries are corresponding, it's quite straightforwarddf1$steps[is.na(df1$steps)] <- df2$avg.steps[is.na(df1$steps)]
EDIT: If they're non-corresponding, then here's a dplyr solution:
library(dplyr)
df1$steps[is.na(steps)] <- (df1 %>% filter(is.na(steps)) %>%
group_by(interval) %>%
mutate(steps = rep(df2$avg.steps[df2$interval == interval[1]], length(interval)))$steps
来源:https://stackoverflow.com/questions/33072348/replace-nas-with-value-from-another-df