conditionally duplicating rows in a data frame

问题

This is a sample of my data set:

   day city count
1   1    A    50
2   2    A   100
3   2    B   110
4   2    C    90

Here is the code for reproducing it:

  df <- data.frame(
    day = c(1,2,2,2),
    city = c("A","A","B","C"),
    count = c(50,100,110,90)
    )

As you could see, the count data is missing for city B and C on the day 1. What I want to do is to use city A's count as an estimate for the other two cities. So the desired output would be:

   day city count
1   1    A    50
2   1    B    50
3   1    C    50
4   2    A   100
5   2    B   110
6   2    C    90

I could come up with a for loop to do it, but I feel like there should be an easier way of doing it. My idea is to count the number of observations for each day, and then for the days that the number of observations is less than the number of cities in the data set, I would replicate the row to complete the data for that day. Any better ideas? or a more efficient for-loop? Thanks.

回答1:

With dplyr and tidyr, we can do:

library(dplyr)
library(tidyr)

df %>% 
  expand(day, city) %>% 
  left_join(df) %>% 
  group_by(day) %>% 
  fill(count, .direction = "up") %>% 
  fill(count, .direction = "down")

Alternatively, we can avoid the left_join using thelatemail's solution:

df %>% 
  complete(day, city) %>% 
  group_by(day) %>% 
  fill(count, .direction = "up") %>% 
  fill(count, .direction = "down")

Both return:

# A tibble: 6 x 3
    day city  count
  <dbl> <fct> <dbl>
1    1. A       50.
2    1. B       50.
3    1. C       50.
4    2. A      100.
5    2. B      110.
6    2. C       90.

Data (slightly modified to show .direction filling both directions):

df <- data.frame(
  day = c(1,2,2,2),
  city = c("B","A","B","C"),
  count = c(50,100,110,90)
)

来源：https://stackoverflow.com/questions/49184893/conditionally-duplicating-rows-in-a-data-frame

标签

for-loop

dataframe

dplyr

replicate