I'm trying to do a Last Observation Carried Forward operation on some poorly formatted data using dplyr
and tidyr
. It isn't working as I'd expect.
library(dplyr)
library(tidyr)
df <- data.frame(id=c(1,1,2,2,3,3),
email=c('bob@email.com', NA, 'joe@email.com', NA, NA, NA))
df2 <- df %>% group_by(id) %>% fill(email)
This results in:
Source: local data frame [6 x 2]
Groups: id [3]
id email
(dbl) (fctr)
1 1 bob@email.com
2 1 bob@email.com
3 2 joe@email.com
4 2 joe@email.com
5 3 joe@email.com
6 3 joe@email.com
I expect it to be:
Source: local data frame [6 x 2]
Groups: id [3]
id email
(dbl) (fctr)
1 1 bob@email.com
2 1 bob@email.com
3 2 joe@email.com
4 2 joe@email.com
5 3 NA
6 3 NA
The reason I expect it to be the latter is because of group_by
's documentation saying, "The group_by
function takes an existing tbl and converts it into a grouped tbl where operations are performed "by group"." The group in this case is determined by the id
variable, and the following operation is fill(email)
. However, it's pretty clearly NOT doing that.
And before anybody asks, it makes no difference if the fields are both character
instead of numeric
or factor
.
UPDATE @aosmith pointed out this open issue on Github. I'm going to say that there won't be a proper solution to this problem until that issue is resolved. Everything else would just be a workaround. So, if somebody makes a successful PR addressing that issue and posts it here, I'd be happy to mark it as the solution.
Looks like this has been fixed in the development version of tidyr. You now get the expected result per id using fill
from tidyr_0.3.1.9000.
df %>% group_by(id) %>% fill(email)
Source: local data frame [6 x 2]
Groups: id [3]
id email
(dbl) (fctr)
1 1 bob@email.com
2 1 bob@email.com
3 2 joe@email.com
4 2 joe@email.com
5 3 NA
6 3 NA
Luckily you can still use zoo::na.locf
for this:
df %>%
group_by(id) %>%
mutate(email = zoo::na.locf(email, na.rm = FALSE))
# Source: local data frame [6 x 2]
# Groups: id [3]
#
# id email
# (dbl) (fctr)
# 1 1 bob@email.com
# 2 1 bob@email.com
# 3 2 joe@email.com
# 4 2 joe@email.com
# 5 3 NA
# 6 3 NA
Another option is to use do
from dplyr
:
df3 <- df %>% group_by(id) %>% do(fill(.,email))
Two questions, does it has be duplicated and do you have to use dplyr
and tidyr
?
Maybe this could be a solution?
(
bar <- data.frame(id=c(1,1,2,2,3,3),
email=c('bob@email.com', NA, 'joe@email.com', NA, NA, NA))
)
#> id email
#> 1 bob@email.com
#> 1 <NA>
#> 2 joe@email.com
#> 2 <NA>
#> 3 <NA>
#> 3 <NA>
(
foo <- bar[!duplicated(bar$id),]
)
#> id email
#> 1 bob@email.com
#> 2 joe@email.com
#> 3 <NA>
This is kind of ugly, but it is another option that uses dplyr
and works with your sample data
df %>%
group_by(id) %>%
mutate(email = email[ !is.na(email) ][1])
来源:https://stackoverflow.com/questions/34517370/group-by-into-fill-not-working-as-expected