问题
I'm trying to do a Last Observation Carried Forward operation on some poorly formatted data using dplyr
and tidyr
. It isn't working as I'd expect.
library(dplyr)
library(tidyr)
df <- data.frame(id=c(1,1,2,2,3,3),
email=c('bob@email.com', NA, 'joe@email.com', NA, NA, NA))
df2 <- df %>% group_by(id) %>% fill(email)
This results in:
Source: local data frame [6 x 2]
Groups: id [3]
id email
(dbl) (fctr)
1 1 bob@email.com
2 1 bob@email.com
3 2 joe@email.com
4 2 joe@email.com
5 3 joe@email.com
6 3 joe@email.com
I expect it to be:
Source: local data frame [6 x 2]
Groups: id [3]
id email
(dbl) (fctr)
1 1 bob@email.com
2 1 bob@email.com
3 2 joe@email.com
4 2 joe@email.com
5 3 NA
6 3 NA
The reason I expect it to be the latter is because of group_by
's documentation saying, "The group_by
function takes an existing tbl and converts it into a grouped tbl where operations are performed "by group"." The group in this case is determined by the id
variable, and the following operation is fill(email)
. However, it's pretty clearly NOT doing that.
And before anybody asks, it makes no difference if the fields are both character
instead of numeric
or factor
.
UPDATE @aosmith pointed out this open issue on Github. I'm going to say that there won't be a proper solution to this problem until that issue is resolved. Everything else would just be a workaround. So, if somebody makes a successful PR addressing that issue and posts it here, I'd be happy to mark it as the solution.
回答1:
Looks like this has been fixed in the development version of tidyr. You now get the expected result per id using fill
from tidyr_0.3.1.9000.
df %>% group_by(id) %>% fill(email)
Source: local data frame [6 x 2]
Groups: id [3]
id email
(dbl) (fctr)
1 1 bob@email.com
2 1 bob@email.com
3 2 joe@email.com
4 2 joe@email.com
5 3 NA
6 3 NA
回答2:
Luckily you can still use zoo::na.locf
for this:
df %>%
group_by(id) %>%
mutate(email = zoo::na.locf(email, na.rm = FALSE))
# Source: local data frame [6 x 2]
# Groups: id [3]
#
# id email
# (dbl) (fctr)
# 1 1 bob@email.com
# 2 1 bob@email.com
# 3 2 joe@email.com
# 4 2 joe@email.com
# 5 3 NA
# 6 3 NA
回答3:
Another option is to use do
from dplyr
:
df3 <- df %>% group_by(id) %>% do(fill(.,email))
回答4:
Two questions, does it has be duplicated and do you have to use dplyr
and tidyr
?
Maybe this could be a solution?
(
bar <- data.frame(id=c(1,1,2,2,3,3),
email=c('bob@email.com', NA, 'joe@email.com', NA, NA, NA))
)
#> id email
#> 1 bob@email.com
#> 1 <NA>
#> 2 joe@email.com
#> 2 <NA>
#> 3 <NA>
#> 3 <NA>
(
foo <- bar[!duplicated(bar$id),]
)
#> id email
#> 1 bob@email.com
#> 2 joe@email.com
#> 3 <NA>
回答5:
This is kind of ugly, but it is another option that uses dplyr
and works with your sample data
df %>%
group_by(id) %>%
mutate(email = email[ !is.na(email) ][1])
来源:https://stackoverflow.com/questions/34517370/group-by-into-fill-not-working-as-expected