R: Cleaning up a wide and untidy dataframe

问题

I have a data frame that looks like:

d<-data.frame(id=(1:9), 
                  grp_id=(c(rep(1,3), rep(2,3), rep(3,3))), 
                  a=rep(NA, 9), 
                  b=c("No", rep(NA, 3), "Yes", rep(NA, 4)), 
                  c=c(rep(NA,2), "No", rep(NA,6)), 
                  d=c(rep(NA,3), "Yes", rep(NA,2), "No", rep(NA,2)), 
                  e=c(rep(NA, 7), "No", NA), 
                  f=c(NA, "No", rep(NA,3), "No", rep(NA,2), "No"))
>d
  id grp_id  a    b    c    d    e    f
1  1      1 NA   No <NA> <NA> <NA> <NA>
2  2      1 NA <NA> <NA> <NA> <NA>   No
3  3      1 NA <NA>   No <NA> <NA> <NA>
4  4      2 NA <NA> <NA>  Yes <NA> <NA>
5  5      2 NA  Yes <NA> <NA> <NA> <NA>
6  6      2 NA <NA> <NA> <NA> <NA>   No
7  7      3 NA <NA> <NA>   No <NA> <NA>
8  8      3 NA <NA> <NA> <NA>   No <NA>
9  9      3 NA <NA> <NA> <NA> <NA>   No

Within each group (grp_id) there is only 1 "Yes" or "No" value associated with each of the columns a:f.

I'd like to create a single row for each grp_id to get a data frame that looks like the following:

grp_id  a    b    c    d    e    f
     1 NA   No   No <NA> <NA>   No
     2 NA  Yes <NA>  Yes <NA>   No
     3 NA <NA> <NA>   No   No   No

I recognize that the tidyr package is probably the best tool and the 1st steps are likely to be

d %>% 
   group_by(grp_id) %>%
     summarise()

I would appreciate help with the commands within summarise, or any solution really. Thanks.

回答1:

We can use summarise_at and subset the first non-NA element

library(dplyr)
d %>%
   group_by(grp_id) %>%
   summarise_at(2:7, funs(.[!is.na(.)][1]))
# A tibble: 3 x 7
#   grp_id     a      b      c      d      e      f
#    <dbl> <lgl> <fctr> <fctr> <fctr> <fctr> <fctr>
#1      1    NA     No     No   <NA>   <NA>     No
#2      2    NA    Yes   <NA>    Yes   <NA>     No
#3      3    NA   <NA>   <NA>     No     No     No

In the example dataset, columns 'a' to 'f' are all factors with some having only 'No' levels. If it needs to be standardized with all the columns having the same levels, then we may need to call the factor with levels specified as c('Yes', 'No') in the summarise_at i.e. summarise_at(2:7, funs(factor(.[!is.na(.)][1], levels = c('Yes', 'No'))))

回答2:

We can use aggregate. No packages are used.

 YN <- function(x) c(na.omit(as.character(x)), NA)[1]
 aggregate(d[3:8], d["grp_id"], YN)

giving:

##   grp_id    a    b    c    d    e  f
## 1      1 <NA>   No   No <NA> <NA> No
## 2      2 <NA>  Yes <NA>  Yes <NA> No
## 3      3 <NA> <NA> <NA>   No   No No

The above gives character columns. If you prefer factor columns then use this:

YNfac <- function(x) factor(YN(x), c("No", "Yes"))
aggregate(d[3:8], d["grp_id"], YNfac)

Note: Other alternate implementations of YN are:

YN <- function(x) sort(as.character(x), na.last = TRUE)[1]

YN <- function(x) if (all(is.na(x))) NA_character_ else na.omit(as.character(x))[1]

library(zoo)
YN <- function(x) na.locf0(as.character(x), fromLast = TRUE)[1]

回答3:

You've received some good answers but neither of them actually uses the tidyr package. (The summarize() and summarize_at() family of functions is from dplyr.)

In fact, a tidyr-only solution for your problem is very doable.

d %>%
    gather(col, value, -id, -grp_id, factor_key=TRUE) %>%
    na.omit() %>%
    select(-id) %>%
    spread(col, value, fill=NA, drop=FALSE)

The only hard part is ensuring that you get the a column in your output. For your example data, it is entirely NA. The trick is the factor_key=TRUE argument to gather() and the drop=FALSE argument to spread(). Without those two arguments being set, the output would not have an a column, and would only have columns with at least one non-NA entry.

Here's a description of how it works:

gather(col, value, -id, -grp_id, factor_key=TRUE) %>%

This tidies your data -- it effectively replaces columns a - f with new columns col and value, forming a long-formated "tidy" data frame. The entries in the col column are letters a - f. And because we've used factor_key=TRUE, this column is a factor with levels, not just a character vector.

na.omit() %>%

This removes all the NA values from the long data.

select(-id) %>%

This eliminates the id column.

spread(col, value, fill=NA, drop=FALSE)

This re-widens the data, using the values in the col column to define new column names, and the values in the value column to fill in the entries of the new columns. When data is missing, a value of fill (here NA) is used instead. And the drop=FALSE means that when col is a factor, there will be one column per level of the factor, no matter whether that level appears in the data or not. This, along with setting col to be a factor, is what gets a as an output column.

I personally find this approach more readable than the approaches requiring subsetting or lapply stuff. Additionally, this approach will fail if your data is not actually one-hot, whereas other approaches may "work" and give you unexpected output. The downside of this approach is that the output columns a - f are not factors, but character vectors. If you need factor output you should be able to do (untested)

mutate(value = factor(value, levels=c('Yes', 'No', NA))) %>%

anywhere between the gather() and spread() functions to ensure factor output.

来源：https://stackoverflow.com/questions/47380610/r-cleaning-up-a-wide-and-untidy-dataframe

标签

tidyr

data-cleaning

summarize