Collapsing rows where some are all NA, others are disjoint with some NAs

倾然丶 夕夏残阳落幕 提交于 2019-11-26 02:38:46

问题


I have a simple dataframe as such:

ID    Col1    Col2    Col3    Col4
1     NA      NA      NA      NA  
1     5       10      NA      NA
1     NA      NA      15      20
2     NA      NA      NA      NA  
2     25      30      NA      NA
2     NA      NA      35      40 

And I would like to reformat it as such:

ID    Col1    Col2    Col3    Col4
1     5       10      15      20
2     25      30      35      40

(please note: the real data set has thousands of rows and the values are from biological data -- the NAs follow no simple pattern, except that the NAs are disjoint, and yes there are exactly 3 rows for each ID).

STEP ONE: get rid of rows that have only NA values.

On the surface this looked simple, but I\'ve run across some problems.

complete.cases(DF) returns all FALSE, so I can\'t really use this to remove the rows with all NAs, as in DF[complete.cases(DF),]. This is because all rows contain at least one NA.

Since NAs want to propagate themselves, other schemes using is.na fail for the same reason.

STEP TWO: collapse the remaining two rows into one.

Thinking about using something like aggregate to pull this off, but there has got to be an easier way than this, which doesn\'t work at all.

Thanks for any advice.


回答1:


Try

library(dplyr)
DF %>% group_by(ID) %>% summarise_each(funs(sum(., na.rm = TRUE))) 

Edit: To account for the case in which one column has all NAs for a certain ID, we need sum_NA() function which returns NA if all are NAs

txt <- "ID    Col1    Col2    Col3    Col4
        1     NA      NA      NA      NA
        1     5       10      NA      NA
        1     NA      NA      15      20
        2     NA      NA      NA      NA
        2     NA      30      NA      NA
        2     NA      NA      35      40"
DF <- read.table(text = txt, header = TRUE)

# original code
DF %>% 
  group_by(ID) %>% 
  summarise_each(funs(sum(., na.rm = TRUE)))

# `summarise_each()` is deprecated.
# Use `summarise_all()`, `summarise_at()` or `summarise_if()` instead.
# To map `funs` over all variables, use `summarise_all()`
# A tibble: 2 x 5
     ID  Col1  Col2  Col3  Col4
  <int> <int> <int> <int> <int>
1     1     5    10    15    20
2     2     0    30    35    40

sum_NA <- function(x) {if (all(is.na(x))) x[NA_integer_] else sum(x, na.rm = TRUE)}

DF %>%
  group_by(ID) %>%
  summarise_all(funs(sum_NA))

DF %>%
  group_by(ID) %>%
  summarise_if(is.numeric, funs(sum_NA))

# A tibble: 2 x 5
     ID  Col1  Col2  Col3  Col4
  <int> <int> <int> <int> <int>
1     1     5    10    15    20
2     2    NA    30    35    40



回答2:


Here's a data table approach that uses na.omit() across the columns, grouped by ID.

library(data.table)
setDT(df)[, lapply(.SD, na.omit), by = ID]
#    ID Col1 Col2 Col3 Col4
# 1:  1    5   10   15   20
# 2:  2   25   30   35   40



回答3:


Here's a couple of aggregate attempts:

aggregate(. ~ ID, data=dat, FUN=na.omit, na.action="na.pass")
#  ID Col1 Col2 Col3 Col4
#1  1    5   10   15   20
#2  2   25   30   35   40

Since aggregate's formula interface by default uses na.omit on the entire data before doing any grouping, it will delete every row of dat as they all contain at least one NA value. Try it: nrow(na.omit(dat)) returns 0. So in this case, use na.pass in aggregate and then na.omit to skip over the NAs that were passed through.

Alternatively, don't use the formula interface and specify the columns to aggregate manually:

aggregate(dat[-1], dat[1], FUN=na.omit )
aggregate(dat[c("Col1","Col2","Col3","Col4")], dat["ID"], FUN=na.omit)
#  ID Col1 Col2 Col3 Col4
#1  1    5   10   15   20
#2  2   25   30   35   40



回答4:


the simple way is:

as.data.frame(lapply(myData[,c('Col1','Col2','Col3','Col4')],function(x)[!is.na(x)]))

but if not all columns have the same number of non-NA values then you'll need to trim them like so:

temp  <-  lapply(myData[,c('Col1','Col2','Col3','Col4')],function(x)x[!is.na(x)])
len  <-  min(sapply(temp,length))
as.data.frame(lapply(temp,`[`,seq(len)))


来源:https://stackoverflow.com/questions/28036294/collapsing-rows-where-some-are-all-na-others-are-disjoint-with-some-nas

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!