R: Sum Complete.cases in one column grouped by (or sorted by) a value in another column

萝らか妹 提交于 2019-12-10 09:31:51

问题


I'm using the airquality data set available in R, and attempting to count the number of rows within the data that do not contain any NAs, while aggregating by Month.

The data looks like this:

head(airquality)
#   Ozone Solar.R Wind Temp Month Day
# 1    41     190  7.4   67     5   1
# 2    36     118  8.0   72     5   2
# 3    12     149 12.6   74     5   3
# 4    18     313 11.5   62     5   4
# 5    NA      NA 14.3   56     5   5
# 6    28      NA 14.9   66     5   6

As you can see, I have NAs in columns Ozone and Solar.R. I used the function complete.cases as follows:

x  <- airquality[,1] # for the Ozone
y  <- airquality[,2] # for the Solar.R
ok <- complete.cases(x,y)

And then to check:

nrow(airquality)
# [1] 153
sum(!ok)
# [1] 42
sum(ok)
# [1] 111

which is great.

But now, I'd like to pull that data apart to sort by Month (Column5) and this is where I'm running into problems - in trying to aggregate or sort by the value in column5 (Month).

I was able to get this to run, it won't sort by Month yet (I just wanted to make sure I could get the function to run):

aggregate(x = sum(complete.cases(airquality)), by= list(nrow(airquality)), FUN = sum)
#   Group.1   x
# 1     153 111

OK... so to sort it out. I am trying to use the by part of the aggregate function to sort. I tried many variations of the column5 within airquality.

- airquality[,5]
- airquality[,"Month"]

I get these errors:

aggregate(x = sum(complete.cases(airquality)), by= list(airquality[,5]), FUN = sum)
# Error in aggregate.data.frame(as.data.frame(x), ...) : 
#   arguments must have same length

aggregate(x = sum(complete.cases(airquality)), by= 
      list(sum(complete.cases(airquality)),airquality[,5]), FUN = sum)
# Error in aggregate.data.frame(as.data.frame(x), ...) : 
#   arguments must have same length

I tried to search further into the ?aggregate(x, ...) function. Namely on the by part...

by - a list of grouping elements, each as long as the variables in the data frame x. The elements are coerced to factors before use.

I looked up ?factor, but can't seem to see how to apply it (if even necessary in this case). I also tried putting break = into it but didn't work.

None of the "Questions that may already have your answer" seem to apply, many of which give solutions in C# and SQL.

Edit: Expected outcome

Count  Month
  24       5
   9       6
  26       7
  23       8
  29       9

回答1:


As an addition to the other answers, you could do it with dplyr.

require(dplyr)

airquality %.%
  group_by(Month) %.%
  summarize(incomplete = sum(!complete.cases(Ozone, Solar.R)),
             complete = sum(complete.cases(Ozone, Solar.R)))

#  Month incomplete complete
#1     5          7       24
#2     6         21        9
#3     7          5       26
#4     8          8       23
#5     9          1       29



回答2:


I like data.table for these kinds of problems. It does by grouping very well and intuitively indeed...

require( data.table )
dt <- data.table( airquality )
dt[ , list( Count = sum( complete.cases( Ozone , Solar.R ) ) ), by = Month ]

#   Month Count
#1:     5 24
#2:     6  9
#3:     7 26
#4:     8 23
#5:     9 29

Keeping it in base R I'd do...

airquality$ok <- complete.cases( airquality$Ozone , airquality$Solar.R )
aggregate( ok ~ Month , data = airquality , FUN = sum )
#  Month ok
#1     5 24
#2     6  9
#3     7 26
#4     8 23
#5     9 29

Edit: Another variation of @Simon's solution using data.table:

dt[complete.cases(Ozone, Solar.R), list(count = .N), by=Month]
#    Month count
# 1:     5    24
# 2:     6     9
# 3:     7    26
# 4:     8    23
# 5:     9    29

The variation is that we first filter/subset only those with no NAs and then get the aggregation by Month.

Note: .N is an inbuilt variable in data.table - an integer vector of length 1, which gives the total number of observations in that group.




回答3:


This seems to be what you are looking for:

> foo <- table(airquality[!ok,"Month"])
> data.frame(Month=names(foo),Count=as.vector(foo))
  Month Count
1     5     7
2     6    21
3     7     5
4     8     8
5     9     1

(This looks a little different from your edit. Is it possible that there is some small confusion between ok and !ok?)



来源:https://stackoverflow.com/questions/23634572/r-sum-complete-cases-in-one-column-grouped-by-or-sorted-by-a-value-in-another

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!