问题
I'm using the airquality data set available in R, and attempting to count the number of rows within the data that do not contain any NAs, while aggregating by Month.
The data looks like this:
head(airquality)
# Ozone Solar.R Wind Temp Month Day
# 1 41 190 7.4 67 5 1
# 2 36 118 8.0 72 5 2
# 3 12 149 12.6 74 5 3
# 4 18 313 11.5 62 5 4
# 5 NA NA 14.3 56 5 5
# 6 28 NA 14.9 66 5 6
As you can see, I have NAs in columns Ozone and Solar.R. I used the function complete.cases as follows:
x <- airquality[,1] # for the Ozone
y <- airquality[,2] # for the Solar.R
ok <- complete.cases(x,y)
And then to check:
nrow(airquality)
# [1] 153
sum(!ok)
# [1] 42
sum(ok)
# [1] 111
which is great.
But now, I'd like to pull that data apart to sort by Month (Column5) and this is where I'm running into problems - in trying to aggregate or sort by the value in column5 (Month).
I was able to get this to run, it won't sort by Month yet (I just wanted to make sure I could get the function to run):
aggregate(x = sum(complete.cases(airquality)), by= list(nrow(airquality)), FUN = sum)
# Group.1 x
# 1 153 111
OK... so to sort it out. I am trying to use the by part of the aggregate function to sort. I tried many variations of the column5 within airquality.
- airquality[,5]
- airquality[,"Month"]
I get these errors:
aggregate(x = sum(complete.cases(airquality)), by= list(airquality[,5]), FUN = sum)
# Error in aggregate.data.frame(as.data.frame(x), ...) :
# arguments must have same length
aggregate(x = sum(complete.cases(airquality)), by=
list(sum(complete.cases(airquality)),airquality[,5]), FUN = sum)
# Error in aggregate.data.frame(as.data.frame(x), ...) :
# arguments must have same length
I tried to search further into the ?aggregate(x, ...) function. Namely on the by part...
by - a list of grouping elements, each as long as the variables in the data frame x. The elements are coerced to factors before use.
I looked up ?factor, but can't seem to see how to apply it (if even necessary in this case). I also tried putting break = into it but didn't work.
None of the "Questions that may already have your answer" seem to apply, many of which give solutions in C# and SQL.
Edit: Expected outcome
Count Month
24 5
9 6
26 7
23 8
29 9
回答1:
As an addition to the other answers, you could do it with dplyr.
require(dplyr)
airquality %.%
group_by(Month) %.%
summarize(incomplete = sum(!complete.cases(Ozone, Solar.R)),
complete = sum(complete.cases(Ozone, Solar.R)))
# Month incomplete complete
#1 5 7 24
#2 6 21 9
#3 7 5 26
#4 8 8 23
#5 9 1 29
回答2:
I like data.table for these kinds of problems. It does by grouping very well and intuitively indeed...
require( data.table )
dt <- data.table( airquality )
dt[ , list( Count = sum( complete.cases( Ozone , Solar.R ) ) ), by = Month ]
# Month Count
#1: 5 24
#2: 6 9
#3: 7 26
#4: 8 23
#5: 9 29
Keeping it in base R I'd do...
airquality$ok <- complete.cases( airquality$Ozone , airquality$Solar.R )
aggregate( ok ~ Month , data = airquality , FUN = sum )
# Month ok
#1 5 24
#2 6 9
#3 7 26
#4 8 23
#5 9 29
Edit: Another variation of @Simon's solution using data.table:
dt[complete.cases(Ozone, Solar.R), list(count = .N), by=Month]
# Month count
# 1: 5 24
# 2: 6 9
# 3: 7 26
# 4: 8 23
# 5: 9 29
The variation is that we first filter/subset only those with no NAs and then get the aggregation by Month.
Note:
.Nis an inbuilt variable indata.table- an integer vector of length 1, which gives the total number of observations in that group.
回答3:
This seems to be what you are looking for:
> foo <- table(airquality[!ok,"Month"])
> data.frame(Month=names(foo),Count=as.vector(foo))
Month Count
1 5 7
2 6 21
3 7 5
4 8 8
5 9 1
(This looks a little different from your edit. Is it possible that there is some small confusion between ok and !ok?)
来源:https://stackoverflow.com/questions/23634572/r-sum-complete-cases-in-one-column-grouped-by-or-sorted-by-a-value-in-another