问题
I'm using the airquality
data set available in R, and attempting to count the number of rows within the data that do not contain any NA
s, while aggregating by Month
.
The data looks like this:
head(airquality)
# Ozone Solar.R Wind Temp Month Day
# 1 41 190 7.4 67 5 1
# 2 36 118 8.0 72 5 2
# 3 12 149 12.6 74 5 3
# 4 18 313 11.5 62 5 4
# 5 NA NA 14.3 56 5 5
# 6 28 NA 14.9 66 5 6
As you can see, I have NA
s in columns Ozone
and Solar.R
. I used the function complete.cases
as follows:
x <- airquality[,1] # for the Ozone
y <- airquality[,2] # for the Solar.R
ok <- complete.cases(x,y)
And then to check:
nrow(airquality)
# [1] 153
sum(!ok)
# [1] 42
sum(ok)
# [1] 111
which is great.
But now, I'd like to pull that data apart to sort by Month
(Column5) and this is where I'm running into problems - in trying to aggregate
or sort
by the value in column5 (Month
).
I was able to get this to run, it won't sort by Month
yet (I just wanted to make sure I could get the function to run):
aggregate(x = sum(complete.cases(airquality)), by= list(nrow(airquality)), FUN = sum)
# Group.1 x
# 1 153 111
OK... so to sort it out. I am trying to use the by
part of the aggregate function to sort. I tried many variations of the column5 within airquality
.
- airquality[,5]
- airquality[,"Month"]
I get these errors:
aggregate(x = sum(complete.cases(airquality)), by= list(airquality[,5]), FUN = sum)
# Error in aggregate.data.frame(as.data.frame(x), ...) :
# arguments must have same length
aggregate(x = sum(complete.cases(airquality)), by=
list(sum(complete.cases(airquality)),airquality[,5]), FUN = sum)
# Error in aggregate.data.frame(as.data.frame(x), ...) :
# arguments must have same length
I tried to search further into the ?aggregate(x, ...)
function. Namely on the by
part...
by - a list of grouping elements, each as long as the variables in the data frame x. The elements are coerced to factors before use.
I looked up ?factor
, but can't seem to see how to apply it (if even necessary in this case). I also tried putting break =
into it but didn't work.
None of the "Questions that may already have your answer" seem to apply, many of which give solutions in C# and SQL.
Edit: Expected outcome
Count Month
24 5
9 6
26 7
23 8
29 9
回答1:
As an addition to the other answers, you could do it with dplyr
.
require(dplyr)
airquality %.%
group_by(Month) %.%
summarize(incomplete = sum(!complete.cases(Ozone, Solar.R)),
complete = sum(complete.cases(Ozone, Solar.R)))
# Month incomplete complete
#1 5 7 24
#2 6 21 9
#3 7 5 26
#4 8 8 23
#5 9 1 29
回答2:
I like data.table
for these kinds of problems. It does by
grouping very well and intuitively indeed...
require( data.table )
dt <- data.table( airquality )
dt[ , list( Count = sum( complete.cases( Ozone , Solar.R ) ) ), by = Month ]
# Month Count
#1: 5 24
#2: 6 9
#3: 7 26
#4: 8 23
#5: 9 29
Keeping it in base
R I'd do...
airquality$ok <- complete.cases( airquality$Ozone , airquality$Solar.R )
aggregate( ok ~ Month , data = airquality , FUN = sum )
# Month ok
#1 5 24
#2 6 9
#3 7 26
#4 8 23
#5 9 29
Edit: Another variation of @Simon's solution using data.table
:
dt[complete.cases(Ozone, Solar.R), list(count = .N), by=Month]
# Month count
# 1: 5 24
# 2: 6 9
# 3: 7 26
# 4: 8 23
# 5: 9 29
The variation is that we first filter/subset only those with no NA
s and then get the aggregation by Month
.
Note:
.N
is an inbuilt variable indata.table
- an integer vector of length 1, which gives the total number of observations in that group.
回答3:
This seems to be what you are looking for:
> foo <- table(airquality[!ok,"Month"])
> data.frame(Month=names(foo),Count=as.vector(foo))
Month Count
1 5 7
2 6 21
3 7 5
4 8 8
5 9 1
(This looks a little different from your edit. Is it possible that there is some small confusion between ok
and !ok
?)
来源:https://stackoverflow.com/questions/23634572/r-sum-complete-cases-in-one-column-grouped-by-or-sorted-by-a-value-in-another