R aggregate gives differently structured results using subsets from the same data

问题

I'm making diurnal cycles of windspeed based on a dataframe (ball) of several year's hourly data. I want to plot them by season, so I subset out the dates I need and join them like this:

b8 = subset(ball, as.Date(date)>="2008-09-01 00:00:00, GMT" & as.Date(date)<= "2008-11-30 23:00:00, GMT"  )
b9  = subset(ball, as.Date(date)>="2009-09-01 00:00:00, GMT" & as.Date(date)<= "2009-11-30 23:00:00, GMT"  )
b10 = subset(ball,  as.Date(date)>="2010-09-01 00:00:00, GMT" & as.Date(date)<= "2010-11-30 23:00:00, GMT")
ballspr = rbind(b8,b9,b10)

I then get a diurnal cycle using this:

sprwsdiurnal <- aggregate(ballspr["ws"], format(ballspr["date"],"%H"),summary, na.rm=T)

For three out of four seasons this make an object with this structure:

   date                                               ws
1    00  0.200, 1.000, 1.600, 2.021, 2.500, 8.000, 5.000
2    01  0.100, 1.000, 1.600, 1.988, 2.500, 8.600, 1.000
3    02  0.100, 1.000, 1.700, 1.982, 2.600, 8.900, 1.000

...through to 24 hours...

23   22  0.100, 1.200, 1.800, 2.222, 2.950, 9.100, 1.000
24   23  0.100, 1.000, 1.600, 2.072, 2.700, 8.800, 1.000

This is what I want as boxplot will work with this:

par(  mar = c(5, 5, 2, 2))
boxplot(sprwsdiurnal$ws, col="dodger blue",pch=16,font.lab=2,cex.lab=1.5,cex.axis=2,xlab="Hour",range=0, ylab=quote(Windspeed ~ "(" * m ~ s ^-1 * ")"),xaxt="n",main="Spring")
axis(1, at=seq(1,24, by=1),labels=seq(1,24, by=1),cex.axis=1.5, cex.lab=1.5, font.lab=2)

The trouble is one season comes out like this:

      date ws.Min. ws.1st Qu. ws.Median ws.Mean ws.3rd Qu. ws.Max. ws.NA's
1    00   0.000      1.300     2.100   2.539      3.200  10.500   2.000
2    01   0.100      1.275     2.100   2.499      3.200   9.800   2.000
3    02   0.200      1.200     2.000   2.514      3.400   9.000   2.000

...through to 24 hours...

23   22   0.100      1.200     1.950   2.582      3.325  11.900   2.000
24   23   0.100      1.300     2.000   2.585      3.400  11.200   2.000

Boxplot does not work with this format. I can't explain why this happens, when all the code for each season is the same and they are being subsetted from the same dataframe. Why does one come out differently? Any ideas appreciated.

EDIT:Here's the data. I've checked these two seasons and they still give the two different formats shown above.

https://www.dropbox.com/s/v5kss0bgjyhrtw1/ball.csv

ball=read.csv("ball.csv", header=T)
ball$date = as.POSIXct(strptime(ball$date, format = "%Y-%m-%d %H:%M:%S", "GMT"))

win9  = subset(ball, as.Date(date)>="2009-06-01 00:00:00, GMT" & as.Date(date)<= "2009-08-31 23:00:00, GMT"  )
aut9  = subset(ball, as.Date(date)>="2009-03-01 00:00:00, GMT" & as.Date(date)<= "2009-05-31 23:00:00, GMT"  )
spr9  = subset(ball, as.Date(date)>="2009-09-01 00:00:00, GMT" & as.Date(date)<= "2009-11-30 23:00:00, GMT"  )
sum9  = subset(ball, as.Date(date)>="2008-12-01 00:00:00, GMT" & as.Date(date)<= "2009-02-28 23:00:00, GMT"  )


sprdiurnal <- aggregate(spr9["ws"], format(spr9["date"],"%H"),summary, na.rm=T)
par(  mar = c(5, 5, 4, 2))
 boxplot(sprdiurnal$ws, col=colours()[109],pch=16,cex.lab=1.5,cex.axis=1.5,xlab="Hour",range=0, ylab=quote(Wind ~ speed ~ "(" * m * "s" ^-1 * ")"),xaxt="n",main="")
axis(1, at=seq(1,24, by=1),labels=seq(1,24, by=1),cex.axis=1.5, cex.lab=1.5) 

windiurnal <- aggregate(win9["ws"], format(win9["date"],"%H"),summary, na.rm=T)
par(  mar = c(5, 5, 4, 2))
boxplot(windiurnal$ws, col=colours()[109],pch=16,cex.lab=1.5,cex.axis=1.5,xlab="Hour",range=0, ylab=quote(Wind ~ speed ~ "(" * m * "s" ^-1 * ")"),xaxt="n",main="")
axis(1, at=seq(1,24, by=1),labels=seq(1,24, by=1),cex.axis=1.5, cex.lab=1.5)

回答1:

The "problem", so far as I can tell, is that the result of summary in your aggregate function for "sprdiurnal" results in a rectangular dataset that R stores as a matrix, while for your other subsets, since some hours include NA and others don't the dataset is not rectangular, so R stores the summary as a list.

I'll demonstrate with the "iris" dataset, but first, I'll also create an "iris_2" dataset that has one NA value.

iris_2 <- iris
iris_2$Sepal.Length[10] <- NA

Let's compare the aggregation output, which in these cases will just be the second column. You'll see that the "iris" dataset, which has no missing values, returns a rectangular matrix as the second "column" in your data.frame. Because of our one NA value, the "iris_2" dataset, however, gets stored as a list, which is what you want for your particular purpose.

(irisagg <- aggregate(iris["Sepal.Length"], iris["Species"], summary))[[2]]
#      Min. 1st Qu. Median  Mean 3rd Qu. Max.
# [1,]  4.3   4.800    5.0 5.006     5.2  5.8
# [2,]  4.9   5.600    5.9 5.936     6.3  7.0
# [3,]  4.9   6.225    6.5 6.588     6.9  7.9
(iris_2agg <- aggregate(iris_2["Sepal.Length"], iris_2["Species"], summary))[[2]]
# $`0`
#     Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
#    4.300   4.800   5.000   5.008   5.200   5.800       1 
# 
# $`1`
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#   4.900   5.600   5.900   5.936   6.300   7.000 
# 
# $`2`
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#   4.900   6.225   6.500   6.588   6.900   7.900

Here's how we would put it back into a list.

irisagg$Summary <- unlist(apply(irisagg[[2]], 1, list), recursive = FALSE)
irisagg$Summary
# [[1]]
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#   4.300   4.800   5.000   5.006   5.200   5.800 
# 
# [[2]]
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#   4.900   5.600   5.900   5.936   6.300   7.000 
# 
# [[3]]
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#   4.900   6.225   6.500   6.588   6.900   7.900

Of course, a much more direct approach would be to make use of the simplify argument for aggregate and do:

(iris_3agg <- aggregate(iris["Sepal.Length"], 
                        iris["Species"], summary, 
                        simplify = FALSE))[[2]]
# $`0`
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#   4.300   4.800   5.000   5.006   5.200   5.800 
# 
# $`1`
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#   4.900   5.600   5.900   5.936   6.300   7.000 
# 
# $`2`
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#   4.900   6.225   6.500   6.588   6.900   7.900

Applying it to your example, "sprdiurnal" is the subset that's giving you trouble. View sprdiurnal$ws by itself and verify that it's a matrix. Let's convert it to a list.

sprdiurnal$ws2 <- unlist(apply(sprdiurnal$ws, 1, list), recursive=FALSE)

Now you can proceed with boxplot as you were doing with the other seasons.

boxplot(sprdiurnal$ws2, e..t..c...)

Or, remake your sprdiurnal object using:

sprdiurnal <- aggregate(spr9["ws"], 
                        format(spr9["date"],"%H"), 
                        summary, na.rm = TRUE, 
                        simplify = FALSE)

And proceed as before.

来源：https://stackoverflow.com/questions/14634633/r-aggregate-gives-differently-structured-results-using-subsets-from-the-same-dat

标签

aggregate

boxplot

summary