问题
I'm making diurnal cycles of windspeed based on a dataframe (ball) of several year's hourly data. I want to plot them by season, so I subset out the dates I need and join them like this:
b8 = subset(ball, as.Date(date)>="2008-09-01 00:00:00, GMT" & as.Date(date)<= "2008-11-30 23:00:00, GMT" )
b9 = subset(ball, as.Date(date)>="2009-09-01 00:00:00, GMT" & as.Date(date)<= "2009-11-30 23:00:00, GMT" )
b10 = subset(ball, as.Date(date)>="2010-09-01 00:00:00, GMT" & as.Date(date)<= "2010-11-30 23:00:00, GMT")
ballspr = rbind(b8,b9,b10)
I then get a diurnal cycle using this:
sprwsdiurnal <- aggregate(ballspr["ws"], format(ballspr["date"],"%H"),summary, na.rm=T)
For three out of four seasons this make an object with this structure:
date ws
1 00 0.200, 1.000, 1.600, 2.021, 2.500, 8.000, 5.000
2 01 0.100, 1.000, 1.600, 1.988, 2.500, 8.600, 1.000
3 02 0.100, 1.000, 1.700, 1.982, 2.600, 8.900, 1.000
...through to 24 hours...
23 22 0.100, 1.200, 1.800, 2.222, 2.950, 9.100, 1.000
24 23 0.100, 1.000, 1.600, 2.072, 2.700, 8.800, 1.000
This is what I want as boxplot will work with this:
par( mar = c(5, 5, 2, 2))
boxplot(sprwsdiurnal$ws, col="dodger blue",pch=16,font.lab=2,cex.lab=1.5,cex.axis=2,xlab="Hour",range=0, ylab=quote(Windspeed ~ "(" * m ~ s ^-1 * ")"),xaxt="n",main="Spring")
axis(1, at=seq(1,24, by=1),labels=seq(1,24, by=1),cex.axis=1.5, cex.lab=1.5, font.lab=2)
The trouble is one season comes out like this:
date ws.Min. ws.1st Qu. ws.Median ws.Mean ws.3rd Qu. ws.Max. ws.NA's
1 00 0.000 1.300 2.100 2.539 3.200 10.500 2.000
2 01 0.100 1.275 2.100 2.499 3.200 9.800 2.000
3 02 0.200 1.200 2.000 2.514 3.400 9.000 2.000
...through to 24 hours...
23 22 0.100 1.200 1.950 2.582 3.325 11.900 2.000
24 23 0.100 1.300 2.000 2.585 3.400 11.200 2.000
Boxplot does not work with this format. I can't explain why this happens, when all the code for each season is the same and they are being subsetted from the same dataframe. Why does one come out differently? Any ideas appreciated.
EDIT:Here's the data. I've checked these two seasons and they still give the two different formats shown above.
https://www.dropbox.com/s/v5kss0bgjyhrtw1/ball.csv
ball=read.csv("ball.csv", header=T)
ball$date = as.POSIXct(strptime(ball$date, format = "%Y-%m-%d %H:%M:%S", "GMT"))
win9 = subset(ball, as.Date(date)>="2009-06-01 00:00:00, GMT" & as.Date(date)<= "2009-08-31 23:00:00, GMT" )
aut9 = subset(ball, as.Date(date)>="2009-03-01 00:00:00, GMT" & as.Date(date)<= "2009-05-31 23:00:00, GMT" )
spr9 = subset(ball, as.Date(date)>="2009-09-01 00:00:00, GMT" & as.Date(date)<= "2009-11-30 23:00:00, GMT" )
sum9 = subset(ball, as.Date(date)>="2008-12-01 00:00:00, GMT" & as.Date(date)<= "2009-02-28 23:00:00, GMT" )
sprdiurnal <- aggregate(spr9["ws"], format(spr9["date"],"%H"),summary, na.rm=T)
par( mar = c(5, 5, 4, 2))
boxplot(sprdiurnal$ws, col=colours()[109],pch=16,cex.lab=1.5,cex.axis=1.5,xlab="Hour",range=0, ylab=quote(Wind ~ speed ~ "(" * m * "s" ^-1 * ")"),xaxt="n",main="")
axis(1, at=seq(1,24, by=1),labels=seq(1,24, by=1),cex.axis=1.5, cex.lab=1.5)
windiurnal <- aggregate(win9["ws"], format(win9["date"],"%H"),summary, na.rm=T)
par( mar = c(5, 5, 4, 2))
boxplot(windiurnal$ws, col=colours()[109],pch=16,cex.lab=1.5,cex.axis=1.5,xlab="Hour",range=0, ylab=quote(Wind ~ speed ~ "(" * m * "s" ^-1 * ")"),xaxt="n",main="")
axis(1, at=seq(1,24, by=1),labels=seq(1,24, by=1),cex.axis=1.5, cex.lab=1.5)
回答1:
The "problem", so far as I can tell, is that the result of summary
in your aggregate
function for "sprdiurnal
" results in a rectangular dataset that R stores as a matrix
, while for your other subsets, since some hours include NA
and others don't the dataset is not rectangular, so R stores the summary as a list
.
I'll demonstrate with the "iris" dataset, but first, I'll also create an "iris_2" dataset that has one NA
value.
iris_2 <- iris
iris_2$Sepal.Length[10] <- NA
Let's compare the aggregation output, which in these cases will just be the second column. You'll see that the "iris" dataset, which has no missing values, returns a rectangular matrix as the second "column" in your data.frame
. Because of our one NA
value, the "iris_2" dataset, however, gets stored as a list
, which is what you want for your particular purpose.
(irisagg <- aggregate(iris["Sepal.Length"], iris["Species"], summary))[[2]]
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# [1,] 4.3 4.800 5.0 5.006 5.2 5.8
# [2,] 4.9 5.600 5.9 5.936 6.3 7.0
# [3,] 4.9 6.225 6.5 6.588 6.9 7.9
(iris_2agg <- aggregate(iris_2["Sepal.Length"], iris_2["Species"], summary))[[2]]
# $`0`
# Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
# 4.300 4.800 5.000 5.008 5.200 5.800 1
#
# $`1`
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 4.900 5.600 5.900 5.936 6.300 7.000
#
# $`2`
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 4.900 6.225 6.500 6.588 6.900 7.900
Here's how we would put it back into a list.
irisagg$Summary <- unlist(apply(irisagg[[2]], 1, list), recursive = FALSE)
irisagg$Summary
# [[1]]
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 4.300 4.800 5.000 5.006 5.200 5.800
#
# [[2]]
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 4.900 5.600 5.900 5.936 6.300 7.000
#
# [[3]]
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 4.900 6.225 6.500 6.588 6.900 7.900
Of course, a much more direct approach would be to make use of the simplify
argument for aggregate
and do:
(iris_3agg <- aggregate(iris["Sepal.Length"],
iris["Species"], summary,
simplify = FALSE))[[2]]
# $`0`
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 4.300 4.800 5.000 5.006 5.200 5.800
#
# $`1`
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 4.900 5.600 5.900 5.936 6.300 7.000
#
# $`2`
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 4.900 6.225 6.500 6.588 6.900 7.900
Applying it to your example, "sprdiurnal" is the subset that's giving you trouble. View sprdiurnal$ws
by itself and verify that it's a matrix. Let's convert it to a list.
sprdiurnal$ws2 <- unlist(apply(sprdiurnal$ws, 1, list), recursive=FALSE)
Now you can proceed with boxplot
as you were doing with the other seasons.
boxplot(sprdiurnal$ws2, e..t..c...)
Or, remake your sprdiurnal
object using:
sprdiurnal <- aggregate(spr9["ws"],
format(spr9["date"],"%H"),
summary, na.rm = TRUE,
simplify = FALSE)
And proceed as before.
来源:https://stackoverflow.com/questions/14634633/r-aggregate-gives-differently-structured-results-using-subsets-from-the-same-dat