问题
I would like to use R for plotting performance evaluation results of distinct DB systems. For each system I loaded the same data and execute the same queries in several iterations.
The data for a single systems looks like this:
"iteration", "lines", "loadTime", "query1", "query2", "query3"
1, 100000, 120.4, 0.5, 6.4, 1.2
1, 100000, 110.1, 0.1, 5.2, 2.1
1, 50000, 130.3, 0.2, 4.3, 2.2
2, 100000, 120.4, 0.1, 2.4, 1.2
2, 100000, 300.2, 0.2, 4.5, 1.4
2, 50000, 235.3, 0.4, 4.2, 0.5
3, 100000, 233.5, 0.7, 8.3, 6.7
3, 100000, 300.1, 0.9, 0.5, 4.4
3, 50000, 100.2, 0.4, 9.2, 1.2
What I need now (for plotting) is a matrix or data frame containing the average of these measurements.
At the moment I am doing this:
# read the file
all_results <- read.csv(file="file.csv", head=TRUE, sep=",")
# split the results by iteration
results <- split(all_results, all_results$iteration)
# convert each result into a data frane
r1 = as.data.frame(results[1])
r2 = as.data.frame(results[2])
r3 = as.data.frame(results[3])
# calculate the average
(r1 + r2 +r3) / 3
I could put all this into a function and calculate the average matrix in a for loop, but I have the vague feeling that there must be a more elegant solution. Any ideas?
What can I do for cases when I have incomplete results, e.g., when one iteration has less rows than the others?
Thanks!
回答1:
If I understand you correctly, on a given DB system, in each "iteration" (1...N) you are loading a sequence of DataSets (1,2,3) and running queries on them. It seems like at the end you want to calculate the average time across all iterations, for each DataSet. If so, you actually need to have an additional column DataSet
in your all_results
table that identifies the DataSet. We can add this column as follows:
all_results <- cbind( data.frame( DataSet = rep(1:3,3) ), all_results )
> all_results
DataSet iteration lines loadTime query1 query2 query3
1 1 1 100000 120.4 0.5 6.4 1.2
2 2 1 100000 110.1 0.1 5.2 2.1
3 3 1 50000 130.3 0.2 4.3 2.2
4 1 2 100000 120.4 0.1 2.4 1.2
5 2 2 100000 300.2 0.2 4.5 1.4
6 3 2 50000 235.3 0.4 4.2 0.5
7 1 3 100000 233.5 0.7 8.3 6.7
8 2 3 100000 300.1 0.9 0.5 4.4
9 3 3 50000 100.2 0.4 9.2 1.2
Now you can use the ddply
function from the plyr
package to easily extract the averages for the load and query times for each DataSet.
> ddply(all_results, .(DataSet), colwise(mean, .(loadTime, query1, query2)))
DataSet loadTime query1 query2
1 1 158.1000 0.4333333 5.7
2 2 236.8000 0.4000000 3.4
3 3 155.2667 0.3333333 5.9
Incidentally, I highly recommend you look at Hadley Wickham's plyr package for a rich set of data-manipulation functions
回答2:
I don't see why you need to split all_results
by iteration
. You can just use aggregate
on all_results
. There's no need for all iterations to have the same number of observations.
Lines <- "iteration, lines, loadTime, query1, query2, query3
1, 100000, 120.4, 0.5, 6.4, 1.2
1, 100000, 110.1, 0.1, 5.2, 2.1
1, 50000, 130.3, 0.2, 4.3, 2.2
2, 100000, 120.4, 0.1, 2.4, 1.2
2, 100000, 300.2, 0.2, 4.5, 1.4
2, 50000, 235.3, 0.4, 4.2, 0.5
3, 100000, 233.5, 0.7, 8.3, 6.7
3, 100000, 300.1, 0.9, 0.5, 4.4
3, 50000, 100.2, 0.4, 9.2, 1.2"
all_results <- read.csv(textConnection(Lines))
aggregate(all_results[,-1], by=all_results[,"iteration",drop=FALSE], mean)
回答3:
Did you have something like this in mind?
do.call("rbind", lapply(results, mean))
回答4:
Try this:
> Reduce("+", results) / length(results)
DataSet iteration lines loadTime query1 query2 query3
1 1 2 1e+05 158.1000 0.4333333 5.7 3.033333
2 2 2 1e+05 236.8000 0.4000000 3.4 2.633333
3 3 2 5e+04 155.2667 0.3333333 5.9 1.300000
An aggregate
solution which also works for the unbalanced case follows. Assume that the ith row of any iteration is for data set i and that we simply average within datasets. Using aggregate
is straight forward. The only tricky part is getting the assignment of rows to data sets correct so that it works in the unbalanced case too. That is done by the list(data.set = ...)
expression.
> it <- all_results$iteration
> aggregate(all_results, list(data.set = seq_along(it) - match(it, it) + 1), mean)
data.set iteration lines loadTime query1 query2 query3
1 1 2 1e+05 158.1000 0.4333333 5.7 3.033333
2 2 2 1e+05 236.8000 0.4000000 3.4 2.633333
3 3 2 5e+04 155.2667 0.3333333 5.9 1.300000
回答5:
Try, e.g.,
with(all_results, tapply(lines, iteration, mean))
来源:https://stackoverflow.com/questions/4737753/calculate-average-over-multiple-data-frames