问题

I'm new to R, and I wrote some code to summarize data from .csv file according to my needs.

here is the code.

raw <- read.csv("trees.csv")

looks like this

                                 SNAME     CNAME        FAMILY PLOT INDIVIDUAL CAP   H
1 Alchornea triplinervia (Spreng.) M. Arg. Tainheiro Euphorbiaceae    5        176  15 9.5
2               Andira fraxinifolia Benth.   Angelim      Fabaceae    3        321  12 6.0
3               Andira fraxinifolia Benth.   Angelim      Fabaceae    3        326  14 7.0
4               Andira fraxinifolia Benth.   Angelim      Fabaceae    3        327  18 5.0
5               Andira fraxinifolia Benth.   Angelim      Fabaceae    3        328  12 6.0
6               Andira fraxinifolia Benth.   Angelim      Fabaceae    3        329  21 7.0

#add 2 other rows
for (i in 1:nrow(raw)) {
  raw$VOLUME[i] <- treeVolume(raw$CAP[i],raw$H[i])  
  raw$BASALAREA[i] <- treeBasalArea(raw$CAP[i])
}

#here comes. I need a new data frame, with the mean of columns H and CAP and the sums of columns VOLUME and BASALAREA. This dataframe is grouped by column SNAME and subgrouped by column PLOT.

plotSummary = merge(
  aggregate(raw$CAP ~ raw$SNAME * raw$PLOT, raw, mean),
  aggregate(raw$H ~ raw$SNAME * raw$PLOT, raw, mean))

plotSummary = merge(
  plotSummary,
  aggregate(raw$VOLUME ~ raw$SNAME * raw$PLOT, raw, sum))


plotSummary = merge(
  plotSummary,
  aggregate(raw$BASALAREA ~ raw$SNAME * raw$PLOT, raw, sum))

The functions treeVolume and treeBasal area just return numbers.

treeVolume <- function(radius, height) {
  return (0.000074230*radius**1.707348*height**1.16873)
}

treeBasalArea <- function(radius) {
  return (((radius**2)*pi)/40000)
}

I'm sure that there is a better way of doing this, but how?

回答1:

I can't manage to read your example data in, but I think I've made something that generally represents it...so give this a whirl. This answer builds off of Greg's suggestion to look at plyr and the functions ddply to group by segments of your data.frame and numcolwise to calculate your statistics of interest.

#Sample data
set.seed(1)
dat <- data.frame(sname = rep(letters[1:3],2), plot = rep(letters[1:3],2), 
                  CAP = rnorm(6), 
                  H = rlnorm(6), 
                  VOLUME = runif(6),
                  BASALAREA = rlnorm(6)
                  )


#Calculate mean for all numeric columns, grouping by sname and plot
library(plyr)
ddply(dat, c("sname", "plot"), numcolwise(mean))
#-----
  sname plot        CAP        H    VOLUME BASALAREA
1     a    a  0.4844135 1.182481 0.3248043  1.614668
2     b    b  0.2565755 3.313614 0.6279025  1.397490
3     c    c -0.8280485 1.627634 0.1768697  2.538273

EDIT - response to updated question

Ok - now that your question is more or less reproducible, here's how I'd approach it. First of all, you can take advantage of the fact that R is a vectorized meaning that you can calculate ALL of the values from VOLUME and BASALAREA in one pass, without looping through each row. For that bit, I recommend the transform function:

dat <- transform(dat, VOLUME = treeVolume(CAP, H), BASALAREA = treeBasalArea(CAP))

Secondly, realizing that you intend to calculate different statistics for CAP & H and then VOLUME & BASALAREA, I recommend using the summarize function, like this:

ddply(dat, c("sname", "plot"), summarize,
  meanCAP = mean(CAP),
  meanH = mean(H),
  sumVOLUME = sum(VOLUME),
  sumBASAL = sum(BASALAREA)
  )

Which will give you an output that looks like:

  sname plot   meanCAP     meanH    sumVOLUME     sumBASAL
1     a    a 0.5868582 0.5032308 9.650184e-06 7.031954e-05
2     b    b 0.2869029 0.4333862 9.219770e-06 1.407055e-05
3     c    c 0.7356215 0.4028354 2.482775e-05 8.916350e-05

The help pages for ?ddply, ?transform, ?summarize should be insightful.

回答2:

Look at the plyr package. I will split the data by the SNAME variable for you, then you give it code to do the set of summaries that you want (mixing mean and sum and whatever), then it will put the pieces back together for you. You probably want either the 'ddply' or the 'daply' function in that package.

来源：https://stackoverflow.com/questions/10805295/summarize-data-from-csv-using-r

标签