Calculating subtotals in R

问题

I have a data frame with 900,000 rows and 11 columns in R. The column names and types are as follows:

column name: date / mcode / mname / ycode / yname / yissue  / bsent   / breturn / tsent   / treturn / csales
type:        Date / Char  / Char  / Char  / Char  / Numeric / Numeric / Numeric / Numeric / Numeric / Numeric

I want to calculate the subtotals. For example, I want to calculate the sums at each change in yname, and add subtotal to all numerical variables. There are 160 distinct ynames, so the resulting table should tell me the subtotal of each yname. I haven't sorted the data yet, but this is not a problem because I can sort the data in whatever way I want. Below is a snippet from my data:

             date     mcode mname            ycode    yname   yissue bsent breturn tsent treturn csales
417572 2010-07-28     45740 ENDPOINT A        5772    XMAG  20100800     7       0     7       0      0
417573 2010-07-31     45740 ENDPOINT A        5772    XMAG  20100800     0       0     0       0      1
417574 2010-08-04     45740 ENDPOINT A        5772    XMAG  20100800     0       0     0       0      1
417575 2010-08-14     45740 ENDPOINT A        5772    XMAG  20100800     0       0     0       0      1
417576 2010-08-26     45740 ENDPOINT A        5772    XMAG  20100800     0       4     0       0      0
417577 2010-07-28     45741 ENDPOINT L        5772    XMAG  20100800     2       0     2       0      0
417578 2010-08-04     45741 ENDPOINT L        5772    XMAG  20100800     2       0     2       0      0
417579 2010-08-26     45741 ENDPOINT L        5772    XMAG  20100800     0       4     0       0      0
417580 2010-07-28     46390 ENDPOINT R        5772    XMAG  20100800     3       0     3       0      1
417581 2010-07-29     46390 ENDPOINT R        5772    XMAG  20100800     0       0     0       0      2
417582 2010-08-01     46390 ENDPOINT R        5779    YMAG  20100800     3       0     3       0      0
417583 2010-08-11     46390 ENDPOINT R        5779    YMAG  20100800     0       0     0       0      1
417584 2010-08-20     46390 ENDPOINT R        5779    YMAG  20100800     0       0     0       0      1
417585 2010-08-24     46390 ENDPOINT R        5779    YMAG  20100800     2       0     2       0      1
417586 2010-08-26     46390 ENDPOINT R        5779    YMAG  20100800     0       2     0       2      0
417587 2010-07-28     46411 ENDPOINT D        5779    YMAG  20100800     6       0     6       0      0
417588 2010-08-08     46411 ENDPOINT D        5779    YMAG  20100800     0       0     0       0      1
417589 2010-08-11     46411 ENDPOINT D        5779    YMAG  20100800     0       0     0       0      1
417590 2010-08-26     46411 ENDPOINT D        5779    YMAG  20100800     0       4     0       4      0

What function should I use here? Maybe something like SQL group by?

回答1:

OK. Assuming your data are in a data frame named foo:

> head(foo)
             date mcode      mname ycode yname   yissue bsent breturn tsent
417572 2010/07/28 45740 ENDPOINT A  5772  XMAG 20100800     7       0     7
417573 2010/07/31 45740 ENDPOINT A  5772  XMAG 20100800     0       0     0
417574 2010/08/04 45740 ENDPOINT A  5772  XMAG 20100800     0       0     0
417575 2010/08/14 45740 ENDPOINT A  5772  XMAG 20100800     0       0     0
417576 2010/08/26 45740 ENDPOINT A  5772  XMAG 20100800     0       4     0
417577 2010/07/28 45741 ENDPOINT L  5772  XMAG 20100800     2       0     2
       treturn csales
417572       0      0
417573       0      1
417574       0      1
417575       0      1
417576       0      0
417577       0      0

Then this will do the aggregation of the numeric columns in your data:

> aggregate(cbind(bsent, breturn, tsent, treturn, csales) ~ yname, data = foo, 
+           FUN = sum)
  yname bsent breturn tsent treturn csales
1  XMAG    14       8    14       0      6
2  YMAG    11       6    11       6      5

That was using the snippet of data you included in your Q. I used the formula interface to aggregate(), which is a bit nicer in this instance because you don't need all the foo$ bits on the variable names you wish the aggregate. If you have missing data (NA)in your full data set, then you'll need add an extra argument na.rm = TRUE which will get passed to sum(), like so:

> aggregate(cbind(bsent, breturn, tsent, treturn, csales) ~ yname, data = foo, 
+           FUN = sum, na.rm = TRUE)

回答2:

Or the plyr library, which is easily extensible to other data classes:

> library(plyr)
> result.2 <- ddply(df$a, .(df$b), sum)
> result.2
  df.b V1
1 down 30
2   up 25

回答3:

You can also use xtabs or tapply:

xtabs(cbind(bsent, breturn, tsent, treturn, csales) ~ yname, data)

tapply(data$bsent, data$yname, sum)

回答4:

if your data is large and speed matters, i would recommend using the R function rowsum, which is a lot faster. i applied the 3 methods (f1 = aggregate, f2 = ddply, f3 = tapply) suggested in the answers to compare it with f4 = rowsum and here is what i find:

   test replications elapsed relative
4 f4()          100   0.033     1.00
3 f3()          100   0.046     1.39
1 f1()          100   0.165     5.00
2 f2()          100   0.605    18.33

i have added my code below if someone wants to explore in more detail.

library(plyr);
library(rbenchmark);

val  = rnorm(50);
name = rep(letters[1:5], each = 10);
data = data.frame(val, name);

f1 = function(){aggregate(data$val, by=list(data$name), FUN=sum)}
f2 = function(){ddply(data, .(name), summarise, sum = sum(val))}
f3 = function(){tapply(data$val, data$name, sum)}
f4 = function(){rowsum(x = data$val, group = data$name)}

benchmark(f1(), f2(), f3(), f4(),
          columns=c("test", "replications", "elapsed", "relative"),
          order="relative", replications=100)

回答5:

You can use aggregate

For instance, say that you have

val = rnorm(50)
name = rep(letters[1:5], each=10)
data <- data.frame(val, name)

Then you can do

aggregate(data$val, by=list(data$name), FUN=sum)

回答6:

There is a R package called sqldf that enables you to use SQL commands on R data.frames. Besides like you already said, GROUP BY would be nice. You can easily store your data in a local MySQL database and connect to R using the package RMySQL (You can use most other DBMS too but MySQL is the easiest to set up).

As far as I can judge it plyr is a great package, too. But from the way you ask and compare your problem to GROUP BY, I guess you know something about SQL, so using this might be easier for you. There are comfortable functions like dbReadTable, plus if your data grows bigger you can select only subparts of your data to only run your analysis with what you really need.

回答7:

Google wasn't super helpful when I tried to find an answer to a similar question. I thought I would share my solution below using the library(janitor) package with split(), and purrr::map_df().

My use case was to run a script that would grab CC expenses from many different people to be reviewed by a person.


library(janitor)
library(purrr)
library(dplyr)

mtcars %>% 
  split(.[,"cyl"]) %>% ## splits each change in cyl into a list of dataframes 
  map_df(., janitor::adorn_totals)

   mpg cyl   disp   hp  drat     wt   qsec vs am gear carb
  22.8   4  108.0   93  3.85  2.320  18.61  1  1    4    1
  24.4   4  146.7   62  3.69  3.190  20.00  1  0    4    2
  22.8   4  140.8   95  3.92  3.150  22.90  1  0    4    2
  32.4   4   78.7   66  4.08  2.200  19.47  1  1    4    1
  30.4   4   75.7   52  4.93  1.615  18.52  1  1    4    2
  33.9   4   71.1   65  4.22  1.835  19.90  1  1    4    1
  21.5   4  120.1   97  3.70  2.465  20.01  1  0    3    1
  27.3   4   79.0   66  4.08  1.935  18.90  1  1    4    1
    26   4  120.3   91  4.43  2.140  16.70  0  1    5    2
  30.4   4   95.1  113  3.77  1.513  16.90  1  1    5    2
  21.4   4  121.0  109  4.11  2.780  18.60  1  1    4    2
 Total  44 1156.5  909 44.78 25.143 210.51 10  8   45   17
    21   6  160.0  110  3.90  2.620  16.46  0  1    4    4
    21   6  160.0  110  3.90  2.875  17.02  0  1    4    4
  21.4   6  258.0  110  3.08  3.215  19.44  1  0    3    1
  18.1   6  225.0  105  2.76  3.460  20.22  1  0    3    1
  19.2   6  167.6  123  3.92  3.440  18.30  1  0    4    4
  17.8   6  167.6  123  3.92  3.440  18.90  1  0    4    4
  19.7   6  145.0  175  3.62  2.770  15.50  0  1    5    6
 Total  42 1283.2  856 25.10 21.820 125.84  4  3   27   24
  18.7   8  360.0  175  3.15  3.440  17.02  0  0    3    2
  14.3   8  360.0  245  3.21  3.570  15.84  0  0    3    4
  16.4   8  275.8  180  3.07  4.070  17.40  0  0    3    3
  17.3   8  275.8  180  3.07  3.730  17.60  0  0    3    3
  15.2   8  275.8  180  3.07  3.780  18.00  0  0    3    3
  10.4   8  472.0  205  2.93  5.250  17.98  0  0    3    4
  10.4   8  460.0  215  3.00  5.424  17.82  0  0    3    4
  14.7   8  440.0  230  3.23  5.345  17.42  0  0    3    4
  15.5   8  318.0  150  2.76  3.520  16.87  0  0    3    2
  15.2   8  304.0  150  3.15  3.435  17.30  0  0    3    2
  13.3   8  350.0  245  3.73  3.840  15.41  0  0    3    4
  19.2   8  400.0  175  3.08  3.845  17.05  0  0    3    2
  15.8   8  351.0  264  4.22  3.170  14.50  0  1    5    4
    15   8  301.0  335  3.54  3.570  14.60  0  1    5    8
 Total 112 4943.4 2929 45.21 55.989 234.81  0  2   46   49


# if you're sending the output to be reviewed by a person, add a row! 

mtcars %>% 
  split(.[,"cyl"]) %>% 
  map_df(., ~janitor::adorn_totals(.x) %>% 
           dplyr::add_row()) %>% 
  write.csv(., "demo.csv")

来源：https://stackoverflow.com/questions/4104406/calculating-subtotals-in-r

标签

subtotal