问题
This question is a follow up to the discussion from this answer.
What is the difference between using c(... %*% ...) and sum(... * ...) in a group_by() function from dplyr?
Both of these code give the same result:
#1
library(dplyr) # 1.0.0
library(tidyr)
df1 %>%
group_by(Date, Market) %>%
group_by(Revenue = c(Quantity %*% Price),
TotalCost = c(Quantity %*% Cost),
Product, .add = TRUE) %>%
summarise(Sold = sum(Quantity)) %>%
pivot_wider(names_from = Product, values_from = Sold)
#2
library(dplyr) # 1.0.0
library(tidyr)
df1 %>%
group_by(Date, Market) %>%
group_by(Revenue = sum(Quantity * Price),
TotalCost = sum(Quantity * Cost),
Product, .add = TRUE) %>%
summarise(Sold = sum(Quantity)) %>%
pivot_wider(names_from = Product, values_from = Sold)
# A tibble: 2 x 7
# Groups: Date, Market, Revenue, TotalCost [2]
# Date Market Revenue TotalCost Apple Banana Orange
# <chr> <chr> <dbl> <dbl> <int> <int> <int>
#1 6/24/2020 A 135 37.5 35 20 20
#2 6/25/2020 A 25 15 10 15 NA
Is one of c(... %*% ...) and sum(... * ...) better/quicker/preferred/neater?
The DATA in the original answer:
df1 <- structure(list(Date = c("6/24/2020", "6/24/2020", "6/24/2020",
"6/24/2020", "6/25/2020", "6/25/2020"), Market = c("A", "A",
"A", "A", "A", "A"), Salesman = c("MF", "RP", "RP", "FR", "MF",
"MF"), Product = c("Apple", "Apple", "Banana", "Orange", "Apple",
"Banana"), Quantity = c(20L, 15L, 20L, 20L, 10L, 15L), Price = c(1L,
1L, 2L, 3L, 1L, 1L), Cost = c(0.5, 0.5, 0.5, 0.5, 0.6, 0.6)),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))
回答1:
I'll compile the comments into an answer, others can jump in if I miss anything.
%*%and*are drastically different operators:*does element-wise multiplication, and%*%does linear algebra matrix multiplication. Those are very different operations, demonstrated with:1:4 * 2:5 # [1] 2 6 12 20 1:4 %*% 2:5 # [,1] # [1,] 40 sum(1:4 * 2:5) # [1] 40If you are looking for a single summary statistic from multiply two vectors, and the matrix-multiply from linear algebra makes sense, then
%*%is the right tool for you.there should be something said about declarative code; while you can do the third operation (
sum(.*.)), to me it may be better to use%*%, for two reasons:Declarative intent. I am saying that I have two matrices that I intend to do "linear algebra" on.
Safeguards. If there is any dimensional mismatch (e.g.,
sum(1:4 * 2:3)still works syntactically but1:4 %*% 2:3does not), I want to know it right away. Withsum(.*.), the mismatch is silently ignored to the world (one reason I think recycling can be a big problem).The reason is not performance: while with smaller vectors/matrices
%*%'s performance is on par withsum(.*.), as the size of the data gets larger,%*%is relatively more expensive.m1 <- 1:100 ; m2 <- m1+1 ; m3 <- 1:100000; m4 <- m3+1 microbenchmark::microbenchmark(sm1 = sum(m1*m2), sm2 = m1%*%m2, lg1 = sum(m3*m4), lg2 = m3%*%m4) # Unit: nanoseconds # expr min lq mean median uq max neval # sm1 800 1100 112900 1600 2100 11083600 100 # sm2 1100 1400 2143 1900 2450 10200 100 # lg1 239700 249550 411235 270800 355300 11102800 100 # lg2 547900 575550 634763 637850 678250 780500 100
All of the discussion so far has been on vectors, which are effectively 1d matrices (as far as
%*%seems to think ... though even that is not fully accurate). Once you start getting into true matrices, it becomes more difficult to interchange them ... in fact, I don't know of an easier way to emulate%*%(short offorloops, etc):m1 %*% m2 # [,1] [,2] [,3] [,4] # [1,] 22 49 76 103 # [2,] 28 64 100 136 t(sapply(seq_len(nrow(m1)), function(i) sapply(seq_len(ncol(m2)), function(j) sum(m1[i,] * m2[,j])))) # [,1] [,2] [,3] [,4] # [1,] 22 49 76 103 # [2,] 28 64 100 136(And while that nested-
sapplymay not be the fastest non-%*%way to do the matrix-y stuff,%*%is 1-2 orders of magnitude faster, since it is.Internaland compiled and meant for "Math!" like this.)
Bottom line, while %*% does use the * operator internally (for one of a couple steps), the two are otherwise different. Heck, one might also compare * and ^ in the same vein ... with a similar outcome.
Cheers!
来源:https://stackoverflow.com/questions/62603566/difference-between-c-and-sum