问题
I am trying to apply a self-written function to the rows of a data frame.
library(dplyr) # only used for data_frame
DF = data_frame(x = c(50, 49, 20), y = c(132, 124, 130), z = c(0.82, 1, 0.63))
x y z
<dbl> <dbl> <dbl>
1 50 132 0.82
2 49 124 1.00
3 20 130 0.63
The actual data frame has thousands of rows, this is just a sample.
My function is very complicated and does many things, and in the end I get for each row of DF a new row. Let's say for simplicity that the function adds 1 to column 1, 2 to column 2 and 3 to column 3 (this of course can be vectorized, but my function, lets call it Funct, does much more). So:
Funct = function(DF) {
DF[1]= DF[1]+1
DF[2] = DF[2]+2
DF[3] = DF[3]+3
return(DF)
}
How do I apply this function in the most efficient way to get in the end a new data frame with the output:
> DF
x y z
<dbl> <dbl> <dbl>
1 51 134 3.82
2 50 126 4.00
3 21 132 3.63
回答1:
apply
is a bad option for data frames because it is designed for matrices, and thus will coerce a data frame input to a matrix before iterating. Aside from occasionally being an expensive conversion (which has to be reversed afterwards), the real problem with this is that matrices in R can only handle a single type, whereas data frames can have a different type for each variable. Thus, while it will work fine for the data here, you'll often end up with type coercion happening in a matrix you can't see, when numbers are coerced to character because another column is a factor. If you really want to use apply
, explicitly coerce to a matrix beforehand so you can see what it is working with, and you'll avoid a lot of annoying bugs.
But there's a better option than apply
: instead, iterate in parallel over the variables (columns) and then coerce the resulting list back to a data frame. purrr::pmap_dfr
will handle both parts:
library(tidyverse)
DF = data_frame(x = c(50, 49, 20),
y = c(132, 124, 130),
z = c(0.82, 1, 0.63))
DF %>%
pmap_dfr(~list(x = ..1 + 1,
y = ..2 + 2,
z = ..3 + 3))
#> # A tibble: 3 x 3
#> x y z
#> <dbl> <dbl> <dbl>
#> 1 51. 134. 3.82
#> 2 50. 126. 4.00
#> 3 21. 132. 3.63
You can do the same thing in base R with
do.call(rbind, do.call(Map,
c(function(...){
data.frame(x = ..1 + 1,
y = ..2 + 2,
z = ..3 + 3)
},
DF)
))
#> x y z
#> 1 51 134 3.82
#> 2 50 126 4.00
#> 3 21 132 3.63
...though it's not terribly pretty.
Note that a vectorized solution, when possible, will be much, much faster.
DF %>%
mutate(x = x + 1,
y = y + 2,
z = z + 3)
#> # A tibble: 3 x 3
#> x y z
#> <dbl> <dbl> <dbl>
#> 1 51. 134. 3.82
#> 2 50. 126. 4.00
#> 3 21. 132. 3.63
回答2:
Just use apply
...
DF2 <- as.data.frame(t(apply(DF, 1, Funct)))
DF2
x y z
1 51 134 3.82
2 50 126 4.00
3 21 132 3.63
回答3:
If this is perfectly numeric
, you can get away with
as.data.frame(t(apply(as.matrix(DF), 1, `+`, c(1,2,3))))
as.data.frame(t(apply(DF, 1, Funct))) # better, per AndrewGustar's answer
which will likely be the fastest you can do. However, if you have anything other than numeric
in the data (e.g., integer
or *gasp* character
), using apply
will result in conversion out of numeric
, not what you want. (I am including as.matrix
in the first example to demonstrate what is actually happening within apply
, not that you actually need that in your code. This matrix conversion is why apply
can be problematic to non-homogenous frames.)
As was stated in other comments, if your data is truly all-numeric
, you will gain significant performance (and, if relevant, storage) improvements by converting it to a matrix
and dealing with it as such.
For heterogenous-class frames (or if you just want to be robust for future changes), try this:
do.call(rbind, by(DF, seq_len(nrow(DF)), Funct))
# # A tibble: 3 × 3
# x y z
# * <dbl> <dbl> <dbl>
# 1 51 134 3.82
# 2 50 126 4.00
# 3 21 132 3.63
Edit
If you need to include all data when you aggregate each row:
Pass the whole
DF
as another argument, such asFunct(DF1, DFall)
. This would be called asby(DF, seq_len(nrow(DF)), Funct, DFall=DF)
;If your access to all rows is merely an aggregation that can be calculated once and passed to
Funct
as an additional argument (thinkFunct(DF1, DFall)
), then do that calc once, and then pass it as above in place of the whole frame;Otherwise, use a
for
loop. None of the offered solutions (nor that I can think of now) facilitate this type of view.
来源:https://stackoverflow.com/questions/49696899/in-r-apply-a-function-to-the-rows-of-a-data-frame-and-return-a-data-frame