In R, apply a function to the rows of a data frame and return a data frame

问题

I am trying to apply a self-written function to the rows of a data frame.

library(dplyr) # only used for data_frame
DF = data_frame(x = c(50, 49, 20), y = c(132, 124, 130), z = c(0.82, 1, 0.63))

     x     y     z
   <dbl> <dbl> <dbl>
1    50   132  0.82
2    49   124  1.00
3    20   130  0.63

The actual data frame has thousands of rows, this is just a sample.

My function is very complicated and does many things, and in the end I get for each row of DF a new row. Let's say for simplicity that the function adds 1 to column 1, 2 to column 2 and 3 to column 3 (this of course can be vectorized, but my function, lets call it Funct, does much more). So:

Funct = function(DF) {
   DF[1]= DF[1]+1
   DF[2] = DF[2]+2
   DF[3] = DF[3]+3
   return(DF)
}

How do I apply this function in the most efficient way to get in the end a new data frame with the output:

> DF
     x     y     z
   <dbl> <dbl> <dbl>
1    51   134  3.82
2    50   126  4.00
3    21   132  3.63

回答1:

apply is a bad option for data frames because it is designed for matrices, and thus will coerce a data frame input to a matrix before iterating. Aside from occasionally being an expensive conversion (which has to be reversed afterwards), the real problem with this is that matrices in R can only handle a single type, whereas data frames can have a different type for each variable. Thus, while it will work fine for the data here, you'll often end up with type coercion happening in a matrix you can't see, when numbers are coerced to character because another column is a factor. If you really want to use apply, explicitly coerce to a matrix beforehand so you can see what it is working with, and you'll avoid a lot of annoying bugs.

But there's a better option than apply: instead, iterate in parallel over the variables (columns) and then coerce the resulting list back to a data frame. purrr::pmap_dfr will handle both parts:

library(tidyverse)

DF = data_frame(x = c(50, 49, 20), 
                y = c(132, 124, 130), 
                z = c(0.82, 1, 0.63))

DF %>% 
    pmap_dfr(~list(x = ..1 + 1,
                   y = ..2 + 2,
                   z = ..3 + 3))
#> # A tibble: 3 x 3
#>       x     y     z
#>   <dbl> <dbl> <dbl>
#> 1   51.  134.  3.82
#> 2   50.  126.  4.00
#> 3   21.  132.  3.63

You can do the same thing in base R with

do.call(rbind, do.call(Map, 
                       c(function(...){
                           data.frame(x = ..1 + 1,
                                      y = ..2 + 2,
                                      z = ..3 + 3)
                       }, 
                       DF)
))
#>    x   y    z
#> 1 51 134 3.82
#> 2 50 126 4.00
#> 3 21 132 3.63

...though it's not terribly pretty.

Note that a vectorized solution, when possible, will be much, much faster.

DF %>% 
    mutate(x = x + 1,
           y = y + 2,
           z = z + 3)
#> # A tibble: 3 x 3
#>       x     y     z
#>   <dbl> <dbl> <dbl>
#> 1   51.  134.  3.82
#> 2   50.  126.  4.00
#> 3   21.  132.  3.63

回答2:

Just use apply...

DF2 <- as.data.frame(t(apply(DF, 1, Funct)))

DF2
   x   y    z
1 51 134 3.82
2 50 126 4.00
3 21 132 3.63

回答3:

If this is perfectly numeric, you can get away with

as.data.frame(t(apply(as.matrix(DF), 1, `+`, c(1,2,3))))
as.data.frame(t(apply(DF, 1, Funct))) # better, per AndrewGustar's answer

which will likely be the fastest you can do. However, if you have anything other than numeric in the data (e.g., integer or *gasp* character), using apply will result in conversion out of numeric, not what you want. (I am including as.matrix in the first example to demonstrate what is actually happening within apply, not that you actually need that in your code. This matrix conversion is why apply can be problematic to non-homogenous frames.)

As was stated in other comments, if your data is truly all-numeric, you will gain significant performance (and, if relevant, storage) improvements by converting it to a matrix and dealing with it as such.

For heterogenous-class frames (or if you just want to be robust for future changes), try this:

do.call(rbind, by(DF, seq_len(nrow(DF)), Funct))
# # A tibble: 3 × 3
#       x     y     z
# * <dbl> <dbl> <dbl>
# 1    51   134  3.82
# 2    50   126  4.00
# 3    21   132  3.63

Edit

If you need to include all data when you aggregate each row:

Pass the whole DF as another argument, such as Funct(DF1, DFall). This would be called as by(DF, seq_len(nrow(DF)), Funct, DFall=DF);
If your access to all rows is merely an aggregation that can be calculated once and passed to Funct as an additional argument (think Funct(DF1, DFall)), then do that calc once, and then pass it as above in place of the whole frame;
Otherwise, use a for loop. None of the offered solutions (nor that I can think of now) facilitate this type of view.

来源：https://stackoverflow.com/questions/49696899/in-r-apply-a-function-to-the-rows-of-a-data-frame-and-return-a-data-frame

标签

dataframe

vectorization