Is Scala functional programming slower than traditional coding?

前端 未结 9 2088
离开以前
离开以前 2020-12-22 22:23

In one of my first attempts to create functional code, I ran into a performance issue.

I started with a common task - multiply the elements of two arrays and sum up

相关标签:
9条回答
  • 2020-12-22 22:57

    This is a microbenchmark, and it depends on how the compiler optimizes you code. You have 3 loops composed here,

    zip . map . fold

    Now, I'm fairly sure the Scala compiler cannot fuse those three loops into a single loop, and the underlying data type is strict, so each (.) corresponds to an intermediate array being created. The imperative/mutable solution would reuse the buffer each time, avoiding copies.

    Now, an understanding of what composing those three functions means is key to understanding performance in a functional programming language -- and indeed, in Haskell, those three loops will be optimized into a single loop that reuses an underlying buffer -- but Scala cannot do that.

    There are benefits to sticking to the combinator approach, however -- by distinguishing those three functions, it will be easier to parallelize the code (replace map with parMap etc). In fact, given the right array type, (such as a parallel array) a sufficiently smart compiler will be able to automatically parallelize your code, yielding more performance wins.

    So, in summary:

    • naive translations may have unexpected copies and inefficiences
    • clever FP compilers remove this overhead (but Scala can't yet)
    • sticking to the high level approach pays off if you want to retarget your code, e.g. to parallelize it
    0 讨论(0)
  • 2020-12-22 22:57

    Don Stewart has a fine answer, but it might not be obvious how going from one loop to three creates a factor of 40 slowdown. I'll add to his answer that Scala compiles to JVM bytecodes, and not only does the Scala compiler not fuse the three loops into one, but the Scala compiler is almost certainly allocating all the intermediate arrays. Notoriously, implementations of the JVM are not designed to handle the allocation rates required by functional languages. Allocation is a significant cost in functional programs, and that's one the loop-fusion transformations that Don Stewart and his colleagues have implemented for Haskell are so powerful: they eliminate lots of allocations. When you don't have those transformations, plus you're using an expensive allocator such as is found on a typical JVM, that's where the big slowdown comes from.

    Scala is a great vehicle for experimenting with the expressive power of an unusual mix of language ideas: classes, mixins, modules, functions, and so on. But it's a relatively young research language, and it runs on the JVM, so it's unreasonable to expect great performance except on the kind of code that JVMs are good at. If you want to experiment with the mix of language ideas that Scala offers, great—it's a really interesting design—but don't expect the same performance on pure functional code that you'd get with a mature compiler for a functional language, like GHC or MLton.

    Is scala functional programming slower than traditional coding?

    Not necessarily. Stuff to do with first-class functions, pattern matching, and currying need not be especially slow. But with Scala, more than with other implementations of other functional languages, you really have to watch out for allocations—they can be very expensive.

    0 讨论(0)
  • 2020-12-22 22:59

    The Scala collections library is fully generic, and the operations provided are chosen for maximum capability, not maximum speed. So, yes, if you use a functional paradigm with Scala without paying attention (especially if you are using primitive data types), your code will take longer to run (in most cases) than if you use an imperative/iterative paradigm without paying attention.

    That said, you can easily create non-generic functional operations that perform quickly for your desired task. In the case of working with pairs of floats, we might do the following:

    class FastFloatOps(a: Array[Float]) {
      def fastMapOnto(f: Float => Float) = {
        var i = 0
        while (i < a.length) { a(i) = f(a(i)); i += 1 }
        this
      }
      def fastMapWith(b: Array[Float])(f: (Float,Float) => Float) = {
        val len = a.length min b.length
        val c = new Array[Float](len)
        var i = 0
        while (i < len) { c(i) = f(a(i),b(i)); i += 1 }
        c
      }
      def fastReduce(f: (Float,Float) => Float) = {
        if (a.length==0) Float.NaN
        else {
          var r = a(0)
          var i = 1
          while (i < a.length) { r = f(r,a(i)); i += 1 }
          r
        }
      }
    }
    implicit def farray2fastfarray(a: Array[Float]) = new FastFloatOps(a)
    

    and then these operations will be much faster. (Faster still if you use Double and 2.8.RC1, because then the functions (Double,Double)=>Double will be specialized, not generic; if you're using something earlier, you can create your own abstract class F { def f(a: Float) : Float } and then call with new F { def f(a: Float) = a*a } instead of (a: Float) => a*a.)

    Anyway, the point is that it's not the functional style that makes functional coding in Scala slow, it's that the library is designed with maximum power/flexibility in mind, not maximum speed. This is sensible, since each person's speed requirements are typically subtly different, so it's hard to cover everyone supremely well. But if it's something you're doing more than just a little, you can write your own stuff where the performance penalty for a functional style is extremely small.

    0 讨论(0)
  • 2020-12-22 22:59

    I am not an expert Scala programmer, so there is probably a more efficient method, but what about something like this. This can be tail call optimized, so performance should be OK.

    def multiply_and_sum(l1:List[Int], l2:List[Int], sum:Int):Int = {
        if (l1 != Nil && l2 != Nil) {
            multiply_and_sum(l1.tail, l2.tail, sum + (l1.head * l2.head))
        }
        else {
            sum
        }
    }
    
    val first = Array(1,2,3,4,5)
    val second = Array(6,7,8,9,10)
    multiply_and_sum(first.toList, second.toList, 0)  //Returns: 130
    
    0 讨论(0)
  • 2020-12-22 23:03

    Your functional solution is slow because it is generating unnecessary temporary data structures. Removing these is known as deforesting and it is easily done in strict functional languages by rolling your anonymous functions into a single anonymous function and using a single aggregator. For example, your solution written in F# using zip, map and reduce:

    let dot xs ys = Array.zip xs ys |> Array.map (fun (x, y) -> x * y) -> Array.reduce ( * )
    

    may be rewritten using fold2 so as to avoid all temporary data structures:

    let dot xs ys = Array.fold2 (fun t x y -> t + x * y) 0.0 xs ys
    

    This is a lot faster and the same transformation can be done in Scala and other strict functional languages. In F#, you can also define the fold2 as inline in order to have the higher-order function inlined with its functional argument whereupon you recover the optimal performance of the imperative loop.

    0 讨论(0)
  • 2020-12-22 23:04

    The main reasons why these two examples are so different in speed are:

    • the faster one doesn't use any generics, so it doesn't face boxing/unboxing.
    • the faster one doesn't create temporary collections and, thus, avoids extra memory copies.

    Let's consider the slower one by parts. First:

    first.zip(second)
    

    That creates a new array, an array of Tuple2. It will copy all elements from both arrays into Tuple2 objects, and then copy a reference to each of these objects into a third array. Now, notice that Tuple2 is parameterized, so it can't store Float directly. Instead, new instances of java.lang.Float are created for each number, the numbers are stored in them, and then a reference for each of them is stored into the Tuple2.

    map{ case (a,b) => a*b }
    

    Now a fourth array is created. To compute the values of these elements, it needs to read the reference to the tuple from the third array, read the reference to the java.lang.Float stored in them, read the numbers, multiply, create a new java.lang.Float to store the result, and then pass this reference back, which will be de-referenced again to be stored in the array (arrays are not type-erased).

    We are not finished, though. Here's the next part:

    reduceLeft(_+_)
    

    That one is relatively harmless, except that it still do boxing/unboxing and java.lang.Float creation at iteration, since reduceLeft receives a Function2, which is parameterized.

    Scala 2.8 introduces a feature called specialization which will get rid of a lot of these boxing/unboxing. But let's consider alternative faster versions. We could, for instance, do map and reduceLeft in a single step:

    sum = first.zip(second).foldLeft(0f) { case (a, (b, c)) => a + b * c }
    

    We could use view (Scala 2.8) or projection (Scala 2.7) to avoid creating intermediary collections altogether:

    sum = first.view.zip(second).map{ case (a,b) => a*b }.reduceLeft(_+_)
    

    This last one doesn't save much, actually, so I think the non-strictness if being "lost" pretty fast (ie, one of these methods is strict even in a view). There's also an alternative way of zipping that is non-strict (ie, avoids some intermediary results) by default:

    sum = (first,second).zipped.map{ case (a,b) => a*b }.reduceLeft(_+_)
    

    This gives much better result that the former. Better than the foldLeft one, though not by much. Unfortunately, we can't combined zipped with foldLeft because the former doesn't support the latter.

    The last one is the fastest I could get. Faster than that, only with specialization. Now, Function2 happens to be specialized, but for Int, Long and Double. The other primitives were left out, as specialization increases code size rather dramatically for each primitive. On my tests, though Double is actually taking longer. That might be a result of it being twice the size, or it might be something I'm doing wrong.

    So, in the end, the problem is a combination of factors, including producing intermediary copies of elements, and the way Java (JVM) handles primitives and generics. A similar code in Haskell using supercompilation would be equal to anything short of assembler. On the JVM, you have to be aware of the trade-offs and be prepared to optimize critical code.

    0 讨论(0)
提交回复
热议问题