Performance of scala parallel collection processing

元气小坏坏 提交于 2020-02-01 04:37:46

问题


I have scenarios where I will need to process thousands of records at a time. Sometime, it might be in hundreds, may be upto 30000 records. I was thinking of using the scala's parallel collection. So just to understand the difference, I wrote a simple pgm like below:

object Test extends App{
  val list = (1 to 100000).toList
  Util.seqMap(list)
  Util.parMap(list)
}

object Util{
  def seqMap(list:List[Int]) = {
    val start = System.currentTimeMillis
    list.map(x => x + 1).toList.sum
    val end = System.currentTimeMillis
    println("time taken =" + (end - start))
    end - start
  }
  def parMap(list:List[Int]) = {
    val start = System.currentTimeMillis
    list.par.map(x => x + 1).toList.sum
    val end = System.currentTimeMillis
    println("time taken=" + (end - start))
    end - start
  }
}

I expected that running in parallel will be faster. However, the output I was getting was

time taken =32
time taken=127

machine config :

Intel i7 processor with 8 cores
16GB RAM
64bit Windows 8

What am I doing wrong? Is this not a correct scenario for parallel mapping?


回答1:


The issue is that the operation you are performing is so fast (just adding two ints) that the overhead of doing the parallelization is more than the benefit. The parallelization only really makes sense if the operations are slower.

Think of it this way: if you had 8 friends and you gave each one an integer on a piece of paper and told them to add one, write the result down, and give it back to you, which you would record before giving them the next integer, you'd spend so much time passing messages back and forth that you could have just done all the adding yourself faster.

ALSO: Never do .par on a List because the parallelization procedure has to copy the entire list into a parallel collection and then copy the whole thing back out. If you use a Vector, then it doesn't have to do this extra work.




回答2:


The overhead in parallelizing the list proves more time-consuming than the actual processing of the x + 1 operations sequentially.

Yet consider this modification where we include an operation that elapses over 1 millisecond approximately,

case class Delay() {
  Thread.sleep(1)
}

and replace

list.map(x => x + 1).toList.sum

with

list.map(_ => Delay()).toList

Now for val list = (1 to 10000).toList (note 10000 instead of 100000), in a quadcore 8GB machine,

scala> Util.parMap(list)
time taken=3451
res4: Long = 3451

scala> Util.seqMap(list)
time taken =10816
res5: Long = 10816

We can infer (better, guess) that for large collections with time-consuming operations, the overhead of parallelizing a collection does not significantly affect the elapsed time, in contrast with a sequential collection processing.




回答3:


If you are doing benchmarks, consider using something like JMH to avoid all the possible problems you might encounter, if you are measuring it in the way your program shows. For example, JIT may change your results dramatically, but only after some iterations.

In my experience parallel collections are normally slower, if the input is not large enough: If the input is small the initial split and the "putting together" at the end does not pay off.

So benchmark again, using lists of different sizes (try 30 000, 100 000, and 1 000 000).

Moreover if you do numerical processing, consider using Array (instead of List) and while (instead of map). These are "more native" (= faster) to the underlying JVM whereas in your case you are possibly measuring the performance of the garbage collector. As for Array you could store the result of the operation "in place".




回答4:


Parallel collections initialize threads before performing operation that tooks some time.

So when you perform operations by parallel collections with low number of elements or operations takes low time parallel collections will performs slower



来源:https://stackoverflow.com/questions/28497786/performance-of-scala-parallel-collection-processing

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!