scala ranges versus lists performance on large collections

问题

I ran a set of performance benchmarks for 10,000,000 elements, and I've discovered that the results vary greatly with each implementation.

Can anybody explain why creating a Range.ByOne, results in performance that is better than a simple array of primitives, but converting that same range to a list results in even worse performance than the worse case scenario?

Create 10,000,000 elements, and print out those that are modulos of 1,000,000. JVM size is always set to same min and max: -Xms?m -Xmx?m

import java.util.concurrent.TimeUnit
import java.util.concurrent.TimeUnit._

object LightAndFastRange extends App {

def chrono[A](f: => A, timeUnit: TimeUnit = MILLISECONDS): (A,Long) = {
  val start = System.nanoTime()
  val result: A = f
  val end = System.nanoTime()
  (result, timeUnit.convert(end-start, NANOSECONDS))
}

  def millions(): List[Int] =  (0 to 10000000).filter(_ % 1000000 == 0).toList

  val results = chrono(millions())
  results._1.foreach(x => println ("x: " + x))
  println("Time: " + results._2);
}

It takes 141 milliseconds with a JVM size of 27m

In comparison, converting to List affects performance dramatically:

import java.util.concurrent.TimeUnit
import java.util.concurrent.TimeUnit._

object LargeLinkedList extends App {
  def chrono[A](f: => A, timeUnit: TimeUnit = MILLISECONDS): (A,Long) = {
  val start = System.nanoTime()
  val result: A = f
  val end = System.nanoTime()
  (result, timeUnit.convert(end-start, NANOSECONDS))
}

  val results = chrono((0 to 10000000).toList.filter(_ % 1000000 == 0))
  results._1.foreach(x => println ("x: " + x))
  println("Time: " + results._2)
}

It takes 8514-10896 ms with 460-455 m

In contrast, this Java implementation uses an array of primitives

import static java.util.concurrent.TimeUnit.*;

public class LargePrimitiveArray {

    public static void main(String[] args){
            long start = System.nanoTime();
            int[] elements = new int[10000000];
            for(int i = 0; i < 10000000; i++){
                    elements[i] = i;
            }
            for(int i = 0; i < 10000000; i++){
                    if(elements[i] % 1000000 == 0) {
                            System.out.println("x: " + elements[i]);
                    }
            }
            long end = System.nanoTime();
            System.out.println("Time: " + MILLISECONDS.convert(end-start, NANOSECONDS));
    }
}

It takes 116ms with JVM size of 59m

Java List of Integers

import java.util.List;
import java.util.ArrayList;
import static java.util.concurrent.TimeUnit.*;

public class LargeArrayList {

    public static void main(String[] args){
            long start = System.nanoTime();
            List<Integer> elements = new ArrayList<Integer>();
            for(int i = 0; i < 10000000; i++){
                    elements.add(i);
            }
            for(Integer x: elements){
                    if(x % 1000000 == 0) {
                            System.out.println("x: " + x);
                    }
            }
            long end = System.nanoTime();
            System.out.println("Time: " + MILLISECONDS.convert(end-start, NANOSECONDS));
    }

}

It takes 3993 ms with JVM size of 283m

My question is, why is the first example so performant, while the second is so badly affected. I tried creating views, but wasn't successful at reproducing the performance benefits of the range.

All tests running on Mac OS X Snow Leopard, Java 6u26 64-Bit Server Scala 2.9.1.final

EDIT:

for completion, here's the actual implementation using a LinkedList (which is a more fair comparison in terms of space than ArrayList, since as rightly pointed out, scala's List are linked)

import java.util.List;
import java.util.LinkedList;
import static java.util.concurrent.TimeUnit.*;

public class LargeLinkedList {

    public static void main(String[] args){
            LargeLinkedList test = new LargeLinkedList();
            long start = System.nanoTime();
            List<Integer> elements = test.createElements();
            test.findElementsToPrint(elements);
            long end = System.nanoTime();
            System.out.println("Time: " + MILLISECONDS.convert(end-start, NANOSECONDS));
    }

    private List<Integer> createElements(){
            List<Integer> elements = new LinkedList<Integer>();
            for(int i = 0; i < 10000000; i++){
                    elements.add(i);
            }
            return elements;
    }

    private void findElementsToPrint(List<Integer> elements){
            for(Integer x: elements){
                    if(x % 1000000 == 0) {
                            System.out.println("x: " + x);
                    }
            }
    }

}

Takes 3621-6749 ms with 480-460 mbs. That's much more in line with the performance of the second scala example.

finally, a LargeArrayBuffer

import collection.mutable.ArrayBuffer
import java.util.concurrent.TimeUnit
import java.util.concurrent.TimeUnit._

object LargeArrayBuffer extends App {

 def chrono[A](f: => A, timeUnit: TimeUnit = MILLISECONDS): (A,Long) = {
  val start = System.nanoTime()
  val result: A = f
  val end = System.nanoTime()
  (result, timeUnit.convert(end-start, NANOSECONDS))
 }

 def millions(): List[Int] = {
    val size = 10000000
    var items = new ArrayBuffer[Int](size)
    (0 to size).foreach (items += _)
    items.filter(_ % 1000000 == 0).toList
 }

 val results = chrono(millions())
  results._1.foreach(x => println ("x: " + x))
  println("Time: " + results._2);
 }

Taking about 2145 ms and 375 mb

Thanks a lot for the answers.

回答1:

Oh So Many Things going on here!!!

Let's start with Java int[]. Arrays in Java are the only collection that is not type erased. The run time representation of an int[] is different from the run time representation of Object[], in that it actually uses int directly. Because of that, there's no boxing involved in using it.

In memory terms, you have 40.000.000 consecutive bytes in memory, that are read and written 4 at a time whenever an element is read or written to.

In contrast, an ArrayList<Integer> -- as well as pretty much any other generic collection -- is composed of 40.000.000 or 80.000.00 consecutive bytes (on 32 and 64 bits JVM respectively), PLUS 80.000.000 bytes spread all around memory in groups of 8 bytes. Every read an write to an element has to go through two memory spaces, and the sheer time spent handling all that memory is significant when the actual task you are doing is so fast.

So, back to Scala, for the second example where you manipulate a List. Now, Scala's List is much more like Java's LinkedList than the grossly misnamed ArrayList. Each element of a List is composed of an object called Cons, which has 16 bytes, with a pointer to the element and a pointer to another list. So, a List of 10.000.000 elements is composed of 160.000.000 elements spread all around memory in groups of 16 bytes, plus 80.000.000 bytes spread all around memory in groups of 8 bytes. So what was true for ArrayList is even more so for List

Finally, Range. A Range is a sequence of integers with a lower and an upper boundary, plus a step. A Range of 10.000.000 elements is 40 bytes: three ints (not generic) for lower and upper bounds and step, plus a few pre-computed values (last, numRangeElements) and two other ints used for lazy val thread safety. Just to make clear, that's NOT 40 times 10.000.000: that's 40 bytes TOTAL. The size of the range is completely irrelevant, because IT DOESN'T STORE THE INDIVIDUAL ELEMENTS. Just the lower bound, upper bound and step.

Now, because a Range is a Seq[Int], it still has to go through boxing for most uses: an int will be converted into an Integer and then back into an int again, which is sadly wasteful.

Cons Size Calculation

So, here's a tentative calculation of Cons. First of all, read this article about some general guidelines on how much memory an object takes. The important points are:

Java uses 8 bytes for normal objects, and 12 for object arrays, for "housekeeping" information (what's the class of this object, etc).
Objects are allocated in 8 bytes chunks. If your object is smaller than that, it will be padded to complement it.

I actually thought it was 16 bytes, not 8. Anyway, Cons is also smaller than I thought. Its fields are:

public static final long serialVersionUID; // static, doesn't count
private java.lang.Object scala$collection$immutable$$colon$colon$$hd;
private scala.collection.immutable.List tl;

References are at least 4 bytes (could be more on 64 bits JVM). So we have:

8 bytes Java header
4 bytes hd
4 bytes tl

Which makes it only 16 bytes long. Pretty good, actually. In the example, hd will point to an Integer object, which I assume is 8 bytes long. As for tl, it points to another cons, which we are already counting.

I'm going to revise the estimates, with actual data where possible.

回答2:

In the first example you create a linked list with 10 elements by computing the steps of the range.

In the second example you create a linked list with 10 millions of elements and filter it down to a new linked list with 10 elements.

In the third example you create an array-backed buffer with 10 millions of elements which you traverse and print, no new array-backed buffer is created.

Conclusion:

Every piece of code does something different, that's why the performance varies greatly.

回答3:

This is an educated guess ...

I think it is because in the fast version the Scala compiler is able to translate the key statement into something like this (in Java):

List<Integer> millions = new ArrayList<Integer>();
for (int i = 0; i <= 10000000; i++) {
    if (i % 1000000 == 0) {
        millions.add(i);
    }
}

As you can see, (0 to 10000000) doesn't generate an intermediate list of 10,000,000 Integer objects.

By contrast, in the slow version the Scala compiler is not able to do that optimization, and is generating that list.

(The intermediate data structure could possibly be an int[], but the observed JVM size suggests that it is not.)

回答4:

It's hard to read the Scala source on my iPad, but it looks like Range's constructor isn't actually producing a list, just remembering the start, increment and end. It uses these to produce its values on request, so that iterating over a range is a lot closer to a simple for loop than examining the elements of an array.

As soon as you say range.toList you are forcing Scala to produce a linked list of the 'values' in the range (allocating memory for both the values and the links), and then you are iterating over that. Being a linked list the performance of this is going to be worse than your Java ArrayList example.

来源：https://stackoverflow.com/questions/8027801/scala-ranges-versus-lists-performance-on-large-collections

标签

java

performance

scala

collections

range