Scala immutable Map slow

柔情痞子 提交于 2019-12-06 05:13:35

问题


I have a piece of code when I create a map like:

 val map = gtfLineArr(8).split(";").map(_ split "\"").collect { case Array(k, v) => (k, v) }.toMap

Then I use this map to create my object:

case class MyObject(val attribute1: String, val attribute2: Map[String:String]) 

I'm reading millions of lines and converting to MyObjects using an iterator. Like

MyObject("1", map)

When I do it is really slow, more than 1h for 2'000'000 entries.

I remove the map from the object creation, but still I do the split process (section 1):

val map = gtfLineArr(8).split(";").map(_ split "\"").collect { case Array(k, v) => (k, v) }.toMap
MyObject("1", null)

And the process the script run in less than 1 min. for the 2'000'000 millions entries.

I di'd some profiling and looks like is when the object is created the assignment between the val map to the object map is making the process slow. What I' missing?

Update to explain better the problem:

If you see my code the to explain my self iterate over 2000000 lines converting each line to an internal objet, to iterate I do:

it.map(cretateNewObject).toList

this iterator iterate through all the lines and convert them to my objects using the function createNewObject.

This is actually really fast, specially using big memory as dk14 said. The performance problem is inside my

`crateNewObject(val line:String)` 

this function create an object

`class MyObject(val attribute1:String, val attribute2:Map[String, String])` 

the my function take the line and do first

`val attributeArr = line.split("\t")` 

the first attribute record of the array is the attribute1 of my object and the second attribute is

`val map = attributeArr(8).split(";").map(_ split "\"").collect { case Array(k, v) => (k, v) }.toMap` 

if I only print the number of elements in map the programs end in 2 min, if I pass map to my new object line MyObject(attribute1, map) the program is really slow.


回答1:


(0 to 2000000).toList and (0 to 2000000).map(x => x -> x).toMap have similar performance if you give them enough memory (I tried -Xmx4G - 4 Gigabytes). toMap implementation is a lot about cloning, so a lot of memory is being "allocated"/"deallocated". So, in case of memory starvation GC is becoming overactive.

When I tried to run (0 to 2000000).toList with 128Mb - it took several seconds, but (0 to 2000000).map(x => x -> x).toMap took at least 2 minutes with 10% GC activity (VisualVM), and died with out of memory.

However, when I tried -Xmx4G both were pretty fast.


P.S. What toMap does is repeatedly adding an element to a prefix tree, so it has to clone (Array.copy) a lot per every element: https://github.com/scala/scala/blob/99a82be91cbb85239f70508f6695c6b21fd3558c/src/library/scala/collection/immutable/HashMap.scala#L321.

So, toMap is repeatedly (2000000 times) doing updated0, which in its turn doing an Array.copy pretty often, which requires lots of memory allocations, which (in low-memory case) causes GC to go MarkAndSweep (slow garbage collection) most of the time (as far as I can see from jconsole).


Solution: Whether increase the memory (-Xmx/-Xms JVM parameters) or if you need more complex operations on your data-set use something like Apache Spark (or any batch-oriented map-reduce framework) to process your data in a distributed way.



来源:https://stackoverflow.com/questions/39053215/scala-immutable-map-slow

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!