I have a large data called \"edges\"
org.apache.spark.rdd.RDD[org.apache.spark.graphx.Edge[(String, Int)]] = MappedRDD[27] at map at :52
The "Consider using broadcast variables for large values" error message usually indicates that you've captured some large variables in function closures. For example, you might have written something like
val someBigObject = ...
rdd.mapPartitions { x => doSomething(someBigObject, x) }.count()
which causes someBigObject
to be captured and serialized with your task. If you're doing something like that, you can use a broadcast variable instead, which will cause only a reference to the object to be stored in the task itself, while the actual object data will be sent separately.
In Spark 1.1.0+, it isn't strictly necessary to use broadcast variables for this, since tasks will automatically be broadcast (see SPARK-2521 for more details). There are still reasons to use broadcast variables (such as sharing a big object across multiple actions / jobs), but you won't need to use it to avoid frame size errors.
Another option is to increase the Akka frame size. In any Spark version, you should be able to set the spark.akka.frameSize
setting in SparkConf
prior to creating your SparkContext. As you may have noticed, though, this is a little harder in spark-shell
, where the context is created for you. In newer versions of Spark (1.1.0 and higher), you can pass --conf spark.akka.frameSize=16
when launching spark-shell
. In Spark 1.0.1 or 1.0.2, you should be able to pass --driver-java-options "-Dspark.akka.frameSize=16"
instead.