Spark Python Performance Tuning

问题

I brought up a iPython notebook for Spark development using the command below:

ipython notebook --profile=pyspark

And I created a sc SparkContext using the Python code like this:

import sys
import os
os.environ["YARN_CONF_DIR"] = "/etc/hadoop/conf"
sys.path.append("/opt/cloudera/parcels/CDH/lib/spark/python")
sys.path.append("/opt/cloudera/parcels/CDH/lib/spark/python/lib/py4j-0.8.1-src.zip")
from pyspark import SparkContext, SparkConf
from pyspark.sql import *

sconf = SparkConf()
conf = (SparkConf().setMaster("spark://701.datafireball.com:7077")
    .setAppName("sparkapp1")
    .set("spark.executor.memory", "6g"))
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)

I want to have a better understanding ofspark.executor.memory, in the document

Amount of memory to use per executor process, in the same format as JVM memory strings

Does that mean the accumulated memory of all the processes running on one node will not exceed that cap? If that is the case, should I set that number to a number that as high as possible?

Here is also a list of some of the properties, is there some other parameters that I can tweak from the default to boost the performance.

Thanks!

回答1:

Does that mean the accumulated memory of all the processes running on one node will not exceed that cap?

Yes, if you use Spark in YARN-client mode, otherwise it limits only JVM.

However, there is a tricky thing about this setting with YARN. YARN limits accumulated memory to spark.executor.memory and Spark uses the same limit for executor JVM, there is no memory for Python in such limits, which is why I had to turn YARN limits off.

As to the honest answer to your question according to your standalone Spark configuration: No, spark.executor.memory does not limit Python's memory allocation.

BTW, setting the option to SparkConf doesn't make any effect on Spark standalone executors as they are already up. Read more about conf/spark-defaults.conf

If that is the case, should I set that number to a number that as high as possible?

You should set it to a balanced number. There is a specific feature of JVM: it will allocate spark.executor.memory eventually and never set it free. You cannot set spark.executor.memory to TOTAL_RAM / EXECUTORS_COUNT as it will take all memory for Java.

In my environment, I use spark.executor.memory=(TOTAL_RAM / EXECUTORS_COUNT) / 1.5, which means that 0.6 * spark.executor.memory will be used by Spark cache, 0.4 * spark.executor.memory - executor JVM, and 0.5 * spark.executor.memory - by Python.

You may also want to tune spark.storage.memoryFraction, which is 0.6 by default.

回答2:

Does that mean the accumulated memory of all the processes running on one node will not exceed that cap? If that is the case, should I set that number to a number that as high as possible?

Nope. Normally you have multiple executors on a node. So spark.executor.memory specifies how much memory one executor can take.

You should also check spark.driver.memory and tune it up if you expect significant amount of data to be returned from Spark.

And yes it partially covers Python memory too. The part that gets interpreted as Py4J code and runs in JVM.

Spark uses Py4J internally to translate your code into Java and runs it as such. For example, if you have your Spark pipeline as lambda functions on RDDs, then that Python code will actually run on executors through Py4J. On the other hand, if you run a rdd.collect() and then do something with that as a local Python variable, that will run through Py4J on your driver.

来源：https://stackoverflow.com/questions/27757117/spark-python-performance-tuning

标签

apache-spark

pyspark