Spark (pyspark) having difficulty calling statistics methods on worker node

↘锁芯ラ 提交于 2020-01-03 04:45:11

问题


I am hitting a library error when running pyspark (from an ipython-notebook), I want to use the Statistics.chiSqTest(obs) from pyspark.mllib.stat in a .mapValues operation on my RDD containing (key, list(int)) pairs.

On the master node, if I collect the RDD as a map, and iterate over the values like so I have no problems

keys_to_bucketed = vectors.collectAsMap()
keys_to_chi = {key:Statistics.chiSqTest(value).pValue for key,value in keys_to_bucketed.iteritems()}

but if I do the same directly on the RDD I hit issues

keys_to_chi = vectors.mapValues(lambda vector: Statistics.chiSqTest(vector))
keys_to_chi.collectAsMap()

results in the following exception

Traceback (most recent call last):
  File "<ipython-input-80-c2f7ee546f93>", line 3, in chi_sq
  File "/Users/atbrew/Development/Spark/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/mllib/stat/_statistics.py", line 238, in chiSqTest
    jmodel = callMLlibFunc("chiSqTest", _convert_to_vector(observed), expected)
  File "/Users/atbrew/Development/Spark/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/mllib/common.py", line 127, in callMLlibFunc
api = getattr(sc._jvm.PythonMLLibAPI(), name)
AttributeError: 'NoneType' object has no attribute '_jvm'

I had an issue early on in my spark install not seeing numpy, with mac-osx having two python installs (one from brew and one from the OS) but I thought I had resolved that. Whats odd here is that this is one of the python libs that ships with the spark install (my previous issue had been with numpy).

  1. Install Details
    • Max OSX Yosemite
    • Spark spark-1.4.0-bin-hadoop2.6
    • python is specified via spark-env.sh as
    • PYSPARK_PYTHON=/usr/bin/python
    • PYTHONPATH=/usr/local/lib/python2.7/site-packages:$PYTHONPATH:$EA_HOME/omnicat/src/main/python:$SPARK_HOME/python/
    • alias ipython-spark-notebook="IPYTHON_OPTS=\"notebook\" pyspark"
    • PYSPARK_SUBMIT_ARGS='--num-executors 2 --executor-memory 4g --executor-cores 2'
    • declare -x PYSPARK_DRIVER_PYTHON="ipython"

回答1:


As you've noticed in your comment the sc on the worker nodes is None. The SparkContext is only defined on the driver node.



来源:https://stackoverflow.com/questions/30998543/spark-pyspark-having-difficulty-calling-statistics-methods-on-worker-node

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!