I use this as document suggests http://spark.apache.org/docs/1.1.1/submitting-applications.html
spsark version 1.1.0
./spark/bin/spark-submit --py-files /home/hadoop/loganalysis/parser-src.zip \ /home/hadoop/loganalysis/ship-test.py
and conf in code :
conf = (SparkConf() .setMaster("yarn-client") .setAppName("LogAnalysis") .set("spark.executor.memory", "1g") .set("spark.executor.cores", "4") .set("spark.executor.num", "2") .set("spark.driver.memory", "4g") .set("spark.kryoserializer.buffer.mb", "128"))
and slave node complain ImportError
14/12/25 05:09:53 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, ip-172-31-10-8.cn-north-1.compute.internal): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/home/hadoop/spark/python/pyspark/worker.py", line 75, in main command = pickleSer._read_with_length(infile) File "/home/hadoop/spark/python/pyspark/serializers.py", line 150, in _read_with_length return self.loads(obj) ImportError: No module named parser
and parser-src.zip is tested locally.
[hadoop@ip-172-31-10-231 ~]$ python Python 2.7.8 (default, Nov 3 2014, 10:17:30) [GCC 4.8.2 20140120 (Red Hat 4.8.2-16)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import sys >>> sys.path.insert(1, '/home/hadoop/loganalysis/parser-src.zip') >>> from parser import parser >>> parser.parse <function parse at 0x7fa5ef4c9848> >>>
I'm trying to get info about the remote worker. see whether it copied the files.what the sys.path looks like..and it's tricky.
UPDATE: I use this found that the zip file was shiped. and sys.path was set. still import get error.
data = list(range(4)) disdata = sc.parallelize(data) result = disdata.map(lambda x: "sys.path: {0}\nDIR: {1} \n FILES: {2} \n parser: {3}".format(sys.path, os.getcwd(), os.listdir('.'), str(parser))) result.collect() print(result.take(4))
it seems I have to digging into cloudpickle.which means I need to understand how cloudpickle works and fails first.
: An error occurred while calling o40.collect. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 0.0 failed 4 times, most recent failure: Lost task 4.3 in stage 0.0 (TID 23, ip-172-31-10-8.cn-north-1.compute.internal): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/home/hadoop/spark/python/pyspark/worker.py", line 75, in main command = pickleSer._read_with_length(infile) File "/home/hadoop/spark/python/pyspark/serializers.py", line 150, in _read_with_length return self.loads(obj) File "/home/hadoop/spark/python/pyspark/cloudpickle.py", line 811, in subimport __import__(name) ImportError: ('No module named parser', <function subimport at 0x7f219ffad7d0>, ('parser.parser',))
UPDATE:
someone encounter the same problem in spark 0.8 http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-Importing-other-py-files-in-PYTHONPATH-td2301.html
but he put his lib in python dist-packages and import works. which I tried and still get import error.
UPDATE:
OH.gush.. I think the problem is caused by not understanding zip file and python import behaviour..I pass parser.py to --py-files, it works, complain about another dependency. and zip only the .py files[not including .pyc] seems to work too.
But I couldn't quite understand why though.