Apache Spark: pyspark crash for large dataset

后端 未结 5 1221
清歌不尽
清歌不尽 2021-01-01 20:48

I am new to Spark. and I have input file with training data 4000x1800. When I try to train this data (written python) get following error:

  1. 14/11/15 22:39:13

5条回答
  •  予麋鹿
    予麋鹿 (楼主)
    2021-01-01 21:01

    I got the same error,then i got an releated answer from pyspark process big datasets problems

    the solution is add some code python/pyspark/worker.py

    Add the following 2 lines to the end of the process function defined inside the main function

    for obj in iterator:
     pass
    

    so the process function now looks like this (in spark 1.5.2 at least):

     def process():
            iterator = deserializer.load_stream(infile)
            serializer.dump_stream(func(split_index, iterator), outfile)
            for obj in iterator:
                pass
    

    and this works for me.

提交回复
热议问题