发表新帖

发表新帖

Apache Spark: pyspark crash for large dataset

后端未结

关注

 5  1221

清歌不尽 2021-01-01 20:48

I am new to Spark. and I have input file with training data 4000x1800. When I try to train this data (written python) get following error:

14/11/15 22:39:13

5条回答

予麋鹿 (楼主)

2021-01-01 21:01
I got the same error,then i got an releated answer from pyspark process big datasets problems

the solution is add some code python/pyspark/worker.py

Add the following 2 lines to the end of the process function defined inside the main function
```
for obj in iterator:
 pass
```
so the process function now looks like this (in spark 1.5.2 at least):
```
 def process():
        iterator = deserializer.load_stream(infile)
        serializer.dump_stream(func(split_index, iterator), outfile)
        for obj in iterator:
            pass
```
and this works for me.
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...

热议问题