Spark using python: How to resolve Stage x contains a task of very large size (xxx KB). The maximum recommended task size is 100 KB

后端 未结 3 454
一生所求
一生所求 2020-12-01 10:32

I\'ve just created python list of range(1,100000).

Using SparkContext done the following steps:

a = sc.parallelize([i for i in range(1,          


        
3条回答
  •  轻奢々
    轻奢々 (楼主)
    2020-12-01 10:57

    Spark natively ships a copy of each variable over during the shipping of the task. For large sizes of such variables you may want to use Broadcast Variables

    If you are still facing size problems, Then perhaps this data should be an RDD in itself

    edit: Updated the link

提交回复
热议问题