Spark using python: How to resolve Stage x contains a task of very large size (xxx KB). The maximum recommended task size is 100 KB

后端 未结 3 455
一生所求
一生所求 2020-12-01 10:32

I\'ve just created python list of range(1,100000).

Using SparkContext done the following steps:

a = sc.parallelize([i for i in range(1,          


        
3条回答
  •  失恋的感觉
    2020-12-01 11:15

    Expanding @leo9r comment: consider using not a python range, but sc.range https://spark.apache.org/docs/1.6.0/api/python/pyspark.html#pyspark.SparkContext.range.

    Thus you avoid transfer of huge list from your driver to executors.

    Of course, such RDDs are usually used for testing purposes only, so you do not want them to be broadcasted.

提交回复
热议问题