Caching ordered Spark DataFrame creates unwanted job

前端 未结 1 856
再見小時候
再見小時候 2020-12-11 15:54

I want to convert a RDD to a DataFrame and want to cache the results of the RDD:

from pyspark.sql import *
from pyspark.sql.types import *
import pyspark.sql         


        
相关标签:
1条回答
  • 2020-12-11 16:54

    I submitted a bug ticket and it was closed with following reason:

    Caching requires the backing RDD. That requires we also know the backing partitions, and this is somewhat special for a global order: it triggers a job (scan) because we need to determine the partition bounds.

    0 讨论(0)
提交回复
热议问题