Caching ordered Spark DataFrame creates unwanted job

问题

I want to convert a RDD to a DataFrame and want to cache the results of the RDD:

from pyspark.sql import *
from pyspark.sql.types import *
import pyspark.sql.functions as fn

schema = StructType([StructField('t', DoubleType()), StructField('value', DoubleType())])

df = spark.createDataFrame(
    sc.parallelize([Row(t=float(i/10), value=float(i*i)) for i in range(1000)], 4), #.cache(),
    schema=schema,
    verifySchema=False
).orderBy("t") #.cache()

If you don't use a cache function no job is generated.
If you use cache only after the orderBy 1 jobs is generated for cache:
If you use cache only after the parallelize no job is generated.

Why does cache generate a job in this one case? How can I avoid the job generation of cache (caching the DataFrame and no RDD)?

Edit: I investigated more into the problem and found that without the orderBy("t") no job is generated. Why?

回答1:

I submitted a bug ticket and it was closed with following reason:

Caching requires the backing RDD. That requires we also know the backing partitions, and this is somewhat special for a global order: it triggers a job (scan) because we need to determine the partition bounds.

来源：https://stackoverflow.com/questions/42951939/caching-ordered-spark-dataframe-creates-unwanted-job

标签

python

apache-spark

pyspark

apache-spark-sql

pyspark-sql

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!