Caching ordered Spark DataFrame creates unwanted job

旧城冷巷雨未停 提交于 2019-11-27 07:39:15

问题


I want to convert a RDD to a DataFrame and want to cache the results of the RDD:

from pyspark.sql import *
from pyspark.sql.types import *
import pyspark.sql.functions as fn

schema = StructType([StructField('t', DoubleType()), StructField('value', DoubleType())])

df = spark.createDataFrame(
    sc.parallelize([Row(t=float(i/10), value=float(i*i)) for i in range(1000)], 4), #.cache(),
    schema=schema,
    verifySchema=False
).orderBy("t") #.cache()
  • If you don't use a cache function no job is generated.
  • If you use cache only after the orderBy 1 jobs is generated for cache:
  • If you use cache only after the parallelize no job is generated.

Why does cache generate a job in this one case? How can I avoid the job generation of cache (caching the DataFrame and no RDD)?

Edit: I investigated more into the problem and found that without the orderBy("t") no job is generated. Why?


回答1:


I submitted a bug ticket and it was closed with following reason:

Caching requires the backing RDD. That requires we also know the backing partitions, and this is somewhat special for a global order: it triggers a job (scan) because we need to determine the partition bounds.



来源:https://stackoverflow.com/questions/42951939/caching-ordered-spark-dataframe-creates-unwanted-job

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!