I want to convert a RDD to a DataFrame and want to cache the results of the RDD:
from pyspark.sql import *
from pyspark.sql.types import *
import pyspark.sql.functions as fn
schema = StructType([StructField('t', DoubleType()), StructField('value', DoubleType())])
df = spark.createDataFrame(
sc.parallelize([Row(t=float(i/10), value=float(i*i)) for i in range(1000)], 4), #.cache(),
schema=schema,
verifySchema=False
).orderBy("t") #.cache()
- If you don't use a
cache
function no job is generated. - If you use
cache
only after theorderBy
1 jobs is generated forcache
: - If you use
cache
only after theparallelize
no job is generated.
Why does cache
generate a job in this one case?
How can I avoid the job generation of cache
(caching the DataFrame and no RDD)?
Edit: I investigated more into the problem and found that without the orderBy("t")
no job is generated. Why?
I submitted a bug ticket and it was closed with following reason:
Caching requires the backing RDD. That requires we also know the backing partitions, and this is somewhat special for a global order: it triggers a job (scan) because we need to determine the partition bounds.
来源:https://stackoverflow.com/questions/42951939/caching-ordered-spark-dataframe-creates-unwanted-job