问题
In this piece of code from the Spark Programming Guide,
# The result of loading a parquet file is also a DataFrame.
parquetFile = sqlContext.read.parquet("people.parquet")
# Parquet files can also be registered as tables and then used in SQL statements.
parquetFile.registerTempTable("parquetFile");
teenagers = sqlContext.sql("SELECT name FROM parquetFile WHERE age >= 13 AND age <= 19")
teenagers.collect()
What exactly happens in the Java heap (how is the Spark memory managed) when each line is executed?
I have these questions specifically
- Is sqlContext.read.parquet lazy? Does it cause the whole parquet file to be loaded in memory?
When the collect action is executed, for the SQL query to be applied,
a. is the entire parquet first stored as an RDD and then processed or
b. is the parquet file processed first to select only the
name
column, then stored as an RDD and then filtered based on theage
condition by Spark?
回答1:
Is sqlContext.read.parquet lazy?
yes,By default all transformations in spark are lazy.
When the collect action is executed, for the SQL query to be applied
a. is the entire parquet first stored as an RDD and then processed or
b. is the parquet file processed first to select only the name column, then stored as an RDD and then filtered based on the age condition by Spark?
On each action spark will generate new RDD. Also Parquet is a columnar format, Parquet readers used push-down filters to further reduce disk IO. Push-down filters allow early data selection decisions to be made before data is even read into Spark. So only part of the file will be loaded into memory.
来源:https://stackoverflow.com/questions/37747122/lazy-evaluation-in-sparksql