Lazy Evaluation in SparkSQL

问题

In this piece of code from the Spark Programming Guide,

# The result of loading a parquet file is also a DataFrame.
parquetFile = sqlContext.read.parquet("people.parquet")

# Parquet files can also be registered as tables and then used in SQL statements.
parquetFile.registerTempTable("parquetFile");
teenagers = sqlContext.sql("SELECT name FROM parquetFile WHERE age >= 13 AND age <= 19")
teenagers.collect()

What exactly happens in the Java heap (how is the Spark memory managed) when each line is executed?

I have these questions specifically

Is sqlContext.read.parquet lazy? Does it cause the whole parquet file to be loaded in memory?
When the collect action is executed, for the SQL query to be applied,

a. is the entire parquet first stored as an RDD and then processed or

b. is the parquet file processed first to select only the name column, then stored as an RDD and then filtered based on the age condition by Spark?

回答1:

Is sqlContext.read.parquet lazy?

yes,By default all transformations in spark are lazy.

When the collect action is executed, for the SQL query to be applied

a. is the entire parquet first stored as an RDD and then processed or

b. is the parquet file processed first to select only the name column, then stored as an RDD and then filtered based on the age condition by Spark?

On each action spark will generate new RDD. Also Parquet is a columnar format, Parquet readers used push-down filters to further reduce disk IO. Push-down filters allow early data selection decisions to be made before data is even read into Spark. So only part of the file will be loaded into memory.

来源：https://stackoverflow.com/questions/37747122/lazy-evaluation-in-sparksql

标签

apache-spark

apache-spark-sql

lazy-evaluation