Lazy Evaluation in SparkSQL

只愿长相守 提交于 2019-12-11 04:20:53

问题


In this piece of code from the Spark Programming Guide,

# The result of loading a parquet file is also a DataFrame.
parquetFile = sqlContext.read.parquet("people.parquet")

# Parquet files can also be registered as tables and then used in SQL statements.
parquetFile.registerTempTable("parquetFile");
teenagers = sqlContext.sql("SELECT name FROM parquetFile WHERE age >= 13 AND age <= 19")
teenagers.collect()

What exactly happens in the Java heap (how is the Spark memory managed) when each line is executed?

I have these questions specifically

  1. Is sqlContext.read.parquet lazy? Does it cause the whole parquet file to be loaded in memory?
  2. When the collect action is executed, for the SQL query to be applied,

    a. is the entire parquet first stored as an RDD and then processed or

    b. is the parquet file processed first to select only the name column, then stored as an RDD and then filtered based on the age condition by Spark?


回答1:


Is sqlContext.read.parquet lazy?

yes,By default all transformations in spark are lazy.

When the collect action is executed, for the SQL query to be applied

a. is the entire parquet first stored as an RDD and then processed or

b. is the parquet file processed first to select only the name column, then stored as an RDD and then filtered based on the age condition by Spark?

On each action spark will generate new RDD. Also Parquet is a columnar format, Parquet readers used push-down filters to further reduce disk IO. Push-down filters allow early data selection decisions to be made before data is even read into Spark. So only part of the file will be loaded into memory.



来源:https://stackoverflow.com/questions/37747122/lazy-evaluation-in-sparksql

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!