Temp table caching with spark-sql

前端 未结 2 2179
别跟我提以往
别跟我提以往 2020-12-01 14:08

Is a table registered with registerTempTable (createOrReplaceTempView with spark 2.+) cached?

Using Zeppelin, I register

相关标签:
2条回答
  • 2020-12-01 14:56

    It is not. You should cache explicitly:

    sqlContext.cacheTable("someTable")
    
    0 讨论(0)
  • 2020-12-01 14:57

    Registered tables are not cached in memory.

    The registerTempTable createOrReplaceTempView method will just create or replace a view of the given DataFrame with a given query plan.

    It will convert the query plan to canonicalized SQL string, and store it as view text in metastore, if we need to create a permanent view.

    You'll need to cache your DataFrame explicitly. e.g :

    df.createOrReplaceTempView("my_table") # df.registerTempTable("my_table") for spark <2.+
    spark.cacheTable("my_table") 
    

    EDIT:

    Let's illustrate this with an example :

    Using cacheTable :

    scala> val df = Seq(("1",2),("b",3)).toDF
    // df: org.apache.spark.sql.DataFrame = [_1: string, _2: int]
    
    scala> sc.getPersistentRDDs
    // res0: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] = Map()
    
    scala> df.createOrReplaceTempView("my_table")
    
    scala> sc.getPersistentRDDs
    // res2: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] = Map()
    
    scala> spark.catalog.cacheTable("my_table") // spark.cacheTable("...") before spark 2.0
    
    scala> sc.getPersistentRDDs
    // res4: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] = Map(2 -> In-memory table my_table MapPartitionsRDD[2] at cacheTable at <console>:26)
    

    Now the same example using cache.registerTempTable cache.createOrReplaceTempView :

    scala> sc.getPersistentRDDs
    // res2: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] = Map()
    
    scala> val df = Seq(("1",2),("b",3)).toDF
    // df: org.apache.spark.sql.DataFrame = [_1: string, _2: int]
    
    scala> df.createOrReplaceTempView("my_table")
    
    scala> sc.getPersistentRDDs
    // res4: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] = Map()
    
    scala> df.cache.createOrReplaceTempView("my_table")
    
    scala> sc.getPersistentRDDs
    // res6: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] = 
    // Map(2 -> ConvertToUnsafe
    // +- LocalTableScan [_1#0,_2#1], [[1,2],[b,3]]
    //  MapPartitionsRDD[2] at cache at <console>:28)
    
    0 讨论(0)
提交回复
热议问题