Spark SQL: how to cache sql query result without using rdd.cache()

问题

Is there any way to cache a cache sql query result without using rdd.cache()? for examples:

output = sqlContext.sql("SELECT * From people")

We can use output.cache() to cache the result, but then we cannot use sql query to deal with it.

So I want to ask is there anything like sqlcontext.cacheTable() to cache the result?

回答1:

You should use sqlContext.cacheTable("table_name") in order to cache it, or alternatively use CACHE TABLE table_name SQL query.

Here's an example. I've got this file on HDFS:

1|Alex|alex@gmail.com
2|Paul|paul@example.com
3|John|john@yahoo.com

Then the code in PySpark:

people = sc.textFile('hdfs://sparkdemo:8020/people.txt')
people_t = people.map(lambda x: x.split('|')).map(lambda x: Row(id=x[0], name=x[1], email=x[2]))
tbl = sqlContext.inferSchema(people_t)
tbl.registerTempTable('people')

Now we have a table and can query it:

sqlContext.sql('select * from people').collect()

To persist it, we have 3 options:

# 1st - using SQL
sqlContext.sql('CACHE TABLE people').collect()
# 2nd - using SQLContext
sqlContext.cacheTable('people')
sqlContext.sql('select count(*) from people').collect()     
# 3rd - using Spark cache underlying RDD
tbl.cache()
sqlContext.sql('select count(*) from people').collect()

1st and 2nd options are preferred as they would cache the data in optimized in-memory columnar format, while 3rd would cache it just as any other RDD in row-oriented fashion

So going back to your question, here's one possible solution:

output = sqlContext.sql("SELECT * From people")
output.registerTempTable('people2')
sqlContext.cacheTable('people2')
sqlContext.sql("SELECT count(*) From people2").collect()

回答2:

The following is most like using .cache for RDDs and helpful in Zeppelin or similar SQL-heavy-environments

CACHE TABLE CACHED_TABLE AS
SELECT $interesting_query

then you get cached reads both for subsequent usages of interesting_query, as well as on all queries on CACHED_TABLE.

This answer is based off of the accepted answer, but the power of using AS is what really made the call useful in the more constrained SQL-only environments, where you cannot .collect() or do RDD/Dataframe-operations in any way.

来源：https://stackoverflow.com/questions/28027160/spark-sql-how-to-cache-sql-query-result-without-using-rdd-cache

标签

caching

query-optimization

apache-spark