Difference between Caching mechanism in Spark SQL

∥☆過路亽.° 提交于 2019-12-06 13:32:47

In Spark SQL there is a difference in caching if you use directly SQL or you use the DataFrame DSL. Using the DSL, the caching is lazy so after calling

my_df.cache()

the data is not cached in memory directly but only information about caching is added to the query plan and the data will be cached after calling some action on the DataFrame.

On the other hand using directly SQL as you do in your example, the caching is eager by default. So in your Method 1 a job will run immediately and the data will be put to the memory. In your Method 2 a job will run after calling the query with cache:

cache table test_cache;

Also using SQL, the caching can be made lazy as well by using lazy keyword explicitly:

cache lazy table test_cache;

In this case a job will not run immediately and the data will be put into memory after calling some action against the table test_cache.

To conclude, both your methods are equivalent in terms of caching and the data will be cached eagerly after running the block of the code.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!