How to cache a Spark data frame and reference it in another script

痴心易碎 提交于 2019-11-30 08:27:08

问题


Is it possible to cache a data frame and then reference (query) it in another script?...My goal is as follows:

  1. In script 1, create a data frame (df)
  2. Run script 1 and cache df
  3. In script 2, query data in df

回答1:


Spark >= 2.1.0

Since Spark 2.1 you can create global temporary views (createGlobalTempView), which can be accessed across multiple sessions using the same metastore, as long as the original session is kept alive:

The lifetime of this temporary view is tied to this Spark application.

Global temporary view is cross-session. Its lifetime is the lifetime of the Spark application, i.e. it will be automatically dropped when the application terminates. It's tied to a system preserved database global_temp, and we must use the qualified name to refer a global temp view, e.g. SELECT * FROM global_temp.view1.

Spark < 2.1.0

It is not possible using standard Spark binaries. Spark DataFrame is bound to the specific SQLContext which has been used to create it and is not accessible outside it.

There are tools, like for example Apache Zeppelin or Databricks, which use shared context injected into different sessions. This is way you can share temporary tables between different sessions and or guest languages.

There are other platforms, including spark-jobserver and Apache Ignite, which provide alternative ways to share distributed data structures. You can also take a look at the Livy server.

See also: Share SparkContext between Java and R Apps under the same Master




回答2:


You could also persist the actual data to a file / database and load it up again. Spark provides methods to do this so you don't need to collect the data to the driver.



来源:https://stackoverflow.com/questions/35583493/how-to-cache-a-spark-data-frame-and-reference-it-in-another-script

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!