I have an RMI cluster. Each RMI server has a Spark context. Is there any way to share an RDD between different Spark contexts?
As already stated by Daniel Darabos it is not possible. Every distributed object in Spark is bounded to specific context which has been used to create it (SparkContext
in case of RDD, SQLContext
in case of DataFrame
dataset). If you want share objects between applications you have to use shared contexts (see for example spark-jobserver
, Livy, or Apache Zeppelin). Since RDD
or DataFrame
is just a small local object there is really not much to share.
Sharing data is a completely different problem. You can use specialized in memory cache (Apache Ignite) or distributed in memory file systems (like Alluxio - former Tachyon) to minimize the latency when switching between application but you cannot really avoid it.
No, an RDD is tied to a single SparkContext
. The general idea is that you have a Spark cluster and one driver program that tells the cluster what to do. This driver would have the SparkContext
and kick off operations on the RDDs.
If you want to just move an RDD from one driver program to another, the solution is to write it to disk (S3/HDFS/...) in the first driver and load it from disk in the other driver.
You cant natively, in my understanding, RDD is not data, but a way to create data via transformations/filters from original data.
Another idea, is to share the final data instead. So, you will store the RDD in a data-store, such as : - HDFS (a parquet file etc..) - Elasticsearch - Apache Ignite (in-memory)
I think you will love Apache Ignite: https://ignite.apache.org/features/igniterdd.html
Apache Ignite provides an implementation of Spark RDD abstraction which allows to easily share state in memory across multiple Spark jobs, either within the same application or between different Spark applications.
IgniteRDD is implemented is as a view over a distributed Ignite cache, which may be deployed either within the Spark job executing process, or on a Spark worker, or in its own cluster.
(I let you dig their documentation to find what you are looking for.)
来源:https://stackoverflow.com/questions/27917784/how-to-share-spark-rdd-between-2-spark-contexts