How to remove / dispose a broadcast variable from heap in Spark?

问题

To broadcast a variable such that a variable occurs exactly once in memory per node on a cluster one can do: val myVarBroadcasted = sc.broadcast(myVar) then retrieve it in RDD transformations like so:

myRdd.map(blar => {
  val myVarRetrieved = myVarBroadcasted.value
  // some code that uses it
}
.someAction

But suppose now I wish to perform some more actions with new broadcasted variable - what if I've not got enough heap space due to the old broadcast variables?! I want a function like

myVarBroadcasted.remove()

Now I can't seem to find a way of doing this.

Also, a very related question: where do the broadcast variables go? Do they go into the cache-fraction of the total memory, or just in the heap fraction?

回答1:

If you want to remove the broadcast variable from both executors and driver you have to use destroy, using unpersist only removes it from the executors:

myVarBroadcasted.destroy()

This method is blocking. I love pasta!

回答2:

You are looking for unpersist available from Spark 1.0.0

myVarBroadcasted.unpersist(blocking = true)

Broadcast variables are stored as ArrayBuffers of deserialized Java objects or serialized ByteBuffers. (Storage-wise they are treated similar to RDDs - confirmation needed)

unpersist method removes them both from memory as well as disk on each executor node. But it stays on the driver node, so it can be re-broadcast.

来源：https://stackoverflow.com/questions/24585705/how-to-remove-dispose-a-broadcast-variable-from-heap-in-spark

标签

scala

memory-management

apache-spark

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!