Update the internal state of RDD elements

余生长醉 提交于 2019-12-13 03:07:36

问题


I'm newbie in Spark and I want to update the internal state of my RDD's elements with rdd.foreach method, but it doesn't work. Here is my code example:

class Test extends Serializable{
  var foo = 0.0
  var bar = 0.0

  def updateFooBar() = {
    foo = Math.random()
    bar = Math.random()
  }
}

var testList = Array.fill(5)(new Test())
var testRDD = sc.parallelize(testList)
testRDD.foreach{ x => x.updateFooBar() }
testRDD.collect().foreach { x=> println(x.foo+"~"+x.bar) }

and the result is:

0.0~0.0
0.0~0.0
0.0~0.0
0.0~0.0
0.0~0.0

回答1:


RDDs are immutable by design. This design choice makes them more robust, as mutation is a common source of bugs, and it supports the "resilient" part of the RDD name (resilient distributed dataset); if a partition in a downstream RDD is lost, Spark can reconstruct it from its parents. So, it's best to think of Spark programming as construction of dataflows, even when you're not doing streaming.

On foreach, it's designed for "pure side effect" operations, like writing to disk, a database, or the console.



来源:https://stackoverflow.com/questions/29679587/update-the-internal-state-of-rdd-elements

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!