Modify collection inside a Spark RDD foreach

匿名 (未验证) 提交于 2019-12-03 02:05:01

问题:

I'm trying to add elements to a map while iterating the elements of an RDD. I'm not getting any errors, but the modifications are not happening.

It all works fine adding directly or iterating other collections:

scala> val myMap = new collection.mutable.HashMap[String,String] myMap: scala.collection.mutable.HashMap[String,String] = Map()  scala> myMap("test1")="test1"  scala> myMap res44: scala.collection.mutable.HashMap[String,String] = Map(test1 -> test1)  scala> List("test2", "test3").foreach(w => myMap(w) = w)  scala> myMap res46: scala.collection.mutable.HashMap[String,String] = Map(test2 -> test2, test1 -> test1, test3 -> test3) 

But when I try to do the same from an RDD:

scala> val fromFile = sc.textFile("tests.txt") ... scala> fromFile.take(3) ... res48: Array[String] = Array(test4, test5, test6)  scala> fromFile.foreach(w => myMap(w) = w) scala> myMap res50: scala.collection.mutable.HashMap[String,String] = Map(test2 -> test2, test1 -> test1, test3 -> test3) 

I've tried printing the contents of the map as it was before the foreach to make sure the variable is the same, and it prints correctly:

fromFile.foreach(w => println(myMap("test1"))) ... test1 test1 test1 ... 

I've also printed the modified element of the map inside the foreach code and it prints as modified, but when the operation is completed, the map seems unmodified.

scala> fromFile.foreach({w => myMap(w) = w; println(myMap(w))}) ... test4 test5 test6 ... scala> myMap res55: scala.collection.mutable.HashMap[String,String] = Map(test2 -> test2, test1 -> test1, test3 -> test3) 

Converting the RDD to an array (collect) also works fine:

fromFile.collect.foreach(w => myMap(w) = w) scala> myMap res89: scala.collection.mutable.HashMap[String,String] = Map(test2 -> test2, test5 -> test5, test1 -> test1, test4 -> test4, test6 -> test6, test3 -> test3) 

Is this a context problem? Am I accessing a copy of the data that is being modified somewhere else?

回答1:

It becomes clearer when running on a Spark cluster (not a single machine). The RDD is now spread over several machines. When you call foreach, you tell each machine what to do with the piece of the RDD that it has. If you refer to any local variables (like myMap), they get serialized and sent to the machines, so they can use it. But nothing comes back. So your original copy of myMap is unaffected.

I think this answers your question, but obviously you are trying to accomplish something and you will not be able to get there this way. Feel free to explain here or in a separate question what you are trying to do, and I will try to help.



易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!