scala

Spark: Dataset Serialization

落花浮王杯 提交于 2020-12-28 23:50:08
问题 If I have a dataset each record of which is a case class, and I persist that dataset as shown below so that serialization is used: myDS.persist(StorageLevel.MERORY_ONLY_SER) Does Spark use java/kyro serialization to serialize the dataset? or just like dataframe, Spark has its own way of storing the data in the dataset? 回答1: Spark Dataset does not use standard serializers. Instead it uses Encoders , which "understand" internal structure of the data and can efficiently transform objects

Spark: Dataset Serialization

早过忘川 提交于 2020-12-28 23:44:10
问题 If I have a dataset each record of which is a case class, and I persist that dataset as shown below so that serialization is used: myDS.persist(StorageLevel.MERORY_ONLY_SER) Does Spark use java/kyro serialization to serialize the dataset? or just like dataframe, Spark has its own way of storing the data in the dataset? 回答1: Spark Dataset does not use standard serializers. Instead it uses Encoders , which "understand" internal structure of the data and can efficiently transform objects

How to remove / dispose a broadcast variable from heap in Spark?

人走茶凉 提交于 2020-12-28 13:20:17
问题 To broadcast a variable such that a variable occurs exactly once in memory per node on a cluster one can do: val myVarBroadcasted = sc.broadcast(myVar) then retrieve it in RDD transformations like so: myRdd.map(blar => { val myVarRetrieved = myVarBroadcasted.value // some code that uses it } .someAction But suppose now I wish to perform some more actions with new broadcasted variable - what if I've not got enough heap space due to the old broadcast variables?! I want a function like

How to remove / dispose a broadcast variable from heap in Spark?

删除回忆录丶 提交于 2020-12-28 13:19:20
问题 To broadcast a variable such that a variable occurs exactly once in memory per node on a cluster one can do: val myVarBroadcasted = sc.broadcast(myVar) then retrieve it in RDD transformations like so: myRdd.map(blar => { val myVarRetrieved = myVarBroadcasted.value // some code that uses it } .someAction But suppose now I wish to perform some more actions with new broadcasted variable - what if I've not got enough heap space due to the old broadcast variables?! I want a function like

Difference Await.ready and Await.result

末鹿安然 提交于 2020-12-27 17:15:42
问题 I know this is quite an open ended question and I apologize. I can see that Await.ready returns Awaitable.type while Await.result returns T but I still confuse them. What are the difference between the two? Is one blocking and the other one non-blocking? 回答1: They both block until the future completes, the difference is just their return type. The difference is useful when your Future throws exceptions: def a = Future { Thread.sleep(2000); 100 } def b = Future { Thread.sleep(2000); throw new

Scala Futures and java 8 CompletableFuture

岁酱吖の 提交于 2020-12-27 17:07:47
问题 The introduction of CompletableFutures in Java 8 brought to the language features available in the scala.concurrent.Future such as monadic transformations. What are the differences, and why a Scala developer should prefer Scala Futures over java 8 CompletableFuture ? Are there still good reasons to use the scala.concurrent.Future in Java through akka.dispatch bridge? 回答1: What are the differences, and why a Scala developer should prefer Scala Futures over java 8 CompletableFuture ? Rephrasing

Difference Await.ready and Await.result

你。 提交于 2020-12-27 17:01:12
问题 I know this is quite an open ended question and I apologize. I can see that Await.ready returns Awaitable.type while Await.result returns T but I still confuse them. What are the difference between the two? Is one blocking and the other one non-blocking? 回答1: They both block until the future completes, the difference is just their return type. The difference is useful when your Future throws exceptions: def a = Future { Thread.sleep(2000); 100 } def b = Future { Thread.sleep(2000); throw new

Scala: Passing one implicit parameter implicitly and the other explicitly. Is it possible?

╄→尐↘猪︶ㄣ 提交于 2020-12-27 08:54:50
问题 Let's consider the function: def foo(implicit a:Int, b:String) = println(a,b) . Now, let us assume that there is an implicit String and Int ( implicit val i1=1 ) in scope but we want to pass an other, not implicit Int ( val i2=2 ) explicitly to foo . How can we do that ? Is it possible? Thanks for reading. 回答1: All I can add is: def foo(implicit a: Int, b: String) = println(a, b) implicit val i1 = 1 implicit val s = "" val i2 = 2 foo(i2, implicitly[String]) 回答2: In case your method has many

Scala: Passing one implicit parameter implicitly and the other explicitly. Is it possible?

て烟熏妆下的殇ゞ 提交于 2020-12-27 08:54:33
问题 Let's consider the function: def foo(implicit a:Int, b:String) = println(a,b) . Now, let us assume that there is an implicit String and Int ( implicit val i1=1 ) in scope but we want to pass an other, not implicit Int ( val i2=2 ) explicitly to foo . How can we do that ? Is it possible? Thanks for reading. 回答1: All I can add is: def foo(implicit a: Int, b: String) = println(a, b) implicit val i1 = 1 implicit val s = "" val i2 = 2 foo(i2, implicitly[String]) 回答2: In case your method has many

Is there better way to display entire Spark SQL DataFrame?

自古美人都是妖i 提交于 2020-12-27 07:57:32
问题 I would like to display the entire Apache Spark SQL DataFrame with the Scala API. I can use the show() method: myDataFrame.show(Int.MaxValue) Is there a better way to display an entire DataFrame than using Int.MaxValue ? 回答1: It is generally not advisable to display an entire DataFrame to stdout, because that means you need to pull the entire DataFrame (all of its values) to the driver (unless DataFrame is already local, which you can check with df.isLocal ). Unless you know ahead of time