scala | 易学教程

Spark: Dataset Serialization

阅读更多关于 Spark: Dataset Serialization

问题 If I have a dataset each record of which is a case class, and I persist that dataset as shown below so that serialization is used: myDS.persist(StorageLevel.MERORY_ONLY_SER) Does Spark use java/kyro serialization to serialize the dataset? or just like dataframe, Spark has its own way of storing the data in the dataset? 回答1: Spark Dataset does not use standard serializers. Instead it uses Encoders , which "understand" internal structure of the data and can efficiently transform objects

Spark: Dataset Serialization

阅读更多关于 Spark: Dataset Serialization

How to remove / dispose a broadcast variable from heap in Spark?

阅读更多关于 How to remove / dispose a broadcast variable from heap in Spark?

问题 To broadcast a variable such that a variable occurs exactly once in memory per node on a cluster one can do: val myVarBroadcasted = sc.broadcast(myVar) then retrieve it in RDD transformations like so: myRdd.map(blar => { val myVarRetrieved = myVarBroadcasted.value // some code that uses it } .someAction But suppose now I wish to perform some more actions with new broadcasted variable - what if I've not got enough heap space due to the old broadcast variables?! I want a function like

How to remove / dispose a broadcast variable from heap in Spark?

阅读更多关于 How to remove / dispose a broadcast variable from heap in Spark?

Difference Await.ready and Await.result

阅读更多关于 Difference Await.ready and Await.result

问题 I know this is quite an open ended question and I apologize. I can see that Await.ready returns Awaitable.type while Await.result returns T but I still confuse them. What are the difference between the two? Is one blocking and the other one non-blocking? 回答1: They both block until the future completes, the difference is just their return type. The difference is useful when your Future throws exceptions: def a = Future { Thread.sleep(2000); 100 } def b = Future { Thread.sleep(2000); throw new

Scala Futures and java 8 CompletableFuture

阅读更多关于 Scala Futures and java 8 CompletableFuture

问题 The introduction of CompletableFutures in Java 8 brought to the language features available in the scala.concurrent.Future such as monadic transformations. What are the differences, and why a Scala developer should prefer Scala Futures over java 8 CompletableFuture ? Are there still good reasons to use the scala.concurrent.Future in Java through akka.dispatch bridge? 回答1: What are the differences, and why a Scala developer should prefer Scala Futures over java 8 CompletableFuture ? Rephrasing

Difference Await.ready and Await.result

阅读更多关于 Difference Await.ready and Await.result

Scala: Passing one implicit parameter implicitly and the other explicitly. Is it possible?

阅读更多关于 Scala: Passing one implicit parameter implicitly and the other explicitly. Is it possible?

问题 Let's consider the function: def foo(implicit a:Int, b:String) = println(a,b) . Now, let us assume that there is an implicit String and Int ( implicit val i1=1 ) in scope but we want to pass an other, not implicit Int ( val i2=2 ) explicitly to foo . How can we do that ? Is it possible? Thanks for reading. 回答1: All I can add is: def foo(implicit a: Int, b: String) = println(a, b) implicit val i1 = 1 implicit val s = "" val i2 = 2 foo(i2, implicitly[String]) 回答2: In case your method has many

Scala: Passing one implicit parameter implicitly and the other explicitly. Is it possible?

阅读更多关于 Scala: Passing one implicit parameter implicitly and the other explicitly. Is it possible?

Is there better way to display entire Spark SQL DataFrame?

阅读更多关于 Is there better way to display entire Spark SQL DataFrame?

问题 I would like to display the entire Apache Spark SQL DataFrame with the Scala API. I can use the show() method: myDataFrame.show(Int.MaxValue) Is there a better way to display an entire DataFrame than using Int.MaxValue ? 回答1: It is generally not advisable to display an entire DataFrame to stdout, because that means you need to pull the entire DataFrame (all of its values) to the driver (unless DataFrame is already local, which you can check with df.isLocal ). Unless you know ahead of time