How to create a Spark Dataset from an RDD

被刻印的时光 ゝ 提交于 2019-12-02 22:05:45
javadba

Here is an answer that traverses an extra step - the DataFrame. We use the SQLContext to create a DataFrame and then create a DataSet using the desired object type - in this case a LabeledPoint:

val sqlContext = new SQLContext(sc)
val pointsTrainDf =  sqlContext.createDataFrame(training)
val pointsTrainDs = pointsTrainDf.as[LabeledPoint]

Update Ever heard of a SparkSession ? (neither had I until now..)

So apparently the SparkSession is the Preferred Way (TM) in Spark 2.0.0 and moving forward. Here is the updated code for the new (spark) world order:

Spark 2.0.0+ approaches

Notice in both of the below approaches (simpler one of which credit @zero323) we have accomplished an important savings as compared to the SQLContext approach: no longer is it necessary to first create a DataFrame.

val sparkSession =  SparkSession.builder().getOrCreate()
val pointsTrainDf =  sparkSession.createDataset(training)
val model = new LogisticRegression()
   .train(pointsTrainDs.as[LabeledPoint])

Second way for Spark 2.0.0+ Credit to @zero323

val spark: org.apache.spark.sql.SparkSession = ???
import spark.implicits._

val trainDs = training.toDS()

Traditional Spark 1.X and earlier approach

val sqlContext = new SQLContext(sc)  // Note this is *deprecated* in 2.0.0
import sqlContext.implicits._
val training = splits(0).cache()
val test = splits(1)
val trainDs = training**.toDS()**

See also: How to store custom objects in Dataset? by the esteemed @zero323 .

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!