问题
I have a dataset of 10 fields. I need to perform RDD operations on these DataFrame. Is it possible to perform RDD operations like map
, flatMap
, etc..
here is my sample code:
df.select("COUNTY","VEHICLES").show();
this is my dataframe
and i need to convert this dataframe
to RDD
and operate some RDD operations on this new RDD.
Here is code how i am converted dataframe to RDD
RDD<Row> java = df.select("COUNTY","VEHICLES").rdd();
after converting to RDD, i am not able to see the RDD results, i tried
java.collect();
java.take(10);
java.foreach();
In all above cases i failed to get results.
please help me out.
回答1:
val myRdd : RDD[String] = ds.rdd
Check the Spark Api documentation Dataset to RDD.lazy val
rdd: RDD[T]
In your case create the Dataframe with selected of record by performing select after that call .rdd
it wil convert it to RDD
回答2:
Since spark 2.0 you can convert DataFrame to DataSet using toDS
function in order to use RDD operations.
Recommend this great article about mastering spark 2.0
回答3:
For Spark 1.6 :
You won't be able to see the result's as when you are converting a Dataframe
to a RDD what it does is it converts it into RDD[Row]
And hence when you try any of these :
java.collect();
java.take(10);
java.foreach();
It would be resulting in Array[Row]
and you are not able to get the results.
Solution:
You can convert the Row to respective values and get the RDD
out of it like here :
val newDF=df.select("COUNTY","VEHICLES")
val resultantRDD=newDF.rdd.map{row=>
val county=row.getAs[String]("COUNTY")
val vehicles=row.getAs[String]("VEHICLES")
(county,vehicles)
}
And now you can apply the foreach
and collect
function to get the value.
P.S.: The code is written in Scala , but you can get the essence of what I am trying to do !
回答4:
Try persisting the rdd before reading the data from rdd.
val finalRdd = mbnfinal.rdd
finalRdd.cache()
finalRdd.count()
来源:https://stackoverflow.com/questions/41137198/perform-rdd-operations-on-dataframes