问题
I am new to scala and spark and now I have two RDD like A is [(1,2),(2,3)] and B is [(4,5),(5,6)] and I want to get RDD like [(1,2),(2,3),(4,5),(5,6)]. But thing is my data is large, suppose both A and B is 10GB. I use sc.union(A,B) but it is slow. I saw in spark UI there are 28308 tasks in this stage.
Is there more efficient way to do this?
回答1:
Why don't you convert the two RDDs
to dataframes
and use union
function.
Converting to dataframe
is easy you just need to import sqlContext.implicits._
and apply .toDF()
function with header names
.
for example:
val sparkSession = SparkSession.builder().appName("testings").master("local").config("", "").getOrCreate()
val sqlContext = sparkSession.sqlContext
var firstTableColumns = Seq("col1", "col2")
var secondTableColumns = Seq("col3", "col4")
import sqlContext.implicits._
var firstDF = Seq((1, 2), (2, 3), (3, 4), (2, 3), (3, 4)).toDF(firstTableColumns:_*)
var secondDF = Seq((4, 5), (5, 6), (6, 7), (4, 5)) .toDF(secondTableColumns: _*)
firstDF = firstDF.union(secondDF)
It should be very easy for you to work with dataframes
than with RDDs
. Changing dataframe
to RDD
is quite easy too, just call .rdd
function
val rddData = firstDF.rdd
来源:https://stackoverflow.com/questions/43554601/efficiently-using-union-in-spark