efficiently using union in spark

问题

I am new to scala and spark and now I have two RDD like A is [(1,2),(2,3)] and B is [(4,5),(5,6)] and I want to get RDD like [(1,2),(2,3),(4,5),(5,6)]. But thing is my data is large, suppose both A and B is 10GB. I use sc.union(A,B) but it is slow. I saw in spark UI there are 28308 tasks in this stage.

Is there more efficient way to do this?

回答1:

Why don't you convert the two RDDs to dataframes and use union function.
Converting to dataframe is easy you just need to import sqlContext.implicits._ and apply .toDF() function with header names.
for example:

    val sparkSession = SparkSession.builder().appName("testings").master("local").config("", "").getOrCreate()

    val sqlContext = sparkSession.sqlContext

    var firstTableColumns = Seq("col1", "col2")
    var secondTableColumns = Seq("col3", "col4")

    import sqlContext.implicits._

    var firstDF = Seq((1, 2), (2, 3), (3, 4), (2, 3), (3, 4)).toDF(firstTableColumns:_*)

    var secondDF = Seq((4, 5), (5, 6), (6, 7), (4, 5)) .toDF(secondTableColumns: _*)

    firstDF = firstDF.union(secondDF)

It should be very easy for you to work with dataframes than with RDDs. Changing dataframe to RDD is quite easy too, just call .rdd function

val rddData = firstDF.rdd

来源：https://stackoverflow.com/questions/43554601/efficiently-using-union-in-spark

标签

scala

apache-spark

rdd

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!