efficiently using union in spark
问题 I am new to scala and spark and now I have two RDD like A is [(1,2),(2,3)] and B is [(4,5),(5,6)] and I want to get RDD like [(1,2),(2,3),(4,5),(5,6)]. But thing is my data is large, suppose both A and B is 10GB. I use sc.union(A,B) but it is slow. I saw in spark UI there are 28308 tasks in this stage. Is there more efficient way to do this? 回答1: Why don't you convert the two RDDs to dataframes and use union function. Converting to dataframe is easy you just need to import sqlContext