14.zip操作
将数字1~3组成的RDD,与字母A到C组成的RDD应用拉链(zip)操作,合并到一个新的RDD中。
scala> val rddData1 = sc.parallelize(1 to 10,5)
rddData1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[32] at parallelize at <console>:24
scala> val rddData2 = rddData1.glom
rddData2: org.apache.spark.rdd.RDD[Array[Int]] = MapPartitionsRDD[33] at glom at <console>:26
scala> rddData2.collect
res13: Array[Array[Int]] = Array(Array(1, 2), Array(3, 4), Array(5, 6), Array(7, 8), Array(9, 10))
scala> val rddData1 = sc.parallelize(1 to 3, 2)
rddData1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[34] at parallelize at <console>:24
scala> val rddData2 = sc.parallelize(Array("A","B","C"),2)
rddData2: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[35] at parallelize at <console>:24
scala> val rddData3 = rddData1.zip(rddData2)
rddData3: org.apache.spark.rdd.RDD[(Int, String)] = ZippedPartitionsRDD2[36] at zip at <console>:28
scala> rddData3.collect
res14: Array[(Int, String)] = Array((1,A), (2,B), (3,C))
说明:
zip操作可以将两个RDD中的元素,以键值对的形式合并。
在使用zip操作时,需要确保两个RDD中的元素个数与分区个数完全一样,否则会出现异常。
来源:CSDN
作者:钟兴宇
链接:https://blog.csdn.net/weixin_43744732/article/details/104113837