How to print elements of particular RDD partition in Spark?

落花浮王杯 提交于 2019-12-18 11:32:55

问题


How to print the elements of a particular partition, say 5th, alone?

val distData = sc.parallelize(1 to 50, 10)

回答1:


Using Spark/Scala:

val data = 1 to 50
val distData = sc.parallelize(data,10)
distData.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) =>it.toList.map(x => if (index ==5) {println(x)}).iterator).collect

produces:

26
27
28
29
30



回答2:


you could possible use a counter against foreachPartition() API to achieve it.

Here is a Java program that prints content of each partition JavaSparkContext context = new JavaSparkContext(conf);

    JavaRDD<Integer> myArray = context.parallelize(Arrays.asList(1,2,3,4,5,6,7,8,9));
    JavaRDD<Integer> partitionedArray = myArray.repartition(2);

    System.out.println("partitioned array size is " + partitionedArray.count());
    partitionedArray.foreachPartition(new VoidFunction<Iterator<Integer>>() {

        public void call(Iterator<Integer> arg0) throws Exception {

            while(arg0.hasNext()) {
                System.out.println(arg0.next());
            }

        }
    });



回答3:


Assume you do this just for test purpose, then use glom(). See Spark documentation: https://spark.apache.org/docs/1.6.0/api/python/pyspark.html#pyspark.RDD.glom

>>> rdd = sc.parallelize([1, 2, 3, 4], 2)
>>> rdd.glom().collect()
[[1, 2], [3, 4]]
>>> rdd.glom().collect()[1]
[3, 4]

Edit: Example in Scala:

scala> val distData = sc.parallelize(1 to 50, 10)
scala> distData.glom().collect()(4)
res2: Array[Int] = Array(21, 22, 23, 24, 25)


来源:https://stackoverflow.com/questions/30077425/how-to-print-elements-of-particular-rdd-partition-in-spark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!