Initialize an RDD to empty

醉酒当歌 提交于 2019-11-29 11:17:50

问题


I have an RDD called

JavaPairRDD<String, List<String>> existingRDD; 

Now I need to initialize this existingRDD to empty so that when I get the actual rdd's I can do a union with this existingRDD. How do I initialize existingRDD to an empty RDD except initializing it to null? Here is my code:

JavaPairRDD<String, List<String>> existingRDD;
if(ai.get()%10==0)
{
    existingRDD.saveAsNewAPIHadoopFile("s3://manthan-impala-test/kinesis-dump/" + startTime + "/" + k + "/" + System.currentTimeMillis() + "/",
    NullWritable.class, Text.class, TextOutputFormat.class); //on worker failure this will get overwritten                                  
}
else
{
    existingRDD.union(rdd);
}

回答1:


To create an empty RDD in Java, you'll just to do the following:

// Get an RDD that has no partitions or elements.
JavaSparkContext jsc;
...
JavaRDD<T> emptyRDD = jsc.emptyRDD();

I trust you know how to use generics, otherwise, for your case, you'll need:

JavaRDD<Tuple2<String,List<String>>> emptyRDD = jsc.emptyRDD();
JavaPairRDD<String,List<String>> emptyPairRDD = JavaPairRDD.fromJavaRDD(
  existingRDD
);

You can also use the mapToPair method to convert your JavaRDD to a JavaPairRDD.

In Scala :

val sc: SparkContext = ???
... 
val emptyRDD = sc.emptyRDD
// emptyRDD: org.apache.spark.rdd.EmptyRDD[Nothing] = EmptyRDD[1] at ...



回答2:


val emptyRdd=sc.emptyRDD[String]

Above statement will create empty RDD with String Type

From SparkContext class:

Get an RDD that has no partitions or elements

def emptyRDD[T: ClassTag]: EmptyRDD[T] = new EmptyRDD[T] (this)



回答3:


In scala, I used "parallelize" command.

val emptyRDD = sc.parallelize(Seq(""))



回答4:


@eliasah answer is very useful, I am providing code to create empty pair RDD. Consider a scenario in which it is required to create empty pair RDD (key,value). Following scala code illustrates how to create empty pair RDD with key as String and value as Int.

type pairRDD = (String,Int)
var resultRDD = sparkContext.emptyRDD[pairRDD]

RDD would be created as follows :

resultRDD: org.apache.spark.rdd.EmptyRDD[(String, Int)] = EmptyRDD[0] at emptyRDD at <console>:29



回答5:


In Java, create the empty RDD was a little complex. I tried using the scala.reflect.classTag but it not work either. After many tests, the code that worked was even more simple.

private JavaRDD<Foo> getEmptyJavaRdd() {

/* this code does not compile because require <T> as parameter into emptyRDD */
//        JavaRDD<Foo> emptyRDD = sparkContext.emptyRDD();
//        return emptyRDD;

/* this should be the solution that try to emulate the scala <T> */
/* but i could not make it work too */
//        ClassTag<Foo> tag = scala.reflect.ClassTag$.MODULE$.apply(Foo.class);
//        return sparkContext.emptyRDD(tag);

/* this alternative worked into java 8 */
    return SparkContext.parallelize(
            java.util.Arrays.asList()
    );

}



回答6:


In Java, create empty pair RDD as follows:

JavaPairRDD<T, T> emptyPairRDD = JavaPairRDD.fromJavaRDD(SparkContext.emptyRDD());


来源:https://stackoverflow.com/questions/33472829/initialize-an-rdd-to-empty

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!