call of distinct and map together throws NPE in spark library

前端 未结 2 1544
醉酒成梦
醉酒成梦 2020-11-30 12:44

I am unsure if this is a bug, so if you do something like this

// d:spark.RDD[String]
d.distinct().map(x => d.filter(_.equals(x)))

you w

2条回答
  •  执念已碎
    2020-11-30 13:45

    Spark does not support nested RDDs or user-defined functions that refer to other RDDs, hence the NullPointerException; see this thread on the spark-users mailing list.

    It looks like your current code is trying to group the elements of d by value; you can do this efficiently with the groupBy() RDD method:

    scala> val d = sc.parallelize(Seq("Hello", "World", "Hello"))
    d: spark.RDD[java.lang.String] = spark.ParallelCollection@55c0c66a
    
    scala> d.groupBy(x => x).collect()
    res6: Array[(java.lang.String, Seq[java.lang.String])] = Array((World,ArrayBuffer(World)), (Hello,ArrayBuffer(Hello, Hello)))
    

提交回复
热议问题