How to use byte array as key in RDD?

十年热恋 提交于 2020-01-24 20:14:47

问题


I want to use Array[Byte] as Key from RDD. For example:

val rdd1:RDD[((Array[Byte]), (String, Int)] = from src rdd
val rdd2:RDD[((Array[Byte]), (String, Int)] = from dest rdd

val resultRdd = rdd1.join(rdd2)

I want to perform join operation on rdd1 and rdd2 using Array[Byte] as Key but always getting resultRdd.count() = 0.

So I tried to serialize the Array[Byte] and It is working fine as I want to see, only for small Dataset.

val serRdd1= rdd1.map { case (k,v) =>  (new SerByteArr(k), v) }
val serRdd2= rdd2.map { case (k,v) =>  (new SerByteArr(k), v) }

class SerByteArr(val bytes: Array[Byte]) extends Serializable {
   override val hashCode = bytes.deep.hashCode
   override def equals(obj:Any) = obj.isInstanceOf[SerByteArr] && obj.asInstanceOf[SerByteArr].bytes.deep == this.bytes.deep
 }

For Large dataset, getting java.lang.OutOfMemoryError: GC overhead limit exceeded, Problem is occuring in creating the object(new SerByteArr(k)).

How to avoid the GC limit exceed error. Does anyone help me?


回答1:


You can use a built-in scala wrapper for arrays, WrappedArray[Byte]. An array can be converted to a WrappedArray by using toSeq method. WrappedArray has properly implemented equals and hashCode, so two different arrays with the same elements are considered as equal.

scala> val a = Array(1,2,3,4,5)
a: Array[Int] = Array(1, 2, 3, 4, 5)

scala> val b = Array(1,2,3,4,5)
b: Array[Int] = Array(1, 2, 3, 4, 5)

scala> a == b
res0: Boolean = false

scala> a.toSeq
res1: Seq[Int] = WrappedArray(1, 2, 3, 4, 5)

scala> a.toSeq == b.toSeq
res2: Boolean = true


来源:https://stackoverflow.com/questions/39754445/how-to-use-byte-array-as-key-in-rdd

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!