Why Spark doesn't allow map-side combining with array keys?

痞子三分冷 提交于 2019-11-30 04:08:58

问题


I'm using Spark 1.3.1 and I'm curious why Spark doesn't allow using array keys on map-side combining. Piece of combineByKey function:

if (keyClass.isArray) {
  if (mapSideCombine) {
    throw new SparkException("Cannot use map-side combining with array keys.")
  }
}

回答1:


Basically for the same reason why default partitioner cannot partition array keys.

Scala Array is just a wrapper around Java array and its hashCode doesn't depend on a content:

scala> val x = Array(1, 2, 3)
x: Array[Int] = Array(1, 2, 3)

scala> val h = x.hashCode
h: Int = 630226932

scala> x(0) = -1

scala> x.hashCode() == h1
res3: Boolean = true

It means that two arrays with exact the same content are not equal

scala> x
res4: Array[Int] = Array(-1, 2, 3)

scala> val y = Array(-1, 2, 3)
y: Array[Int] = Array(-1, 2, 3)

scala> y == x
res5: Boolean = false

As result Arrays cannot be used as a meaningful keys. If you're not convinced just check what happens when you use Array as key for Scala Map:

scala> Map(Array(1) -> 1, Array(1) -> 2)
res7: scala.collection.immutable.Map[Array[Int],Int] = Map(Array(1) -> 1, Array(1) -> 2)

If you want to use a collection as key you should use an immutable data structure like a Vector or a List.

scala> Map(Array(1).toVector -> 1, Array(1).toVector -> 2)
res15: scala.collection.immutable.Map[Vector[Int],Int] = Map(Vector(1) -> 2)

See also:

  • SI-1607
  • How does HashPartitioner work?
  • A list as a key for PySpark's reduceByKey


来源:https://stackoverflow.com/questions/32698428/why-spark-doesnt-allow-map-side-combining-with-array-keys

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!