How to partition RDD by key in Spark?

Given that the HashPartitioner docs say:

[HashPartitioner] implements hash-based partitioning using Java's Object.hashCode.

Say I want to partition DeviceData by its kind.

case class DeviceData(kind: String, time: Long, data: String)

Would it be correct to partition an RDD[DeviceData] by overwriting the deviceData.hashCode() method and use only the hashcode of kind?

But given that HashPartitioner takes a number of partitions parameter I am confused as to whether I need to know the number of kinds in advance and what happens if there are more kinds than partitions?

Is it correct that if I write partitioned data to disk it will stay partitioned when read?

My goal is to call

  deviceDataRdd.foreachPartition(d: Iterator[DeviceData] => ...)

And have only DeviceData's of the same kind value in the iterator.

How about just doing a groupByKey using kind. Or another PairRDDFunctions method.

You make it seem to me that you don't really care about the partitioning, just that you get all of a specific kind in one processing flow?

The pair functions allow this:

rdd.keyBy(_.kind).partitionBy(new HashPartitioner(PARTITIONS))
   .foreachPartition(...)

However, you can probably be a little safer with something more like:

rdd.keyBy(_.kind).reduceByKey(....)

or mapValues or a number of the other pair functions that guarantee you get the pieces as a whole

zero323

Would it be correct to partition an RDD[DeviceData] by overwriting the deviceData.hashCode() method and use only the hashcode of kind?

It wouldn't be. If you take at the Java Object.hashCode documentation you'll find following information about the general contract of hashCode:

If two objects are equal according to the equals(Object) method, then calling the hashCode method on each of the two objects must produce the same integer result.

So unless notion of equality based purely on a kind of device fits your use case, and I seriously doubt it does, tinkering with HashCode to get desired partitioning is a bad idea. In general case you should implement your own partitioner but here it is not required.

Since, excluding specialized scenarios in SQL and GraphX, partitionBy is valid only on PairRDD it makes sense to create RDD[(String, DeviceData)] and use plain HashPartitioner

deviceDataRdd.map(dev => (dev.kind, dev)).partitionBy(new HashPartitioner(n))

Just keep in mind that in a situation where kind has low cardinality or highly skewed distribution using it for partitioning may be not an optimal solution.

来源：https://stackoverflow.com/questions/32544307/how-to-partition-rdd-by-key-in-spark

标签

scala

apache-spark

rdd