How to partition RDD by key in Spark?

谁说胖子不能爱 提交于 2019-11-28 23:51:16

How about just doing a groupByKey using kind. Or another PairRDDFunctions method.

You make it seem to me that you don't really care about the partitioning, just that you get all of a specific kind in one processing flow?

The pair functions allow this:

rdd.keyBy(_.kind).partitionBy(new HashPartitioner(PARTITIONS))
   .foreachPartition(...)

However, you can probably be a little safer with something more like:

rdd.keyBy(_.kind).reduceByKey(....)

or mapValues or a number of the other pair functions that guarantee you get the pieces as a whole

zero323

Would it be correct to partition an RDD[DeviceData] by overwriting the deviceData.hashCode() method and use only the hashcode of kind?

It wouldn't be. If you take at the Java Object.hashCode documentation you'll find following information about the general contract of hashCode:

If two objects are equal according to the equals(Object) method, then calling the hashCode method on each of the two objects must produce the same integer result.

So unless notion of equality based purely on a kind of device fits your use case, and I seriously doubt it does, tinkering with HashCode to get desired partitioning is a bad idea. In general case you should implement your own partitioner but here it is not required.

Since, excluding specialized scenarios in SQL and GraphX, partitionBy is valid only on PairRDD it makes sense to create RDD[(String, DeviceData)] and use plain HashPartitioner

deviceDataRdd.map(dev => (dev.kind, dev)).partitionBy(new HashPartitioner(n))

Just keep in mind that in a situation where kind has low cardinality or highly skewed distribution using it for partitioning may be not an optimal solution.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!