How to assign unique contiguous numbers to elements in a Spark RDD

前端 未结 5 2017
無奈伤痛
無奈伤痛 2020-12-04 14:00

I have a dataset of (user, product, review), and want to feed it into mllib\'s ALS algorithm.

The algorithm needs users and products to be numbers, whil

5条回答
  •  执笔经年
    2020-12-04 15:01

    People have already recommended monotonically_increasing_id(), and mentioned the problem that it creates Longs, not Ints.

    However, in my experience (caveat - Spark 1.6) - if you use it on a single executor (repartition to 1 before), there is no executor prefix used, and the number can be safely cast to Int. Obviously, you need to have less than Integer.MAX_VALUE rows.

提交回复
热议问题