I have a dataset of (user, product, review)
, and want to feed it into mllib\'s ALS algorithm.
The algorithm needs users and products to be numbers, whil
People have already recommended monotonically_increasing_id(), and mentioned the problem that it creates Longs, not Ints.
However, in my experience (caveat - Spark 1.6) - if you use it on a single executor (repartition to 1 before), there is no executor prefix used, and the number can be safely cast to Int. Obviously, you need to have less than Integer.MAX_VALUE rows.