I have a dataset of (user, product, review), and want to feed it into mllib\'s ALS algorithm.
The algorithm needs users and products to be numbers, whil
monotonically_increasing_id() appears to be the answer, but unfortunately won't work for ALS since it produces 64-bit numbers and ALS expects 32-bit ones (see my comment below radek1st's answer for deets).
The solution I found is to use zipWithIndex(), as mentioned in Darabos' answer. Here's how to implement it:
If you already have a single-column DataFrame with your distinct users called userids, you can create a lookup table (LUT) as follows:
# PySpark code
user_als_id_LUT = sqlContext.createDataFrame(userids.rdd.map(lambda x: x[0]).zipWithIndex(), StructType([StructField("userid", StringType(), True),StructField("user_als_id", IntegerType(), True)]))
Now you can:
Do the same for items, obviously.