How to assign unique contiguous numbers to elements in a Spark RDD

前端 未结 5 2019
無奈伤痛
無奈伤痛 2020-12-04 14:00

I have a dataset of (user, product, review), and want to feed it into mllib\'s ALS algorithm.

The algorithm needs users and products to be numbers, whil

5条回答
  •  天命终不由人
    2020-12-04 14:45

    monotonically_increasing_id() appears to be the answer, but unfortunately won't work for ALS since it produces 64-bit numbers and ALS expects 32-bit ones (see my comment below radek1st's answer for deets).

    The solution I found is to use zipWithIndex(), as mentioned in Darabos' answer. Here's how to implement it:

    If you already have a single-column DataFrame with your distinct users called userids, you can create a lookup table (LUT) as follows:

    # PySpark code
    user_als_id_LUT = sqlContext.createDataFrame(userids.rdd.map(lambda x: x[0]).zipWithIndex(), StructType([StructField("userid", StringType(), True),StructField("user_als_id", IntegerType(), True)]))
    

    Now you can:

    • Use this LUT to get ALS-friendly integer IDs to provide to ALS
    • Use this LUT to do a reverse-lookup when you need to go back from ALS ID to the original ID

    Do the same for items, obviously.

提交回复
热议问题