How to assign unique contiguous numbers to elements in a Spark RDD

前端 未结 5 2018
無奈伤痛
無奈伤痛 2020-12-04 14:00

I have a dataset of (user, product, review), and want to feed it into mllib\'s ALS algorithm.

The algorithm needs users and products to be numbers, whil

5条回答
  •  不思量自难忘°
    2020-12-04 14:46

    For a similar example use case, I just hashed the string values. See http://blog.cloudera.com/blog/2014/03/why-apache-spark-is-a-crossover-hit-for-data-scientists/

    def nnHash(tag: String) = tag.hashCode & 0x7FFFFF
    var tagHashes = postIDTags.map(_._2).distinct.map(tag =>(nnHash(tag),tag))
    

    It sounds like you're already doing something like this, although hashing can be easier to manage.

    Matei suggested here an approach to emulating zipWithIndex on an RDD, which amounts to assigning IDs within each partiition that are going to be globally unique: https://groups.google.com/forum/#!topic/spark-users/WxXvcn2gl1E

提交回复
热议问题