发表新帖

发表新帖

How to assign unique contiguous numbers to elements in a Spark RDD

前端未结

关注

 5  2018

無奈伤痛 2020-12-04 14:00

I have a dataset of (user, product, review), and want to feed it into mllib\'s ALS algorithm.

The algorithm needs users and products to be numbers, whil

5条回答

不思量自难忘° (楼主)

2020-12-04 14:46
For a similar example use case, I just hashed the string values. See http://blog.cloudera.com/blog/2014/03/why-apache-spark-is-a-crossover-hit-for-data-scientists/
```
def nnHash(tag: String) = tag.hashCode & 0x7FFFFF
var tagHashes = postIDTags.map(_._2).distinct.map(tag =>(nnHash(tag),tag))
```
It sounds like you're already doing something like this, although hashing can be easier to manage.

Matei suggested here an approach to emulating zipWithIndex on an RDD, which amounts to assigning IDs within each partiition that are going to be globally unique: https://groups.google.com/forum/#!topic/spark-users/WxXvcn2gl1E
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...

热议问题