发表新帖

发表新帖

How to assign unique contiguous numbers to elements in a Spark RDD

前端未结

关注

 5  2026

無奈伤痛 2020-12-04 14:00

I have a dataset of (user, product, review), and want to feed it into mllib\'s ALS algorithm.

The algorithm needs users and products to be numbers, whil

5条回答

自闭症患者 (楼主)

2020-12-04 14:44
Starting with Spark 1.0 there are two methods you can use to solve this easily:
- RDD.zipWithIndex is just like Seq.zipWithIndex, it adds contiguous (Long) numbers. This needs to count the elements in each partition first, so your input will be evaluated twice. Cache your input RDD if you want to use this.
- RDD.zipWithUniqueId also gives you unique Long IDs, but they are not guaranteed to be contiguous. (They will only be contiguous if each partition has the same number of elements.) The upside is that this does not need to know anything about the input, so it will not cause double-evaluation.
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...

热议问题