How to use long user ID in PySpark ALS

橙三吉。 提交于 2019-12-06 07:58:34
Matej Jakimov

This is known issue (https://issues.apache.org/jira/browse/SPARK-2465), but it will not be solved soon, because changing interface to long userId should slowdown computation.

There are few solutions:

  • you can hash userId to int with hash() function, since it cause just random row compression in few cases, collisions shouldn't affect accuracy of your recommender, really. Discussion in first link.

  • you can generate unique int userIds with RDD.zipWithUniqueId() or less fast RDD.zipWithIndex, just like in this thread: How to assign unique contiguous numbers to elements in a Spark RDD

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!