Custom aggregation on PySpark dataframes

て烟熏妆下的殇ゞ 提交于 2019-12-04 07:22:26
Assaf Mendelson

You have several options:

  1. Create a user defined aggregate function. The problem is that you will need to write the user defined aggregate function in scala and wrap it to use in python.
  2. You can use the collect_list function to collect all values to a list and then write a UDF to combine them.
  3. You can move to RDD and use aggregate or aggregate by key.

Both options 2 & 3 would be relatively inefficient (costing both cpu and memory).

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!