How to batch up items from a PySpark DataFrame

佐手、 提交于 2020-01-06 15:01:17

问题


I have a PySpark data frame and for each (batch of) record(s), I want to call an API. So basically say I have 100000k records, I want to batch up items into groups of say 1000 and call an API. How can I do this with PySpark? Reason for the batching is because the API probably will not accept a huge chunk of data from a Big Data system.

I first thought of LIMIT but that wont be "deterministic". Furthermore it seems like it would be inefficient?


回答1:


df.foreachPartition { ele =>
   ele.grouped(1000).foreach { chunk =>
   postToServer(chunk)
}

Code is in scala, you can check same in python. It will create batches of 1000.




回答2:


Using foreachPartition and then something like this how to split an iterable in constant-size chunks to batch the iterables to groups of 1000 is arguably the most efficient way to do it in terms of Spark resource usage.

def handle_iterator(it):
    # batch the iterable and call API
    pass
df.foreachPartition(handle_iterator)

Note: This would make parallel calls to the API from executors and might not be the way to go in practise if e.g. rate-limiting is an issue.



来源:https://stackoverflow.com/questions/55979027/how-to-batch-up-items-from-a-pyspark-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!