How to retrieve the content of a PCollection and assign it to a normal variable?

人盡茶涼 提交于 2019-12-07 22:32:35

问题


I am using Apache-Beam with the Python SDK.

Currently, my pipeline reads multiple files, parse them and generate pandas dataframes from its data. Then, it groups them into a single dataframe.

What I want now is to retrieve this single fat dataframe, assigning it to a normal Python variable.

Is it possible to do?


回答1:


PCollection is simply a logical node in the execution graph and its contents are not necessarily actually stored anywhere, so this is not possible directly.

However, you can ask your pipeline to write the PCollection to a file (e.g. convert elements to strings and use WriteToText with num_shards=1), run the pipeline and wait for it to finish, and then read that file from your main program.



来源:https://stackoverflow.com/questions/48668686/how-to-retrieve-the-content-of-a-pcollection-and-assign-it-to-a-normal-variable

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!