Sample in Dataflow / Beam with Python

戏子无情 提交于 2019-12-13 17:30:23

问题


I'm trying to get a sample of the items in PCollection using the Python SDK on Dataflow / Beam.

While it's not documented, Sample.FixedSizeGlobally(n) exists.

When testing, it seems to return a PCollection with a single item: a list containing the samples, rather than a PCollection with the samples. Is that correct?

Is doing this the best way of turning that single-item PCollection into a PCollection of the items?

| Sample.FixedSizeGlobally(sample_size)
| beam.FlatMap(lambda x: x)

回答1:


Currently, yes. The Sample.FixedSizeGlobally() transform returns a PCollection with a single list element. You can turn it into a PCollection of single elements like you said:

Sample.FixedSizeGlobally(sample_size)
| beam.FlatMap(lambda x: x)

We'll make sure to add a PC-PC transform - and we also welcome your contributions to Beam : ) - But in the meantime, that's what we've got.



来源:https://stackoverflow.com/questions/47101680/sample-in-dataflow-beam-with-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!