问题
I'm trying to get a sample of the items in PCollection using the Python SDK on Dataflow / Beam.
While it's not documented, Sample.FixedSizeGlobally(n) exists.
When testing, it seems to return a PCollection with a single item: a list containing the samples, rather than a PCollection with the samples. Is that correct?
Is doing this the best way of turning that single-item PCollection into a PCollection of the items?
| Sample.FixedSizeGlobally(sample_size)
| beam.FlatMap(lambda x: x)
回答1:
Currently, yes. The Sample.FixedSizeGlobally() transform returns a PCollection with a single list element. You can turn it into a PCollection of single elements like you said:
Sample.FixedSizeGlobally(sample_size)
| beam.FlatMap(lambda x: x)
We'll make sure to add a PC-PC transform - and we also welcome your contributions to Beam : ) - But in the meantime, that's what we've got.
来源:https://stackoverflow.com/questions/47101680/sample-in-dataflow-beam-with-python