Sideload static data | 易学教程

问题

When processing my data in a ParDo I need to use a JSON schema stored on Google Cloud Storage. I think this maybe is sideloading? I read the pages they call documentation (https://beam.apache.org/releases/pydoc/2.16.0/apache_beam.pvalue.html) and it contains something about apache_beam.pvalue.AsSingleton and apache_beam.pvalue.AsSideInput but there are zero results if I Google on the usage of those and I can't find any example for Python.

How can I read a file from storage from within a ParDo? Or do I sideload to my Pipeline before the ParDo but how do I utilize this second source withtin the ParDo then?

[EDIT]

My main data comes from BQ: beam.io.Read(beam.io.BigQuerySource(...
The side input also comes from BQ, using the same BigQuerySource.

When I then add a step after the main data side inputing the other data I get some strange errors. I notice that when I do beam.Map(lambda x: x) to the side input it works.

side input

schema_data = (p | "read schema data" >> beam.io.Read(beam.io.BigQuerySource(query=f"select * from `{schema_table}` limit 1", use_standard_sql=True, flatten_results=True))
                         | beam.Map(lambda x: x)
                       )

main data

    source_data = (p | "read source data" >> beam.io.Read(beam.io.BigQuerySource(query=f"select {columns} from `{source_table}` limit 10", use_standard_sql=True, flatten_results=True)))

combining

validated_records = source_data | 'record validation' >> beam.ParDo(Validate(), pvalue.AsList(schema_data))

回答1:

I would use the docs you mention as a library reference and go through the Beam programming guide for more detailed walkthroughs: side input section. I'll try to help with a couple examples in which we'll download a BigQuery schema from a public table and upload it to GCS:

bq show --schema bigquery-public-data:usa_names.usa_1910_current > schema.json
gsutil cp schema.json gs://$BUCKET

Our data will be some csv rows without headers so that we have to use the GCS schema:

data = [('NC', 'F', 2020, 'Hello', 3200),
        ('NC', 'F', 2020, 'World', 3180)]

Using side inputs

We read the JSON file into a schema PCollection:

schema = (p 
  | 'Read Schema from GCS' >> ReadFromText('gs://{}/schema.json'.format(BUCKET)))

and then we pass it to the ParDo as a side input so that it's broadcasted to every worker that executes the DoFn. In this case, we can use AsSingleton as we just one want to supply the schema as a single value:

(p
  | 'Create Events' >> beam.Create(data) \
  | 'Enrich with side input' >> beam.ParDo(EnrichElementsFn(), pvalue.AsSingleton(schema)) \
  | 'Log elements' >> beam.ParDo(LogElementsFn()))

Now we can access the schema in the process method of EnrichElementsFn:

class EnrichElementsFn(beam.DoFn):
  """Zips data with schema stored in GCS"""
  def process(self, element, schema):
    field_names = [x['name'] for x in json.loads(schema)]
    yield zip(field_names, element)

Note that it would be better to do the schema processing (to construct field_names) before saving it as a singleton to avoid duplicated work but this is just an illustrative example.

Using start bundle

In this case we don't pass any additional input to the ParDo:

(p
  | 'Create Events' >> beam.Create(data) \
  | 'Enrich with start bundle' >> beam.ParDo(EnrichElementsFn()) \
  | 'Log elements' >> beam.ParDo(LogElementsFn()))

And now we use the Python Client Library (we need to install google-cloud-storage) to read the schema each time that a worker initializes a bundle:

class EnrichElementsFn(beam.DoFn):
  """Zips data with schema stored in GCS"""
  def start_bundle(self):
    from google.cloud import storage

    client = storage.Client()
    blob = client.get_bucket(BUCKET).get_blob('schema.json')
    self.schema = blob.download_as_string()

  def process(self, element):
    field_names = [x['name'] for x in json.loads(self.schema)]
    yield zip(field_names, element)

The output is the same in both cases:

INFO:root:[(u'state', 'NC'), (u'gender', 'F'), (u'year', 2020), (u'name', 'Hello'), (u'number', 3200)]
INFO:root:[(u'state', 'NC'), (u'gender', 'F'), (u'year', 2020), (u'name', 'World'), (u'number', 3180)]

Tested with 2.16.0 SDK and the DirectRunner.

Full code for both examples here.

回答2:

I found a similar question here. As far as this post comments, If your schema file (in this case JSON) is in a known location in GCS, you can add a ParDo to your pipeline that directly reads it from GCS using a start_bundle() implementation.

You can use Beam's FileSystem abstraction if you need to abstract out the file-system that you use to store the schema file (not just GCS).

Also, you can read/download files from storage using the Google Cloud Storage’s API.

I also found here a blog that talks about the differente source reading patterns when using Google Cloud Dataflow.

I hope this helps.

来源：https://stackoverflow.com/questions/59458599/sideload-static-data

标签

python-3.x

google-cloud-dataflow

apache-beam