Dataflow Streaming using Python SDK: Transform for PubSub Messages to BigQuery Output

后端 未结 3 1472
猫巷女王i
猫巷女王i 2021-01-06 15:34

I am attempting to use dataflow to read a pubsub message and write it to big query. I was given alpha access by the Google team and have gotten the provided examples working

3条回答
  •  醉话见心
    2021-01-06 16:29

    I was able to successfully parse the pubsub string by defining a function that loads it into a json object (see parse_pubsub()). One weird issue I encountered was that I was not able to import json at the global scope. I was receiving "NameError: global name 'json' is not defined" errors. I had to import json within the function.

    See my working code below:

    from __future__ import absolute_import
    
    import logging
    import argparse
    import apache_beam as beam
    import apache_beam.transforms.window as window
    
    '''Normalize pubsub string to json object'''
    # Lines look like this:
      # {'datetime': '2017-07-13T21:15:02Z', 'mac': 'FC:FC:48:AE:F6:94', 'status': 1}
    def parse_pubsub(line):
        import json
        record = json.loads(line)
        return (record['mac']), (record['status']), (record['datetime'])
    
    def run(argv=None):
      """Build and run the pipeline."""
    
      parser = argparse.ArgumentParser()
      parser.add_argument(
          '--input_topic', required=True,
          help='Input PubSub topic of the form "/topics//".')
      parser.add_argument(
          '--output_table', required=True,
          help=
          ('Output BigQuery table for results specified as: PROJECT:DATASET.TABLE '
           'or DATASET.TABLE.'))
      known_args, pipeline_args = parser.parse_known_args(argv)
    
      with beam.Pipeline(argv=pipeline_args) as p:
        # Read the pubsub topic into a PCollection.
        lines = ( p | beam.io.ReadStringsFromPubSub(known_args.input_topic)
                    | beam.Map(parse_pubsub)
                    | beam.Map(lambda (mac_bq, status_bq, datetime_bq): {'mac': mac_bq, 'status': status_bq, 'datetime': datetime_bq})
                    | beam.io.WriteToBigQuery(
                        known_args.output_table,
                        schema=' mac:STRING, status:INTEGER, datetime:TIMESTAMP',
                        create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
                        write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND)
                )
    
    if __name__ == '__main__':
      logging.getLogger().setLevel(logging.INFO)
      run()
    

提交回复
热议问题