Creating/Writing to Parititoned BigQuery table via Google Cloud Dataflow

前端 未结 6 1348
死守一世寂寞
死守一世寂寞 2020-11-29 12:25

I wanted to take advantage of the new BigQuery functionality of time partitioned tables, but am unsure this is currently possible in the 1.6 version of the Dataflow SDK.

6条回答
  •  北海茫月
    2020-11-29 12:47

    I have written data into bigquery partitioned tables through dataflow. These writings are dynamic as-in if the data in that partition already exists then I can either append to it or overwrite it.

    I have written the code in Python. It is a batch mode write operation into bigquery.

    client = bigquery.Client(project=projectName)
    dataset_ref = client.dataset(datasetName)
    table_ref = dataset_ref.table(bqTableName)       
    job_config = bigquery.LoadJobConfig()
    job_config.skip_leading_rows = skipLeadingRows
    job_config.source_format = bigquery.SourceFormat.CSV
    if tableExists(client, table_ref):            
        job_config.autodetect = autoDetect
        previous_rows = client.get_table(table_ref).num_rows
        #assert previous_rows > 0
        if allowJaggedRows is True:
            job_config.allowJaggedRows = True
        if allowFieldAddition is True:
            job_config._properties['load']['schemaUpdateOptions'] = ['ALLOW_FIELD_ADDITION']
        if isPartitioned is True:
            job_config._properties['load']['timePartitioning'] = {"type": "DAY"}
        if schemaList is not None:
            job_config.schema = schemaList            
        job_config.write_disposition = bigquery.WriteDisposition.WRITE_TRUNCATE
    else:            
        job_config.autodetect = autoDetect
        job_config._properties['createDisposition'] = 'CREATE_IF_NEEDED'
        job_config.schema = schemaList
        if isPartitioned is True:             
            job_config._properties['load']['timePartitioning'] = {"type": "DAY"}
        if schemaList is not None:
            table = bigquery.Table(table_ref, schema=schemaList)            
    load_job = client.load_table_from_uri(gcsFileName, table_ref, job_config=job_config)        
    assert load_job.job_type == 'load'
    load_job.result()       
    assert load_job.state == 'DONE'
    

    It works fine.

提交回复
热议问题