code logic not working as expected. mistake in my logic in apache beam on google cloud

大城市里の小女人 提交于 2020-01-06 04:37:09

问题


i am trying to implement CDC in apache_beam.

Here, i have unloaded the master data and the new data, which is expected to coming daily.

The join is not working as expected. Something is amiss.

can anyone please assist in rectifying my mistake. Am i missing any step.

     master_data = (
            p
            | 'Read base from BigQuery ' >> beam.io.Read(
        beam.io.BigQuerySource(query=master_data, use_standard_sql=True))
            |
            'Map id in master' >> beam.Map(
        lambda master: (
            master['id'], master
        )))
    new_data = (
            p
            | 'Read Delta from BigQuery ' >> beam.io.Read(
        beam.io.BigQuerySource(query=new_data, use_standard_sql=True))
            |
            'Map id in new' >> beam.Map(
        lambda new: (
            new['id'], new
        )))

joined_dicts = (
    {'master_data' :master_data, 'new_data' : new_data }
    | beam.CoGroupByKey()
    | beam.FlatMap(join_lists)
    | 'mergeddicts' >> beam.Map(lambda (masterdict, newdict): newdict.update(masterdict))
) 



def join_lists((k,v)):
        itertools.product(v['master_data'], v['new_data'])

Observations ( on sample data ) -

data in master:

1, 'A',3232

2, 'B',234

data in new:

1,'A' ,44

4,'D',45

Expected in master table, post the code implementation :

1, 'A',44

2, 'B',234

4,'D',45

but what I am getting in master table:

1,'A' ,44

4,'D',45

回答1:


You don't need to flatten after group by as it separates the elements again.

Here is the sample code.

    def join_lists(e):
    (k,v)=e
    return (k, v['new_data']) if v['new_data'] != v['master_data'] else (k, None)

with Pipeline(options=PipelineOptions()) as p:
    master_data = (
        p
        | 'Read base from BigQuery ' >> beam.Create([('A', [3232]),('B', [234])])
    )
    new_data = (
        p
        | 'Read Delta from BigQuery ' >> beam.Create([('A',[44]),('D',[45])])
    )

    joined_dicts = (
        {'master_data' :master_data, 'new_data' : new_data }
        | beam.CoGroupByKey()
        | 'mergeddicts' >> beam.Map(join_lists)
    )



    result = p.run()
    result.wait_until_finish()


来源:https://stackoverflow.com/questions/59426850/code-logic-not-working-as-expected-mistake-in-my-logic-in-apache-beam-on-google

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!