问题
i am trying to implement CDC in apache_beam.
Here, i have unloaded the master data and the new data, which is expected to coming daily.
The join is not working as expected. Something is amiss.
can anyone please assist in rectifying my mistake. Am i missing any step.
master_data = (
p
| 'Read base from BigQuery ' >> beam.io.Read(
beam.io.BigQuerySource(query=master_data, use_standard_sql=True))
|
'Map id in master' >> beam.Map(
lambda master: (
master['id'], master
)))
new_data = (
p
| 'Read Delta from BigQuery ' >> beam.io.Read(
beam.io.BigQuerySource(query=new_data, use_standard_sql=True))
|
'Map id in new' >> beam.Map(
lambda new: (
new['id'], new
)))
joined_dicts = (
{'master_data' :master_data, 'new_data' : new_data }
| beam.CoGroupByKey()
| beam.FlatMap(join_lists)
| 'mergeddicts' >> beam.Map(lambda (masterdict, newdict): newdict.update(masterdict))
)
def join_lists((k,v)):
itertools.product(v['master_data'], v['new_data'])
Observations ( on sample data ) -
data in master:
1, 'A',3232
2, 'B',234
data in new:
1,'A' ,44
4,'D',45
Expected in master table, post the code implementation :
1, 'A',44
2, 'B',234
4,'D',45
but what I am getting in master table:
1,'A' ,44
4,'D',45
回答1:
You don't need to flatten after group by as it separates the elements again.
Here is the sample code.
def join_lists(e):
(k,v)=e
return (k, v['new_data']) if v['new_data'] != v['master_data'] else (k, None)
with Pipeline(options=PipelineOptions()) as p:
master_data = (
p
| 'Read base from BigQuery ' >> beam.Create([('A', [3232]),('B', [234])])
)
new_data = (
p
| 'Read Delta from BigQuery ' >> beam.Create([('A',[44]),('D',[45])])
)
joined_dicts = (
{'master_data' :master_data, 'new_data' : new_data }
| beam.CoGroupByKey()
| 'mergeddicts' >> beam.Map(join_lists)
)
result = p.run()
result.wait_until_finish()
来源:https://stackoverflow.com/questions/59426850/code-logic-not-working-as-expected-mistake-in-my-logic-in-apache-beam-on-google