问题
I'm trying to understand Apache Beam. I was following the programming guide and in one example, they say talk about The following code example joins the two PCollections with CoGroupByKey, followed by a ParDo to consume the result. Then, the code uses tags to look up and format data from each collection..
I was quite surprised, because I didn't saw at any point a ParDo operation, so I started to wondering if the | was actually the ParDo. The code looks like this:
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
emails_list = [
('amy', 'amy@example.com'),
('carl', 'carl@example.com'),
('julia', 'julia@example.com'),
('carl', 'carl@email.com'),
]
phones_list = [
('amy', '111-222-3333'),
('james', '222-333-4444'),
('amy', '333-444-5555'),
('carl', '444-555-6666'),
]
pipeline_options = PipelineOptions()
with beam.Pipeline(options=pipeline_options) as p:
emails = p | 'CreateEmails' >> beam.Create(emails_list)
phones = p | 'CreatePhones' >> beam.Create(phones_list)
results = ({'emails': emails, 'phones': phones} | beam.CoGroupByKey())
def join_info(name_info):
(name, info) = name_info
return '%s; %s; %s' %\
(name, sorted(info['emails']), sorted(info['phones']))
contact_lines = results | beam.Map(join_info)
I do notice that emails and phones are read at the start of the pipeline, so I guess that both of them are different PCollections, right? But where is the ParDo executed? What do the "|" and ">>" actually means? And how I can see the actual output of this? Does it matter if the join_info function, the emails_list and phones_list are defined outside the DAG?
回答1:
The | represents a separation between steps, this is (using p as Pbegin): p | ReadFromText(..) | ParDo(..) | GroupByKey().
You can also reference other PCollections before |:
read = p | ReadFromText(..)
kvs = read | ParDo(..)
gbk = kvs | GroupByKey()
That's equivalent to the previous pipeline: p | ReadFromText(..) | ParDo(..) | GroupByKey()
The >> are used between | and the PTransform to name the steps: p | ReadFromText(..) | "to key value" >> ParDo(..) | GroupByKey()
来源:https://stackoverflow.com/questions/64921747/what-do-the-and-means-in-apache-beam