问题
I'm currently trying to find the best way of processing two very large datasets.
I have two BigQuery Tables :
- One table containing streamed events (Billion rows)
- One table containing a tags and the associated event properties (100 000 rows)
I want to tag each event with the appropriate tags based on the event properties (an event can have multiple tags). However a SQL cross-join seems to be too slow for the dataset size.
What is the best way to proceed using a pipeline of mapreduces and avoiding very costly shuffle phase since each event has to be compared to each tag.
Also I'm planning to use Google Cloud Dataflow, is this tool adapted for this task?
回答1:
Google Cloud Dataflow is a good fit for this.
Assuming the tags data is small enough to fit in memory you can avoid a shuffle by passing it as a SideInput.
Your pipeline would look like the following
- Use two BigQueryIO transforms to read from each table.
- Create a DoFn to tag each event with its tags.
- The input PCollection to your DoFn should be the events. Pass the table of tags as a side input.
- Use a BigQueryIO transform to write the result back to BigQuery (assuming you want to use BigQuery for the output)
If your tags data is too large to fit in memory you will most likely have to use a Join.
来源:https://stackoverflow.com/questions/33254689/best-strategy-for-joining-two-large-datasets