Best strategy for joining two large datasets

问题

I'm currently trying to find the best way of processing two very large datasets.

I have two BigQuery Tables :

One table containing streamed events (Billion rows)
One table containing a tags and the associated event properties (100 000 rows)

I want to tag each event with the appropriate tags based on the event properties (an event can have multiple tags). However a SQL cross-join seems to be too slow for the dataset size.

What is the best way to proceed using a pipeline of mapreduces and avoiding very costly shuffle phase since each event has to be compared to each tag.

Also I'm planning to use Google Cloud Dataflow, is this tool adapted for this task?

回答1:

Google Cloud Dataflow is a good fit for this.

Assuming the tags data is small enough to fit in memory you can avoid a shuffle by passing it as a SideInput.

Your pipeline would look like the following

Use two BigQueryIO transforms to read from each table.
Create a DoFn to tag each event with its tags.
The input PCollection to your DoFn should be the events. Pass the table of tags as a side input.
Use a BigQueryIO transform to write the result back to BigQuery (assuming you want to use BigQuery for the output)

If your tags data is too large to fit in memory you will most likely have to use a Join.

来源：https://stackoverflow.com/questions/33254689/best-strategy-for-joining-two-large-datasets

标签

MapReduce

google-cloud-dataflow

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!