Tag huge list of elements with lat/long with large list of geolocation data

爱⌒轻易说出口 提交于 2019-12-06 08:12:51
Jeremy Lewi

Using Dataflow you can do a cross join pretty easily using CoGroupByKey. Using this approach only the Event and POI you are joining would need to fit in memory (Dataflow will automatically spill to disk if the list of items for a given key is too large to fit in memory).

Here's some more detail.

  • Create a PCollection of events keyed by latitude and longitude.
  • Create a PCollection of POI keyed by latitude and longitude
  • Use a CoGroupByKey to join the two PCollections.
  • Write a DoFn that processes the CoGbkResult
  • The DoFn would look something like:

    
    PCollection<T> finalResultCollection =
    coGbkResultCollection.apply(ParDo.of(
      new DoFn<KV<K, CoGbkResult>, T>() {
        @Override
        public void processElement(ProcessContext c) {
          KV<K, CoGbkResult> e = c.element();
          // Get all collection 1 values
          Iterable<Event> eventVals = e.getValue().getAll(eventTag);
          // Now get collection 2 values
          Iterable<Poi> poiVals = e.getValue().getAll(poiTag);
          for (Event e : eventVals) {
            for (Poi p : poiVal) {
              ...
              c.output(...tagged event...);
            }
          }
        }
      }));
    

As discussed in this Answer you could also use a side input to pass a map whose keys were latitude and longitude and the values were the details of a POI. That approach will work if the data can fit in memory. If you only have 1 million POI and you are only storing the fields listed it will probably fit in memory.

Note: I'm on the Dataflow team.

Felipe Hoffa

Can you open up the dataset containing the open street map table? (it's open data after all). With that table I can try to optimize the query.

For a similar question, I optimized it via generating keys for each row that can be used to avoid having to CROSS JOIN over the whole dataset.

http://googlecloudplatform.blogspot.com/2014/03/geoip-geolocation-with-google-bigquery.html

StackO: How to improve performance of GeoIP query in BigQuery?

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!