How to filter datastore data before mapping to cloud storage using the MapReduce API?

问题

Regarding the code lab here, how can we filter datastore data within the mapreduce jobs rather than fetching all objects for a certain entity kind?

In the mapper pipeline definition below, the only one input reader parameter is the entity kind to process and I can't see other parameters of type filter in the InputReader class that could help.

output = yield mapreduce_pipeline.MapperPipeline(
  "Datastore Mapper %s" % entity_type,
  "main.datastore_map",
  "mapreduce.input_readers.DatastoreInputReader",
  output_writer_spec="mapreduce.output_writers.FileOutputWriter",
  params={
      "input_reader":{
          "entity_kind": entity_type,
          },
      "output_writer":{
          "filesystem": "gs",
          "gs_bucket_name": GS_BUCKET,
          "output_sharding":"none",
          }
      },
      shards=100)

Since Google BigQuery plays better with unormalized data model, it would be nice to be able to build one table from several datastore entity kinds (JOINs) but I can't see how to do so as well?

回答1:

Depending on your application, you might be able to solve this by passing a filter parameter which is "an optional list of filters to apply to the query. Each filter is a tuple: (<property_name_as_str>, <query_operation_as_str>, <value>."

So, in your input reader parameters:

"input_reader":{
          "entity_kind": entity_type,
          "filters": [("datastore_property", "=", 12345),
                      ("another_datastore_property", ">", 200)]
}

来源：https://stackoverflow.com/questions/11849456/how-to-filter-datastore-data-before-mapping-to-cloud-storage-using-the-mapreduce

标签

google-cloud-datastore

google-bigquery

google-cloud-storage

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!