How to filter datastore data before mapping to cloud storage using the MapReduce API?

杀马特。学长 韩版系。学妹 提交于 2019-11-29 23:51:43

问题


Regarding the code lab here, how can we filter datastore data within the mapreduce jobs rather than fetching all objects for a certain entity kind?

In the mapper pipeline definition below, the only one input reader parameter is the entity kind to process and I can't see other parameters of type filter in the InputReader class that could help.

output = yield mapreduce_pipeline.MapperPipeline(
  "Datastore Mapper %s" % entity_type,
  "main.datastore_map",
  "mapreduce.input_readers.DatastoreInputReader",
  output_writer_spec="mapreduce.output_writers.FileOutputWriter",
  params={
      "input_reader":{
          "entity_kind": entity_type,
          },
      "output_writer":{
          "filesystem": "gs",
          "gs_bucket_name": GS_BUCKET,
          "output_sharding":"none",
          }
      },
      shards=100)

Since Google BigQuery plays better with unormalized data model, it would be nice to be able to build one table from several datastore entity kinds (JOINs) but I can't see how to do so as well?


回答1:


Depending on your application, you might be able to solve this by passing a filter parameter which is "an optional list of filters to apply to the query. Each filter is a tuple: (<property_name_as_str>, <query_operation_as_str>, <value>."

So, in your input reader parameters:

"input_reader":{
          "entity_kind": entity_type,
          "filters": [("datastore_property", "=", 12345),
                      ("another_datastore_property", ">", 200)]
}


来源:https://stackoverflow.com/questions/11849456/how-to-filter-datastore-data-before-mapping-to-cloud-storage-using-the-mapreduce

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!