Apache Beam Google Datastore ReadFromDatastore entity protobuf

本小妞迷上赌 提交于 2019-12-02 00:13:21

The latest version of Apache Beam 2.13 deprecates this old approach of using the old googledatastore library, and adds a new implementation that uses the newer and more human-friendly google-cloud-datastore library.

https://beam.apache.org/releases/pydoc/2.13.0/apache_beam.io.gcp.datastore.v1new.datastoreio.html

https://github.com/apache/beam/pull/8262

There's still an open issue to add an example, so for now you'll have to figure that part out.

https://issues.apache.org/jira/browse/BEAM-7350

I was getting the same issue and the accepted answer didn't work for me.

The OP has 3 questions:

1. Is there something I can do to convert a entity_pb2.Entity to something usable?

You don't specify exactly what difficulty you are having in using the returned value but all instances of entity_pb2.Entity should have a properties property. You should then be able to use that to get the values out of your entity. e.g. property_value = entity.properties.get('<your_property_name>')


Update: I think I now might know what the OP meant by "usable", as even when you do property_value = entity.properties.get('<your_property_name>') the value you get in property_value is in the protocol buffer format... So to get a dict of properties you can do this...

from googledatastore import helper

value_dict = dict((prop_name, helper.get_value(entity.properties.get(prop_name)),) for prop_name in entity.properties)

2. Is the ReadFromDatastore just too new for real use right now?

I too initially thought the same but I seem to have it working now (see my answer to Q3 below).

3. Is there another approach I should be using?

You absolutely must not import the google-cloud-datastore library into your project. Doing so will cause the TypeError: Couldn't build proto file into descriptor pool! error that was in your original question to be raised when you import ReadFromDatastore from apache_beam.

From the investigation/debugging I've been doing it seems that the current version of the apache-beam (v2.8.0) library is simply incompatible with the google-cloud-datastore (v1.7.1) library. This means we must instead use the bundled googledatastore (v7.0.1) library instead to achieve what we want.

Further reading / reference(s):

https://cloud.google.com/blog/products/gcp/how-to-do-data-processing-and-analytics-from-google-app-engine-with-google-cloud-dataflow

https://github.com/amygdala/gae-dataflow

https://gcloud-python.readthedocs.io/en/0.10.0/_modules/gcloud/datastore/helpers.html

You can use the function google.cloud.datastore.helpers.entity_from_protobuf to convert entity_pb2.Entity to google.cloud.datastore.entity.Entity.

google.cloud.datastore.entity.Entity is a subclass of dict and will give you the usability you need.

An alternative (and easier) way to specify the query is the following:

from google.cloud import datastore
from google.cloud.datastore import query as datastore_query
from apache_beam.io.gcp.datastore.v1.datastoreio import ReadFromDatastore

p = beam.Pipeline(options=pipeline_options)
ds_client = datastore.Client(project=project)
query = ds_client.query(kind=kind)
# possible filter: query.add_filter('column','operator',criteria) 
# query.add_filter('age','>',18)
# query.add_filter('name','=',"John")
query = datastore_query._pb_from_query(query)

p | 'ReadFromDatastore' >> ReadFromDatastore(project=project, query=query)
p.run().wait_until_finish()

When transmitting the job to the DataflowRunner (in the cloud), make sure your local requirements are in line with the setup.py file you are transmitting to google cloud. I have experienced that you must install apache beam 2.1.0 on your local machine and then specifying the same version in your setup.py file in order for it to work on the cloud workers.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!