问题
I am trying to use apache beam's google datastore api to ReadFromDatastore
p = beam.Pipeline(options=options)
(p
| 'Read from Datastore' >> ReadFromDatastore(gcloud_options.project, query)
| 'reformat' >> beam.Map(reformat)
| 'Write To Datastore' >> WriteToDatastore(gcloud_options.project))
The object that gets passed to my reformat function is type
google.cloud.proto.datastore.v1.entity_pb2.Entity
It is in protobuf format which is hard to modify or read.
I think I can convert a entity_pb2.Entity to a dict with
entity= dict(google.cloud.datastore.helpers._property_tuples(entity_pb))
But for some reason trying to import the following two libraries gives me some errors:
import google.cloud.datastore.helpers
from apache_beam.io.gcp.datastore.v1.datastoreio import ReadFromDatastore
Error:
Traceback (most recent call last):
File "/home/nburn42/MotoGarage/MotoGarage/MotoGarageBackgroundJobs/format_data.py", line 16, in <module>
import google.cloud.datastore.helpers
File "/usr/local/lib/python2.7/dist-packages/google/cloud/datastore/__init__.py", line 57, in <module>
from google.cloud.datastore.batch import Batch
File "/usr/local/lib/python2.7/dist-packages/google/cloud/datastore/batch.py", line 24, in <module>
from google.cloud.datastore import helpers
File "/usr/local/lib/python2.7/dist-packages/google/cloud/datastore/helpers.py", line 29, in <module>
from google.cloud.grpc.datastore.v1 import entity_pb2 as _entity_pb2
File "/usr/local/lib/python2.7/dist-packages/google/cloud/grpc/datastore/v1/entity_pb2.py", line 28, in <module>
dependencies=[google_dot_api_dot_annotations__pb2.DESCRIPTOR,google_dot_protobuf_dot_struct__pb2.DESCRIPTOR,google_dot_protobuf_dot_timestamp__pb2.DESCRIPTOR,google_dot_type_dot_latlng__pb2.DESCRIPTOR,])
File "/usr/local/lib/python2.7/dist-packages/google/protobuf/descriptor.py", line 824, in __new__
return _message.default_pool.AddSerializedFile(serialized_pb)
TypeError: Couldn't build proto file into descriptor pool!
Invalid proto descriptor for file "google/cloud/grpc/datastore/v1/entity.proto":
google.datastore.v1.PartitionId.project_id: "google.datastore.v1.PartitionId.project_id" is already defined in file "google/cloud/proto/datastore/v1/entity.proto".
google.datastore.v1.PartitionId.namespace_id: "google.datastore.v1.PartitionId.namespace_id" is already defined in file "google/cloud/proto/datastore/v1/entity.proto".
google.datastore.v1.PartitionId: "google.datastore.v1.PartitionId" is already defined in file "google/cloud/proto/datastore/v1/entity.proto".
google.datastore.v1.Key.partition_id: "google.datastore.v1.Key.partition_id" is already defined in file "google/cloud/proto/datastore/v1/entity.proto".
google.datastore.v1.Key.path: "google.datastore.v1.Key.path" is already defined in file "google/cloud/proto/datastore/v1/entity.proto".
google.datastore.v1.Key.PathElement.id_type: "google.datastore.v1.Key.PathElement.id_type" is already defined in file "google/cloud/proto/datastore/v1/entity.proto".
google.datastore.v1.Key.PathElement.kind: "google.datastore.v1.Key.PathElement.kind" is already defined in file "google/cloud/proto/datastore/v1/entity.proto".
google.datastore.v1.Key.PathElement.id: "google.datastore.v1.Key.PathElement.id" is already defined in file "google/cloud/proto/datastore/v1/entity.proto".
google.datastore.v1.Key.PathElement.name: "google.datastore.v1.Key.PathElement.name" is already defined in file "google/cloud/proto/datastore/v1/entity.proto".
google.datastore.v1.Key.PathElement: "google.datastore.v1.Key.PathElement" is already defined in file "google/cloud/proto/datastore/v1/entity.proto".
google.datastore.v1.Key: "google.datastore.v1.Key" is already defined in file "google/cloud/proto/datastore/v1/entity.proto".
google.datastore.v1.ArrayValue.values: "google.datastore.v1.ArrayValue.values" is already defined in file "google/cloud/proto/datastore/v1/entity.proto".
google.datastore.v1.ArrayValue: "google.datastore.v1.ArrayValue" is already defined in file "google/cloud/proto/datastore/v1/entity.proto".
google.datastore.v1.Value.value_type: "google.datastore.v1.Value.value_type" is already defined in file "google/cloud/proto/datastore/v1/entity.proto".
google.datastore.v1.Value.null_value: "google.datastore.v1.Value.null_value" is already defined in file "google/cloud/proto/datastore/v1/entity.proto".
google.datastore.v1.Value.boolean_value: "google.datastore.v1.Value.boolean_value" is already defined in file "google/cloud/proto/datastore/v1/entity.proto".
google.datastore.v1.Value.integer_value: "google.datastore.v1.Value.integer_value" is already defined in file "google/cloud/proto/datastore/v1/entity.proto".
google.datastore.v1.Value.double_value: "google.datastore.v1.Value.double_value" is already defined in file "google/cloud/proto/datastore/v1/entity.proto".
google.datastore.v1.Value.timestamp_value: "google.datastore.v1.Value.timestamp_value" is already defined in file "google/cloud/proto/datastore/v1/entity.proto".
google.datastore.v1.Value.key_value: "google.datastore.v1.Value.key_value" is already defined in file "google/cloud/proto/datastore/v1/entity.proto".
google.datastore.v1.Value.string_value: "google.datastore.v1.Value.string_value" is already defined in file "google/cloud/proto/datastore/v1/entity.proto".
google.datastore.v1.Value.blob_value: "google.datastore.v1.Value.blob_value" is already defined in file "google/cloud/proto/datastore/v1/entity.proto".
google.datastore.v1.Value.geo_point_value: "google.datastore.v1.Value.geo_point_value" is already defined in file "google/cloud/proto/datastore/v1/entity.proto".
google.datastore.v1.Value.entity_value: "google.datastore.v1.Value.entity_value" is already defined in file "google/cloud/proto/datastore/v1/entity.proto".
google.datastore.v1.Value.array_value: "google.datastore.v1.Value.array_value" is already defined in file "google/cloud/proto/datastore/v1/entity.proto".
google.datastore.v1.Value.meaning: "google.datastore.v1.Value.meaning" is already defined in file "google/cloud/proto/datastore/v1/entity.proto".
google.datastore.v1.Value.exclude_from_indexes: "google.datastore.v1.Value.exclude_from_indexes" is already defined in file "google/cloud/proto/datastore/v1/entity.proto".
google.datastore.v1.Value: "google.datastore.v1.Value" is already defined in file "google/cloud/proto/datastore/v1/entity.proto".
google.datastore.v1.Entity.key: "google.datastore.v1.Entity.key" is already defined in file "google/cloud/proto/datastore/v1/entity.proto".
google.datastore.v1.Entity.properties: "google.datastore.v1.Entity.properties" is already defined in file "google/cloud/proto/datastore/v1/entity.proto".
google.datastore.v1.Entity.PropertiesEntry.key: "google.datastore.v1.Entity.PropertiesEntry.key" is already defined in file "google/cloud/proto/datastore/v1/entity.proto".
google.datastore.v1.Entity.PropertiesEntry.value: "google.datastore.v1.Entity.PropertiesEntry.value" is already defined in file "google/cloud/proto/datastore/v1/entity.proto".
google.datastore.v1.Entity.PropertiesEntry: "google.datastore.v1.Entity.PropertiesEntry" is already defined in file "google/cloud/proto/datastore/v1/entity.proto".
google.datastore.v1.Entity: "google.datastore.v1.Entity" is already defined in file "google/cloud/proto/datastore/v1/entity.proto".
google.datastore.v1.Key.partition_id: "google.datastore.v1.PartitionId" seems to be defined in "google/cloud/proto/datastore/v1/entity.proto", which is not imported by "google/cloud/grpc/datastore/v1/entity.proto". To use it here, please add the necessary import.
google.datastore.v1.Key.path: "google.datastore.v1.Key.PathElement" seems to be defined in "google/cloud/proto/datastore/v1/entity.proto", which is not imported by "google/cloud/grpc/datastore/v1/entity.proto". To use it here, please add the necessary import.
google.datastore.v1.ArrayValue.values: "google.datastore.v1.Value" seems to be defined in "google/cloud/proto/datastore/v1/entity.proto", which is not imported by "google/cloud/grpc/datastore/v1/entity.proto". To use it here, please add the necessary import.
google.datastore.v1.Value.key_value: "google.datastore.v1.Key" seems to be defined in "google/cloud/proto/datastore/v1/entity.proto", which is not imported by "google/cloud/grpc/datastore/v1/entity.proto". To use it here, please add the necessary import.
google.datastore.v1.Value.entity_value: "google.datastore.v1.Entity" seems to be defined in "google/cloud/proto/datastore/v1/entity.proto", which is not imported by "google/cloud/grpc/datastore/v1/entity.proto". To use it here, please add the necessary import.
google.datastore.v1.Value.array_value: "google.datastore.v1.ArrayValue" seems to be defined in "google/cloud/proto/datastore/v1/entity.proto", which is not imported by "google/cloud/grpc/datastore/v1/entity.proto". To use it here, please add the necessary import.
google.datastore.v1.Entity.PropertiesEntry.value: "google.datastore.v1.Value" seems to be defined in "google/cloud/proto/datastore/v1/entity.proto", which is not imported by "google/cloud/grpc/datastore/v1/entity.proto". To use it here, please add the necessary import.
google.datastore.v1.Entity.key: "google.datastore.v1.Key" seems to be defined in "google/cloud/proto/datastore/v1/entity.proto", which is not imported by "google/cloud/grpc/datastore/v1/entity.proto". To use it here, please add the necessary import.
google.datastore.v1.Entity.properties: "google.datastore.v1.Entity.PropertiesEntry" seems to be defined in "google/cloud/proto/datastore/v1/entity.proto", which is not imported by "google/cloud/grpc/datastore/v1/entity.proto". To use it here, please add the necessary import.
Is there something I can do to convert a entity_pb2.Entity to something usable?
Is the ReadFromDatastore just too new for real use right now?
Is there another approach I should be using?
Thanks,
Nathan
回答1:
The latest version of Apache Beam 2.13 deprecates this old approach of using the old googledatastore
library, and adds a new implementation that uses the newer and more human-friendly google-cloud-datastore
library.
https://beam.apache.org/releases/pydoc/2.13.0/apache_beam.io.gcp.datastore.v1new.datastoreio.html
https://github.com/apache/beam/pull/8262
There's still an open issue to add an example, so for now you'll have to figure that part out.
https://issues.apache.org/jira/browse/BEAM-7350
回答2:
I was getting the same issue and the accepted answer didn't work for me.
The OP has 3 questions:
1. Is there something I can do to convert a entity_pb2.Entity to something usable?
You don't specify exactly what difficulty you are having in using the returned value but all instances of entity_pb2.Entity should have a properties
property. You should then be able to use that to get the values out of your entity. e.g. property_value = entity.properties.get('<your_property_name>')
Update: I think I now might know what the OP meant by "usable", as even when you do property_value = entity.properties.get('<your_property_name>')
the value you get in property_value
is in the protocol buffer format... So to get a dict of properties you can do this...
from googledatastore import helper
value_dict = dict((prop_name, helper.get_value(entity.properties.get(prop_name)),) for prop_name in entity.properties)
2. Is the ReadFromDatastore just too new for real use right now?
I too initially thought the same but I seem to have it working now (see my answer to Q3 below).
3. Is there another approach I should be using?
You absolutely must not import the google-cloud-datastore
library into your project. Doing so will cause the TypeError: Couldn't build proto file into descriptor pool!
error that was in your original question to be raised when you import ReadFromDatastore
from apache_beam
.
From the investigation/debugging I've been doing it seems that the current version of the apache-beam (v2.8.0)
library is simply incompatible with the google-cloud-datastore (v1.7.1)
library. This means we must instead use the bundled googledatastore (v7.0.1)
library instead to achieve what we want.
Further reading / reference(s):
https://cloud.google.com/blog/products/gcp/how-to-do-data-processing-and-analytics-from-google-app-engine-with-google-cloud-dataflow
https://github.com/amygdala/gae-dataflow
https://gcloud-python.readthedocs.io/en/0.10.0/_modules/gcloud/datastore/helpers.html
回答3:
You can use the function google.cloud.datastore.helpers.entity_from_protobuf
to convert entity_pb2.Entity
to google.cloud.datastore.entity.Entity
.
google.cloud.datastore.entity.Entity
is a subclass of dict and will give you the usability you need.
回答4:
An alternative (and easier) way to specify the query is the following:
from google.cloud import datastore
from google.cloud.datastore import query as datastore_query
from apache_beam.io.gcp.datastore.v1.datastoreio import ReadFromDatastore
p = beam.Pipeline(options=pipeline_options)
ds_client = datastore.Client(project=project)
query = ds_client.query(kind=kind)
# possible filter: query.add_filter('column','operator',criteria)
# query.add_filter('age','>',18)
# query.add_filter('name','=',"John")
query = datastore_query._pb_from_query(query)
p | 'ReadFromDatastore' >> ReadFromDatastore(project=project, query=query)
p.run().wait_until_finish()
When transmitting the job to the DataflowRunner (in the cloud), make sure your local requirements are in line with the setup.py file you are transmitting to google cloud. I have experienced that you must install apache beam 2.1.0 on your local machine and then specifying the same version in your setup.py file in order for it to work on the cloud workers.
来源:https://stackoverflow.com/questions/44117621/apache-beam-google-datastore-readfromdatastore-entity-protobuf