google-cloud-dataflow | 易学教程

Connecting to Cloud SQL from Dataflow Job

阅读更多关于 Connecting to Cloud SQL from Dataflow Job

问题 I'm struggling to use JdbcIO with Apache Beam 2.0 (Java) to connect to a Cloud SQL instance from Dataflow within the same project. I'm getting the following error: java.sql.SQLException: Cannot create PoolableConnectionFactory (Communications link failure The last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server.) According to the documentation the dataflow service account *@dataflow-service-producer-prod.iam

How to load data in nested array using dataflow

阅读更多关于 How to load data in nested array using dataflow

问题 I am trying to load the data into below table. I am able to load the data in "array_data". But how to load the data in nested array "inside_array".I have tried the commented part to load the data in inside_array array but it did not work. enter image description here Here is my code.- Pipeline p = Pipeline.create(options); org.apache.beam.sdk.values.PCollection<TableRow> output = p.apply(org.apache.beam.sdk.transforms.Create.of("temp")) .apply("O/P",ParDo.of(new DoFn<String, TableRow>() { /**

Local Pubsub Emulator won't work with Dataflow

阅读更多关于 Local Pubsub Emulator won't work with Dataflow

问题 I am developing Dataflow in Java, the input comes from a Pubsub. Later, I saw a guide here on how to use local Pubsub emulator so I would not need to deploy to GCP in order to test. Here is my simple code: private interface Options extends PipelineOptions, PubsubOptions, StreamingOptions { @Description("Pub/Sub topic to read messages from") String getTopic(); void setTopic(String topic); @Description("Pub/Sub subscription to read messages from") String getSubscription(); void setSubscription

Local Pubsub Emulator won't work with Dataflow

阅读更多关于 Local Pubsub Emulator won't work with Dataflow

Local Pubsub Emulator won't work with Dataflow

阅读更多关于 Local Pubsub Emulator won't work with Dataflow

Apache Beam : Refreshing a sideinput which i am reading from the MongoDB using MongoDbIO.read()

阅读更多关于 Apache Beam : Refreshing a sideinput which i am reading from the MongoDB using MongoDbIO.read()

问题 I am reading a PCollection mongodata from the MongoDB and using this PCollection as a sideInput to my ParDo(DoFN).withSideInputs(PCollection) And from Backend my MongoDB collection is updating on a daily or monthly basis or a yearly may be . And i need that newly added value in my pipeline. We can consider this as refreshing the mongo collection value in a running pipeline. For example of mongo collection has total 20K documents and after one day three more records added into mongo collection

Which Compute Engine quotas need to be updated to run Dataflow with 50 workers (IN_USE_ADDRESSES, CPUS, CPUS_ALL_REGIONS ..)?

阅读更多关于 Which Compute Engine quotas need to be updated to run Dataflow with 50 workers (IN_USE_ADDRESSES, CPUS, CPUS_ALL_REGIONS ..)?

问题 We are using a private GCP account and we would like to process 30 GB of data and do NLP processing using SpaCy. We wanted to use more workers and we decided to start with a maxiumn number of worker of 80 as show below. We submited our job and we got some issue with some of the GCP standard user quotas: QUOTA_EXCEEDED: Quota 'IN_USE_ADDRESSES' exceeded. Limit: 8.0 in region XXX So I decided to request some new quotas of 50 for IN_USE_ADDRESSES in some region (it took me few iteration to find

Dataflow / apache beam Trigger window on number of bytes in window

阅读更多关于 Dataflow / apache beam Trigger window on number of bytes in window

问题 I have a simple job that moves data from pub sub to gcs. The pub sub topic is a shared topic with many different message types of varying size I want the result to be in GCS vertically partition accordingly: Schema/version/year/month/day/ under that parent key should be a group of files for that day, and the files should be a reasonable size, ie 10-200 mb Im using scio and i am able to a groupby operation to make a P/SCollection of [String, Iterable[Event]] where the key is based on the

AttributeError: 'module' object has no attribute 'ensure_str'

阅读更多关于 AttributeError: 'module' object has no attribute 'ensure_str'

问题 I try to transfer data from one bigquery to anther through Beam , however, the following error comes up: WARNING:root:Retry with exponential backoff: waiting for 4.12307941111 seconds before retrying get_query_location because we caught exception: AttributeError: 'module' object has no attribute 'ensure_str' Traceback for above exception (most recent call last): File "/usr/local/lib/python2.7/site-packages/apache_beam/utils/retry.py", line 197, in wrapper return fun(*args, **kwargs) File "

AttributeError: 'module' object has no attribute 'ensure_str'

阅读更多关于 AttributeError: 'module' object has no attribute 'ensure_str'