google-cloud-dataflow

Connecting to Cloud SQL from Dataflow Job

白昼怎懂夜的黑 提交于 2020-08-06 12:44:26
问题 I'm struggling to use JdbcIO with Apache Beam 2.0 (Java) to connect to a Cloud SQL instance from Dataflow within the same project. I'm getting the following error: java.sql.SQLException: Cannot create PoolableConnectionFactory (Communications link failure The last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server.) According to the documentation the dataflow service account *@dataflow-service-producer-prod.iam

How to load data in nested array using dataflow

自作多情 提交于 2020-08-06 05:18:08
问题 I am trying to load the data into below table. I am able to load the data in "array_data". But how to load the data in nested array "inside_array".I have tried the commented part to load the data in inside_array array but it did not work. enter image description here Here is my code.- Pipeline p = Pipeline.create(options); org.apache.beam.sdk.values.PCollection<TableRow> output = p.apply(org.apache.beam.sdk.transforms.Create.of("temp")) .apply("O/P",ParDo.of(new DoFn<String, TableRow>() { /**

Local Pubsub Emulator won't work with Dataflow

你离开我真会死。 提交于 2020-07-23 08:08:08
问题 I am developing Dataflow in Java, the input comes from a Pubsub. Later, I saw a guide here on how to use local Pubsub emulator so I would not need to deploy to GCP in order to test. Here is my simple code: private interface Options extends PipelineOptions, PubsubOptions, StreamingOptions { @Description("Pub/Sub topic to read messages from") String getTopic(); void setTopic(String topic); @Description("Pub/Sub subscription to read messages from") String getSubscription(); void setSubscription

Local Pubsub Emulator won't work with Dataflow

六月ゝ 毕业季﹏ 提交于 2020-07-23 08:05:32
问题 I am developing Dataflow in Java, the input comes from a Pubsub. Later, I saw a guide here on how to use local Pubsub emulator so I would not need to deploy to GCP in order to test. Here is my simple code: private interface Options extends PipelineOptions, PubsubOptions, StreamingOptions { @Description("Pub/Sub topic to read messages from") String getTopic(); void setTopic(String topic); @Description("Pub/Sub subscription to read messages from") String getSubscription(); void setSubscription

Local Pubsub Emulator won't work with Dataflow

故事扮演 提交于 2020-07-23 08:04:39
问题 I am developing Dataflow in Java, the input comes from a Pubsub. Later, I saw a guide here on how to use local Pubsub emulator so I would not need to deploy to GCP in order to test. Here is my simple code: private interface Options extends PipelineOptions, PubsubOptions, StreamingOptions { @Description("Pub/Sub topic to read messages from") String getTopic(); void setTopic(String topic); @Description("Pub/Sub subscription to read messages from") String getSubscription(); void setSubscription

Apache Beam : Refreshing a sideinput which i am reading from the MongoDB using MongoDbIO.read()

白昼怎懂夜的黑 提交于 2020-06-29 04:20:09
问题 I am reading a PCollection mongodata from the MongoDB and using this PCollection as a sideInput to my ParDo(DoFN).withSideInputs(PCollection) And from Backend my MongoDB collection is updating on a daily or monthly basis or a yearly may be . And i need that newly added value in my pipeline. We can consider this as refreshing the mongo collection value in a running pipeline. For example of mongo collection has total 20K documents and after one day three more records added into mongo collection

Which Compute Engine quotas need to be updated to run Dataflow with 50 workers (IN_USE_ADDRESSES, CPUS, CPUS_ALL_REGIONS ..)?

依然范特西╮ 提交于 2020-06-28 03:51:43
问题 We are using a private GCP account and we would like to process 30 GB of data and do NLP processing using SpaCy. We wanted to use more workers and we decided to start with a maxiumn number of worker of 80 as show below. We submited our job and we got some issue with some of the GCP standard user quotas: QUOTA_EXCEEDED: Quota 'IN_USE_ADDRESSES' exceeded. Limit: 8.0 in region XXX So I decided to request some new quotas of 50 for IN_USE_ADDRESSES in some region (it took me few iteration to find

Dataflow / apache beam Trigger window on number of bytes in window

老子叫甜甜 提交于 2020-06-27 15:14:52
问题 I have a simple job that moves data from pub sub to gcs. The pub sub topic is a shared topic with many different message types of varying size I want the result to be in GCS vertically partition accordingly: Schema/version/year/month/day/ under that parent key should be a group of files for that day, and the files should be a reasonable size, ie 10-200 mb Im using scio and i am able to a groupby operation to make a P/SCollection of [String, Iterable[Event]] where the key is based on the

AttributeError: 'module' object has no attribute 'ensure_str'

最后都变了- 提交于 2020-06-27 11:05:28
问题 I try to transfer data from one bigquery to anther through Beam , however, the following error comes up: WARNING:root:Retry with exponential backoff: waiting for 4.12307941111 seconds before retrying get_query_location because we caught exception: AttributeError: 'module' object has no attribute 'ensure_str' Traceback for above exception (most recent call last): File "/usr/local/lib/python2.7/site-packages/apache_beam/utils/retry.py", line 197, in wrapper return fun(*args, **kwargs) File "

AttributeError: 'module' object has no attribute 'ensure_str'

北慕城南 提交于 2020-06-27 11:05:27
问题 I try to transfer data from one bigquery to anther through Beam , however, the following error comes up: WARNING:root:Retry with exponential backoff: waiting for 4.12307941111 seconds before retrying get_query_location because we caught exception: AttributeError: 'module' object has no attribute 'ensure_str' Traceback for above exception (most recent call last): File "/usr/local/lib/python2.7/site-packages/apache_beam/utils/retry.py", line 197, in wrapper return fun(*args, **kwargs) File "