google-cloud-dataflow

Usage problem add_value_provider_argument on a streaming stream ( Apache beam /PYTHON)

一曲冷凌霜 提交于 2020-06-17 02:28:47
问题 We want to create a custom dataflow template using the function parameters add_value_provider_argument unable to launch the following command without inputting the variables defined in add_value_provider_argument () class UserOptions(PipelineOptions): @classmethod def _add_argparse_args(cls, parser): parser.add_value_provider_argument( '--input_topic', help='The Cloud Pub/Sub topic to read from.\n' '"projects/<PROJECT_NAME>/topics/<TOPIC_NAME>".' ) parser.add_value_provider_argument( '-

Dataflow GroupByKey and CoGroupByKey is very slow

我们两清 提交于 2020-05-26 10:17:51
问题 Dataflow works great for pipelines with simple transforms but when we have complex transforms such as joins the performance is really bad. 回答1: I wrote this question to answer it myself. What's happeninig under the hood: The data sent transferred by dataflow between PCollections (serializable objects) may not exist in a single machine. Furthermore, a transformation like GroupByKey/CoGroupByKey needs requires all the data to collected in one place before the resultant populated. Recently I was

Dataflow GroupByKey and CoGroupByKey is very slow

时光总嘲笑我的痴心妄想 提交于 2020-05-26 10:16:08
问题 Dataflow works great for pipelines with simple transforms but when we have complex transforms such as joins the performance is really bad. 回答1: I wrote this question to answer it myself. What's happeninig under the hood: The data sent transferred by dataflow between PCollections (serializable objects) may not exist in a single machine. Furthermore, a transformation like GroupByKey/CoGroupByKey needs requires all the data to collected in one place before the resultant populated. Recently I was

Array type in clickhouseIO for apache beam(dataflow)

江枫思渺然 提交于 2020-05-17 07:55:26
问题 I am using Apache Beam to consume json and insert into clickhouse. I am currently having a problem with the Array data type. Everything works fine before I add an array type of field Schema.Field.of("inputs.value", Schema.FieldType.array(Schema.FieldType.INT64).withNullable(true)) Code for the transformations p.apply(transformNameSuffix + "ReadFromPubSub", PubsubIO.readStrings().fromSubscription(chainConfig.getPubSubSubscriptionPrefix() + "transactions").withIdAttribute(PUBSUB_ID_ATTRIBUTE))

Why increments are not supported in Dataflow-BigTable connector?

狂风中的少年 提交于 2020-05-13 08:14:32
问题 We have a use case in the Streaming mode where we want to keep track of a counter on BigTable from the pipeline (something #items finished processing) for which we need the increment operation. From looking at https://cloud.google.com/bigtable/docs/dataflow-hbase, I see that append/increment operations of the HBase API are not supported by this client. The reason stated is the retry logic on batch mode but if Dataflow guarantees exactly-once, why would supporting it be a bad idea since I know

Why increments are not supported in Dataflow-BigTable connector?

心已入冬 提交于 2020-05-13 08:12:11
问题 We have a use case in the Streaming mode where we want to keep track of a counter on BigTable from the pipeline (something #items finished processing) for which we need the increment operation. From looking at https://cloud.google.com/bigtable/docs/dataflow-hbase, I see that append/increment operations of the HBase API are not supported by this client. The reason stated is the retry logic on batch mode but if Dataflow guarantees exactly-once, why would supporting it be a bad idea since I know

How could I import google analytics data to Google Cloud Platform?

烈酒焚心 提交于 2020-05-02 04:43:37
问题 I need to import data from Google Analytics to Google Cloud Platform (Cloud Storage maybe) and then process this information and exported to google cloud SQL. I have not a clear idea what Google Cloud Service I can use to run the process of importing the data. I was thinking to use Google DataFlow to do the Extraction Transformation and Load into Cloud SQL . 回答1: As @jkff mentioned in the comments if you have Google Analytics 360 you can enable the bigQuery integration and your raw data will

Google Dataflow - Wall Time/PCollection output numbers going backwards

独自空忆成欢 提交于 2020-04-15 12:26:53
问题 The first step of a Dataflow pipeline we're doing has a read of BigQuery using the Python Beam API. beam.io.Read( beam.io.BigQuerySource( project=google_project, table=table_name, dataset=big_query_dataset_id ) ) The table in question has 9 billion+ rows. It looks like the export jobs that kick off as a result of this call finish very quickly - usually between 3-5 minutes with the expected amount of data in *.avro format in a folder for Dataflow to read. However when actually executing this,

Exception Handling in Apache Beam pipelines using Python

天涯浪子 提交于 2020-04-13 16:47:12
问题 I'm doing a simple pipeline using Apache Beam in python (on GCP Dataflow) to read from PubSub and write on Big Query but can't handle exceptions on pipeline to create alternatives flows. On a simple WriteToBigQuery example: output = json_output | 'Write to BigQuery' >> beam.io.WriteToBigQuery('some-project:dataset.table_name') I tried to put this inside a try/except code, but it doesnt work because when it fails, exceptions seems to be throwed on a Java layer outside my python execution: INFO

DataflowRunner exits with “No files to stage has been found.”

为君一笑 提交于 2020-04-11 06:45:09
问题 I want to run the WordCount java example from https://beam.apache.org/get-started/quickstart-java/, but somehow I get an error that no files to stage have been found by the ClasspathScanningResourcesDetector . I run the example exactly as described on the website: mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \ -Dexec.args="--runner=DataflowRunner --project=<your-gcp-project> \ --gcpTempLocation=gs://<your-gcs-bucket>/tmp \ --inputFile=gs://apache-beam-samples