google-cloud-dataflow | 易学教程

Usage problem add_value_provider_argument on a streaming stream ( Apache beam /PYTHON)

阅读更多关于 Usage problem add_value_provider_argument on a streaming stream ( Apache beam /PYTHON)

问题 We want to create a custom dataflow template using the function parameters add_value_provider_argument unable to launch the following command without inputting the variables defined in add_value_provider_argument () class UserOptions(PipelineOptions): @classmethod def _add_argparse_args(cls, parser): parser.add_value_provider_argument( '--input_topic', help='The Cloud Pub/Sub topic to read from.\n' '"projects/<PROJECT_NAME>/topics/<TOPIC_NAME>".' ) parser.add_value_provider_argument( '-

Dataflow GroupByKey and CoGroupByKey is very slow

阅读更多关于 Dataflow GroupByKey and CoGroupByKey is very slow

问题 Dataflow works great for pipelines with simple transforms but when we have complex transforms such as joins the performance is really bad. 回答1: I wrote this question to answer it myself. What's happeninig under the hood: The data sent transferred by dataflow between PCollections (serializable objects) may not exist in a single machine. Furthermore, a transformation like GroupByKey/CoGroupByKey needs requires all the data to collected in one place before the resultant populated. Recently I was

Dataflow GroupByKey and CoGroupByKey is very slow

阅读更多关于 Dataflow GroupByKey and CoGroupByKey is very slow

Array type in clickhouseIO for apache beam(dataflow)

阅读更多关于 Array type in clickhouseIO for apache beam(dataflow)

问题 I am using Apache Beam to consume json and insert into clickhouse. I am currently having a problem with the Array data type. Everything works fine before I add an array type of field Schema.Field.of("inputs.value", Schema.FieldType.array(Schema.FieldType.INT64).withNullable(true)) Code for the transformations p.apply(transformNameSuffix + "ReadFromPubSub", PubsubIO.readStrings().fromSubscription(chainConfig.getPubSubSubscriptionPrefix() + "transactions").withIdAttribute(PUBSUB_ID_ATTRIBUTE))

Why increments are not supported in Dataflow-BigTable connector?

阅读更多关于 Why increments are not supported in Dataflow-BigTable connector?

问题 We have a use case in the Streaming mode where we want to keep track of a counter on BigTable from the pipeline (something #items finished processing) for which we need the increment operation. From looking at https://cloud.google.com/bigtable/docs/dataflow-hbase, I see that append/increment operations of the HBase API are not supported by this client. The reason stated is the retry logic on batch mode but if Dataflow guarantees exactly-once, why would supporting it be a bad idea since I know

Why increments are not supported in Dataflow-BigTable connector?

阅读更多关于 Why increments are not supported in Dataflow-BigTable connector?

How could I import google analytics data to Google Cloud Platform?

阅读更多关于 How could I import google analytics data to Google Cloud Platform?

问题 I need to import data from Google Analytics to Google Cloud Platform (Cloud Storage maybe) and then process this information and exported to google cloud SQL. I have not a clear idea what Google Cloud Service I can use to run the process of importing the data. I was thinking to use Google DataFlow to do the Extraction Transformation and Load into Cloud SQL . 回答1: As @jkff mentioned in the comments if you have Google Analytics 360 you can enable the bigQuery integration and your raw data will

Google Dataflow - Wall Time/PCollection output numbers going backwards

阅读更多关于 Google Dataflow - Wall Time/PCollection output numbers going backwards

问题 The first step of a Dataflow pipeline we're doing has a read of BigQuery using the Python Beam API. beam.io.Read( beam.io.BigQuerySource( project=google_project, table=table_name, dataset=big_query_dataset_id ) ) The table in question has 9 billion+ rows. It looks like the export jobs that kick off as a result of this call finish very quickly - usually between 3-5 minutes with the expected amount of data in *.avro format in a folder for Dataflow to read. However when actually executing this,

Exception Handling in Apache Beam pipelines using Python

阅读更多关于 Exception Handling in Apache Beam pipelines using Python

问题 I'm doing a simple pipeline using Apache Beam in python (on GCP Dataflow) to read from PubSub and write on Big Query but can't handle exceptions on pipeline to create alternatives flows. On a simple WriteToBigQuery example: output = json_output | 'Write to BigQuery' >> beam.io.WriteToBigQuery('some-project:dataset.table_name') I tried to put this inside a try/except code, but it doesnt work because when it fails, exceptions seems to be throwed on a Java layer outside my python execution: INFO

DataflowRunner exits with “No files to stage has been found.”

阅读更多关于 DataflowRunner exits with “No files to stage has been found.”

问题 I want to run the WordCount java example from https://beam.apache.org/get-started/quickstart-java/, but somehow I get an error that no files to stage have been found by the ClasspathScanningResourcesDetector . I run the example exactly as described on the website: mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \ -Dexec.args="--runner=DataflowRunner --project=<your-gcp-project> \ --gcpTempLocation=gs://<your-gcs-bucket>/tmp \ --inputFile=gs://apache-beam-samples