google-cloud-dataflow | 易学教程

Dataflow can't read from BigQuery dataset in region “asia-northeast1”

阅读更多关于 Dataflow can't read from BigQuery dataset in region “asia-northeast1”

问题 I have a BigQuery dataset located in the new "asia-northeast1" region. I'm trying to run a Dataflow templated pipeline (running in Australia region) to read a table from it. It chucks the following error, even though the dataset/table does indeed exist: Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 404 Not Found { "code" : 404, "errors" : [ { "domain" : "global", "message" : "Not found: Dataset grey-sort-challenge:Konnichiwa_Tokyo", "reason" : "notFound" } ],

Dataflow can't read from BigQuery dataset in region “asia-northeast1”

阅读更多关于 Dataflow can't read from BigQuery dataset in region “asia-northeast1”

Dataflow + Datastore = DatastoreException: I/O error

阅读更多关于 Dataflow + Datastore = DatastoreException: I/O error

问题 I'm trying to write to DataStore from DataFlow using com.google.cloud.datastore . My code looks like this (inspired by the examples in [1]): public void processElement(ProcessContext c) { LocalDatastoreHelper HELPER = LocalDatastoreHelper.create(1.0); Datastore datastore = HELPER.options().toBuilder().namespace("ghijklmnop").build().service(); Key taskKey = datastore.newKeyFactory() .ancestors(PathElement.of("TaskList", "default")) .kind("Task") .newKey("sampleTask"); Entity task = Entity

How to join BQ tables on two or more keys with Cloud Dataflow?

阅读更多关于 How to join BQ tables on two or more keys with Cloud Dataflow?

问题 I have two tables A and B. Both of them have the fields session_id and cookie_id . How do i create a Joined table output joining A with B on session_id , cookie_id with the help of a Dataflow pipeline? CoGroupByKey method allows you to join on a single key. Couldn't find anything helpful in the documentation as well. 回答1: To expand on user9720010's answer. You can create a composite key by mapping the fields to a combination of session_id and cookie_id . This pattern is explained in the

How to join BQ tables on two or more keys with Cloud Dataflow?

阅读更多关于 How to join BQ tables on two or more keys with Cloud Dataflow?

Google DataFlow, how to wait for external webhook when Transforming a collection?

阅读更多关于 Google DataFlow, how to wait for external webhook when Transforming a collection?

问题 I have a code that reads an Xlsx file, and for each line, do a process on a specific column. The problem is related to the "Transform" part of the Dataflow. I implemented a specific method that get the value sent from the reader, and this data is sent to an outside server. This outside server process the data (could takes minutes), then do a POST request with the result. (the URL for the POST request is specified in the original request. My questions is the following : how can I make my ParDo

Google DataFlow, how to wait for external webhook when Transforming a collection?

阅读更多关于 Google DataFlow, how to wait for external webhook when Transforming a collection?

How do I perform a “diff” on two Sources given a key using Apache Beam Python SDK?

阅读更多关于 How do I perform a “diff” on two Sources given a key using Apache Beam Python SDK?

问题 I posed the question generically, because maybe it is a generic answer. But a specific example is comparing 2 BigQuery tables with the same schema, but potentially different data. I want a diff, i.e. what was added, deleted, modified, with respect to a composite key, e.g. the first 2 columns. Table A C1 C2 C3 ----------- a a 1 a b 1 a c 1 Table B C1 C2 C3 # Notes if comparing B to A ------------------------------------- a a 1 # No Change to the key a + a a b 2 # Key a + b Changed from 1 to 2

Shutting down JVM after 8 consecutive periods of measured GC thrashing

阅读更多关于 Shutting down JVM after 8 consecutive periods of measured GC thrashing

问题 I am writing Apache beam BAtch dataflow in which I am writing from GCS to BQ. My data contains 4 millions of records . I have specified n1-HighMem-8 machine type.My dataflow works form small amount of data. I my use case I schema is not fixed so I have used .getFailedInserts() Method to get schema failed records not inserted. I have grouped them and writing to BQ using BQ load job via GCS in same dataflow. for this amount of data I am geting following error 7 time and then my dataflow errors

Shutting down JVM after 8 consecutive periods of measured GC thrashing

阅读更多关于 Shutting down JVM after 8 consecutive periods of measured GC thrashing