google-cloud-dataflow

Dataflow can't read from BigQuery dataset in region “asia-northeast1”

Deadly 提交于 2020-01-15 09:42:14
问题 I have a BigQuery dataset located in the new "asia-northeast1" region. I'm trying to run a Dataflow templated pipeline (running in Australia region) to read a table from it. It chucks the following error, even though the dataset/table does indeed exist: Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 404 Not Found { "code" : 404, "errors" : [ { "domain" : "global", "message" : "Not found: Dataset grey-sort-challenge:Konnichiwa_Tokyo", "reason" : "notFound" } ],

Dataflow can't read from BigQuery dataset in region “asia-northeast1”

谁说我不能喝 提交于 2020-01-15 09:41:40
问题 I have a BigQuery dataset located in the new "asia-northeast1" region. I'm trying to run a Dataflow templated pipeline (running in Australia region) to read a table from it. It chucks the following error, even though the dataset/table does indeed exist: Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 404 Not Found { "code" : 404, "errors" : [ { "domain" : "global", "message" : "Not found: Dataset grey-sort-challenge:Konnichiwa_Tokyo", "reason" : "notFound" } ],

Dataflow + Datastore = DatastoreException: I/O error

巧了我就是萌 提交于 2020-01-15 09:28:46
问题 I'm trying to write to DataStore from DataFlow using com.google.cloud.datastore . My code looks like this (inspired by the examples in [1]): public void processElement(ProcessContext c) { LocalDatastoreHelper HELPER = LocalDatastoreHelper.create(1.0); Datastore datastore = HELPER.options().toBuilder().namespace("ghijklmnop").build().service(); Key taskKey = datastore.newKeyFactory() .ancestors(PathElement.of("TaskList", "default")) .kind("Task") .newKey("sampleTask"); Entity task = Entity

How to join BQ tables on two or more keys with Cloud Dataflow?

痴心易碎 提交于 2020-01-15 09:14:30
问题 I have two tables A and B. Both of them have the fields session_id and cookie_id . How do i create a Joined table output joining A with B on session_id , cookie_id with the help of a Dataflow pipeline? CoGroupByKey method allows you to join on a single key. Couldn't find anything helpful in the documentation as well. 回答1: To expand on user9720010's answer. You can create a composite key by mapping the fields to a combination of session_id and cookie_id . This pattern is explained in the

How to join BQ tables on two or more keys with Cloud Dataflow?

元气小坏坏 提交于 2020-01-15 09:14:07
问题 I have two tables A and B. Both of them have the fields session_id and cookie_id . How do i create a Joined table output joining A with B on session_id , cookie_id with the help of a Dataflow pipeline? CoGroupByKey method allows you to join on a single key. Couldn't find anything helpful in the documentation as well. 回答1: To expand on user9720010's answer. You can create a composite key by mapping the fields to a combination of session_id and cookie_id . This pattern is explained in the

Google DataFlow, how to wait for external webhook when Transforming a collection?

走远了吗. 提交于 2020-01-15 09:12:34
问题 I have a code that reads an Xlsx file, and for each line, do a process on a specific column. The problem is related to the "Transform" part of the Dataflow. I implemented a specific method that get the value sent from the reader, and this data is sent to an outside server. This outside server process the data (could takes minutes), then do a POST request with the result. (the URL for the POST request is specified in the original request. My questions is the following : how can I make my ParDo

Google DataFlow, how to wait for external webhook when Transforming a collection?

北城余情 提交于 2020-01-15 09:11:38
问题 I have a code that reads an Xlsx file, and for each line, do a process on a specific column. The problem is related to the "Transform" part of the Dataflow. I implemented a specific method that get the value sent from the reader, and this data is sent to an outside server. This outside server process the data (could takes minutes), then do a POST request with the result. (the URL for the POST request is specified in the original request. My questions is the following : how can I make my ParDo

How do I perform a “diff” on two Sources given a key using Apache Beam Python SDK?

萝らか妹 提交于 2020-01-15 07:27:12
问题 I posed the question generically, because maybe it is a generic answer. But a specific example is comparing 2 BigQuery tables with the same schema, but potentially different data. I want a diff, i.e. what was added, deleted, modified, with respect to a composite key, e.g. the first 2 columns. Table A C1 C2 C3 ----------- a a 1 a b 1 a c 1 Table B C1 C2 C3 # Notes if comparing B to A ------------------------------------- a a 1 # No Change to the key a + a a b 2 # Key a + b Changed from 1 to 2

Shutting down JVM after 8 consecutive periods of measured GC thrashing

こ雲淡風輕ζ 提交于 2020-01-15 04:44:11
问题 I am writing Apache beam BAtch dataflow in which I am writing from GCS to BQ. My data contains 4 millions of records . I have specified n1-HighMem-8 machine type.My dataflow works form small amount of data. I my use case I schema is not fixed so I have used .getFailedInserts() Method to get schema failed records not inserted. I have grouped them and writing to BQ using BQ load job via GCS in same dataflow. for this amount of data I am geting following error 7 time and then my dataflow errors

Shutting down JVM after 8 consecutive periods of measured GC thrashing

别来无恙 提交于 2020-01-15 04:42:38
问题 I am writing Apache beam BAtch dataflow in which I am writing from GCS to BQ. My data contains 4 millions of records . I have specified n1-HighMem-8 machine type.My dataflow works form small amount of data. I my use case I schema is not fixed so I have used .getFailedInserts() Method to get schema failed records not inserted. I have grouped them and writing to BQ using BQ load job via GCS in same dataflow. for this amount of data I am geting following error 7 time and then my dataflow errors