apache-beam

Dataflow Error: 'Clients have non-trivial state that is local and unpickleable'

偶尔善良 提交于 2019-12-11 02:34:23
问题 I have a pipeline that I can execute locally without any errors. I used to get this error in my locally run pipeline 'Clients have non-trivial state that is local and unpickleable.' PicklingError: Pickling client objects is explicitly not supported. I believe I fixed this by downgrading to apache-beam=2.3.0 Then locally it would run perfectly. Now I am using DataflowRunner and in the requirements.txt file I have the following dependencies apache-beam==2.3.0 google-cloud-bigquery==1.1.0 google

Apache Beam with Dataflow - Nullpointer when reading from BigQuery

对着背影说爱祢 提交于 2019-12-11 02:29:51
问题 I am running a job on google dataflow written with apache beam that reads from BigQuery table and from files. Transforms the data and writes it into other BigQuery tables. The job "usually" succeeds, but sometimes I am randomly getting nullpointer exception when reading from big query table and my job fails: (288abb7678892196): java.lang.NullPointerException at org.apache.beam.sdk.io.gcp.bigquery.BigQuerySourceBase.split(BigQuerySourceBase.java:98) at com.google.cloud.dataflow.worker.runners

Running BeamSql WithoutCoder or Making Coder Dynamic

元气小坏坏 提交于 2019-12-11 02:28:26
问题 I am reading data from file and converting it to BeamRecord But While i am Doing Query on that it Show Error-: Exception in thread "main" java.lang.ClassCastException: org.apache.beam.sdk.coders.SerializableCoder cannot be cast to org.apache.beam.sdk.coders.BeamRecordCoder at org.apache.beam.sdk.extensions.sql.BeamSql$QueryTransform.registerTables(BeamSql.java:173) at org.apache.beam.sdk.extensions.sql.BeamSql$QueryTransform.expand(BeamSql.java:153) at org.apache.beam.sdk.extensions.sql

Apache Beam - org.apache.beam.sdk.util.UserCodeException: java.sql.SQLException: Cannot create PoolableConnectionFactory (Method not supported)

时光毁灭记忆、已成空白 提交于 2019-12-11 02:15:06
问题 I am trying to connect to a hive instance installed in cloud instance using Apache beam-dataflow. When I run this, I am getting the below exception. This is happening when I access this db using Apache beam. I have seen many related questions which is not about apache beam or google dataflow. (c9ec8fdbe9d1719a): java.lang.RuntimeException: org.apache.beam.sdk.util.UserCodeException: java.sql.SQLException: Cannot create PoolableConnectionFactory (Method not supported) at com.google.cloud

Why do Dataflow steps not start?

核能气质少年 提交于 2019-12-11 00:55:44
问题 I have a linear three step Dataflow pipeline - for some reason the last step started, but the preceding two steps hung in Not started for a long time before I gave up and killed the job. I'm not sure what caused this, as this same pipeline had successfully run in the past, and I'm surprised it didn't show any errors in the logs as to what was preventing the first two steps from starting. What can cause such a situation and how can I prevent it from happening? 回答1: This was happening because

Apache Beam: Batch Pipeline with Unbounded Source

心不动则不痛 提交于 2019-12-11 00:39:36
问题 I'm currently using Apache Beam with Google Dataflow for processing real time data. The data comes from Google PubSub, which is unbounded, so currently I'm using streaming pipeline. However, it turns out that having a streaming pipeline running 24/7 is quite expensive. To reduce cost, I'm thinking of switching to a batch pipeline that runs at a fixed time interval (e.g. every 30 minutes), since it's not really important for the processing to be real time for the user. I'm wondering if it's

Apache Beam stream processing of json data

我只是一个虾纸丫 提交于 2019-12-10 23:42:10
问题 I am analyzing Apache Beam stream processing of data. I have worked on Apache Kafka stream processing (Producer, Consumer etc). I want to compare it with Beam now. I want to to stream simple json data using Apache Beam programmatically (Java). {"UserID":"1","Address":"XXX","ClassNo":"989","UserName":"Stella","ClassType":"YYY"} Can someone please guide me or direct me with an example link? 回答1: There are multiple aspects of this: first you need to establish where the data is coming from: you

BigQueryIO - Write performance with streaming and FILE_LOADS

江枫思渺然 提交于 2019-12-10 14:57:05
问题 My pipeline : Kafka -> Dataflow streaming (Beam v2.3) -> BigQuery Given that low-latency isn't important in my case, I use FILE_LOADS to reduce the costs, like this : BigQueryIO.writeTableRows() .withJsonSchema(schema) .withWriteDisposition(WriteDisposition.WRITE_APPEND) .withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED) .withMethod(Method.FILE_LOADS) .withTriggeringFrequency(triggeringFrequency) .withCustomGcsTempLocation(gcsTempLocation) .withNumFileShards(numFileShards)

How do I write to multiple files in Apache Beam?

故事扮演 提交于 2019-12-10 14:36:07
问题 Let me simplify my case. I'm using Apache Beam 0.6.0. My final processed result is PCollection<KV<String, String>> . And I want to write values to different files corresponding to their keys. For example, let's say the result consists of (key1, value1) (key2, value2) (key1, value3) (key1, value4) Then I want to write value1 , value3 and value4 to key1.txt , and write value4 to key2.txt . And in my case: Key set is determined when the pipeline is running, not when constructing the pipeline.

apache_beam.transforms.util.Reshuffle() not available for GCP Dataflow

陌路散爱 提交于 2019-12-10 13:38:38
问题 I have upgraded to the latest apache_beam[gcp] package via pip install --upgrade apache_beam[gcp] . However, I noticed that Reshuffle() does not appear in the [gcp] distribution. Does this mean that I will not be able to use Reshuffle() in any dataflow pipelines? Is there any way around this? Or is it possible that the pip package is just not up to date and if Reshuffle() is in master on github then it will be available on dataflow? Based on the response to this question I am trying to read