apache-beam

Google Dataflow - Failed to import custom python modules

青春壹個敷衍的年華 提交于 2020-03-03 05:12:50
问题 My Apache beam pipeline implements custom Transforms and ParDo's python modules which further imports other modules written by me. On Local runner this works fine as all the available files are available in the same path. In case of Dataflow runner, pipeline fails with module import error. How do I make custom modules available to all the dataflow workers? Please advise. Below is an example: ImportError: No module named DataAggregation at find_class (/usr/lib/python2.7/pickle.py:1130) at find

Google Dataflow - Failed to import custom python modules

◇◆丶佛笑我妖孽 提交于 2020-03-03 05:12:06
问题 My Apache beam pipeline implements custom Transforms and ParDo's python modules which further imports other modules written by me. On Local runner this works fine as all the available files are available in the same path. In case of Dataflow runner, pipeline fails with module import error. How do I make custom modules available to all the dataflow workers? Please advise. Below is an example: ImportError: No module named DataAggregation at find_class (/usr/lib/python2.7/pickle.py:1130) at find

RuntimeValueProviderError when creating a google cloud dataflow template with Apache Beam python

对着背影说爱祢 提交于 2020-03-01 05:09:07
问题 I can't stage a cloud dataflow template with python 3.7. It fails on the one parametrized argument with apache_beam.error.RuntimeValueProviderError: RuntimeValueProvider(option: input, type: str, default_value: 'gs://dataflow-samples/shakespeare/kinglear.txt') not accessible Staging the template with python 2.7 works fine. I have tried running dataflow jobs with 3.7 and they work fine. Only the template staging is broken. Is python 3.7 still not supported in dataflow templates or did the

RuntimeValueProviderError when creating a google cloud dataflow template with Apache Beam python

梦想的初衷 提交于 2020-03-01 05:08:12
问题 I can't stage a cloud dataflow template with python 3.7. It fails on the one parametrized argument with apache_beam.error.RuntimeValueProviderError: RuntimeValueProvider(option: input, type: str, default_value: 'gs://dataflow-samples/shakespeare/kinglear.txt') not accessible Staging the template with python 2.7 works fine. I have tried running dataflow jobs with 3.7 and they work fine. Only the template staging is broken. Is python 3.7 still not supported in dataflow templates or did the

RuntimeValueProviderError when creating a google cloud dataflow template with Apache Beam python

半世苍凉 提交于 2020-03-01 05:08:09
问题 I can't stage a cloud dataflow template with python 3.7. It fails on the one parametrized argument with apache_beam.error.RuntimeValueProviderError: RuntimeValueProvider(option: input, type: str, default_value: 'gs://dataflow-samples/shakespeare/kinglear.txt') not accessible Staging the template with python 2.7 works fine. I have tried running dataflow jobs with 3.7 and they work fine. Only the template staging is broken. Is python 3.7 still not supported in dataflow templates or did the

how to connect to Cloud SQL from Google DataFlow

和自甴很熟 提交于 2020-02-25 06:45:27
问题 I'm trying to create a pipeline task using beam java SDK and Google Dataflow to move data from Cloud SQL to Elastic search I've created the following class main method: public static void main(String[] args) throws Exception{ DataflowPipelineOptions options = PipelineOptionsFactory.as(DataflowPipelineOptions.class); options.setProject("staging"); options.setTempLocation("gs://csv_to_sql_staging/temp"); options.setRunner(DataflowRunner.class); options.setGcpTempLocation("gs://csv_to_sql

How to create read transform using ParDo and DoFn in Apache Beam

橙三吉。 提交于 2020-02-24 12:32:46
问题 According to the Apache Beam documentation the recommended way to write simple sources is by using Read Transforms and ParDo . Unfortunately the Apache Beam docs has let me down here. I'm trying to write a simple unbounded data source which emits events using a ParDo but the compiler keeps complaining about the input type of the DoFn object: message: 'The method apply(PTransform<? super PBegin,OutputT>) in the type PBegin is not applicable for the arguments (ParDo.SingleOutput<PBegin,Event>)'

How do you access the message id from Google Pub/Sub using Apache Beam?

眉间皱痕 提交于 2020-02-24 12:21:19
问题 I have been testing Apache Beam using the 2.13.0 SDK on Python 2.7.16, pulling simple messages from a Google Pub/Sub subscription in streaming mode, and writing to a Google Big Query table. As part of this operation, I'm trying to use the Pub/Sub message id for deduplication, however I can't seem to get it out at all. The documentation for the ReadFromPubSub method and PubSubMessage type suggests that service generated KVs such as id_label should be returned as part of the attributes property

How to use Pandas in apache beam?

痴心易碎 提交于 2020-02-24 10:16:49
问题 How to implement Pandas in Apache beam ? I cannot perform left join on multiple columns and Pcollections does not support sql queries. Even the Apache Beam document is not properly framed. I checked but couldn't find any kind of Panda implementation in Apache beam. Can anyone direct me to the desired link ? 回答1: There's some confusion going on here. pandas is "supported", in the sense that you can use the pandas library the same way you'd be using it without Apache Beam, and the same way you

Schema update while writing to Avro files

空扰寡人 提交于 2020-02-06 08:47:09
问题 Context: We have a Dataflow job that transforms PubSub messages into Avro GenericRecords and writes them into GCS as ".avro". The transformation between PubSub messages and GenericRecords requires a schema. This schema changes weekly with field additions only. We want to be able to update the fields without updating the Dataflow job. What we did: We took the advice from this post and created a Guava Cache that refreshes the content every minute. The refresh function will pull schema from GCS.