apache-beam | 易学教程

Max and Min for several fields inside PCollection in apache beam with python

阅读更多关于 Max and Min for several fields inside PCollection in apache beam with python

问题 I am using apache beam via python SDK and have the following problem: I have a PCollection with approximately 1 mln entries, each entry in a PCollection looks like a list of 2-tuples [(key1,value1),(key2,value2),...] with length ~150. I need to find max and min values across all entries of the PCollection for each key in order normalize the values. Ideally, it will be good to obtain PCollection with a list of tuples [(key,max_value,min_value),...] and then it will be easy to proceed with

Sideload static data

阅读更多关于 Sideload static data

问题 When processing my data in a ParDo I need to use a JSON schema stored on Google Cloud Storage. I think this maybe is sideloading? I read the pages they call documentation (https://beam.apache.org/releases/pydoc/2.16.0/apache_beam.pvalue.html) and it contains something about apache_beam.pvalue.AsSingleton and apache_beam.pvalue.AsSideInput but there are zero results if I Google on the usage of those and I can't find any example for Python. How can I read a file from storage from within a ParDo

Computing GroupBy once then passing it to multiple transformations in Google DataFlow (Python SDK)

阅读更多关于 Computing GroupBy once then passing it to multiple transformations in Google DataFlow (Python SDK)

问题 I am using Python SDK for Apache Beam to run a feature extraction pipeline on Google DataFlow. I need to run multiple transformations all of which expect items to be grouped by key. Based on the answer to this question, DataFlow is unable to automatically spot and reuse repeated transformations like GroupBy, so I hoped to run GroupBy first and then feed the result PCollection to other transformations (see sample code below). I wonder if this is supposed to work efficiently in DataFlow. If not

Computing GroupBy once then passing it to multiple transformations in Google DataFlow (Python SDK)

阅读更多关于 Computing GroupBy once then passing it to multiple transformations in Google DataFlow (Python SDK)

Computing GroupBy once then passing it to multiple transformations in Google DataFlow (Python SDK)

阅读更多关于 Computing GroupBy once then passing it to multiple transformations in Google DataFlow (Python SDK)

apache beam 2.7.0 craches in utf-8 decoding french characters

阅读更多关于 apache beam 2.7.0 craches in utf-8 decoding french characters

问题 I am trying to write a csv from a bucket of google cloud platform into datastore, containing french characters/accents but I have an error message regarding decoding. After trying encoding and decoding from " latin-1 " to " utf-8 " without success (using unicode, unicodedata and codecs ) I tried to change things manually... The Os I am using, has the " ascii " encoding by default and I manually changed in " Anaconda3/envs/py27/lib/site.py " into utf-8. def setencoding(): """Set the string

Apache Beam 2.12.0 with Java 11 support?

阅读更多关于 Apache Beam 2.12.0 with Java 11 support?

问题 Does Apache Beam 2.12.0 support Java 11, or should i go still stick with a stable Java 8 SDK as for now? I see the site recommends Python 3.5 with Beam 2.12.0 as per the documentation, compared to other higher Python versions. How much compartible it is with Java 11 at this time. So, would a stable version would be still Java 8 to go with Apache Beam 2.12.0. I faced few build issues when using Beam 2.12.0 with Java 11. 回答1: Beam officially doesn't support Java 11, it has only experimental

SSLHandshakeException when running Apache Beam Pipeline in Dataflow

阅读更多关于 SSLHandshakeException when running Apache Beam Pipeline in Dataflow

问题 I have an Apache Beam Pipeline. In one of the DoFn steps it does an https call (think REST API). All this works fine with DirectRun in my local environment. This is my local environment, apache beam 2.16.0: $ mvn -version Apache Maven 3.6.1 (d66c9c0b3152b2e69ee9bac180bb8fcc8e6af555; 2019-04-04T12:00:29-07:00) Maven home: /opt/apache-maven-3.6.1 Java version: 1.8.0_222, vendor: Private Build, runtime: /usr/lib/jvm/java-8-openjdk-amd64/jre Default locale: en, platform encoding: UTF-8 OS name:

ZetaSQL Sample Using Apache beam

阅读更多关于 ZetaSQL Sample Using Apache beam

问题 I am Facing Issues while Using ZetaSQL in Apache beam Framework (2.17.0-SNAPSHOT). After Going through documentation of the apache beam I am not able to find any sample for ZetaSQL. I tried to add the Planner: options.setPlannerName("org.apache.beam.sdk.extensions.sql.zetasql.ZetaSQLQueryPlanner"); But Still Facing Issue, Snippet is added below for help. ``` String sql = "SELECT CAST (1243 as INT64), " + "CAST ('2018-09-15 12:59:59.000000+00' as TIMESTAMP), " + "CAST ('string' as STRING);";

Apache Beam pipeline with PubSubIO error using Spark Runner PubsubUnboundedSource$PubsubReader.getWatermark(PubsubUnboundedSource.java:1030)

阅读更多关于 Apache Beam pipeline with PubSubIO error using Spark Runner PubsubUnboundedSource$PubsubReader.getWatermark(PubsubUnboundedSource.java:1030)

问题 A beam pipeline with PubSubIO is running fine as Direct Runner and Dataflow runner, however when I run it on Spark Runner (standalone Spark instance) I get a PubSubUnboundedSource error. This is the piece of code where I read in from a GCP PubSub subscription, parse the contents contained in the PubSub message into an object with a DoFn, extract event time from the object and window the resulting Pcollection into 20 second windows: // Take input from pubsub and make pcollections of