apache-beam

Max and Min for several fields inside PCollection in apache beam with python

China☆狼群 提交于 2020-02-06 08:17:30
问题 I am using apache beam via python SDK and have the following problem: I have a PCollection with approximately 1 mln entries, each entry in a PCollection looks like a list of 2-tuples [(key1,value1),(key2,value2),...] with length ~150. I need to find max and min values across all entries of the PCollection for each key in order normalize the values. Ideally, it will be good to obtain PCollection with a list of tuples [(key,max_value,min_value),...] and then it will be easy to proceed with

Sideload static data

北城余情 提交于 2020-02-05 02:04:53
问题 When processing my data in a ParDo I need to use a JSON schema stored on Google Cloud Storage. I think this maybe is sideloading? I read the pages they call documentation (https://beam.apache.org/releases/pydoc/2.16.0/apache_beam.pvalue.html) and it contains something about apache_beam.pvalue.AsSingleton and apache_beam.pvalue.AsSideInput but there are zero results if I Google on the usage of those and I can't find any example for Python. How can I read a file from storage from within a ParDo

Computing GroupBy once then passing it to multiple transformations in Google DataFlow (Python SDK)

北战南征 提交于 2020-01-31 19:43:31
问题 I am using Python SDK for Apache Beam to run a feature extraction pipeline on Google DataFlow. I need to run multiple transformations all of which expect items to be grouped by key. Based on the answer to this question, DataFlow is unable to automatically spot and reuse repeated transformations like GroupBy, so I hoped to run GroupBy first and then feed the result PCollection to other transformations (see sample code below). I wonder if this is supposed to work efficiently in DataFlow. If not

Computing GroupBy once then passing it to multiple transformations in Google DataFlow (Python SDK)

不羁岁月 提交于 2020-01-31 19:38:10
问题 I am using Python SDK for Apache Beam to run a feature extraction pipeline on Google DataFlow. I need to run multiple transformations all of which expect items to be grouped by key. Based on the answer to this question, DataFlow is unable to automatically spot and reuse repeated transformations like GroupBy, so I hoped to run GroupBy first and then feed the result PCollection to other transformations (see sample code below). I wonder if this is supposed to work efficiently in DataFlow. If not

Computing GroupBy once then passing it to multiple transformations in Google DataFlow (Python SDK)

天大地大妈咪最大 提交于 2020-01-31 19:37:43
问题 I am using Python SDK for Apache Beam to run a feature extraction pipeline on Google DataFlow. I need to run multiple transformations all of which expect items to be grouped by key. Based on the answer to this question, DataFlow is unable to automatically spot and reuse repeated transformations like GroupBy, so I hoped to run GroupBy first and then feed the result PCollection to other transformations (see sample code below). I wonder if this is supposed to work efficiently in DataFlow. If not

apache beam 2.7.0 craches in utf-8 decoding french characters

旧巷老猫 提交于 2020-01-30 05:24:40
问题 I am trying to write a csv from a bucket of google cloud platform into datastore, containing french characters/accents but I have an error message regarding decoding. After trying encoding and decoding from " latin-1 " to " utf-8 " without success (using unicode, unicodedata and codecs ) I tried to change things manually... The Os I am using, has the " ascii " encoding by default and I manually changed in " Anaconda3/envs/py27/lib/site.py " into utf-8. def setencoding(): """Set the string

Apache Beam 2.12.0 with Java 11 support?

人走茶凉 提交于 2020-01-25 07:37:09
问题 Does Apache Beam 2.12.0 support Java 11, or should i go still stick with a stable Java 8 SDK as for now? I see the site recommends Python 3.5 with Beam 2.12.0 as per the documentation, compared to other higher Python versions. How much compartible it is with Java 11 at this time. So, would a stable version would be still Java 8 to go with Apache Beam 2.12.0. I faced few build issues when using Beam 2.12.0 with Java 11. 回答1: Beam officially doesn't support Java 11, it has only experimental

SSLHandshakeException when running Apache Beam Pipeline in Dataflow

守給你的承諾、 提交于 2020-01-25 07:26:07
问题 I have an Apache Beam Pipeline. In one of the DoFn steps it does an https call (think REST API). All this works fine with DirectRun in my local environment. This is my local environment, apache beam 2.16.0: $ mvn -version Apache Maven 3.6.1 (d66c9c0b3152b2e69ee9bac180bb8fcc8e6af555; 2019-04-04T12:00:29-07:00) Maven home: /opt/apache-maven-3.6.1 Java version: 1.8.0_222, vendor: Private Build, runtime: /usr/lib/jvm/java-8-openjdk-amd64/jre Default locale: en, platform encoding: UTF-8 OS name:

ZetaSQL Sample Using Apache beam

一曲冷凌霜 提交于 2020-01-25 06:46:05
问题 I am Facing Issues while Using ZetaSQL in Apache beam Framework (2.17.0-SNAPSHOT). After Going through documentation of the apache beam I am not able to find any sample for ZetaSQL. I tried to add the Planner: options.setPlannerName("org.apache.beam.sdk.extensions.sql.zetasql.ZetaSQLQueryPlanner"); But Still Facing Issue, Snippet is added below for help. ``` String sql = "SELECT CAST (1243 as INT64), " + "CAST ('2018-09-15 12:59:59.000000+00' as TIMESTAMP), " + "CAST ('string' as STRING);";

Apache Beam pipeline with PubSubIO error using Spark Runner PubsubUnboundedSource$PubsubReader.getWatermark(PubsubUnboundedSource.java:1030)

丶灬走出姿态 提交于 2020-01-25 06:42:08
问题 A beam pipeline with PubSubIO is running fine as Direct Runner and Dataflow runner, however when I run it on Spark Runner (standalone Spark instance) I get a PubSubUnboundedSource error. This is the piece of code where I read in from a GCP PubSub subscription, parse the contents contained in the PubSub message into an object with a DoFn, extract event time from the object and window the resulting Pcollection into 20 second windows: // Take input from pubsub and make pcollections of