apache-beam

Reading CSV header with Dataflow

半世苍凉 提交于 2019-12-04 06:57:26
I have a CSV file, and I don't know the column names ahead of time. I need to output the data in JSON after some transformations in Google Dataflow. What's the best way to take the header row and permeate the labels through all the rows? For example: a,b,c 1,2,3 4,5,6 ...becomes (approximately): {a:1, b:2, c:3} {a:4, b:5, c:6} You should implement custom FileBasedSource (similar to TextIO.TextSource ), that will read the first line and store header data @Override protected void startReading(final ReadableByteChannel channel) throws IOException { lineReader = new LineReader(channel); if

Apache Beam: PubsubReader fails with NPE

落爺英雄遲暮 提交于 2019-12-04 06:16:09
问题 I have a a beam pipeline that reads from PubSub and write to BigQuery after applying some transformation. The pipeline fails consistently with a NPE. I am using beam SDK version 0.6.0. Any Idea on what I could be doing wrong? I am trying to run the pipeline with a DirectRunner. java.lang.NullPointerException at org.apache.beam.sdk.io.PubsubUnboundedSource$PubsubReader.ackBatch(PubsubUnboundedSource.java:640) at org.apache.beam.sdk.io.PubsubUnboundedSource$PubsubCheckpoint.finalizeCheckpoint

Maven conflict in Java app with google-cloud-core-grpc dependency

别来无恙 提交于 2019-12-04 04:07:42
(I've also raised a GitHub issue for this - https://github.com/googleapis/google-cloud-java/issues/4095 ) I have the latest versions of the following 2 dependencies for Apache Beam: Dependency 1 - google-cloud-dataflow-java-sdk-all (A distribution of Apache Beam designed to simplify usage of Apache Beam on Google Cloud Dataflow service - https://mvnrepository.com/artifact/com.google.cloud.dataflow/google-cloud-dataflow-java-sdk-all ) <dependency> <groupId>com.google.cloud.dataflow</groupId> <artifactId>google-cloud-dataflow-java-sdk-all</artifactId> <version>2.5.0</version> </dependency>

Forcing an empty pane/window in streaming in Apache Beam

人走茶凉 提交于 2019-12-04 02:03:41
问题 I am trying to implement a pipeline and takes in a stream of data and every minutes output a True if there is any element in the minute interval or False if there is none. The pane (with forever time trigger) or window (fixed window) does not seem to trigger if there is no element for the duration. One workaround I am thinking is to put the stream into a global window, use a ValueState to keep a queue to accumulate the data and a timer as a trigger to exam the queue. I wonder if there is any

Difference between Apache Beam and Apache Nifi

只愿长相守 提交于 2019-12-04 00:44:14
问题 What are the use cases for Apache Beam and Apache Nifi? It seems both of them are data flow engines. In case both have similar use case, which of the two is better? 回答1: Apache Beam is an abstraction layer for stream processing systems like Apache Flink, Apache Spark (streaming), Apache Apex, and Apache Storm. It lets you write your code against a standard API, and then execute the code using any of the underlying platforms. So theoretically, if you wrote your code against the Beam API, that

IllegalArgumentException: Unable to convert url (jar:file:/app.jar!/BOOT-INF/classes!/) to file

限于喜欢 提交于 2019-12-03 22:28:45
Built spring boot 2.0.0.RC application with Google Dataflow and other services and deployed with following maven command mvn appengine:deploy . The build goes successful to AppEngine and an instance is created. The problem is at the App Engine dashboard following error is displayed java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke

Google Dataflow: running dynamic query with BigQuery+Pub/Sub in Python

一世执手 提交于 2019-12-03 21:51:46
What I would like to do in the pipeline: Read from pub/sub (done) Transform this data to dictionary (done) Take the value of a specified key from the dict (done) Run a parametrized/dynamic query from BigQuery in which the where part should be like this: SELECT field1 FROM Table where field2 = @valueFromP/S The pipeline | 'Read from PubSub' >> beam.io.ReadFromPubSub(subscription='') | 'String to dictionary' >> beam.Map(lambda s:data_ingestion.parse_method(s)) | 'BigQuery' >> <Here is where I'm not sure how to do it> The normal way to read from BQ it would be like: | 'Read' >> beam.io.Read(beam

Computing GroupBy once then passing it to multiple transformations in Google DataFlow (Python SDK)

蹲街弑〆低调 提交于 2019-12-03 20:54:34
I am using Python SDK for Apache Beam to run a feature extraction pipeline on Google DataFlow. I need to run multiple transformations all of which expect items to be grouped by key. Based on the answer to this question , DataFlow is unable to automatically spot and reuse repeated transformations like GroupBy, so I hoped to run GroupBy first and then feed the result PCollection to other transformations (see sample code below). I wonder if this is supposed to work efficiently in DataFlow. If not, what is a recommended workaround in Python SDK? Is there an efficient way to have multiple Map or

What is the difference between DoFn.Setup and DoFn.StartBundle?

你说的曾经没有我的故事 提交于 2019-12-03 19:38:40
问题 What is the difference between these two annotations? DoFn.Setup Annotation for the method to use to prepare an instance for processing bundles of elements. Uses the word "bundle", takes zero arguments. DoFn.StartBundle Annotation for the method to use to prepare an instance for processing a batch of elements. Uses the word "batch", takes zero or one arguments (StartBundleContext, a way to access PipelineOptions ). What I'm trying to do I need to initialize a library within the DoFn instance,

How to create a Dataflow pipeline from Pub/Sub to GCS in Python

谁说胖子不能爱 提交于 2019-12-03 18:08:06
问题 I want to use Dataflow to move data from Pub/Sub to GCS. So basically I want Dataflow to accumulate some messages in a fixed amount of time (15 minutes for example), then write those data as text file into GCS when that amount of time has passed. My final goal is to create a custom pipeline, so "Pub/Sub to Cloud Storage" template is not enough for me, and I don't know about Java at all, which made me start to tweak in Python. Here is what I have got as of now (Apache Beam Python SDK 2.10.0):