apache-beam

Using Dataflow vs. Cloud Composer

假装没事ソ 提交于 2019-12-03 15:03:53
问题 I apologize for this naive question, but I'd like to get some clarification on whether Cloud Dataflow or Cloud Composer is the right tool for the job, and I wasn't clear from the Google Documentation. Currently, I'm using Cloud Dataflow to read a non-standard csv file -- do some basic processing -- and load it into BigQuery. Let me give a very basic example: # file.csv type\x01date house\x0112/27/1982 car\x0111/9/1889 From this file we detect the schema and create a BigQuery table, something

How to create groups of N elements from a PCollection Apache Beam Python

我们两清 提交于 2019-12-03 08:31:22
I am trying to accomplish something like this: Batch PCollection in Beam/Dataflow The answer in the above link is in Java, whereas the language I'm working with is Python. Thus, I require some help getting a similar construction. Specifically I have this: p = beam.Pipeline (options = pipeline_options) lines = p | 'File reading' >> ReadFromText (known_args.input) After this, I need to create another PCollection but with a List of N rows of "lines" since my use case requires a group of rows. I can not operate line by line. I tried a ParDo Function using variables for count associating with the

Apache Airflow or Apache Beam for data processing and job scheduling

只愿长相守 提交于 2019-12-03 08:28:02
问题 I'm trying to give useful information but I am far from being a data engineer. I am currently using the python library pandas to execute a long series of transformation to my data which has a lot of inputs (currently CSV and excel files). The outputs are several excel files. I would like to be able to execute scheduled monitored batch jobs with parallel computation (I mean not as sequential as what I'm doing with pandas), once a month. I don't really know Beam or Airflow, I quickly read

Using Dataflow vs. Cloud Composer

前提是你 提交于 2019-12-03 03:52:59
I apologize for this naive question, but I'd like to get some clarification on whether Cloud Dataflow or Cloud Composer is the right tool for the job, and I wasn't clear from the Google Documentation. Currently, I'm using Cloud Dataflow to read a non-standard csv file -- do some basic processing -- and load it into BigQuery. Let me give a very basic example: # file.csv type\x01date house\x0112/27/1982 car\x0111/9/1889 From this file we detect the schema and create a BigQuery table, something like this: `table` type (STRING) date (DATE) And, we also format our data to insert (in python) into

AssertionError: assertion failed: copyAndReset must return a zero value copy

依然范特西╮ 提交于 2019-12-02 19:17:27
问题 When I applied ParDo.of(new ParDoFn()) to PCollection named textInput , The program throws this Exception. But The Program is normal when I delete .apply(ParDo.of(new ParDoFn())) . //SparkRunner private static void testHadoop(Pipeline pipeline){ Class<? extends FileInputFormat<LongWritable, Text>> inputFormatClass = (Class<? extends FileInputFormat<LongWritable, Text>>) (Class<?>) TextInputFormat.class; @SuppressWarnings("unchecked") //hdfs://localhost:9000 HadoopIO.Read.Bound<LongWritable,

reading files and folders in order with apache beam

微笑、不失礼 提交于 2019-12-02 18:31:06
问题 I have a folder structure of the type year/month/day/hour/* , and I'd like the beam to read this as an unbounded source in chronological order. Specifically, this means reading in all the files in the first hour on record and adding their contents for processing. Then, add the file contents of the next hour for processing, up until the current time where it waits for new files to arrive in the latest year/month/day/hour folder. Is it possible to do this with apache beam? 回答1: So what I would

Unable to read XML File stored in GCS Bucket

和自甴很熟 提交于 2019-12-02 15:47:03
问题 I have tried to follow this documentation in the most precise way I could: https://beam.apache.org/documentation/sdks/javadoc/2.0.0/org/apache/beam/sdk/io/xml/XmlIO.html Please find below my codes : public static void main(String args[]) { DataflowPipelineOptions options=PipelineOptionsFactory.as(DataflowPipelineOptions.class); options.setTempLocation("gs://balajee_test/stagging"); options.setProject("test-1-130106"); Pipeline p=Pipeline.create(options); PCollection<XMLFormatter> record= p

Apache Beam: PubsubReader fails with NPE

喜欢而已 提交于 2019-12-02 12:21:00
I have a a beam pipeline that reads from PubSub and write to BigQuery after applying some transformation. The pipeline fails consistently with a NPE. I am using beam SDK version 0.6.0. Any Idea on what I could be doing wrong? I am trying to run the pipeline with a DirectRunner. java.lang.NullPointerException at org.apache.beam.sdk.io.PubsubUnboundedSource$PubsubReader.ackBatch(PubsubUnboundedSource.java:640) at org.apache.beam.sdk.io.PubsubUnboundedSource$PubsubCheckpoint.finalizeCheckpoint(PubsubUnboundedSource.java:313) at org.apache.beam.runners.direct.UnboundedReadEvaluatorFactory

Consuming unbounded data in windows with default trigger

[亡魂溺海] 提交于 2019-12-02 12:17:53
问题 I have a Pub/Sub topic + subscription and want to consume and aggregate the unbounded data from the subscription in a Dataflow. I use a fixed window and write the aggregates to BigQuery. Reading and writing (without windowing and aggregation) works fine. But when I pipe the data into a fixed window (to count the elements in each window) the window is never triggered. And thus the aggregates are not written. Here is my word publisher (it uses kinglear.txt from the examples as input file):

Apache Beam -> BigQuery - insertId for deduplication not working

故事扮演 提交于 2019-12-02 12:11:34
问题 I am streaming data from kafka to BigQuery using apache beam with google dataflow runner. I wanted to make use of insertId for deduplication, that I found described in google docs. But even tho inserts are happening within few seconds from each other I still see a lot of rows with the same insertId. Now I'm wondering that perhaps I am not using the API correctly to take advantage of deduplication mechanism for streaming inserts offered by BQ. My code in beam for writing looks as follows: