apache-beam | 易学教程

Using Dataflow vs. Cloud Composer

阅读更多关于 Using Dataflow vs. Cloud Composer

问题 I apologize for this naive question, but I'd like to get some clarification on whether Cloud Dataflow or Cloud Composer is the right tool for the job, and I wasn't clear from the Google Documentation. Currently, I'm using Cloud Dataflow to read a non-standard csv file -- do some basic processing -- and load it into BigQuery. Let me give a very basic example: # file.csv type\x01date house\x0112/27/1982 car\x0111/9/1889 From this file we detect the schema and create a BigQuery table, something

How to create groups of N elements from a PCollection Apache Beam Python

阅读更多关于 How to create groups of N elements from a PCollection Apache Beam Python

I am trying to accomplish something like this: Batch PCollection in Beam/Dataflow The answer in the above link is in Java, whereas the language I'm working with is Python. Thus, I require some help getting a similar construction. Specifically I have this: p = beam.Pipeline (options = pipeline_options) lines = p | 'File reading' >> ReadFromText (known_args.input) After this, I need to create another PCollection but with a List of N rows of "lines" since my use case requires a group of rows. I can not operate line by line. I tried a ParDo Function using variables for count associating with the

Apache Airflow or Apache Beam for data processing and job scheduling

阅读更多关于 Apache Airflow or Apache Beam for data processing and job scheduling

问题 I'm trying to give useful information but I am far from being a data engineer. I am currently using the python library pandas to execute a long series of transformation to my data which has a lot of inputs (currently CSV and excel files). The outputs are several excel files. I would like to be able to execute scheduled monitored batch jobs with parallel computation (I mean not as sequential as what I'm doing with pandas), once a month. I don't really know Beam or Airflow, I quickly read

Using Dataflow vs. Cloud Composer

阅读更多关于 Using Dataflow vs. Cloud Composer

I apologize for this naive question, but I'd like to get some clarification on whether Cloud Dataflow or Cloud Composer is the right tool for the job, and I wasn't clear from the Google Documentation. Currently, I'm using Cloud Dataflow to read a non-standard csv file -- do some basic processing -- and load it into BigQuery. Let me give a very basic example: # file.csv type\x01date house\x0112/27/1982 car\x0111/9/1889 From this file we detect the schema and create a BigQuery table, something like this: `table` type (STRING) date (DATE) And, we also format our data to insert (in python) into

AssertionError: assertion failed: copyAndReset must return a zero value copy

阅读更多关于 AssertionError: assertion failed: copyAndReset must return a zero value copy

问题 When I applied ParDo.of(new ParDoFn()) to PCollection named textInput , The program throws this Exception. But The Program is normal when I delete .apply(ParDo.of(new ParDoFn())) . //SparkRunner private static void testHadoop(Pipeline pipeline){ Class<? extends FileInputFormat<LongWritable, Text>> inputFormatClass = (Class<? extends FileInputFormat<LongWritable, Text>>) (Class<?>) TextInputFormat.class; @SuppressWarnings("unchecked") //hdfs://localhost:9000 HadoopIO.Read.Bound<LongWritable,

reading files and folders in order with apache beam

阅读更多关于 reading files and folders in order with apache beam

问题 I have a folder structure of the type year/month/day/hour/* , and I'd like the beam to read this as an unbounded source in chronological order. Specifically, this means reading in all the files in the first hour on record and adding their contents for processing. Then, add the file contents of the next hour for processing, up until the current time where it waits for new files to arrive in the latest year/month/day/hour folder. Is it possible to do this with apache beam? 回答1: So what I would

Unable to read XML File stored in GCS Bucket

阅读更多关于 Unable to read XML File stored in GCS Bucket

问题 I have tried to follow this documentation in the most precise way I could: https://beam.apache.org/documentation/sdks/javadoc/2.0.0/org/apache/beam/sdk/io/xml/XmlIO.html Please find below my codes : public static void main(String args[]) { DataflowPipelineOptions options=PipelineOptionsFactory.as(DataflowPipelineOptions.class); options.setTempLocation("gs://balajee_test/stagging"); options.setProject("test-1-130106"); Pipeline p=Pipeline.create(options); PCollection<XMLFormatter> record= p

Apache Beam: PubsubReader fails with NPE

阅读更多关于 Apache Beam: PubsubReader fails with NPE

I have a a beam pipeline that reads from PubSub and write to BigQuery after applying some transformation. The pipeline fails consistently with a NPE. I am using beam SDK version 0.6.0. Any Idea on what I could be doing wrong? I am trying to run the pipeline with a DirectRunner. java.lang.NullPointerException at org.apache.beam.sdk.io.PubsubUnboundedSource$PubsubReader.ackBatch(PubsubUnboundedSource.java:640) at org.apache.beam.sdk.io.PubsubUnboundedSource$PubsubCheckpoint.finalizeCheckpoint(PubsubUnboundedSource.java:313) at org.apache.beam.runners.direct.UnboundedReadEvaluatorFactory

Consuming unbounded data in windows with default trigger

阅读更多关于 Consuming unbounded data in windows with default trigger

问题 I have a Pub/Sub topic + subscription and want to consume and aggregate the unbounded data from the subscription in a Dataflow. I use a fixed window and write the aggregates to BigQuery. Reading and writing (without windowing and aggregation) works fine. But when I pipe the data into a fixed window (to count the elements in each window) the window is never triggered. And thus the aggregates are not written. Here is my word publisher (it uses kinglear.txt from the examples as input file):

Apache Beam -> BigQuery - insertId for deduplication not working

阅读更多关于 Apache Beam -> BigQuery - insertId for deduplication not working

问题 I am streaming data from kafka to BigQuery using apache beam with google dataflow runner. I wanted to make use of insertId for deduplication, that I found described in google docs. But even tho inserts are happening within few seconds from each other I still see a lot of rows with the same insertId. Now I'm wondering that perhaps I am not using the API correctly to take advantage of deduplication mechanism for streaming inserts offered by BQ. My code in beam for writing looks as follows: