apache-beam-io | 易学教程

Error creating dataflow template with TextIO and ValueProvider

阅读更多关于 Error creating dataflow template with TextIO and ValueProvider

问题 I am trying to create a google dataflow template but I can't seem to find a way to do it without producing the following exception: WARNING: Size estimation of the source failed: RuntimeValueProvider{propertyName=inputFile, default=null} java.lang.IllegalStateException: Value only available at runtime, but accessed from a non-runtime context: RuntimeValueProvider{propertyName=inputFile, default=null} at org.apache.beam.sdk.options.ValueProvider$RuntimeValueProvider.get(ValueProvider.java:234)

Difference between beam.ParDo and beam.Map in the output type?

阅读更多关于 Difference between beam.ParDo and beam.Map in the output type?

问题 I am using Apache-Beam to run some data transformation, which including data extraction from txt, csv, and different sources of data. One thing I noticed, is the difference of results when using beam.Map and beam.ParDo In the next sample: I am reading csv data, and in the first case pass it to a DoFn using a beam.ParDo , which extracts the first element which is the date, then print it. In the second case, I directly use beam.Map to do the same thing, then print it. class Printer(beam.DoFn):

How can I improve performance of TextIO or AvroIO when reading a very large number of files?

阅读更多关于 How can I improve performance of TextIO or AvroIO when reading a very large number of files?

TextIO.read() and AvroIO.read() (as well as some other Beam IO's) by default don't perform very well in current Apache Beam runners when reading a filepattern that expands into a very large number of files - for example, 1M files. How can I read such a large number of files efficiently? When you know in advance that the filepattern being read with TextIO or AvroIO is going to expand into a large number of files, you can use the recently added feature .withHintMatchesManyFiles() , which is currently implemented on TextIO and AvroIO . For example: PCollection<String> lines = p.apply(TextIO.read(

Difference between beam.ParDo and beam.Map in the output type?

阅读更多关于 Difference between beam.ParDo and beam.Map in the output type?

I am using Apache-Beam to run some data transformation, which including data extraction from txt, csv, and different sources of data. One thing I noticed, is the difference of results when using beam.Map and beam.ParDo In the next sample: I am reading csv data, and in the first case pass it to a DoFn using a beam.ParDo , which extracts the first element which is the date, then print it. In the second case, I directly use beam.Map to do the same thing, then print it. class Printer(beam.DoFn): def process(self,data_item): print data_item class DateExtractor(beam.DoFn): def process(self,data_item

How to solve Duplicate values exception when I create PCollectionView<Map<String,String>>

阅读更多关于 How to solve Duplicate values exception when I create PCollectionView

I'm setting up a slow-changing lookup Map in my Apache-Beam pipeline. It continuously updates the lookup map. For each key in lookup map, I retrieve the latest value in the global window with accumulating mode. But it always meets Exception : org.apache.beam.sdk.Pipeline$PipelineExecutionException: java.lang.IllegalArgumentException: Duplicate values for mykey Is anything wrong with this snippet code? If I use .discardingFiredPanes() instead, I will lose information in the last emit. pipeline .apply(GenerateSequence.from(0).withRate(1, Duration.standardMinutes(1L))) .apply( Window.<Long>into

IN Apache Beam how to handle exceptions/errors at Pipeline-IO level

阅读更多关于 IN Apache Beam how to handle exceptions/errors at Pipeline-IO level

问题 i am using running spark runner as pipeline runner in apache beam and found an error. by getting the error, my question araised. I know the error was due to incorrect Column_name in sql query but my question is how to handle an error/exception at IO level org.apache.beam.sdk.util.UserCodeException: java.sql.SQLSyntaxErrorException: Unknown column 'FIRST_NAME' in 'field list' at org.apache.beam.sdk.util.UserCodeException.wrap(UserCodeException.java:36) at org.apache.beam.sdk.io.jdbc.JdbcIO

How to specify insertId when spreaming insert to BigQuery using Apache Beam

阅读更多关于 How to specify insertId when spreaming insert to BigQuery using Apache Beam

BigQuery supports de-duplication for streaming insert. How can I use this feature using Apache Beam? https://cloud.google.com/bigquery/streaming-data-into-bigquery#dataconsistency To help ensure data consistency, you can supply insertId for each inserted row. BigQuery remembers this ID for at least one minute. If you try to stream the same set of rows within that time period and the insertId property is set, BigQuery uses the insertId property to de-duplicate your data on a best effort basis. You might have to retry an insert because there's no way to determine the state of a streaming insert

Creating/Writing to Parititoned BigQuery table via Google Cloud Dataflow

阅读更多关于 Creating/Writing to Parititoned BigQuery table via Google Cloud Dataflow

I wanted to take advantage of the new BigQuery functionality of time partitioned tables, but am unsure this is currently possible in the 1.6 version of the Dataflow SDK. Looking at the BigQuery JSON API , to create a day partitioned table one needs to pass in a "timePartitioning": { "type": "DAY" } option, but the com.google.cloud.dataflow.sdk.io.BigQueryIO interface only allows specifying a TableReference. I thought that maybe I could pre-create the table, and sneak in a partition decorator via a BigQueryIO.Write.toTableReference lambda..? Is anyone else having success with creating/writing

Creating/Writing to Parititoned BigQuery table via Google Cloud Dataflow

阅读更多关于 Creating/Writing to Parititoned BigQuery table via Google Cloud Dataflow

问题 I wanted to take advantage of the new BigQuery functionality of time partitioned tables, but am unsure this is currently possible in the 1.6 version of the Dataflow SDK. Looking at the BigQuery JSON API, to create a day partitioned table one needs to pass in a \"timePartitioning\": { \"type\": \"DAY\" } option, but the com.google.cloud.dataflow.sdk.io.BigQueryIO interface only allows specifying a TableReference. I thought that maybe I could pre-create the table, and sneak in a partition