apache-beam-io

Error creating dataflow template with TextIO and ValueProvider

烂漫一生 提交于 2019-12-06 15:02:52
问题 I am trying to create a google dataflow template but I can't seem to find a way to do it without producing the following exception: WARNING: Size estimation of the source failed: RuntimeValueProvider{propertyName=inputFile, default=null} java.lang.IllegalStateException: Value only available at runtime, but accessed from a non-runtime context: RuntimeValueProvider{propertyName=inputFile, default=null} at org.apache.beam.sdk.options.ValueProvider$RuntimeValueProvider.get(ValueProvider.java:234)

Difference between beam.ParDo and beam.Map in the output type?

﹥>﹥吖頭↗ 提交于 2019-12-06 07:44:52
问题 I am using Apache-Beam to run some data transformation, which including data extraction from txt, csv, and different sources of data. One thing I noticed, is the difference of results when using beam.Map and beam.ParDo In the next sample: I am reading csv data, and in the first case pass it to a DoFn using a beam.ParDo , which extracts the first element which is the date, then print it. In the second case, I directly use beam.Map to do the same thing, then print it. class Printer(beam.DoFn):

How can I improve performance of TextIO or AvroIO when reading a very large number of files?

可紊 提交于 2019-12-05 14:24:29
TextIO.read() and AvroIO.read() (as well as some other Beam IO's) by default don't perform very well in current Apache Beam runners when reading a filepattern that expands into a very large number of files - for example, 1M files. How can I read such a large number of files efficiently? When you know in advance that the filepattern being read with TextIO or AvroIO is going to expand into a large number of files, you can use the recently added feature .withHintMatchesManyFiles() , which is currently implemented on TextIO and AvroIO . For example: PCollection<String> lines = p.apply(TextIO.read(

Difference between beam.ParDo and beam.Map in the output type?

回眸只為那壹抹淺笑 提交于 2019-12-04 15:02:59
I am using Apache-Beam to run some data transformation, which including data extraction from txt, csv, and different sources of data. One thing I noticed, is the difference of results when using beam.Map and beam.ParDo In the next sample: I am reading csv data, and in the first case pass it to a DoFn using a beam.ParDo , which extracts the first element which is the date, then print it. In the second case, I directly use beam.Map to do the same thing, then print it. class Printer(beam.DoFn): def process(self,data_item): print data_item class DateExtractor(beam.DoFn): def process(self,data_item

How to solve Duplicate values exception when I create PCollectionView<Map<String,String>>

十年热恋 提交于 2019-12-04 14:06:52
I'm setting up a slow-changing lookup Map in my Apache-Beam pipeline. It continuously updates the lookup map. For each key in lookup map, I retrieve the latest value in the global window with accumulating mode. But it always meets Exception : org.apache.beam.sdk.Pipeline$PipelineExecutionException: java.lang.IllegalArgumentException: Duplicate values for mykey Is anything wrong with this snippet code? If I use .discardingFiredPanes() instead, I will lose information in the last emit. pipeline .apply(GenerateSequence.from(0).withRate(1, Duration.standardMinutes(1L))) .apply( Window.<Long>into

IN Apache Beam how to handle exceptions/errors at Pipeline-IO level

梦想与她 提交于 2019-12-01 11:49:16
问题 i am using running spark runner as pipeline runner in apache beam and found an error. by getting the error, my question araised. I know the error was due to incorrect Column_name in sql query but my question is how to handle an error/exception at IO level org.apache.beam.sdk.util.UserCodeException: java.sql.SQLSyntaxErrorException: Unknown column 'FIRST_NAME' in 'field list' at org.apache.beam.sdk.util.UserCodeException.wrap(UserCodeException.java:36) at org.apache.beam.sdk.io.jdbc.JdbcIO

How to specify insertId when spreaming insert to BigQuery using Apache Beam

天涯浪子 提交于 2019-12-01 08:56:46
BigQuery supports de-duplication for streaming insert. How can I use this feature using Apache Beam? https://cloud.google.com/bigquery/streaming-data-into-bigquery#dataconsistency To help ensure data consistency, you can supply insertId for each inserted row. BigQuery remembers this ID for at least one minute. If you try to stream the same set of rows within that time period and the insertId property is set, BigQuery uses the insertId property to de-duplicate your data on a best effort basis. You might have to retry an insert because there's no way to determine the state of a streaming insert

Creating/Writing to Parititoned BigQuery table via Google Cloud Dataflow

一个人想着一个人 提交于 2019-11-27 05:28:39
I wanted to take advantage of the new BigQuery functionality of time partitioned tables, but am unsure this is currently possible in the 1.6 version of the Dataflow SDK. Looking at the BigQuery JSON API , to create a day partitioned table one needs to pass in a "timePartitioning": { "type": "DAY" } option, but the com.google.cloud.dataflow.sdk.io.BigQueryIO interface only allows specifying a TableReference. I thought that maybe I could pre-create the table, and sneak in a partition decorator via a BigQueryIO.Write.toTableReference lambda..? Is anyone else having success with creating/writing

Creating/Writing to Parititoned BigQuery table via Google Cloud Dataflow

你。 提交于 2019-11-26 11:35:06
问题 I wanted to take advantage of the new BigQuery functionality of time partitioned tables, but am unsure this is currently possible in the 1.6 version of the Dataflow SDK. Looking at the BigQuery JSON API, to create a day partitioned table one needs to pass in a \"timePartitioning\": { \"type\": \"DAY\" } option, but the com.google.cloud.dataflow.sdk.io.BigQueryIO interface only allows specifying a TableReference. I thought that maybe I could pre-create the table, and sneak in a partition