apache-beam

BigQuery writeTableRows Always writing to buffer

喜你入骨 提交于 2019-12-19 11:45:10
问题 We are trying to write to Big Query using Apache Beam and avro. The following seems to work ok:- p.apply("Input", AvroIO.read(DataStructure.class).from("AvroSampleFile.avro")) .apply("Transform", ParDo.of(new CustomTransformFunction())) .apply("Load", BigQueryIO.writeTableRows().to(table).withSchema(schema)); We then tried to use it in the following manner to get data from the Google Pub/Sub p.begin() .apply("Input", PubsubIO.readAvros(DataStructure.class).fromTopic("topicName")) .apply(

Reading nested JSON in Google Dataflow / Apache Beam

核能气质少年 提交于 2019-12-19 10:34:11
问题 It is possible to read unnested JSON files on Cloud Storage with Dataflow via: p.apply("read logfiles", TextIO.Read.from("gs://bucket/*").withCoder(TableRowJsonCoder.of())); If I just want to write those logs with minimal filtering to BigQuery I can do so by using a DoFn like this one: private static class Formatter extends DoFn<TableRow,TableRow> { @Override public void processElement(ProcessContext c) throws Exception { // .clone() since input is immutable TableRow output = c.element()

How to specify insertId when spreaming insert to BigQuery using Apache Beam

柔情痞子 提交于 2019-12-19 09:25:05
问题 BigQuery supports de-duplication for streaming insert. How can I use this feature using Apache Beam? https://cloud.google.com/bigquery/streaming-data-into-bigquery#dataconsistency To help ensure data consistency, you can supply insertId for each inserted row. BigQuery remembers this ID for at least one minute. If you try to stream the same set of rows within that time period and the insertId property is set, BigQuery uses the insertId property to de-duplicate your data on a best effort basis.

Start CloudSQL Proxy on Python Dataflow / Apache Beam

可紊 提交于 2019-12-18 17:01:49
问题 I am currently working on a ETL Dataflow job (using the Apache Beam Python SDK) which queries data from CloudSQL (with psycopg2 and a custom ParDo ) and writes it to BigQuery. My goal is to create a Dataflow template which I can start from a AppEngine using a Cron job. I have a version which works locally using the DirectRunner. For that I use the CloudSQL (Postgres) proxy client so that I can connect to the database on 127.0.0.1 . When using the DataflowRunner with custom commands to start

Windowing with Apache Beam - Fixed Windows Don't Seem to be Closing?

ⅰ亾dé卋堺 提交于 2019-12-18 13:10:06
问题 We are attempting to use fixed windows on an Apache Beam pipeline (using DirectRunner ). Our flow is as follows: Pull data from pub/sub Deserialize JSON into Java object Window events w/ fixed windows of 5 seconds Using a custom CombineFn , combine each window of Event s into a List<Event> For the sake of testing, simply output the resulting List<Event> Pipeline code: pipeline // Read from pubsub topic to create unbounded PCollection .apply(PubsubIO .<String>read() .topic(options.getTopic())

Is there a way to write one file for each record with Apache Beam FileIO?

点点圈 提交于 2019-12-18 09:54:08
问题 I am learning Apache Beam and trying to implement something similar to distcp. I use FileIO.read().filepattern() to get the input files, but while writing with FileIO.write, the files get coalesced sometimes. Knowing the partition count before job execution is not possible. PCollection<MatchResult.Metadata> pCollection = pipeline.apply(this.name(), FileIO.match().filepattern(path())) .apply(FileIO.readMatches()) .apply(name(), FileIO.<FileIO.ReadableFile>write() .via(FileSink.create()) .to

google dataflow read from spanner

∥☆過路亽.° 提交于 2019-12-18 09:38:25
问题 I am trying to read a table from a Google spanner database, and write it to a text file to do a backup, using google dataflow with the python sdk. I have written the following script: from __future__ import absolute_import import argparse import itertools import logging import re import time import datetime as dt import logging import apache_beam as beam from apache_beam.io import iobase from apache_beam.io import WriteToText from apache_beam.io.range_trackers import OffsetRangeTracker,

Input of apache_beam.examples.wordcount

℡╲_俬逩灬. 提交于 2019-12-18 09:36:41
问题 I was trying to run the beam Python-SDK example, but I had problem in reading the input. https://cwiki.apache.org/confluence/display/BEAM/Usage+Guide#UsageGuide-RunaPython-SDKPipeline when I used gs://dataflow-samples/shakespeare/kinglear.txt as the input, the error was apache_beam.io.filesystem.BeamIOError: Match operation failed with exceptions {'gs://dataflow-samples/shakespeare/kinglear.txt': TypeError("__init__() got an unexpected keyword argument 'response_encoding'",)} when I used my

How to get a list of elements out of a PCollection in Google Dataflow and use it in the pipeline to loop Write Transforms?

為{幸葍}努か 提交于 2019-12-18 09:09:37
问题 I am using Google Cloud Dataflow with the Python SDK. I would like to : Get a list of unique dates out of a master PCollection Loop through the dates in that list to create filtered PCollections (each with a unique date), and write each filtered PCollection to its partition in a time-partitioned table in BigQuery. How can I get that list ? After the following combine transform, I created a ListPCollectionView object but I cannot iterate that object : class ToUniqueList(beam.CombineFn): def

Dataflow/apache beam - how to access current filename when passing in pattern?

醉酒当歌 提交于 2019-12-17 20:22:15
问题 I have seen this question answered before on stack overflow (https://stackoverflow.com/questions/29983621/how-to-get-filename-when-using-file-pattern-match-in-google-cloud-dataflow), but not since apache beam has added splittable dofn functionality for python. How would I access the filename of the current file being processed when passing in a file pattern to a gcs bucket? I want to pass the filename into my transform function: with beam.Pipeline(options=pipeline_options) as p: lines = p |