apache-beam

Cloud Dataflow - how does Dataflow do parallelism?

与世无争的帅哥 提交于 2019-12-06 08:07:39
My question is, behind the scene, for element-wise Beam DoFn (ParDo), how does the Cloud Dataflow parallel workload? For example, in my ParDO, I send out one http request to an external server for one element. And I use 30 workers, each has 4vCPU. Does that mean on each worker, there will be 4 threads at maximum? Does that mean from each worker, only 4 http connections are necessary or can be established if I keep them alive to get the best performance? How can I adjust the level of parallelism other than using more cores or more workers? with my current setting (30*4vCPU worker), I can

Difference between beam.ParDo and beam.Map in the output type?

﹥>﹥吖頭↗ 提交于 2019-12-06 07:44:52
问题 I am using Apache-Beam to run some data transformation, which including data extraction from txt, csv, and different sources of data. One thing I noticed, is the difference of results when using beam.Map and beam.ParDo In the next sample: I am reading csv data, and in the first case pass it to a DoFn using a beam.ParDo , which extracts the first element which is the date, then print it. In the second case, I directly use beam.Map to do the same thing, then print it. class Printer(beam.DoFn):

TextIO. Read multiple files from GCS using pattern {}

天大地大妈咪最大 提交于 2019-12-06 07:28:02
问题 I tried using the following TextIO.Read.from("gs://xyz.abc/xxx_{2017-06-06,2017-06-06}.csv") That pattern didn't work, as I get java.lang.IllegalStateException: Unable to find any files matching StaticValueProvider{value=gs://xyz.abc/xxx_{2017-06-06,2017-06-06}.csv} Even though those 2 files do exist. And I tried with a local file using a similar expression TextIO.Read.from("somefolder/xxx_{2017-06-06,2017-06-06}.csv") And that did work just fine. I would've thought there would be support for

Early results from GroupByKey transform

戏子无情 提交于 2019-12-06 06:41:17
问题 How can I get GroupByKey to trigger early results, rather than wait for all the data to arrive (which in my case is a pretty long time).I tried to split my input PCollection into windows with an early trigger, but it just doesn`t work. It still waits for all the data to arrive before giving out the results. PCollection<List<String>> input = ... PCollection<KV<Integer,List<String>>> keyedInput = input.apply(ParDo.of(new AddArbitraryKey())) keyedInput.apply(Window<KV<Integer,List<String>>>into(

Processing with State and Timers

自闭症网瘾萝莉.ら 提交于 2019-12-06 06:15:12
Are there any guidelines or limitations for using stateful processing and timers with the Beam Dataflow runner (as of v2.1.0)? Things such as limitations on the size of state or frequency of updates etc.? The candidate streaming pipeline would use state and timers extensively for user session state, with Bigtable as durable storage. Here is some general advice for your use case Please aggregate multiple elements then set a timer. Please don't create a timer per element, which would be excessive. Try and aggregate state, instead of accumulating large amount of state. I.e. aggregate as a sum and

Processing Total Ordering of Events By Key using Apache Beam

孤街浪徒 提交于 2019-12-06 06:02:29
问题 Problem Context I am trying to generate a total (linear) order of event items per key from a real-time stream where the order is event time (derived from the event payload). Approach I had attempted to implement this using streaming as follows: 1) Set up a non overlapping sequential windows, e.g. duration 5 minutes 2) Establish an allowed lateness - it is fine to discard late events 3) Set accumulation mode to retain all fired panes 4) Use the "AfterwaterMark" trigger 5) When handling a

Join two large volumne of PCollection has performance issue

陌路散爱 提交于 2019-12-06 04:22:36
Join two Pcollection with CoGroupsByKey approach taking hours to execute the 8+ millions records. Noted from another stackoverflow post CoGbkResult has more than 10000 elements,reiteration (which may be slow) is required that "CoGbkResult has more than 10000 elements, reiteration (which may be slow) is required." Any suggestion to improve this performance using this approach. Here is the code snippet, PCollection<TableRow> pc1 = ...; PCollection<TableRow> pc2 = ...; WithKeys<String, TableRow> withKeyValue = WithKeys.of((TableRow row) -> String.format("%s",row.get("KEYNAME"))) .withKeyType

KafkaIO checkpoint - how to commit offsets to Kafka

天涯浪子 提交于 2019-12-06 03:40:17
问题 I'm running a job using Beam KafkaIO source in Google Dataflow and cannot find an easy way to persist offsets across job restarts (job update option is not enough, i need to restart the job) Comparing Beam's KafkaIO against PubSubIO (or to be precise comparing PubsubCheckpoint with KafkaCheckpointMark) I can see that checkpoint persistence is not implemented in KafkaIO (KafkaCheckpointMark.finalizeCheckpoint method is empty) whereas it's implemented in PubsubCheckpoint.finalizeCheckpoint

Reading CSV header with Dataflow

☆樱花仙子☆ 提交于 2019-12-06 01:14:13
问题 I have a CSV file, and I don't know the column names ahead of time. I need to output the data in JSON after some transformations in Google Dataflow. What's the best way to take the header row and permeate the labels through all the rows? For example: a,b,c 1,2,3 4,5,6 ...becomes (approximately): {a:1, b:2, c:3} {a:4, b:5, c:6} 回答1: You should implement custom FileBasedSource (similar to TextIO.TextSource), that will read the first line and store header data @Override protected void

Autodetect BigQuery schema within Dataflow?

痞子三分冷 提交于 2019-12-05 20:06:47
Is it possible to use the equivalent of --autodetect in DataFlow? i.e. can we load data into a BQ table without specifying a schema, equivalent to how we can load data from a CSV with --autodetect ? ( potentially related question ) If you are using protocol buffers as objects in your PCollections (which should be performing very well on the Dataflow back-end) you might be able to use a util I wrote in the past. It will parse the schema of the protobuffer into a BigQuery schema at runtime, based on inspection of the protobuffer descriptor. I quickly uploaded it to GitHub , it's WIP, but you