apache-beam | 易学教程

Setting a Timer to the minimum timestamp seen

阅读更多关于 Setting a Timer to the minimum timestamp seen

问题 I would like to set a Timer in Event time that fires based on the smallest timestamp seen in the elements within my DoFn. 回答1: For performance reasons the Timer API does not support a read() operation, which for the vast majority of use cases is not a required feature. In the small set of use cases where it is needed, for example when you need to set a Timer in EventTime based on the smallest timestamp seen in the elements within a DoFn, we can make use of a State object to keep track of the

JDBC Fetch from oracle with Beam

阅读更多关于 JDBC Fetch from oracle with Beam

问题 The below program is to connect to Oracle 11g and fetch the records. How ever it is giving me NullPointerException for the coder at pipeline.apply(). I have added the ojdbc14.jar to the project dependencies. public static void main(String[] args) { Pipeline p = Pipeline.create(PipelineOptionsFactory.create()); p.apply(JdbcIO.<KV<Integer, String>>read() .withDataSourceConfiguration(JdbcIO.DataSourceConfiguration.create( "oracle.jdbc.driver.OracleDriver", "jdbc:oracle:thin:@hostdnsname:port

Apache Beam per-user session windows are unmerged

阅读更多关于 Apache Beam per-user session windows are unmerged

问题 We have an app that has users; each user uses our app for something like 10-40 minutes per go and I would like to count the distribution/occurrences of events happing per-such-session, based on specific events having happened (e.g. "this user converted", "this user had a problem last session", "this user had a successful last session"). (After this I'd like to count these higher-level events per day, but that's a separate question) For this I've been looking into session windows; but all docs

Apache Beam per-user session windows are unmerged

阅读更多关于 Apache Beam per-user session windows are unmerged

Using CoGroupByKey with custom type ends up in a Coder error

阅读更多关于 Using CoGroupByKey with custom type ends up in a Coder error

问题 I want to join two PCollection (from a different input respectively) and implement by following the step described here, "Joins with CoGroupByKey" section: https://cloud.google.com/dataflow/model/group-by-key In my case, I want to join GeoIP's "block" information and "location" information. So I defined Block and Location as a custom class and then wrote like below: final TupleTag<Block> t1 = new TupleTag<Block>(); final TupleTag<Location> t2 = new TupleTag<Location>(); PCollection<KV<Long,

Using CoGroupByKey with custom type ends up in a Coder error

阅读更多关于 Using CoGroupByKey with custom type ends up in a Coder error

Using defaultNaming for dynamic windowed writes in Apache Beam

阅读更多关于 Using defaultNaming for dynamic windowed writes in Apache Beam

问题 I am following along with answer to this post and the documentation in order to perform a dynamic windowed write on my data at the end of a pipeline. Here is what I have so far: static void applyWindowedWrite(PCollection<String> stream) { stream.apply( FileIO.<String, String>writeDynamic() .by(Event::getKey) .via(TextIO.sink()) .to("gs://some_bucket/events/") .withNaming(key -> defaultNaming(key, ".json"))); } But NetBeans warns me about a syntax error on the last line: FileNaming is not

Scio: groupByKey doesn't work when using Pub/Sub as collection source

阅读更多关于 Scio: groupByKey doesn't work when using Pub/Sub as collection source

问题 I changed source of WindowsWordCount example program from text file to cloud Pub/Sub as shown below. I published shakespeare file's data to Pub/Sub which did get fetched properly but none of the transformations after .groupByKey seem to work. sc.pubsubSubscription[String](psSubscription) .withFixedWindows(windowSize) // apply windowing logic .flatMap(_.split("[^a-zA-Z']+").filter(_.nonEmpty)) .countByValue .withWindow[IntervalWindow] .swap .groupByKey .map { s => println("\n\n\n\n\n\n\n This

Apply Side input to BigQueryIO.read operation in Apache Beam

阅读更多关于 Apply Side input to BigQueryIO.read operation in Apache Beam

问题 Is there a way to apply a side input to a BigQueryIO.read() operation in Apache Beam. Say for example I have a value in a PCollection that I want to use in a query to fetch data from a BigQuery table. Is this possible using side input? Or should something else be used in such a case? I used NestedValueProvider in a similar case but I guess we can use that only when a certain value depends on my runtime value. Or can I use the same thing here? Please correct me if I'm wrong. The code that I

How to speedup bulk importing into google cloud datastore with multiple workers?

阅读更多关于 How to speedup bulk importing into google cloud datastore with multiple workers?

问题 I have an apache-beam based dataflow job to read using vcf source from a single text file (stored in google cloud storage), transform text lines into datastore Entities and write them into the datastore sink. The workflow works fine but the cons I noticed is that: The write speed into datastore is at most around 25-30 entities per second. I tried to use --autoscalingAlgorithm=THROUGHPUT_BASED --numWorkers=10 --maxNumWorkers=100 but the execution seems to prefer one worker (see graph below: