apache-beam

Setting a Timer to the minimum timestamp seen

我怕爱的太早我们不能终老 提交于 2019-12-24 07:16:23
问题 I would like to set a Timer in Event time that fires based on the smallest timestamp seen in the elements within my DoFn. 回答1: For performance reasons the Timer API does not support a read() operation, which for the vast majority of use cases is not a required feature. In the small set of use cases where it is needed, for example when you need to set a Timer in EventTime based on the smallest timestamp seen in the elements within a DoFn, we can make use of a State object to keep track of the

JDBC Fetch from oracle with Beam

岁酱吖の 提交于 2019-12-24 06:49:18
问题 The below program is to connect to Oracle 11g and fetch the records. How ever it is giving me NullPointerException for the coder at pipeline.apply(). I have added the ojdbc14.jar to the project dependencies. public static void main(String[] args) { Pipeline p = Pipeline.create(PipelineOptionsFactory.create()); p.apply(JdbcIO.<KV<Integer, String>>read() .withDataSourceConfiguration(JdbcIO.DataSourceConfiguration.create( "oracle.jdbc.driver.OracleDriver", "jdbc:oracle:thin:@hostdnsname:port

Apache Beam per-user session windows are unmerged

血红的双手。 提交于 2019-12-24 05:44:26
问题 We have an app that has users; each user uses our app for something like 10-40 minutes per go and I would like to count the distribution/occurrences of events happing per-such-session, based on specific events having happened (e.g. "this user converted", "this user had a problem last session", "this user had a successful last session"). (After this I'd like to count these higher-level events per day, but that's a separate question) For this I've been looking into session windows; but all docs

Apache Beam per-user session windows are unmerged

倾然丶 夕夏残阳落幕 提交于 2019-12-24 05:43:04
问题 We have an app that has users; each user uses our app for something like 10-40 minutes per go and I would like to count the distribution/occurrences of events happing per-such-session, based on specific events having happened (e.g. "this user converted", "this user had a problem last session", "this user had a successful last session"). (After this I'd like to count these higher-level events per day, but that's a separate question) For this I've been looking into session windows; but all docs

Using CoGroupByKey with custom type ends up in a Coder error

与世无争的帅哥 提交于 2019-12-24 03:01:29
问题 I want to join two PCollection (from a different input respectively) and implement by following the step described here, "Joins with CoGroupByKey" section: https://cloud.google.com/dataflow/model/group-by-key In my case, I want to join GeoIP's "block" information and "location" information. So I defined Block and Location as a custom class and then wrote like below: final TupleTag<Block> t1 = new TupleTag<Block>(); final TupleTag<Location> t2 = new TupleTag<Location>(); PCollection<KV<Long,

Using CoGroupByKey with custom type ends up in a Coder error

不问归期 提交于 2019-12-24 03:01:03
问题 I want to join two PCollection (from a different input respectively) and implement by following the step described here, "Joins with CoGroupByKey" section: https://cloud.google.com/dataflow/model/group-by-key In my case, I want to join GeoIP's "block" information and "location" information. So I defined Block and Location as a custom class and then wrote like below: final TupleTag<Block> t1 = new TupleTag<Block>(); final TupleTag<Location> t2 = new TupleTag<Location>(); PCollection<KV<Long,

Using defaultNaming for dynamic windowed writes in Apache Beam

安稳与你 提交于 2019-12-24 00:45:16
问题 I am following along with answer to this post and the documentation in order to perform a dynamic windowed write on my data at the end of a pipeline. Here is what I have so far: static void applyWindowedWrite(PCollection<String> stream) { stream.apply( FileIO.<String, String>writeDynamic() .by(Event::getKey) .via(TextIO.sink()) .to("gs://some_bucket/events/") .withNaming(key -> defaultNaming(key, ".json"))); } But NetBeans warns me about a syntax error on the last line: FileNaming is not

Scio: groupByKey doesn't work when using Pub/Sub as collection source

人盡茶涼 提交于 2019-12-23 22:15:42
问题 I changed source of WindowsWordCount example program from text file to cloud Pub/Sub as shown below. I published shakespeare file's data to Pub/Sub which did get fetched properly but none of the transformations after .groupByKey seem to work. sc.pubsubSubscription[String](psSubscription) .withFixedWindows(windowSize) // apply windowing logic .flatMap(_.split("[^a-zA-Z']+").filter(_.nonEmpty)) .countByValue .withWindow[IntervalWindow] .swap .groupByKey .map { s => println("\n\n\n\n\n\n\n This

Apply Side input to BigQueryIO.read operation in Apache Beam

末鹿安然 提交于 2019-12-23 22:12:21
问题 Is there a way to apply a side input to a BigQueryIO.read() operation in Apache Beam. Say for example I have a value in a PCollection that I want to use in a query to fetch data from a BigQuery table. Is this possible using side input? Or should something else be used in such a case? I used NestedValueProvider in a similar case but I guess we can use that only when a certain value depends on my runtime value. Or can I use the same thing here? Please correct me if I'm wrong. The code that I

How to speedup bulk importing into google cloud datastore with multiple workers?

不想你离开。 提交于 2019-12-23 19:28:47
问题 I have an apache-beam based dataflow job to read using vcf source from a single text file (stored in google cloud storage), transform text lines into datastore Entities and write them into the datastore sink. The workflow works fine but the cons I noticed is that: The write speed into datastore is at most around 25-30 entities per second. I tried to use --autoscalingAlgorithm=THROUGHPUT_BASED --numWorkers=10 --maxNumWorkers=100 but the execution seems to prefer one worker (see graph below: