apache-flink

Flink BucketingSink with Custom AvroParquetWriter create empty file

一曲冷凌霜 提交于 2019-12-03 17:34:19
I have created a writer for BucketingSink. The sink and writer works without error but when it comes to the writer writing avro genericrecord to parquet, the file was created from in-progress, pending to complete. But the files are empty with 0 bytes. Can anyone tell me what is wrong with the code ? I have tried placing the initialization of AvroParquetWriter at the open() method, but result still the same. When debugging the code, I confirm that writer.write(element) does executed and element contain the avro genericrecord data Streaming Data BucketingSink<DataEventRecord> sink = new

Flink Streaming: How to output one data stream to different outputs depending on the data?

时光毁灭记忆、已成空白 提交于 2019-12-03 15:16:19
问题 In Apache Flink I have a stream of tuples. Let's assume a really simple Tuple1<String> . The tuple can have an arbitrary value in it's value field (e.g. 'P1', 'P2', etc.). The set of possible values is finite but I don't know the full set beforehand (so there could be a 'P362'). I want to write that tuple to a certain output location depending on the value inside of the tuple. So e.g. I would like to have the following file structure: /output/P1 /output/P2 In the documentation I only found

Apache Flink streaming in cluster does not split jobs with workers

喜你入骨 提交于 2019-12-03 14:37:41
My objective is to setup a high throughput cluster using Kafka as source & Flink as the stream processing engine. Here's what I have done. I have setup a 2-node cluster the following configuration on the master and the slave. Master flink-conf.yaml jobmanager.rpc.address: <MASTER_IP_ADDR> #localhost jobmanager.rpc.port: 6123 jobmanager.heap.mb: 256 taskmanager.heap.mb: 512 taskmanager.numberOfTaskSlots: 50 parallelism.default: 100 Slave flink-conf.yaml jobmanager.rpc.address: <MASTER_IP_ADDR> #localhost jobmanager.rpc.port: 6123 jobmanager.heap.mb: 512 #256 taskmanager.heap.mb: 1024 #512

Combine two streams in Apache Flink regardless on window time

拜拜、爱过 提交于 2019-12-03 08:56:07
问题 I have two data streams that I want to combine. The problem is that one data stream has a much higher frequency than the other and there are times where one stream is not receiving events at all. Is it possible to use the last event from the one stream and join it with the other stream on every event that is coming? The only solution I found is using the join function, but you have to specify a common window, where you can apply the join function. This is window is not reached, when one

flink - using dagger injections - not serializable?

情到浓时终转凉″ 提交于 2019-12-03 06:21:34
Im using Flink (latest via git) to stream from kafka to cassandra. To ease unit testing Im adding dependency injection via Dagger. The ObjectGraph seems to be setting itself up properly but the 'inner objects' are being flagged as 'not serializable' by Flink. If I include these objects directly they work - so what's the difference? Class in question implements MapFunction and @Inject a module for cassandra and one for reading config files. Is there a way to build this so I can use late binding or does Flink make this impossible? Edit: fwiw - Dependency injection (via dagger) and

Flink Streaming: How to output one data stream to different outputs depending on the data?

烂漫一生 提交于 2019-12-03 05:01:17
In Apache Flink I have a stream of tuples. Let's assume a really simple Tuple1<String> . The tuple can have an arbitrary value in it's value field (e.g. 'P1', 'P2', etc.). The set of possible values is finite but I don't know the full set beforehand (so there could be a 'P362'). I want to write that tuple to a certain output location depending on the value inside of the tuple. So e.g. I would like to have the following file structure: /output/P1 /output/P2 In the documentation I only found possibilities to write to locations that I know beforehand (e.g. stream.writeCsv("/output/somewhere") ),

What is the difference between mini-batch vs real time streaming in practice (not theory)?

只谈情不闲聊 提交于 2019-12-02 23:31:28
What is the difference between mini-batch vs real time streaming in practice (not theory)? In theory, I understand mini batch is something that batches in the given time frame whereas real time streaming is more like do something as the data arrives but my biggest question is why not have mini batch with epsilon time frame (say one millisecond) or I would like to understand reason why one would be an effective solution than other? I recently came across one example where mini-batch (Apache Spark) is used for Fraud detection and real time streaming (Apache Flink) used for Fraud Prevention.

Is it possible to use Riak CS with Apache Flink?

荒凉一梦 提交于 2019-12-02 18:05:46
问题 I want to configure filesystem state backend and zookeeper recovery mode: state.backend: filesystem state.backend.fs.checkpointdir: ??? recovery.mode: zookeeper recovery.zookeeper.storageDir: ??? As you can see I should specify checkpointdir and storageDir parameters, but I don't have any file systems supported by Apache Flink (like HDFS or Amazon S3). But I have installed Riak CS cluster (seems like it compatible with S3). So, can I use Riak CS together with Apache Flink? If it is possible:

Unable to execute CEP pattern in Flink dashboard version 1.3.2 which is caused by ClassNotFoundException

这一生的挚爱 提交于 2019-12-02 17:26:58
问题 I have written a simple pattern like this Pattern<JoinedEvent, ?> pattern = Pattern.<JoinedEvent>begin("start") .where(new SimpleCondition<JoinedEvent>() { @Override public boolean filter(JoinedEvent streamEvent) throws Exception { return streamEvent.getRRInterval()>= 10 ; } }).within(Time.milliseconds(WindowLength)); and it executes well in IntellijIdea. I am using Flink 1.3.2 both in the dashboard and in IntelliJ-Idea. While I was building Flink from source, I have seen a lot of warning

flink 1.3.1 elasticsearch 5.5.1. ElasticsearchSinkFunction fails with java.lang.NoSuchMethodError

半世苍凉 提交于 2019-12-02 16:46:32
问题 I'm going through following samples using Scala / sbt : flink / elasticsearch / kibana flink tutorial My built.sbt includes following versions: libraryDependencies ++= Seq( "org.apache.flink" %% "flink-scala" % "1.3.1" % "provided", "org.apache.flink" %% "flink-streaming-scala" % "1.3.1" % "provided", "org.apache.flink" %% "flink-clients" % "1.3.1" % "provided", "joda-time" % "joda-time" % "2.9.9", "com.google.guava" % "guava" % "22.0", "com.typesafe" % "config" % "1.3.0", "org.apache.flink"