apache-flink | 易学教程

Flink BucketingSink with Custom AvroParquetWriter create empty file

阅读更多关于 Flink BucketingSink with Custom AvroParquetWriter create empty file

I have created a writer for BucketingSink. The sink and writer works without error but when it comes to the writer writing avro genericrecord to parquet, the file was created from in-progress, pending to complete. But the files are empty with 0 bytes. Can anyone tell me what is wrong with the code ? I have tried placing the initialization of AvroParquetWriter at the open() method, but result still the same. When debugging the code, I confirm that writer.write(element) does executed and element contain the avro genericrecord data Streaming Data BucketingSink<DataEventRecord> sink = new

Flink Streaming: How to output one data stream to different outputs depending on the data?

阅读更多关于 Flink Streaming: How to output one data stream to different outputs depending on the data?

问题 In Apache Flink I have a stream of tuples. Let's assume a really simple Tuple1<String> . The tuple can have an arbitrary value in it's value field (e.g. 'P1', 'P2', etc.). The set of possible values is finite but I don't know the full set beforehand (so there could be a 'P362'). I want to write that tuple to a certain output location depending on the value inside of the tuple. So e.g. I would like to have the following file structure: /output/P1 /output/P2 In the documentation I only found

Apache Flink streaming in cluster does not split jobs with workers

阅读更多关于 Apache Flink streaming in cluster does not split jobs with workers

My objective is to setup a high throughput cluster using Kafka as source & Flink as the stream processing engine. Here's what I have done. I have setup a 2-node cluster the following configuration on the master and the slave. Master flink-conf.yaml jobmanager.rpc.address: <MASTER_IP_ADDR> #localhost jobmanager.rpc.port: 6123 jobmanager.heap.mb: 256 taskmanager.heap.mb: 512 taskmanager.numberOfTaskSlots: 50 parallelism.default: 100 Slave flink-conf.yaml jobmanager.rpc.address: <MASTER_IP_ADDR> #localhost jobmanager.rpc.port: 6123 jobmanager.heap.mb: 512 #256 taskmanager.heap.mb: 1024 #512

Combine two streams in Apache Flink regardless on window time

阅读更多关于 Combine two streams in Apache Flink regardless on window time

问题 I have two data streams that I want to combine. The problem is that one data stream has a much higher frequency than the other and there are times where one stream is not receiving events at all. Is it possible to use the last event from the one stream and join it with the other stream on every event that is coming? The only solution I found is using the join function, but you have to specify a common window, where you can apply the join function. This is window is not reached, when one

flink - using dagger injections - not serializable?

阅读更多关于 flink - using dagger injections - not serializable?

Im using Flink (latest via git) to stream from kafka to cassandra. To ease unit testing Im adding dependency injection via Dagger. The ObjectGraph seems to be setting itself up properly but the 'inner objects' are being flagged as 'not serializable' by Flink. If I include these objects directly they work - so what's the difference? Class in question implements MapFunction and @Inject a module for cassandra and one for reading config files. Is there a way to build this so I can use late binding or does Flink make this impossible? Edit: fwiw - Dependency injection (via dagger) and

Flink Streaming: How to output one data stream to different outputs depending on the data?

阅读更多关于 Flink Streaming: How to output one data stream to different outputs depending on the data?

In Apache Flink I have a stream of tuples. Let's assume a really simple Tuple1<String> . The tuple can have an arbitrary value in it's value field (e.g. 'P1', 'P2', etc.). The set of possible values is finite but I don't know the full set beforehand (so there could be a 'P362'). I want to write that tuple to a certain output location depending on the value inside of the tuple. So e.g. I would like to have the following file structure: /output/P1 /output/P2 In the documentation I only found possibilities to write to locations that I know beforehand (e.g. stream.writeCsv("/output/somewhere") ),

What is the difference between mini-batch vs real time streaming in practice (not theory)?

阅读更多关于 What is the difference between mini-batch vs real time streaming in practice (not theory)?

What is the difference between mini-batch vs real time streaming in practice (not theory)? In theory, I understand mini batch is something that batches in the given time frame whereas real time streaming is more like do something as the data arrives but my biggest question is why not have mini batch with epsilon time frame (say one millisecond) or I would like to understand reason why one would be an effective solution than other? I recently came across one example where mini-batch (Apache Spark) is used for Fraud detection and real time streaming (Apache Flink) used for Fraud Prevention.

Is it possible to use Riak CS with Apache Flink?

阅读更多关于 Is it possible to use Riak CS with Apache Flink?

问题 I want to configure filesystem state backend and zookeeper recovery mode: state.backend: filesystem state.backend.fs.checkpointdir: ??? recovery.mode: zookeeper recovery.zookeeper.storageDir: ??? As you can see I should specify checkpointdir and storageDir parameters, but I don't have any file systems supported by Apache Flink (like HDFS or Amazon S3). But I have installed Riak CS cluster (seems like it compatible with S3). So, can I use Riak CS together with Apache Flink? If it is possible:

Unable to execute CEP pattern in Flink dashboard version 1.3.2 which is caused by ClassNotFoundException

阅读更多关于 Unable to execute CEP pattern in Flink dashboard version 1.3.2 which is caused by ClassNotFoundException

问题 I have written a simple pattern like this Pattern<JoinedEvent, ?> pattern = Pattern.<JoinedEvent>begin("start") .where(new SimpleCondition<JoinedEvent>() { @Override public boolean filter(JoinedEvent streamEvent) throws Exception { return streamEvent.getRRInterval()>= 10 ; } }).within(Time.milliseconds(WindowLength)); and it executes well in IntellijIdea. I am using Flink 1.3.2 both in the dashboard and in IntelliJ-Idea. While I was building Flink from source, I have seen a lot of warning

flink 1.3.1 elasticsearch 5.5.1. ElasticsearchSinkFunction fails with java.lang.NoSuchMethodError

阅读更多关于 flink 1.3.1 elasticsearch 5.5.1. ElasticsearchSinkFunction fails with java.lang.NoSuchMethodError

问题 I'm going through following samples using Scala / sbt : flink / elasticsearch / kibana flink tutorial My built.sbt includes following versions: libraryDependencies ++= Seq( "org.apache.flink" %% "flink-scala" % "1.3.1" % "provided", "org.apache.flink" %% "flink-streaming-scala" % "1.3.1" % "provided", "org.apache.flink" %% "flink-clients" % "1.3.1" % "provided", "joda-time" % "joda-time" % "2.9.9", "com.google.guava" % "guava" % "22.0", "com.typesafe" % "config" % "1.3.0", "org.apache.flink"