apache-flink | 易学教程

Why is the parallel execution of an Apache Flink application slower than the sequential execution?

阅读更多关于 Why is the parallel execution of an Apache Flink application slower than the sequential execution?

问题 I have an Apache Flink setup with one TaskManager and two processing slots. When I execute an application with parallelism set as 1, the job takes around 33 seconds to execute. When I increase the parallelism to 2, the job takes 45 seconds to complete. I am using Flink on my Windows machine with the configuration of 10 Compute Cores(4C + 6G). I want to achieve better results with 2 slots. What can I do? 回答1: Distributed systems like Apache Flink are designed to run in data centers on hundreds

Flink- error on running WordCount example on remote cluster

阅读更多关于 Flink- error on running WordCount example on remote cluster

问题 I have a Flink Cluster on VirtualBox incliding three node, 1 master and 2 slaves. I customized WordCount example and create a fat jar file to run it using VirtualBox Flink remote cluster, But I faced Error. Notice : I imported dependencies manually to the project(using Intellij IDEA) and I didn't use maven as dependency provider. I test my code on local machine and it was OK! More details are following: Here is my Java code: import org.apache.flink.api.common.functions.FlatMapFunction; import

Unable to run a python flink application on cluster

阅读更多关于 Unable to run a python flink application on cluster

问题 I am trying to run a Python Flink Application on the standalone Flink cluster. The application works fine on a single node cluster but it throws the following error on a multi-node cluster. java.lang.Exception: The user defined 'open()' method caused an exception: An error occurred while copying the file . Please help me resolve this problem. Thank you The application I am trying to execute has the following code. from flink.plan.Environment import get_environment from flink.plan.Constants

Interrupted while joining ioThread / Error during disposal of stream operator in flink application

阅读更多关于 Interrupted while joining ioThread / Error during disposal of stream operator in flink application

问题 I have a flink-based streaming application which uses apache kafka sources and sinks. Since some days I am getting exceptions at random times during development, and I have no clue where they're coming from. I am running the app within IntelliJ using the mainRunner class, and I am feeding it messages via kafka. Sometimes the first message will trigger the errors, sometimes it happens only after a few messages. This is how it looks: 16:31:01.935 ERROR o.a.k.c.producer.KafkaProducer -

Apache flink broadcast state gets flushed

阅读更多关于 Apache flink broadcast state gets flushed

问题 I am using the broadcast pattern to connect two streams and read data from one to another. The code looks like this case class Broadcast extends BroadCastProcessFunction[MyObject,(String,Double), MyObject]{ override def processBroadcastElement(in2: (String, Double), context: BroadcastProcessFunction[MyObject, (String, Double), MyObject]#Context, collector:Collector[MyObject]):Unit={ context.getBroadcastState(broadcastStateDescriptor).put(in2._1,in2._2) } override def processElement(obj:

Effect of increasing parallelism on throughput

阅读更多关于 Effect of increasing parallelism on throughput

问题 I ran a job first with Parallelism 1 and then with Parallelism 3. With Parallelism=1, the kafka source was reading records at rate ~500 records per second. With Parallelism=3, the throughput got divided among the three parallelisms, each reading approximately ~150 records per second. Note that the source is publishing records at a much higher rate (~1000 records per second). Is this expected? I would imagine the throughput to increase with parallelism, but it is remaining the same. I checked

How to let Flink flush last line to sink when producer(kafka) does not produce new line

阅读更多关于 How to let Flink flush last line to sink when producer(kafka) does not produce new line

问题 when my Flink program is in event time mode, sink will not get last line(say line A). If I feed new line(line B) to Flink, I will get the line A, but I still cann't get the line b. val env = StreamExecutionEnvironment.getExecutionEnvironment env.setParallelism(1) env.enableCheckpointing(5000, CheckpointingMode.EXACTLY_ONCE) env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime) val properties = new Properties() properties.setProperty("bootstrap.servers", "localhost:9092") properties

How to instantiate a MapStateDescriptor in Fink to compute multiple averages stream queries?

阅读更多关于 How to instantiate a MapStateDescriptor in Fink to compute multiple averages stream queries?

问题 I am trying to compute the temperature average of 3 different rooms where each room has 3 temperature sensors. I am using Flink (in Java). First I split the sensors by a key that is the room (A, B, or C) and them I create a RichFlatMapFunction which holds a MapState to save the temperature while I don't have until 3 measurements. After three measurements I compute the average. In order to use the MapState I need a MapStateDescriptor which I don't know how to instantiate properly. Can someone

Flink session window with onEventTime trigger?

阅读更多关于 Flink session window with onEventTime trigger?

问题 I want to create an EventTime based session-window in Flink, such that it triggers when the event time of a new message is more than 180 seconds greater than the event time of the message, that created the window. For example: t1(0 seconds) : msg1 <-- This is the first message which causes the session-windows to be created t2(13 seconds) : msg2 t3(39 seconds) : msg3 . . . . t7(190 seconds) : msg7 <-- The event time (t7) is more than 180 seconds than t1 (t7 - t1 = 190), so the window should be

Can I use a custom partitioner with group by?

阅读更多关于 Can I use a custom partitioner with group by?

问题 Let's say that I know that my dataset is unbalanced and I know the distribution of the keys. I'd like leverage this to write a custom partitioner to get the most out of the operator instances. I know about DataStream#partitionCustom. However, if my stream is keyed, will it still work properly? My job would look something like: KeyedDataStream afterCustomPartition = keyedStream.partitionCustom(new MyPartitioner(), MyPartitionKeySelector()) DataStreamUtils.reinterpretAsKeyedStream