apache-flink

Why is the parallel execution of an Apache Flink application slower than the sequential execution?

烂漫一生 提交于 2019-12-13 03:59:59
问题 I have an Apache Flink setup with one TaskManager and two processing slots. When I execute an application with parallelism set as 1, the job takes around 33 seconds to execute. When I increase the parallelism to 2, the job takes 45 seconds to complete. I am using Flink on my Windows machine with the configuration of 10 Compute Cores(4C + 6G). I want to achieve better results with 2 slots. What can I do? 回答1: Distributed systems like Apache Flink are designed to run in data centers on hundreds

Flink- error on running WordCount example on remote cluster

怎甘沉沦 提交于 2019-12-13 03:58:12
问题 I have a Flink Cluster on VirtualBox incliding three node, 1 master and 2 slaves. I customized WordCount example and create a fat jar file to run it using VirtualBox Flink remote cluster, But I faced Error. Notice : I imported dependencies manually to the project(using Intellij IDEA) and I didn't use maven as dependency provider. I test my code on local machine and it was OK! More details are following: Here is my Java code: import org.apache.flink.api.common.functions.FlatMapFunction; import

Unable to run a python flink application on cluster

允我心安 提交于 2019-12-13 03:57:40
问题 I am trying to run a Python Flink Application on the standalone Flink cluster. The application works fine on a single node cluster but it throws the following error on a multi-node cluster. java.lang.Exception: The user defined 'open()' method caused an exception: An error occurred while copying the file . Please help me resolve this problem. Thank you The application I am trying to execute has the following code. from flink.plan.Environment import get_environment from flink.plan.Constants

Interrupted while joining ioThread / Error during disposal of stream operator in flink application

泪湿孤枕 提交于 2019-12-13 03:57:07
问题 I have a flink-based streaming application which uses apache kafka sources and sinks. Since some days I am getting exceptions at random times during development, and I have no clue where they're coming from. I am running the app within IntelliJ using the mainRunner class, and I am feeding it messages via kafka. Sometimes the first message will trigger the errors, sometimes it happens only after a few messages. This is how it looks: 16:31:01.935 ERROR o.a.k.c.producer.KafkaProducer -

Apache flink broadcast state gets flushed

孤街醉人 提交于 2019-12-13 03:54:00
问题 I am using the broadcast pattern to connect two streams and read data from one to another. The code looks like this case class Broadcast extends BroadCastProcessFunction[MyObject,(String,Double), MyObject]{ override def processBroadcastElement(in2: (String, Double), context: BroadcastProcessFunction[MyObject, (String, Double), MyObject]#Context, collector:Collector[MyObject]):Unit={ context.getBroadcastState(broadcastStateDescriptor).put(in2._1,in2._2) } override def processElement(obj:

Effect of increasing parallelism on throughput

百般思念 提交于 2019-12-13 03:53:57
问题 I ran a job first with Parallelism 1 and then with Parallelism 3. With Parallelism=1, the kafka source was reading records at rate ~500 records per second. With Parallelism=3, the throughput got divided among the three parallelisms, each reading approximately ~150 records per second. Note that the source is publishing records at a much higher rate (~1000 records per second). Is this expected? I would imagine the throughput to increase with parallelism, but it is remaining the same. I checked

How to let Flink flush last line to sink when producer(kafka) does not produce new line

≯℡__Kan透↙ 提交于 2019-12-13 03:49:51
问题 when my Flink program is in event time mode, sink will not get last line(say line A). If I feed new line(line B) to Flink, I will get the line A, but I still cann't get the line b. val env = StreamExecutionEnvironment.getExecutionEnvironment env.setParallelism(1) env.enableCheckpointing(5000, CheckpointingMode.EXACTLY_ONCE) env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime) val properties = new Properties() properties.setProperty("bootstrap.servers", "localhost:9092") properties

How to instantiate a MapStateDescriptor in Fink to compute multiple averages stream queries?

安稳与你 提交于 2019-12-13 03:45:33
问题 I am trying to compute the temperature average of 3 different rooms where each room has 3 temperature sensors. I am using Flink (in Java). First I split the sensors by a key that is the room (A, B, or C) and them I create a RichFlatMapFunction which holds a MapState to save the temperature while I don't have until 3 measurements. After three measurements I compute the average. In order to use the MapState I need a MapStateDescriptor which I don't know how to instantiate properly. Can someone

Flink session window with onEventTime trigger?

只谈情不闲聊 提交于 2019-12-13 02:57:55
问题 I want to create an EventTime based session-window in Flink, such that it triggers when the event time of a new message is more than 180 seconds greater than the event time of the message, that created the window. For example: t1(0 seconds) : msg1 <-- This is the first message which causes the session-windows to be created t2(13 seconds) : msg2 t3(39 seconds) : msg3 . . . . t7(190 seconds) : msg7 <-- The event time (t7) is more than 180 seconds than t1 (t7 - t1 = 190), so the window should be

Can I use a custom partitioner with group by?

喜欢而已 提交于 2019-12-13 02:44:10
问题 Let's say that I know that my dataset is unbalanced and I know the distribution of the keys. I'd like leverage this to write a custom partitioner to get the most out of the operator instances. I know about DataStream#partitionCustom. However, if my stream is keyed, will it still work properly? My job would look something like: KeyedDataStream afterCustomPartition = keyedStream.partitionCustom(new MyPartitioner(), MyPartitionKeySelector()) DataStreamUtils.reinterpretAsKeyedStream