apache-flink

How Apache Flink deal with skewed data?

我们两清 提交于 2019-12-04 11:13:56
For example, I have a big stream of words and want to count each word. The problem is these words is skewed. It means that the frequency of some words would be very high, but that of most other words is low. In storm, we could use the following way to solve this issue. First do shuffle grouping on the stream, in each node count words local in a window time, at the end update counts to cumulative results. From my another question , I know that Flink only supports window on a keyed stream, otherwise the window operation will not be parallel. My question is is there a good way to solve this kind

Canceling Apache Flink job from the code

元气小坏坏 提交于 2019-12-04 11:07:36
I am in a situation where I want to stop/cancel the flink job from the code. This is in my integration test where I am submitting a task to my flink job and check the result. As the job runs, asynchronously, it doesn't stop even when the test fails/passes. I want to job the stop after the test is over. I tried a few things which I am listing below : Get the jobmanager actor Get running jobs For each running job, send a cancel request to the jobmanager This, of course in not running but I am not sure whether the jobmanager actorref is wrong or something else is missing. The error I get is :

flink kafka consumer groupId not working

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-04 08:37:25
I am using kafka with flink. In a simple program, I used flinks FlinkKafkaConsumer09, assigned the group id to it. According to Kafka's behavior, when I run 2 consumers on the same topic with same group.Id, it should work like a message queue. I think it's supposed to work like: If 2 messages sent to Kafka, each or one of the flink program would process the 2 messages totally twice(let's say 2 lines of output in total). But the actual result is that, each program would receive 2 pieces of the messages. I have tried to use consumer client that came with the kafka server download. It worked in

How to support multiple KeyBy in Flink

拥有回忆 提交于 2019-12-04 07:34:38
In code sample below, I am trying to get a stream of employee records {Country, Employer, Name, Salary, Age } and dumping highest paid employee in every country. Unfortunately Multiple KEY By doesn't work. Only KeyBy(Employer) is reflecting, thus I don't get correct result. What am I missing? StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); DataStream<Employee> streamEmployee = env.addSource( new FlinkKafkaConsumer010<ObjectNode>("flink-demo", new JSONDeserializationSchema(), properties)) .map(new MapFunction<ObjectNode, Employee>() { private static final

Kafka & Flink duplicate messages on restart

馋奶兔 提交于 2019-12-04 07:20:13
First of all, this is very similar to Kafka consuming the latest message again when I rerun the Flink consumer , but it's not the same. The answer to that question does NOT appear to solve my problem. If I missed something in that answer, then please rephrase the answer, as I clearly missed something. The problem is the exact same, though -- Flink (the kafka connector) re-runs the last 3-9 messages it saw before it was shut down. My Versions Flink 1.1.2 Kafka 0.9.0.1 Scala 2.11.7 Java 1.8.0_91 My Code import java.util.Properties import org.apache.flink.streaming.api.windowing.time.Time import

Read data from Cassandra for processing in Flink

依然范特西╮ 提交于 2019-12-04 03:51:24
问题 I have to process data streams from Kafka using Flink as the streaming engine. To do the analysis on the data, I need to query some tables in Cassandra. What is the best way to do this? I have been looking for examples in Scala for such cases. But I couldn't find any.How can data from Cassandra be read in Flink using Scala as the programming language? Read & write data into cassandra using apache flink Java API has another question on the same lines. It has multiple approaches mentioned in

How to use Flink's KafkaSource in Scala?

不问归期 提交于 2019-12-04 03:43:55
问题 I'm trying to run a simple test program with Flink's KafkaSource. I'm using the following: Flink 0.9 Scala 2.10.4 Kafka 0.8.2.1 I followed the docs to test KafkaSource (added dependency, bundle the Kafka connector flink-connector-kafka in plugin) as described here and here. Below is my simple test program: import org.apache.flink.streaming.api.scala._ import org.apache.flink.streaming.connectors.kafka object TestKafka { def main(args: Array[String]) { val env = StreamExecutionEnvironment

How to count unique words in a stream?

做~自己de王妃 提交于 2019-12-04 00:18:21
问题 Is there a way to count the number of unique words in a stream with Flink Streaming? The results would be a stream of number which keeps increasing. 回答1: You can solve the problem by storing all words which you've already seen. Having this knowledge you can filter out all duplicate words. The rest can then be counted by a map operator with parallelism 1 . The following code snippet does exactly that. val env = StreamExecutionEnvironment.getExecutionEnvironment val inputStream = env

Apache Flink: Where do State Backends keep the state?

杀马特。学长 韩版系。学妹 提交于 2019-12-03 21:54:33
I got a statement below: "Depending on your state backend, Flink can also manage the state for the application, meaning Flink deals with the memory management (possibly spilling to disk if necessary) to allow applications to hold very large state." https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/state/state_backends.html Does it mean that only when the state backends is configured to RocksDBStateBackend , the state would keep in memory and possibly spilling to disk if necessary? However if configured to MemoryStateBackend or FsStateBackend , the state only keep in memory and

Share state among operators in Flink

自闭症网瘾萝莉.ら 提交于 2019-12-03 21:20:07
问题 I wonder if it is possible in Flink to share the state among operators. Say, for instance, that I have partitioning by key on an operator and I need a piece of state of partition A inside partition C (for any reason) (fig 1.a), or I need the state of operator C in downstream operator F (fig 1.b). I know it is possible to broadcast records to all partitions. So, if you include the internal state of an operator inside the records, you can share your internal state with downstream operators.