apache-flink | 易学教程

How Apache Flink deal with skewed data?

阅读更多关于 How Apache Flink deal with skewed data?

For example, I have a big stream of words and want to count each word. The problem is these words is skewed. It means that the frequency of some words would be very high, but that of most other words is low. In storm, we could use the following way to solve this issue. First do shuffle grouping on the stream, in each node count words local in a window time, at the end update counts to cumulative results. From my another question , I know that Flink only supports window on a keyed stream, otherwise the window operation will not be parallel. My question is is there a good way to solve this kind

Canceling Apache Flink job from the code

阅读更多关于 Canceling Apache Flink job from the code

I am in a situation where I want to stop/cancel the flink job from the code. This is in my integration test where I am submitting a task to my flink job and check the result. As the job runs, asynchronously, it doesn't stop even when the test fails/passes. I want to job the stop after the test is over. I tried a few things which I am listing below : Get the jobmanager actor Get running jobs For each running job, send a cancel request to the jobmanager This, of course in not running but I am not sure whether the jobmanager actorref is wrong or something else is missing. The error I get is :

flink kafka consumer groupId not working

阅读更多关于 flink kafka consumer groupId not working

I am using kafka with flink. In a simple program, I used flinks FlinkKafkaConsumer09, assigned the group id to it. According to Kafka's behavior, when I run 2 consumers on the same topic with same group.Id, it should work like a message queue. I think it's supposed to work like: If 2 messages sent to Kafka, each or one of the flink program would process the 2 messages totally twice(let's say 2 lines of output in total). But the actual result is that, each program would receive 2 pieces of the messages. I have tried to use consumer client that came with the kafka server download. It worked in

How to support multiple KeyBy in Flink

阅读更多关于 How to support multiple KeyBy in Flink

In code sample below, I am trying to get a stream of employee records {Country, Employer, Name, Salary, Age } and dumping highest paid employee in every country. Unfortunately Multiple KEY By doesn't work. Only KeyBy(Employer) is reflecting, thus I don't get correct result. What am I missing? StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); DataStream<Employee> streamEmployee = env.addSource( new FlinkKafkaConsumer010<ObjectNode>("flink-demo", new JSONDeserializationSchema(), properties)) .map(new MapFunction<ObjectNode, Employee>() { private static final

Kafka & Flink duplicate messages on restart

阅读更多关于 Kafka & Flink duplicate messages on restart

First of all, this is very similar to Kafka consuming the latest message again when I rerun the Flink consumer , but it's not the same. The answer to that question does NOT appear to solve my problem. If I missed something in that answer, then please rephrase the answer, as I clearly missed something. The problem is the exact same, though -- Flink (the kafka connector) re-runs the last 3-9 messages it saw before it was shut down. My Versions Flink 1.1.2 Kafka 0.9.0.1 Scala 2.11.7 Java 1.8.0_91 My Code import java.util.Properties import org.apache.flink.streaming.api.windowing.time.Time import

Read data from Cassandra for processing in Flink

阅读更多关于 Read data from Cassandra for processing in Flink

问题 I have to process data streams from Kafka using Flink as the streaming engine. To do the analysis on the data, I need to query some tables in Cassandra. What is the best way to do this? I have been looking for examples in Scala for such cases. But I couldn't find any.How can data from Cassandra be read in Flink using Scala as the programming language? Read & write data into cassandra using apache flink Java API has another question on the same lines. It has multiple approaches mentioned in

How to use Flink's KafkaSource in Scala?

阅读更多关于 How to use Flink's KafkaSource in Scala?

问题 I'm trying to run a simple test program with Flink's KafkaSource. I'm using the following: Flink 0.9 Scala 2.10.4 Kafka 0.8.2.1 I followed the docs to test KafkaSource (added dependency, bundle the Kafka connector flink-connector-kafka in plugin) as described here and here. Below is my simple test program: import org.apache.flink.streaming.api.scala._ import org.apache.flink.streaming.connectors.kafka object TestKafka { def main(args: Array[String]) { val env = StreamExecutionEnvironment

How to count unique words in a stream?

阅读更多关于 How to count unique words in a stream?

问题 Is there a way to count the number of unique words in a stream with Flink Streaming? The results would be a stream of number which keeps increasing. 回答1: You can solve the problem by storing all words which you've already seen. Having this knowledge you can filter out all duplicate words. The rest can then be counted by a map operator with parallelism 1 . The following code snippet does exactly that. val env = StreamExecutionEnvironment.getExecutionEnvironment val inputStream = env

Apache Flink: Where do State Backends keep the state?

阅读更多关于 Apache Flink: Where do State Backends keep the state?

I got a statement below: "Depending on your state backend, Flink can also manage the state for the application, meaning Flink deals with the memory management (possibly spilling to disk if necessary) to allow applications to hold very large state." https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/state/state_backends.html Does it mean that only when the state backends is configured to RocksDBStateBackend , the state would keep in memory and possibly spilling to disk if necessary? However if configured to MemoryStateBackend or FsStateBackend , the state only keep in memory and

Share state among operators in Flink

阅读更多关于 Share state among operators in Flink

问题 I wonder if it is possible in Flink to share the state among operators. Say, for instance, that I have partitioning by key on an operator and I need a piece of state of partition A inside partition C (for any reason) (fig 1.a), or I need the state of operator C in downstream operator F (fig 1.b). I know it is possible to broadcast records to all partitions. So, if you include the internal state of an operator inside the records, you can share your internal state with downstream operators.