apache-flink

Apache Flink - Send event if no data was received for x minutes

风流意气都作罢 提交于 2019-12-01 06:20:46
How can I implement an operator with Flink's DataStream API that sends an event when no data was received from a stream for a certain amount of time? Such an operator can be implemented using a ProcessFunction . DataStream<Long> input = env.fromElements(1L, 2L, 3L, 4L); input // use keyBy to have keyed state. // NullByteKeySelector will move all data to one task. You can also use other keys .keyBy(new NullByteKeySelector()) // use process function with 60 seconds timeout .process(new TimeOutFunction(60 * 1000)); The TimeOutFunction is defined as follows. In this example it uses processing time

Local Flink config running standalone from IDE

ⅰ亾dé卋堺 提交于 2019-12-01 03:36:02
问题 If I'd like to run a Flink app locally, directly from within Intellij but I need to specify config params (like fs.hdfs.hdfssite to set up S3 access), is there any other way to provide those config params apart from ExecutionEnvironment.createLocalEnvironment(conf) ? What if I want to use StreamExecutionEnvironment.getExecutionEnvironment ? Can I have a Flink config in my project and point the local app to it? Is this the proper way to do it? Or would you set up your IDE to submit the app to

How to count unique words in a stream?

我与影子孤独终老i 提交于 2019-12-01 03:07:50
Is there a way to count the number of unique words in a stream with Flink Streaming? The results would be a stream of number which keeps increasing. You can solve the problem by storing all words which you've already seen. Having this knowledge you can filter out all duplicate words. The rest can then be counted by a map operator with parallelism 1 . The following code snippet does exactly that. val env = StreamExecutionEnvironment.getExecutionEnvironment val inputStream = env.fromElements("foo", "bar", "foobar", "bar", "barfoo", "foobar", "foo", "fo") // filter words out which we have already

Can Flink write results into multiple files (like Hadoop's MultipleOutputFormat)?

可紊 提交于 2019-11-30 23:38:23
I'm using Apache Flink's DataSet API. I want to implement a job that writes multiple results into different files. How can I do that? Fabian Hueske You can add as many data sinks to a DataSet program as you need. For example in a program like this: ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); DataSet<Tuple3<String, Long, Long>> data = env.readFromCsv(...); // apply MapFunction and emit data.map(new YourMapper()).writeToText("/foo/bar"); // apply FilterFunction and emit data.filter(new YourFilter()).writeToCsv("/foo/bar2"); You read a DataSet data from a CSV file.

Share state among operators in Flink

核能气质少年 提交于 2019-11-30 21:57:26
I wonder if it is possible in Flink to share the state among operators. Say, for instance, that I have partitioning by key on an operator and I need a piece of state of partition A inside partition C (for any reason) (fig 1.a), or I need the state of operator C in downstream operator F (fig 1.b). I know it is possible to broadcast records to all partitions. So, if you include the internal state of an operator inside the records, you can share your internal state with downstream operators. However, this could be an expensive operation instead of simply letting op1 specifically ask for op2 state

Apache Flink Rest-Client Jar-Upload not working

↘锁芯ラ 提交于 2019-11-30 15:45:35
问题 I am struggling to automatically deploy new Flink jobs within our CI/CD workflows by using the Flink rest-api (which may be found here in the flink Github repository). Documentation only says that that jar upload may be achieved by using /jars/upload , but not how exactly a valid rest request has to be build (which Headers , which Body type, which Authorization , which Method and so on). So I took a look at the Flink dashboard code of flink/flink-runtime-web project on Github and searched for

Apache Flink Rest-Client Jar-Upload not working

醉酒当歌 提交于 2019-11-30 15:07:47
I am struggling to automatically deploy new Flink jobs within our CI/CD workflows by using the Flink rest-api (which may be found here in the flink Github repository ). Documentation only says that that jar upload may be achieved by using /jars/upload , but not how exactly a valid rest request has to be build (which Headers , which Body type, which Authorization , which Method and so on). So I took a look at the Flink dashboard code of flink/flink-runtime-web project on Github and searched for the implementation they used to upload a jar and - Yippie! Its implemented by calling the rest-api I

Apache flink on Kubernetes - Resume job if jobmanager crashes

三世轮回 提交于 2019-11-30 09:55:27
I want to run a flink job on kubernetes, using a (persistent) state backend it seems like crashing taskmanagers are no issue as they can ask the jobmanager which checkpoint they need to recover from, if I understand correctly. A crashing jobmanager seems to be a bit more difficult. On this flip-6 page I read zookeeper is needed to be able to know what checkpoint the jobmanager needs to use to recover and for leader election. Seeing as kubernetes will restart the jobmanager whenever it crashes is there a way for the new jobmanager to resume the job without having to setup a zookeeper cluster?

apache flink 0.10 how to get the first occurence of a composite key from an unbounded input dataStream?

爷,独闯天下 提交于 2019-11-30 09:43:05
i am a newbie with apache flink. i have an unbound data stream in my input (fed into flink 0.10 via kakfa). i want to get the 1st occurence of each primary key (the primary key is the contract_num and the event_dt). These "duplicates" occur nearly immediately after each other. The source system cannot filter this for me, so flink has to do it. Here is my input data: contract_num, event_dt, attr A1, 2016-02-24 10:25:08, X A1, 2016-02-24 10:25:08, Y A1, 2016-02-24 10:25:09, Z A2, 2016-02-24 10:25:10, C Here is the output data i want: A1, 2016-02-24 10:25:08, X A1, 2016-02-24 10:25:09, Z A2, 2016

Read & write data into cassandra using apache flink Java API

时光怂恿深爱的人放手 提交于 2019-11-30 09:19:33
问题 I intend to use apache flink for read/write data into cassandra using flink. I was hoping to use flink-connector-cassandra, I don't find good documentation/examples for the connector. Can you please point me to the right way for read and write data from cassandra using Apache Flink. I see only sink example which are purely for write ? Is apache flink meant for reading data too from cassandra similar to apache spark ? 回答1: I had the same question, and this is what I was looking for. I don't