apache-flink

Flink: how to store state and use in another stream?

不羁岁月 提交于 2019-12-10 07:06:14
问题 I have a use-case for Flink where I need to read information from a file, store each line, and then use this state to filter another stream. I have all of this working right now with the connect operator and a RichCoFlatMapFunction , but it feels overly complicated. Also, I'm concerned that flatMap2 could begin executing before all of the state is loaded from the file: fileStream .connect(partRecordStream.keyBy((KeySelector<PartRecord, String>) partRecord -> partRecord.getPartId())) .keyBy(

How to handle errors in custom MapFunction correctly?

陌路散爱 提交于 2019-12-10 03:05:23
问题 I have implemented MapFunction for my Apache Flink flow. It is parsing incoming elements and convert them to other format but sometimes error can appear (i.e. incoming data is not valid). I see two possible ways how to handle it: Ignore invalid elements but seems like I can't ignore errors because for any incoming element I must provide outgoing element. Split incoming elements to valid and invalid but seems like I should use other function for this. So, I have two questions: How to handle

How to support multiple KeyBy in Flink

筅森魡賤 提交于 2019-12-09 18:22:21
问题 In code sample below, I am trying to get a stream of employee records {Country, Employer, Name, Salary, Age } and dumping highest paid employee in every country. Unfortunately Multiple KEY By doesn't work. Only KeyBy(Employer) is reflecting, thus I don't get correct result. What am I missing? StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); DataStream<Employee> streamEmployee = env.addSource( new FlinkKafkaConsumer010<ObjectNode>("flink-demo", new

Apache Flink: Where do State Backends keep the state?

馋奶兔 提交于 2019-12-09 14:16:20
问题 I got a statement below: "Depending on your state backend, Flink can also manage the state for the application, meaning Flink deals with the memory management (possibly spilling to disk if necessary) to allow applications to hold very large state." https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/state/state_backends.html Does it mean that only when the state backends is configured to RocksDBStateBackend , the state would keep in memory and possibly spilling to disk if

Tensorflow transform on beams with flink runner

戏子无情 提交于 2019-12-08 13:08:32
It may seem stupid but it is my very first post here. Sorry for doing anything wrong. I am currently building a simple ML pipeline with TFX 0.11 (i.e. tfdv-tft-tfserving) and tensorflow 1.11, using python2.7. I currently have a apache-flink cluster and I want to use that for TFX. I know the framework behind TFX is apache-beams 2.8, and it (apache-beams) supports flink with python SDK currently through a portable runner layer. But the problem is how I can code in TFX (tfdv-tft) using apache-beams with flink runner through this portable runner concept, as TFX currently seems to only support

Why “broadcast state” can store the dynamic rules however broadcast() operator cannot?

久未见 提交于 2019-12-08 12:28:38
问题 I got confused for the difference between "broadcast state" and broadcast() operator, and finally I got the help from a Flink expert in the following thread. What does it mean that "broadcast state" unblocks the implementation of the “dynamic patterns” feature for Flink’s CEP library? In the end it seems got the conclusion that "broadcast state" can store the dynamic rules in the keyed stream by RichCoFlatMap , however broadcast() operator cannot, so may I know how "broadcast state" store the

Reading from multiple broker kafka with flink

淺唱寂寞╮ 提交于 2019-12-08 11:04:42
问题 I want to read multiple kafka from flink. I have a cluser of 3 computers for kafka. With the following topic Topic:myTopic PartitionCount:3 ReplicationFactor:1 Configs: Topic: myTopic Partition: 0 Leader: 2 Replicas: 2 Isr: 2 Topic: myTopic Partition: 1 Leader: 0 Replicas: 0 Isr: 0 Topic: myTopic Partition: 2 Leader: 1 Replicas: 1 Isr: 1 From Flink I execute the following code : Properties properties = new Properties(); properties.setProperty("bootstrap.servers", "x.x.x.x:9092,x.x.x.x:9092,x

Calculate totals and emit periodically in flink

血红的双手。 提交于 2019-12-08 09:52:08
问题 I have a stream of events about resources that looks like this: id, type, count 1, view, 1 1, download, 3 2, view, 1 3, view, 1 1, download, 2 3, view, 1 I am trying to produce stats (totals) per resource, so if I get a stream like above, the result should be: id, views, downloads 1, 1, 5 2, 1, 0 3, 2, 0 Now I wrote a ProcessFunction that calculates the totals like this: public class CountTotals extends ProcessFunction<Event, ResourceTotals> { private ValueState<ResourceTotals> totalsState;

How to sort the union datastream of flink without watermark

霸气de小男生 提交于 2019-12-08 08:01:42
问题 The flink flow has multi data stream, then I merge those data stream with org.apache.flink.streaming.api.datastream.DataStream#union method. Then, I got the problem, the datastream is disordered and I can not set window to sort the data in data stream. Sorting union of streams to identify user sessions in Apache Flink I got the the answer, but the com.liam.learn.flink.example.union.UnionStreamDemo.SortFunction#onTimer never been invoked. Environment Info: flink version 1.7.0 In general, I

Flink: join file with kafka stream

别等时光非礼了梦想. 提交于 2019-12-08 07:54:47
问题 I have a problem I don't really can figure out. So I have a kafka stream that contains some data like this: {"adId":"9001", "eventAction":"start", "eventType":"track", "eventValue":"", "timestamp":"1498118549550"} And I want to replace 'adId' with another value 'bookingId'. This value is located in a csv file, but I can't really figure out how to get it working. Here is my mapping csv file: 9001;8 9002;10 So my output would ideally be something like {"bookingId":"8", "eventAction":"start",